Transition-Based Discounting

Definition

A discount function $γ : S \times A \times S \to [0, 1]$ whose value depends on the transition triple (s, a, s') rather than only on the current state, the current state-action pair, or a global constant. The return becomes $G_{t} = Σ_{i = 0}^{\infty} (Π_{j = 0}^{i - 1} γ (s_{t + j}, a_{t + j}, s_{t + 1 + j})) R_{t + 1 + i}$ , which subsumes constant $γ_{c}$ , state-based $γ_{s} (s)$ , and episodic-termination absorbing-state formulations as special cases.

Intuition

The discount becomes a property of the task definition rather than of the environment dynamics. Episodic termination is encoded by setting $γ (s, a, s^{'}) = 0$ on the terminating transitions; soft termination (preserving some value of the post-transition continuation) is encoded by setting it to a small $ε > 0$ . This separates “what the agent is trying to optimize” from “how the environment evolves,” so multiple agents with different objectives can share the same MDP without inducing different state spaces.

Formal notation

For policy $π : S \times A \to [0, 1]$ , the transition-discounted transition matrix is

$P_{π, γ} (s, s^{'}) = Σ_{a} π (s, a) P r (s, a, s^{'}) γ (s, a, s^{'})$

which differs from the standard $P_{π} (s, s^{'}) = Σ_{a} π (s, a) P r (s, a, s^{'})$ by a per-triple weighting. The Bellman equation becomes $v_{π} (s) = r_{π} (s) + Σ_{s^{'}} P_{π, γ} (s, s^{'}) v_{π} (s^{'})$ , and the generalized λ-operator is $T^{(} λ) v = r_{π}^{λ} + P_{π}^{λ} v$ with $P_{π}^{λ} = (I - λ_{c} P_{π, γ})^{- 1} P_{π, γ} (1 - λ_{c})$ .

Variants

Hard termination — $γ (s, a, s^{'}) = 0$ on terminating transitions; recovers absorbing-state formulation without adding states.
Soft termination — $γ (s, a, s^{'}) = ε$ for small $ε > 0$ ; preserves residual value of the continuation, useful when two pre-termination policies differ in post-termination value.
State-based discounting $γ_{s} (s^{'})$ — strictly less expressive: requires adding $∣ A ∣ n^{2} + n$ hypothetical states to represent an arbitrary transition-based MDP (White 2017, Theorem 2 / Appendix B).
Constant discount $γ_{c} \in [0, 1)$ — recovered by $γ (s, a, s^{'}) = γ_{c}$ everywhere; this is the textbook continuing-task case.
Probabilistic discount $P r (γ ∣ s, a, s^{'})$ — White 2017 Appendix A; treatment is unchanged after taking conditional expectations.

Comparison

Form	Type signature	Adds states for episodic?	Expresses options?	Soft termination?
Constant $γ_{c}$	scalar	yes (absorbing state)	no	no
State-based $γ_{s} (s)$	$S \to [0, 1]$	sometimes (hypothetical states)	partial	no
Transition-based $γ (s, a, s^{'})$	$S \times A \times S \to [0, 1]$	no	yes ( $γ = 1 - β (s^{'})$ )	yes

When to use

Specifying episodic and continuing variants of the same environment without rewriting the MDP.
Encoding options or subtasks by setting $γ$ to a termination probability per transition.
Building general value function (GVF) demons in a Horde-style architecture, where many predictive demons share the underlying dynamics but have distinct termination structures.
Predicting transition events (e.g., probability of hitting a wall) by setting $γ (s, a, s) = 0$ on the relevant self-loop transitions.

Known limitations

The contraction-bound parameter $s_{D} = ∥ P_{π}^{λ} ∥_{D}$ is harder to interpret than the closed-form $(1 - γ_{c} λ_{c}) / (1 - γ_{c})$ of the constant- $γ$ case, because $γ$ varies across transitions (White 2017, Section 5; Table 1 gives only empirical intuition in the taxi domain).
For some state-based discounting examples, the Bellman operator is not a contraction under the on-policy weighting $D_{π}$ ; the more flexible emphasis weighting $M$ is needed to guarantee contraction in fully general transition-based settings.
Theoretical results assume finite state/action spaces and unique invariant distributions; the lift to continuous or non-ergodic settings is not addressed.
Practical use with deep RL function approximators is unstudied — the contraction analysis depends on assumption A4 (interest-consistency of the function class), which neural networks generally violate.

Open problems

Closed-form bounds on $s_{D}$ for natural classes of transition-based discount schedules (subgoal hierarchies, soft-termination chains), to recover the interpretability of the constant- $γ$ analysis.
Principled choice of the soft-termination value $ε$ , trading bias of the truncated return against ease of credit assignment.
Lifting the contraction analysis to nonlinear function classes used in deep RL.

Key papers

white-2017-unifying-task-specification-rl — introduces the construct, proves the generalized Bellman contraction, demonstrates utility on the taxi domain.

My understanding

Transition-based discounting is the minimal type-signature change to $γ$ that absorbs options, episodic termination, and interest-weighted prediction into a single formalism. The mechanical lift of contraction proofs from scalar arithmetic on $γ_{c}$ to matrix-norm arithmetic on $P_{π, γ}$ (White 2017 Lemmas 1–2) is clean: as long as the row sums of $P_{π}^{λ}$ are strictly less than 1 somewhere along positive-probability transitions (assumption A3), the operator stays a contraction under either $D_{π}$ or the emphasis weighting $M$ . Tangential parallel for the CRE asset-pricing project: regime-dependent discounting $D (z_{t}, z_{t + 1})$ in Markov-switching pricing kernels has the same (s, a, s')-style transition signature, and the algebraic shape of the geometric-tail / Riccati contraction analysis on regime transition matrices echoes the $P_{π, γ}$ analysis here.

LeatherSagiKnowledgebase

Explorer