Definition
A discount function whose value depends on the transition triple (s, a, s') rather than only on the current state, the current state-action pair, or a global constant. The return becomes
,
which subsumes constant , state-based , and episodic-termination absorbing-state formulations as special cases.
Intuition
The discount becomes a property of the task definition rather than of the environment dynamics. Episodic termination is encoded by setting on the terminating transitions; soft termination (preserving some value of the post-transition continuation) is encoded by setting it to a small . This separates “what the agent is trying to optimize” from “how the environment evolves,” so multiple agents with different objectives can share the same MDP without inducing different state spaces.
Formal notation
For policy , the transition-discounted transition matrix is
which differs from the standard by a per-triple weighting. The Bellman equation becomes , and the generalized λ-operator is with .
Variants
- Hard termination — on terminating transitions; recovers absorbing-state formulation without adding states.
- Soft termination — for small ; preserves residual value of the continuation, useful when two pre-termination policies differ in post-termination value.
- State-based discounting — strictly less expressive: requires adding hypothetical states to represent an arbitrary transition-based MDP (White 2017, Theorem 2 / Appendix B).
- Constant discount — recovered by everywhere; this is the textbook continuing-task case.
- Probabilistic discount — White 2017 Appendix A; treatment is unchanged after taking conditional expectations.
Comparison
| Form | Type signature | Adds states for episodic? | Expresses options? | Soft termination? |
|---|---|---|---|---|
| Constant | scalar | yes (absorbing state) | no | no |
| State-based | sometimes (hypothetical states) | partial | no | |
| Transition-based | no | yes () | yes |
When to use
- Specifying episodic and continuing variants of the same environment without rewriting the MDP.
- Encoding options or subtasks by setting to a termination probability per transition.
- Building general value function (GVF) demons in a Horde-style architecture, where many predictive demons share the underlying dynamics but have distinct termination structures.
- Predicting transition events (e.g., probability of hitting a wall) by setting on the relevant self-loop transitions.
Known limitations
- The contraction-bound parameter is harder to interpret than the closed-form of the constant- case, because varies across transitions (White 2017, Section 5; Table 1 gives only empirical intuition in the taxi domain).
- For some state-based discounting examples, the Bellman operator is not a contraction under the on-policy weighting ; the more flexible emphasis weighting is needed to guarantee contraction in fully general transition-based settings.
- Theoretical results assume finite state/action spaces and unique invariant distributions; the lift to continuous or non-ergodic settings is not addressed.
- Practical use with deep RL function approximators is unstudied — the contraction analysis depends on assumption A4 (interest-consistency of the function class), which neural networks generally violate.
Open problems
- Closed-form bounds on for natural classes of transition-based discount schedules (subgoal hierarchies, soft-termination chains), to recover the interpretability of the constant- analysis.
- Principled choice of the soft-termination value , trading bias of the truncated return against ease of credit assignment.
- Lifting the contraction analysis to nonlinear function classes used in deep RL.
Key papers
- white-2017-unifying-task-specification-rl — introduces the construct, proves the generalized Bellman contraction, demonstrates utility on the taxi domain.
My understanding
Transition-based discounting is the minimal type-signature change to that absorbs options, episodic termination, and interest-weighted prediction into a single formalism. The mechanical lift of contraction proofs from scalar arithmetic on to matrix-norm arithmetic on (White 2017 Lemmas 1–2) is clean: as long as the row sums of are strictly less than 1 somewhere along positive-probability transitions (assumption A3), the operator stays a contraction under either or the emphasis weighting . Tangential parallel for the CRE asset-pricing project: regime-dependent discounting in Markov-switching pricing kernels has the same (s, a, s')-style transition signature, and the algebraic shape of the geometric-tail / Riccati contraction analysis on regime transition matrices echoes the analysis here.