Transition-based discounting unifies episodic and continuing RL tasks under a single Bellman contraction

Statement

A single change to the type signature of the discount function — from constant γ_c (or state-based γ_s(s)) to transition-based γ(s, a, s') : S × A × S → [0, 1] — is sufficient to express episodic, continuing, options-based, and interest-weighted RL tasks within one unified RL-task tuple (P, r, γ, i). The generalized Bellman operator T^(λ) v = r^λ_π + P^λ_π v (with P^λ_π = (I − λ_c P_{π,γ})^{-1} P_{π,γ} (1 − λ_c)) remains a contraction under either the on-policy weighting D_π or the emphasis weighting M, and the classical Tsitsiklis–Van Roy approximation-error bound is recovered as a special case.

Evidence summary

The single supporting paper provides three layers of evidence:

Definitional unification (Sections 2.1–2.3). Episodic-vs-continuing, options (π, β, I), and general value functions are each shown to be expressible as choices of γ and i within the RL-task tuple, without modifying the underlying MDP state space. The taxi-domain demonstration (Section 3) further shows that the transition-based form distinguishes hard from soft termination in cases where the optimal policy depends on post-termination orientation, while state-based discounting cannot represent the desired return without adding hypothetical states.
Theoretical contraction (Section 5.1). Lemmas 1 and 2 lift the Bellman contraction proof from scalar arithmetic on γ_c to matrix-norm arithmetic on P_{π,γ}, under assumptions A1–A3. Theorem 1 then gives ‖v − v*‖_D ≤ (1 − s_D)^{-1} ‖Π_D v* − v*‖_D, which reduces to (1 − γ_c λ_c)/(1 − γ_c) in the constant-γ case (matching the original Tsitsiklis–Van Roy 1997 bound).
Algorithmic propagation (Section 5.2). The paper extends previous convergence results for Emphatic TD / ELSTD to RL tasks via positive-definiteness of A under the emphasis weighting, and lifts the Tagorti–Scherrer 2015 LSTD(λ) convergence-rate result from continuing-only to episodic settings by replacing the constant-γ matrix with the general P^λ_π.

Conditions and scope

Finite spaces (A1): finite state and action sets.
Stationarity (A2): unique invariant distributions d_π, d_μ for target and behavior policies.
Reachable termination (A3): at least one transition (s, a, s') with γ(s, a, s') < 1 and π(s, a) Pr(s, a, s') > 0. Without this, s_D = 1 and the contraction degenerates.
Function-class regularity (A4, linear FA case): the approximation subspace spans the states of positive interest, ensuring that the contraction holds under the relevant weighting.
On-policy vs off-policy: the on-policy weighting D_π requires the additional assumption that some state-based-discounting pathologies do not occur (Section 5.1 example); in fully general transition-based settings the emphasis weighting M is needed.
Interpretability of s_D: closed-form bounds analogous to (1 − γ_c λ_c)/(1 − γ_c) are not available outside the constant-γ case; only empirical (Table 1, taxi domain) intuition is given.

Counter-evidence

None encountered in the source paper. The paper does provide a counterexample for state-based discounting (Section 5.1) showing that T^(λ) is not always a contraction under D_π for state-based γ_s, which is part of the motivation for the more flexible emphasis weighting.

Linked ideas

(populated as new ideas reference this claim)

Open questions

Can closed-form bounds on s_D be derived for natural classes of transition-based discount schedules (subgoal hierarchies, soft-termination chains)?
Does the unification extend cleanly to continuous state/action spaces, where assumptions A1–A2 do not hold?
Can the contraction analysis be lifted to nonlinear function classes (deep RL value approximators) that violate A4?
Tangential cross-domain question: does the same (s, a, s')-style transition-discount analysis apply to regime-dependent discounting in regime-switching pricing kernels (e.g., the Riccati / geometric-tail contraction analysis used in Markov-switching asset pricing models)? The mathematical shape is similar but the application class is different.

LeatherSagiKnowledgebase

Explorer