Statement
A single change to the type signature of the discount function — from constant γ_c (or state-based γ_s(s)) to transition-based γ(s, a, s') : S × A × S → [0, 1] — is sufficient to express episodic, continuing, options-based, and interest-weighted RL tasks within one unified RL-task tuple (P, r, γ, i). The generalized Bellman operator T^(λ) v = r^λ_π + P^λ_π v (with P^λ_π = (I − λ_c P_{π,γ})^{-1} P_{π,γ} (1 − λ_c)) remains a contraction under either the on-policy weighting D_π or the emphasis weighting M, and the classical Tsitsiklis–Van Roy approximation-error bound is recovered as a special case.
Evidence summary
The single supporting paper provides three layers of evidence:
-
Definitional unification (Sections 2.1–2.3). Episodic-vs-continuing, options
(π, β, I), and general value functions are each shown to be expressible as choices ofγandiwithin the RL-task tuple, without modifying the underlying MDP state space. The taxi-domain demonstration (Section 3) further shows that the transition-based form distinguishes hard from soft termination in cases where the optimal policy depends on post-termination orientation, while state-based discounting cannot represent the desired return without adding hypothetical states. -
Theoretical contraction (Section 5.1). Lemmas 1 and 2 lift the Bellman contraction proof from scalar arithmetic on
γ_cto matrix-norm arithmetic onP_{π,γ}, under assumptions A1–A3. Theorem 1 then gives‖v − v*‖_D ≤ (1 − s_D)^{-1} ‖Π_D v* − v*‖_D, which reduces to(1 − γ_c λ_c)/(1 − γ_c)in the constant-γcase (matching the original Tsitsiklis–Van Roy 1997 bound). -
Algorithmic propagation (Section 5.2). The paper extends previous convergence results for Emphatic TD / ELSTD to RL tasks via positive-definiteness of
Aunder the emphasis weighting, and lifts the Tagorti–Scherrer 2015 LSTD(λ) convergence-rate result from continuing-only to episodic settings by replacing the constant-γmatrix with the generalP^λ_π.
Conditions and scope
- Finite spaces (A1): finite state and action sets.
- Stationarity (A2): unique invariant distributions
d_π,d_μfor target and behavior policies. - Reachable termination (A3): at least one transition
(s, a, s')withγ(s, a, s') < 1andπ(s, a) Pr(s, a, s') > 0. Without this,s_D = 1and the contraction degenerates. - Function-class regularity (A4, linear FA case): the approximation subspace spans the states of positive interest, ensuring that the contraction holds under the relevant weighting.
- On-policy vs off-policy: the on-policy weighting
D_πrequires the additional assumption that some state-based-discounting pathologies do not occur (Section 5.1 example); in fully general transition-based settings the emphasis weightingMis needed. - Interpretability of
s_D: closed-form bounds analogous to(1 − γ_c λ_c)/(1 − γ_c)are not available outside the constant-γcase; only empirical (Table 1, taxi domain) intuition is given.
Counter-evidence
None encountered in the source paper. The paper does provide a counterexample for state-based discounting (Section 5.1) showing that T^(λ) is not always a contraction under D_π for state-based γ_s, which is part of the motivation for the more flexible emphasis weighting.
Linked ideas
(populated as new ideas reference this claim)
Open questions
- Can closed-form bounds on
s_Dbe derived for natural classes of transition-based discount schedules (subgoal hierarchies, soft-termination chains)? - Does the unification extend cleanly to continuous state/action spaces, where assumptions A1–A2 do not hold?
- Can the contraction analysis be lifted to nonlinear function classes (deep RL value approximators) that violate A4?
- Tangential cross-domain question: does the same
(s, a, s')-style transition-discount analysis apply to regime-dependent discounting in regime-switching pricing kernels (e.g., the Riccati / geometric-tail contraction analysis used in Markov-switching asset pricing models)? The mathematical shape is similar but the application class is different.