Problem
Reinforcement learning tasks are typically specified as Markov decision processes (MDPs), but standard formalisms tightly couple environment dynamics with the learning objective. This coupling complicates the task specification, obfuscates connections between settings such as episodic versus continuing problems, and forces separate convergence proofs and algorithmic variants. Successive generalizations — options, state-based discounts, interest functions, general value functions — have multiplied terminology without yielding a unified treatment.
Key idea
Decouple the environment dynamics from the task by introducing the RL task tuple (P, r, γ, i) defined on top of an MDP (S, A, Pr), where γ : S × A × S → [0, 1] is a transition-based discount function rather than a state-based scalar. With this single change:
- Episodic and continuing tasks differ only in
γ— no absorbing or hypothetical states are needed. - The options framework
(π, β, I)collapses to an RL task withγ(s, a, s') = 1 − β(s')andi(s) = 1[s ∈ I]. - General value functions, hard termination, and soft termination (
γset to smallε > 0rather than 0) all express as choices ofγ. - Bellman contraction bounds extend uniformly to both episodic and continuing tasks via the matrix
P_{π,γ}(s, s') = Σ_a π(s, a) Pr(s, a, s') γ(s, a, s'), with previous constant-γand state-γresults recovered as special cases.
Method
Generalized Bellman operator. For trace parameter λ_c, define P^λ_π = (I − λ_c P_{π,γ})^{-1} P_{π,γ} (1 − λ_c) and r^λ_π = (I − λ_c P_{π,γ})^{-1} r_π, giving the generalized operator T^(λ) v = r^λ_π + P^λ_π v. TD updates extend by simply replacing γ_c (or γ_s(s_{t+1})) with γ_{t+1} = γ(s_t, a_t, s_{t+1}) in the TD error and eligibility trace.
Emphasis weighting. With interest function i, the diagonal weighting becomes D = M = diag((I − P^λ_π)^{-1} (d_μ ∘ i)), recovering Emphatic TD as a special case and ensuring the linear system A is positive definite even off-policy.
Equivalence to state-based discounting. Transition-based and state-based discounting have equal representational power, but the equivalent state-based MDP requires |A| n² + n states (one hypothetical state per (s, a, s') triple) versus the original n. The paper proves stationary-distribution equivalence (Theorem 2 / Appendix B) but shows that the transition-based form is strictly more compact, more modular across discount choices, and avoids feature-aliasing in linear function approximation.
Contraction bounds. Lemma 1 proves s_D := ‖P^λ_π‖_D < 1 for both D = D_π (on-policy) and D = M (emphasis), under assumptions A1–A4 (finite spaces, unique invariant distributions, some transition with γ < 1, and an interest-consistency condition for the function class). Lemma 2 lifts contraction to T^(λ). Theorem 1 then bounds ‖v − v*‖_D ≤ (1 − s_D)^{-1} ‖Π_D v* − v*‖_D, which reduces to the Tsitsiklis–Van Roy 1997 bound (1 − γ_c λ_c)/(1 − γ_c) in the constant-discount special case.
Results
Taxi-domain demonstration (Section 3). A modified taxi domain (4 platforms, orientation costs) compares optimal policies under three discount choices for the pickup transition:
- Soft termination
γ(source, Pickup, P-in-Car) = 0.1,γ = 0.99elsewhere → optimal policy, total reward−7.7, accounts for orientation after pickup. - Hard termination
γ = 0at the same transition → suboptimal: agent greedily minimizes turns to passenger, ignores post-pickup orientation, total reward−8.0. - State-based termination
γ_s(Car-in-source ∧ P-in-Car) = 0→ degenerate: agent oscillates around source after pickup and never drops off (the state-based form cannot represent the desired return without adding hypothetical states). - Constant
γ_csweeps → numerical imprecision causes most settings to fail to pick up at all.
The figure-2(e) summary table reports successful pickup/dropoff counts and turn cost over 100 steps × 5000 runs, showing the hard- and constant-discount baselines fail outright on some settings.
Contraction bound diagnostics (Table 1). For a random policy in the taxi domain, s_D is reported across λ_c ∈ {0, 0.5, 0.9, 0.99, 0.999} for episodic discount, constant γ_c = 0.99, and four random-termination schedules. Three findings: (1) hard and soft termination at the pickup yield identical s_D; (2) episodic discount has slightly smaller s_D than constant γ_c = 0.99; (3) increasing λ_c reduces s_D more strongly than adding extra terminations does.
Algorithm-level extensions (Section 5.2). Convergence in expectation of Emphatic TD / ELSTD extends to RL tasks via positive-definiteness of A under emphasis weighting. The Tagorti–Scherrer 2015 LSTD(λ) convergence-rate result extends from continuing tasks to episodic ones by replacing (1 − λ_c)γ_c P_π (I − λ_c γ_c P_π)^{-1} with the general P^λ_π, and the rate constant becomes 1/(1 − s_D) rather than (1 − λ_c γ_c)/(1 − γ_c).
Limitations
- The bound parameter
s_Dis harder to interpret than(1 − γ_c λ_c)/(1 − γ_c)becauseγvaries across transitions; the paper offers only empirical intuition (Table 1) rather than closed-form interpretation. - Contraction with the on-policy weighting
D_πrequires Assumption A3 (some transition hasγ < 1along a positive-probability path); the paper notes that the Bellman operator is not a contraction underD_πfor some state-based-discounting examples, and uses this to motivate the more flexible emphasis weighting. - The theoretical results assume finite state and action spaces (A1) and unique invariant distributions (A2); extensions to continuous spaces are not addressed.
- The probabilistic generalization
Pr(r, γ | s, a, s')is sketched in Appendix A but the main body keepsγdeterministic.
Open questions
- Can
s_Dbe bounded in closed form for natural classes of transition-based discount schedules (e.g., subgoal hierarchies, soft-termination chains), to recover the interpretability of the constant-γcase? - How does the choice of soft-termination value
εtrade off between bias of the truncated return and ease of credit assignment in deep RL settings? - Can the contraction analysis be lifted to nonlinear function classes (e.g., neural value approximators) that violate the linear A4 condition?
My take
The technical novelty is concentrated and clean: a single change to the type signature of γ (state-action-state instead of state, or constant) absorbs options, episodic-vs-continuing termination, and interest functions into one formalism, and the contraction proof carries through with only a switch from scalar arithmetic to matrix-norm arithmetic on P_{π,γ}. The taxi soft-termination example is a nice demonstration that the generalization is not just bookkeeping — it changes which policies are optimal under approximation.
For this project, the relevance is tangential but worth noting: transition-based discounting γ(s, a, s') has a structural parallel to regime-dependent discounting in regime-switching pricing kernels, where the effective discount factor at time t depends on the regime transition (z_t, z_{t+1}) rather than just the level z_t. The Bellman-style contraction analysis on P_{π,γ} is the same algebraic shape as the geometric-tail analysis on regime transition matrices in Markov-switching asset pricing. There is no direct method-borrow here, but the paper is a useful reference point if a future iteration of the regime-switching pricing pipeline ever needs to argue contraction of a transition-conditioned Bellman/Riccati operator from first principles.
Related
- transition-based-discounting — concept page extending the paper’s core construct
- bellman-operator — foundational background (textbook RL operator)
- episodic-vs-continuing-tasks — foundational background (classical RL task taxonomy)
- options-framework — foundational background (Sutton, Precup, Singh 1999)
- martha-white — author
- supports: transition-discount-unifies-episodic-continuing-rl