Unifying Task Specification in Reinforcement Learning

Problem

Reinforcement learning tasks are typically specified as Markov decision processes (MDPs), but standard formalisms tightly couple environment dynamics with the learning objective. This coupling complicates the task specification, obfuscates connections between settings such as episodic versus continuing problems, and forces separate convergence proofs and algorithmic variants. Successive generalizations — options, state-based discounts, interest functions, general value functions — have multiplied terminology without yielding a unified treatment.

Key idea

Decouple the environment dynamics from the task by introducing the RL task tuple (P, r, γ, i) defined on top of an MDP (S, A, Pr), where γ : S × A × S → [0, 1] is a transition-based discount function rather than a state-based scalar. With this single change:

Episodic and continuing tasks differ only in γ — no absorbing or hypothetical states are needed.
The options framework (π, β, I) collapses to an RL task with γ(s, a, s') = 1 − β(s') and i(s) = 1[s ∈ I].
General value functions, hard termination, and soft termination (γ set to small ε > 0 rather than 0) all express as choices of γ.
Bellman contraction bounds extend uniformly to both episodic and continuing tasks via the matrix P_{π,γ}(s, s') = Σ_a π(s, a) Pr(s, a, s') γ(s, a, s'), with previous constant-γ and state-γ results recovered as special cases.

Method

Generalized Bellman operator. For trace parameter λ_c, define P^λ_π = (I − λ_c P_{π,γ})^{-1} P_{π,γ} (1 − λ_c) and r^λ_π = (I − λ_c P_{π,γ})^{-1} r_π, giving the generalized operator T^(λ) v = r^λ_π + P^λ_π v. TD updates extend by simply replacing γ_c (or γ_s(s_{t+1})) with γ_{t+1} = γ(s_t, a_t, s_{t+1}) in the TD error and eligibility trace.

Emphasis weighting. With interest function i, the diagonal weighting becomes D = M = diag((I − P^λ_π)^{-1} (d_μ ∘ i)), recovering Emphatic TD as a special case and ensuring the linear system A is positive definite even off-policy.

Equivalence to state-based discounting. Transition-based and state-based discounting have equal representational power, but the equivalent state-based MDP requires |A| n² + n states (one hypothetical state per (s, a, s') triple) versus the original n. The paper proves stationary-distribution equivalence (Theorem 2 / Appendix B) but shows that the transition-based form is strictly more compact, more modular across discount choices, and avoids feature-aliasing in linear function approximation.

Contraction bounds. Lemma 1 proves s_D := ‖P^λ_π‖_D < 1 for both D = D_π (on-policy) and D = M (emphasis), under assumptions A1–A4 (finite spaces, unique invariant distributions, some transition with γ < 1, and an interest-consistency condition for the function class). Lemma 2 lifts contraction to T^(λ). Theorem 1 then bounds ‖v − v*‖_D ≤ (1 − s_D)^{-1} ‖Π_D v* − v*‖_D, which reduces to the Tsitsiklis–Van Roy 1997 bound (1 − γ_c λ_c)/(1 − γ_c) in the constant-discount special case.

Results

Taxi-domain demonstration (Section 3). A modified taxi domain (4 platforms, orientation costs) compares optimal policies under three discount choices for the pickup transition:

Soft termination γ(source, Pickup, P-in-Car) = 0.1, γ = 0.99 elsewhere → optimal policy, total reward −7.7, accounts for orientation after pickup.
Hard termination γ = 0 at the same transition → suboptimal: agent greedily minimizes turns to passenger, ignores post-pickup orientation, total reward −8.0.
State-based termination γ_s(Car-in-source ∧ P-in-Car) = 0 → degenerate: agent oscillates around source after pickup and never drops off (the state-based form cannot represent the desired return without adding hypothetical states).
Constant γ_c sweeps → numerical imprecision causes most settings to fail to pick up at all.

The figure-2(e) summary table reports successful pickup/dropoff counts and turn cost over 100 steps × 5000 runs, showing the hard- and constant-discount baselines fail outright on some settings.

Contraction bound diagnostics (Table 1). For a random policy in the taxi domain, s_D is reported across λ_c ∈ {0, 0.5, 0.9, 0.99, 0.999} for episodic discount, constant γ_c = 0.99, and four random-termination schedules. Three findings: (1) hard and soft termination at the pickup yield identical s_D; (2) episodic discount has slightly smaller s_D than constant γ_c = 0.99; (3) increasing λ_c reduces s_D more strongly than adding extra terminations does.

Algorithm-level extensions (Section 5.2). Convergence in expectation of Emphatic TD / ELSTD extends to RL tasks via positive-definiteness of A under emphasis weighting. The Tagorti–Scherrer 2015 LSTD(λ) convergence-rate result extends from continuing tasks to episodic ones by replacing (1 − λ_c)γ_c P_π (I − λ_c γ_c P_π)^{-1} with the general P^λ_π, and the rate constant becomes 1/(1 − s_D) rather than (1 − λ_c γ_c)/(1 − γ_c).

Limitations

The bound parameter s_D is harder to interpret than (1 − γ_c λ_c)/(1 − γ_c) because γ varies across transitions; the paper offers only empirical intuition (Table 1) rather than closed-form interpretation.
Contraction with the on-policy weighting D_π requires Assumption A3 (some transition has γ < 1 along a positive-probability path); the paper notes that the Bellman operator is not a contraction under D_π for some state-based-discounting examples, and uses this to motivate the more flexible emphasis weighting.
The theoretical results assume finite state and action spaces (A1) and unique invariant distributions (A2); extensions to continuous spaces are not addressed.
The probabilistic generalization Pr(r, γ | s, a, s') is sketched in Appendix A but the main body keeps γ deterministic.

Open questions

Can s_D be bounded in closed form for natural classes of transition-based discount schedules (e.g., subgoal hierarchies, soft-termination chains), to recover the interpretability of the constant-γ case?
How does the choice of soft-termination value ε trade off between bias of the truncated return and ease of credit assignment in deep RL settings?
Can the contraction analysis be lifted to nonlinear function classes (e.g., neural value approximators) that violate the linear A4 condition?

My take

The technical novelty is concentrated and clean: a single change to the type signature of γ (state-action-state instead of state, or constant) absorbs options, episodic-vs-continuing termination, and interest functions into one formalism, and the contraction proof carries through with only a switch from scalar arithmetic to matrix-norm arithmetic on P_{π,γ}. The taxi soft-termination example is a nice demonstration that the generalization is not just bookkeeping — it changes which policies are optimal under approximation.

For this project, the relevance is tangential but worth noting: transition-based discounting γ(s, a, s') has a structural parallel to regime-dependent discounting in regime-switching pricing kernels, where the effective discount factor at time t depends on the regime transition (z_t, z_{t+1}) rather than just the level z_t. The Bellman-style contraction analysis on P_{π,γ} is the same algebraic shape as the geometric-tail analysis on regime transition matrices in Markov-switching asset pricing. There is no direct method-borrow here, but the paper is a useful reference point if a future iteration of the regime-switching pricing pipeline ever needs to argue contraction of a transition-conditioned Bellman/Riccati operator from first principles.

transition-based-discounting — concept page extending the paper’s core construct
bellman-operator — foundational background (textbook RL operator)
episodic-vs-continuing-tasks — foundational background (classical RL task taxonomy)
options-framework — foundational background (Sutton, Precup, Singh 1999)
martha-white — author
supports: transition-discount-unifies-episodic-continuing-rl

LeatherSagiKnowledgebase

Explorer

Unifying Task Specification in Reinforcement Learning

Problem

Key idea

Method

Results

Limitations

Open questions

My take

Graph View

Table of Contents

Backlinks

LeatherSagiKnowledgebase

Explorer

Unifying Task Specification in Reinforcement Learning

Problem

Key idea

Method

Results

Limitations

Open questions

My take

Related

Graph View

Table of Contents

Backlinks