Problem
Off-policy temporal-difference (TD) learning in discounted Markov decision processes evaluates a target policy from observations generated by a different behavior policy. Existing off-policy TD algorithms suffer from very high variance because they rely on importance-sampling ratios; the eligibility-trace iterates e_t can have unbounded variance and even unbounded magnitude in many natural situations. Prior schemes (Retrace, ABQ, Tree-Backup) constrain lambda_t * rho_{t-1} <= 1 to bound traces, but this limits how aggressively lambda can be set.
Key idea
Choose the eligibility-trace parameter lambda_t as a function of the current trace e_{t-1} itself (and a Markovian “memory state” y_t summarizing past states), so that e_t can be kept inside a desired bounded set directly — for example, by scaling gamma_t * rho_{t-1} * e_{t-1} back into a Euclidean ball whenever it exits. This allows substantially larger lambda values than prior off-policy schemes while still bounding traces. Crucially, the construction makes the joint state-trace process (S_t, y_t, e_t) a weak Feller Markov chain to which Markov-chain ergodicity tools apply, and the resulting limit of the LSTD equation corresponds to a generalized Bellman equation whose Bellman operator is parameterized by a randomized stopping time on the target-policy chain.
Method
- Setup. Two finite-state Markov chains over state space
S: targetP^piand observable behaviorP^{pi^o}. State-dependent discountgamma(s), importance-sampling ratiorho(s,s') = P^pi_{s,s'} / P^{pi^o}_{s,s'}. Linear value-function approximationv(s) = phi(s)^T theta. - Bounded-trace TD scheme. Trace iteration
e_t = lambda_t * gamma_t * rho_{t-1} * e_{t-1} + phi(S_t); TD errordelta_t(v) = rho_t * (R_t + gamma_{t+1} v(S_{t+1}) - v(S_t)). The novelty is lettinglambda_t = lambda(y_t, e_{t-1})wherey_t = g(y_{t-1}, S_t)is a finite-state memory. - Conditions on
lambda. Two conditions on the functionlambda: (i) Lipschitz-style continuity|| lambda(y,e) e - lambda(y,e') e'|| <= ||e - e'||— gives weak Feller property and ensures different initial traces converge. (ii)gamma(s') * rho(s,s') * lambda(y,e) * ||e|| <= C_y— explicit boundedness of the trace iterate. - Ergodicity (Theorem 2.1). Under Conditions 2.1–2.3 the joint chain
(S_t, y_t, e_t)is weak Feller, has a unique invariant probability measurezeta, and occupation measures converge weakly tozetaP_x-a.s. for every initial condition. - LSTD characterization (Corollary 2.1). The empirical LSTD equation
(1/t) sum_k e_k delta_k(v) = 0converges almost surely to the limiting linear equationE_zeta[ e_0 delta_0(v) ] = 0. - Generalized Bellman operator (Section 3). Construct a randomized stopping time
tauon the target chain{S_t}: at eacht >= 1, conditional on the past, stop with probability1 - lambda_twherelambda_tis generated by the same rule (2.5). Then the operator(Tv)(s) = E^pi_s[ R^tau + gamma_1^tau v(S_tau) ]is a generalized Bellman operator whose unique fixed point isv_pi. Equivalent expression in terms of the entire state history viaq_t(S_0^t)andq_t^+(S_0^t), the conditional distributions oftau. - Equivalence theorem (Theorem 3.1).
E_zeta[ e_0 delta_0(v) ] = 0is equivalent toTv - vbeing orthogonal (in thezeta_S-weighted Euclidean inner product) to the linear spanL_phiof feature components. So LSTD asymptotically solves the projected generalized Bellman equation for the operatorT. Proof unfolds the trace iteration backwards in time, exploits stationarity, and applies a change-of-measure between behavior and target processes.
Results
- Theorem 2.1 (ergodicity).
(S_t, y_t, e_t)is weak Feller with a unique invariant measure under nonrestrictive conditions; occupation measures converge weakly to it almost surely. - Corollary 2.1 (LSTD limit). LSTD equations converge to
E_zeta[ e_0 delta_0(v) ] = 0almost surely. - Theorem 3.1 (Bellman characterization). The LSTD limit equation is equivalent to the projected generalized Bellman equation
Tv - v perp_{zeta_S} L_phi, whereTis the operator induced by the randomized-stopping-time interpretation. - Compared to prior bounded-trace schemes (Retrace, ABQ, Tree-Backup, Precup-Sutton-Singh): the new scheme is more direct (bound
e_tstraight in trace space rather than throughlambda * rhoconstraints), more flexible (any Lipschitzlambdafunction over a Markovian memory works), and admits much largerlambdavalues. - No numerical experiments in this conference version (deferred to a longer version).
Limitations
- Theory only. No empirical evaluation; numerical examples are deferred to a longer (journal) version of the paper.
- Finite state space
S. All results are derived for finiteS; extension to continuous state spaces is not addressed. - LSTD only. Convergence is characterized only for the LSTD algorithm; gradient-based variants (e.g. GTD2, TDC) are mentioned as future work.
- Function-class bounded by
L_phi. Approximation error of the projected fixed point versusv_piis not bounded in this paper — only existence/uniqueness of the projected solution is discussed. - Behavior policy must be irreducible and target-policy support must be covered (Condition 2.1(ii)).
Open questions
- Convergence and finite-sample analysis for gradient-based TD algorithms (GTD2, TDC, proximal-TD) under the same bounded-trace scheme.
- Error bounds for
v - v_piwhen the projected generalized Bellman equation has a unique solution (referenced via [22, 32] but not derived here). - Extension to control (state-action value functions and policy improvement) rather than pure policy evaluation.
- How to design the memory function
gand thelambdafunction in practice for variance-reduction effectiveness. - Continuous state and action spaces.
My take
The paper sits at the theoretical end of the off-policy TD literature: its contribution is a clean Markov-chain-ergodicity proof for a more flexible trace-bounding scheme, plus a clean stopping-time interpretation of the implicit Bellman equation it solves. For this CRE asset-pricing project the relevance is tangential but mathematically suggestive: the paper’s central object — a generalized Bellman operator parameterized by a randomized stopping time, whose fixed point is the value of an infinite-horizon discounted reward — is structurally analogous to the project’s exponential-quadratic asset-pricing recursions, where pricing factors satisfy a fixed-point equation under a risk-neutral measure with state-dependent discounting (gamma -> exp(-r)). The randomized-stopping-time decomposition and the geometric-tail accumulator in (3.9) (sum_t lambda_1^t gamma_1^t r(S_t)) parallel the geometric-tail operator eta_H * r/(1-r) discussed in Exp 09 / Exp 12 of the project. The paper does not solve a problem the project has, but it is a clean reference point for “fixed-point of an operator parameterized by a state-dependent discounting Markov chain.”
Related
- generalized-bellman-equation
- bounded-eligibility-traces-enable-off-policy
- richard-sutton
- Adjacent off-policy TD literature referenced by the paper: Retrace (Munos et al. 2016), ABQ / Mahmood-Yu-Sutton 2017, Precup-Sutton-Singh 2000 (Tree-Backup), GTD2/TDC (Maei 2011), emphatic TD (Sutton-Mahmood-White 2016), LSTD (Boyan 1999), Yu (2012, 2015, 2016).
- Mathematically adjacent but not in this wiki yet: Meyn-Tweedie 2009 Markov Chains and Stochastic Stability (used for the weak-Feller ergodicity machinery).