Problem

Off-policy temporal-difference (TD) learning in discounted Markov decision processes evaluates a target policy from observations generated by a different behavior policy. Existing off-policy TD algorithms suffer from very high variance because they rely on importance-sampling ratios; the eligibility-trace iterates e_t can have unbounded variance and even unbounded magnitude in many natural situations. Prior schemes (Retrace, ABQ, Tree-Backup) constrain lambda_t * rho_{t-1} <= 1 to bound traces, but this limits how aggressively lambda can be set.

Key idea

Choose the eligibility-trace parameter lambda_t as a function of the current trace e_{t-1} itself (and a Markovian “memory state” y_t summarizing past states), so that e_t can be kept inside a desired bounded set directly — for example, by scaling gamma_t * rho_{t-1} * e_{t-1} back into a Euclidean ball whenever it exits. This allows substantially larger lambda values than prior off-policy schemes while still bounding traces. Crucially, the construction makes the joint state-trace process (S_t, y_t, e_t) a weak Feller Markov chain to which Markov-chain ergodicity tools apply, and the resulting limit of the LSTD equation corresponds to a generalized Bellman equation whose Bellman operator is parameterized by a randomized stopping time on the target-policy chain.

Method

  1. Setup. Two finite-state Markov chains over state space S: target P^pi and observable behavior P^{pi^o}. State-dependent discount gamma(s), importance-sampling ratio rho(s,s') = P^pi_{s,s'} / P^{pi^o}_{s,s'}. Linear value-function approximation v(s) = phi(s)^T theta.
  2. Bounded-trace TD scheme. Trace iteration e_t = lambda_t * gamma_t * rho_{t-1} * e_{t-1} + phi(S_t); TD error delta_t(v) = rho_t * (R_t + gamma_{t+1} v(S_{t+1}) - v(S_t)). The novelty is letting lambda_t = lambda(y_t, e_{t-1}) where y_t = g(y_{t-1}, S_t) is a finite-state memory.
  3. Conditions on lambda. Two conditions on the function lambda: (i) Lipschitz-style continuity || lambda(y,e) e - lambda(y,e') e'|| <= ||e - e'|| — gives weak Feller property and ensures different initial traces converge. (ii) gamma(s') * rho(s,s') * lambda(y,e) * ||e|| <= C_y — explicit boundedness of the trace iterate.
  4. Ergodicity (Theorem 2.1). Under Conditions 2.1–2.3 the joint chain (S_t, y_t, e_t) is weak Feller, has a unique invariant probability measure zeta, and occupation measures converge weakly to zeta P_x-a.s. for every initial condition.
  5. LSTD characterization (Corollary 2.1). The empirical LSTD equation (1/t) sum_k e_k delta_k(v) = 0 converges almost surely to the limiting linear equation E_zeta[ e_0 delta_0(v) ] = 0.
  6. Generalized Bellman operator (Section 3). Construct a randomized stopping time tau on the target chain {S_t}: at each t >= 1, conditional on the past, stop with probability 1 - lambda_t where lambda_t is generated by the same rule (2.5). Then the operator (Tv)(s) = E^pi_s[ R^tau + gamma_1^tau v(S_tau) ] is a generalized Bellman operator whose unique fixed point is v_pi. Equivalent expression in terms of the entire state history via q_t(S_0^t) and q_t^+(S_0^t), the conditional distributions of tau.
  7. Equivalence theorem (Theorem 3.1). E_zeta[ e_0 delta_0(v) ] = 0 is equivalent to Tv - v being orthogonal (in the zeta_S-weighted Euclidean inner product) to the linear span L_phi of feature components. So LSTD asymptotically solves the projected generalized Bellman equation for the operator T. Proof unfolds the trace iteration backwards in time, exploits stationarity, and applies a change-of-measure between behavior and target processes.

Results

  • Theorem 2.1 (ergodicity). (S_t, y_t, e_t) is weak Feller with a unique invariant measure under nonrestrictive conditions; occupation measures converge weakly to it almost surely.
  • Corollary 2.1 (LSTD limit). LSTD equations converge to E_zeta[ e_0 delta_0(v) ] = 0 almost surely.
  • Theorem 3.1 (Bellman characterization). The LSTD limit equation is equivalent to the projected generalized Bellman equation Tv - v perp_{zeta_S} L_phi, where T is the operator induced by the randomized-stopping-time interpretation.
  • Compared to prior bounded-trace schemes (Retrace, ABQ, Tree-Backup, Precup-Sutton-Singh): the new scheme is more direct (bound e_t straight in trace space rather than through lambda * rho constraints), more flexible (any Lipschitz lambda function over a Markovian memory works), and admits much larger lambda values.
  • No numerical experiments in this conference version (deferred to a longer version).

Limitations

  • Theory only. No empirical evaluation; numerical examples are deferred to a longer (journal) version of the paper.
  • Finite state space S. All results are derived for finite S; extension to continuous state spaces is not addressed.
  • LSTD only. Convergence is characterized only for the LSTD algorithm; gradient-based variants (e.g. GTD2, TDC) are mentioned as future work.
  • Function-class bounded by L_phi. Approximation error of the projected fixed point versus v_pi is not bounded in this paper — only existence/uniqueness of the projected solution is discussed.
  • Behavior policy must be irreducible and target-policy support must be covered (Condition 2.1(ii)).

Open questions

  • Convergence and finite-sample analysis for gradient-based TD algorithms (GTD2, TDC, proximal-TD) under the same bounded-trace scheme.
  • Error bounds for v - v_pi when the projected generalized Bellman equation has a unique solution (referenced via [22, 32] but not derived here).
  • Extension to control (state-action value functions and policy improvement) rather than pure policy evaluation.
  • How to design the memory function g and the lambda function in practice for variance-reduction effectiveness.
  • Continuous state and action spaces.

My take

The paper sits at the theoretical end of the off-policy TD literature: its contribution is a clean Markov-chain-ergodicity proof for a more flexible trace-bounding scheme, plus a clean stopping-time interpretation of the implicit Bellman equation it solves. For this CRE asset-pricing project the relevance is tangential but mathematically suggestive: the paper’s central object — a generalized Bellman operator parameterized by a randomized stopping time, whose fixed point is the value of an infinite-horizon discounted reward — is structurally analogous to the project’s exponential-quadratic asset-pricing recursions, where pricing factors satisfy a fixed-point equation under a risk-neutral measure with state-dependent discounting (gamma -> exp(-r)). The randomized-stopping-time decomposition and the geometric-tail accumulator in (3.9) (sum_t lambda_1^t gamma_1^t r(S_t)) parallel the geometric-tail operator eta_H * r/(1-r) discussed in Exp 09 / Exp 12 of the project. The paper does not solve a problem the project has, but it is a clean reference point for “fixed-point of an operator parameterized by a state-dependent discounting Markov chain.”

  • generalized-bellman-equation
  • bounded-eligibility-traces-enable-off-policy
  • richard-sutton
  • Adjacent off-policy TD literature referenced by the paper: Retrace (Munos et al. 2016), ABQ / Mahmood-Yu-Sutton 2017, Precup-Sutton-Singh 2000 (Tree-Backup), GTD2/TDC (Maei 2011), emphatic TD (Sutton-Mahmood-White 2016), LSTD (Boyan 1999), Yu (2012, 2015, 2016).
  • Mathematically adjacent but not in this wiki yet: Meyn-Tweedie 2009 Markov Chains and Stochastic Stability (used for the weak-Feller ergodicity machinery).