Bounded eligibility traces enable off-policy TD with much larger lambda values than prior schemes

Statement

Setting the TD eligibility-trace parameter lambda_t as a function of the previous trace iterate e_{t-1} (and a Markovian memory of past states) — rather than as a constant or as a function of the current state alone — directly bounds the trace e_t while permitting substantially larger lambda values than prior off-policy schemes that constrain lambda_t * rho_{t-1} <= 1. Under mild Lipschitz and boundedness conditions on the choice of lambda, the joint state-trace Markov chain (S_t, y_t, e_t) is weak Feller and admits a unique invariant probability measure, so LSTD with this scheme converges almost surely to the solution of a projected generalized Bellman equation whose Bellman operator corresponds to a randomized stopping time on the target-policy chain.

Evidence summary

Theoretical (Yu-Mahmood-Sutton 2017, this paper). Theorem 2.1 establishes the ergodicity of (S_t, y_t, e_t) and uniqueness of the invariant measure. Corollary 2.1 establishes a.s. convergence of the LSTD equation. Theorem 3.1 establishes the equivalence between the LSTD limit and a projected generalized Bellman equation. Conditions 2.1–2.3 are described as nonrestrictive and a Euclidean-ball-scaling example shows they are easy to satisfy. The paper does not report numerical experiments; numerical examples and finite-sample comparisons are deferred to a longer journal version.

Conditions and scope

State space: finite. Continuous S not addressed.
Policies: behavior policy pi^o must be irreducible; target policy pi must be supported on the same transitions (Condition 2.1).
Memory: finite-state Markovian summary y_t of past states.
lambda function: must be Lipschitz in e (Condition 2.3(i)) and produce traces bounded in some norm (Condition 2.3(ii)).
Algorithm: the formal convergence guarantee is for LSTD only. Gradient-TD variants are mentioned as future work and are not proved here.
Quantitative claim: the paper does not state by how much lambda can be enlarged in any specific environment — that is left to the deferred numerical version.

Counter-evidence

None in the wiki. The “much larger lambda” qualifier is a comparative theoretical statement against Retrace / ABQ / Tree-Backup; the empirical magnitude of the improvement is not measured in this paper. A full assessment would require either the longer journal version or independent empirical replication.

Linked ideas

(none yet)

Open questions

How much variance reduction does the bounded-trace scheme actually deliver in standard off-policy benchmarks compared to Retrace and ABQ?
Does the convergence guarantee extend to gradient-TD algorithms (GTD2, TDC, proximal-TD)?
Does it extend to continuous state spaces, and to control (not just policy evaluation)?
How should the memory function g and the lambda function be designed in practice for a given problem class?

LeatherSagiKnowledgebase

Explorer