Statement
Setting the TD eligibility-trace parameter lambda_t as a function of the previous trace iterate e_{t-1} (and a Markovian memory of past states) — rather than as a constant or as a function of the current state alone — directly bounds the trace e_t while permitting substantially larger lambda values than prior off-policy schemes that constrain lambda_t * rho_{t-1} <= 1. Under mild Lipschitz and boundedness conditions on the choice of lambda, the joint state-trace Markov chain (S_t, y_t, e_t) is weak Feller and admits a unique invariant probability measure, so LSTD with this scheme converges almost surely to the solution of a projected generalized Bellman equation whose Bellman operator corresponds to a randomized stopping time on the target-policy chain.
Evidence summary
- Theoretical (Yu-Mahmood-Sutton 2017, this paper). Theorem 2.1 establishes the ergodicity of
(S_t, y_t, e_t)and uniqueness of the invariant measure. Corollary 2.1 establishes a.s. convergence of the LSTD equation. Theorem 3.1 establishes the equivalence between the LSTD limit and a projected generalized Bellman equation. Conditions 2.1–2.3 are described as nonrestrictive and a Euclidean-ball-scaling example shows they are easy to satisfy. The paper does not report numerical experiments; numerical examples and finite-sample comparisons are deferred to a longer journal version.
Conditions and scope
- State space: finite. Continuous
Snot addressed. - Policies: behavior policy
pi^omust be irreducible; target policypimust be supported on the same transitions (Condition 2.1). - Memory: finite-state Markovian summary
y_tof past states. lambdafunction: must be Lipschitz ine(Condition 2.3(i)) and produce traces bounded in some norm (Condition 2.3(ii)).- Algorithm: the formal convergence guarantee is for LSTD only. Gradient-TD variants are mentioned as future work and are not proved here.
- Quantitative claim: the paper does not state by how much
lambdacan be enlarged in any specific environment — that is left to the deferred numerical version.
Counter-evidence
None in the wiki. The “much larger lambda” qualifier is a comparative theoretical statement against Retrace / ABQ / Tree-Backup; the empirical magnitude of the improvement is not measured in this paper. A full assessment would require either the longer journal version or independent empirical replication.
Linked ideas
(none yet)
Open questions
- How much variance reduction does the bounded-trace scheme actually deliver in standard off-policy benchmarks compared to Retrace and ABQ?
- Does the convergence guarantee extend to gradient-TD algorithms (GTD2, TDC, proximal-TD)?
- Does it extend to continuous state spaces, and to control (not just policy evaluation)?
- How should the memory function
gand thelambdafunction be designed in practice for a given problem class?