Episodic vs Continuing Tasks

Definition

The classical taxonomy of reinforcement learning tasks distinguishes two cases by how the agent–environment interaction terminates:

Episodic task — interaction is divided into finite-length episodes. Each episode ends when the agent reaches a designated terminal state (or absorbing state), at which point the agent is “teleported” to a start state and a new episode begins. The return is the undiscounted (or discounted) sum of rewards within the episode.
Continuing task — interaction is unbounded; there is no terminal state. The return is an infinite discounted sum Σ_{i=0}^∞ γ^i R_{t+1+i} with discount γ < 1 to ensure boundedness.

Intuition

Episodic tasks model goal-reaching domains (puzzles, games with win/loss conditions, robotic pick-and-place). Continuing tasks model ongoing control (process control, recommendation, lifelong learning). Historically the two settings have required different convergence proofs and different algorithm specifications, even though the underlying mathematical machinery is largely shared.

Formal notation

In the absorbing-state formulation, an episodic task is encoded as an MDP with a special state s_T such that Pr(s_T, a, s_T) = 1 and r(s_T, a, s_T) = 0 for all actions a. Once entered, the agent stays in s_T forever with zero reward, so the infinite-discounted-sum formulation reduces to the within-episode return. In practice, when this absorbing state is reached, a reset transports the agent back to a start state distribution d_0.

In a continuing formulation, the return is G_t = Σ_{i=0}^∞ γ^i R_{t+1+i}, with γ ∈ [0, 1) to ensure convergence. Average-reward formulations (γ → 1) drop the discount and use the long-run average return lim_{T→∞} (1/T) Σ_{t=1}^T R_t instead.

Key variants

Absorbing-state encoding — adds a hypothetical sink state to express episodes as continuing-style infinite sums.
State-based discounting γ_s(s) — sets γ = 0 at terminal states, sometimes requires extra hypothetical states to avoid feature aliasing.
Transition-based discounting γ(s, a, s') — encodes terminal transitions as γ = 0 without modifying the state space; see transition-based-discounting and white-2017-unifying-task-specification-rl for the unification argument.
Average-reward — drops γ entirely, replaces the return with a long-run average; requires different convergence analysis (Mahadevan 1996).
Finite-horizon discounted — caps the episode at a fixed number of steps T, returns Σ_{i=0}^{T-1} γ^i R_{t+1+i}.

Known limitations

The absorbing-state encoding is conceptually awkward: it introduces hypothetical states and reset transitions that are not part of the “real” environment, complicating function approximation (the absorbing state’s features can interfere with learned values elsewhere).
Convergence proofs for episodic and continuing settings have historically been written separately (Bertsekas & Tsitsiklis 1996; Tsitsiklis & Van Roy 1997; Sutton et al. 2009), inflating the analytical and algorithmic surface area.
The choice between episodic and continuing framings can change which policies are optimal for the same underlying environment, especially under approximation.

Open problems

Universal convergence proofs that apply uniformly across both settings without case splits — partially addressed by the transition-based discounting unification (White 2017).
Principled treatment of “soft termination” — situations where reaching a goal does not erase the value of subsequent transitions but only down-weights them.
Algorithm design for tasks that mix episodic and continuing structure (e.g., a continuing task with rare hard resets).

Relevance to active research

The episodic/continuing distinction is foundational to every RL textbook and shows up in essentially every empirical paper as a setup choice. Recent theoretical work (transition-based discounting, generalized Bellman operators) aims to dissolve the distinction by treating it as a special case of a more general task specification.

LeatherSagiKnowledgebase

Explorer