Definition
The options framework introduces temporally extended actions, called options, into reinforcement learning. An option is a tuple (π, β, I) where:
π : S × A → [0, 1]is an internal policy executed while the option is active;β : S → [0, 1]is a termination probability function — at each step the option terminates with probabilityβ(s);I ⊂ Sis an initiation set of states from which the option may be invoked.
Once invoked from a state in I, the option follows π until it terminates according to β, at which point control returns to the higher-level policy. A higher-level policy that selects among options (and possibly primitive actions) defines a semi-Markov decision process (SMDP) over the original MDP.
Intuition
Options are the standard formalization of temporal abstraction in reinforcement learning: they let the agent reason and plan over coarse-grained “skills” (open-the-door, walk-to-kitchen) rather than only over single primitive actions. This gives exponential speedups in planning depth and supports hierarchical learning and transfer.
Formal notation
The composite system (MDP, options) becomes a semi-Markov decision process. For an option o = (π, β, I) invoked at state s ∈ I, the option’s discounted reward and transition model are
r_o(s) = E[ Σ_{k=0}^{τ-1} γ^k R_{t+k+1} | s_t = s, o ],
P_o(s, s') = E[ γ^τ 1{s_{t+τ} = s'} | s_t = s, o ],
where τ is the (random) number of steps until the option terminates. The Bellman equation over options uses these in place of single-step r and P. Sutton, Precup & Singh (1999, Theorem 1) prove that planning and learning over options is equivalent to standard SMDP planning.
Key variants
- Markov options —
πandβdepend only on the current state. - Semi-Markov options —
πandβmay depend on the entire history since invocation. - Intra-option learning — value updates that share information across options that pass through the same states, accelerating learning.
- Option-critic architecture (Bacon, Harb, Precup 2017) — end-to-end learning of
π,β, and the higher-level policy via policy gradient. - Feudal / hierarchical RL — related but distinct frameworks that decompose tasks into subgoals without explicitly using the
(π, β, I)triple.
Known limitations
- The classical SMDP treatment requires separate machinery from primitive-action MDPs, and the connection to general value functions and to interest-weighted prediction has been historically opaque (one of the motivations for transition-based-discounting is to absorb options into a single RL-task formalism).
- Discovering useful options automatically (rather than hand-designing them) remains hard: the literature on option discovery (eigenoptions, bottleneck states, information-theoretic objectives) has produced partial solutions but no consensus method.
- Options can introduce additional bias and variance in TD-style updates; intra-option learning helps but does not eliminate this.
Open problems
- Principled, general-purpose option discovery from experience.
- Compositional reuse of options across tasks and environments.
- Theoretical bounds on the planning-depth speedup achievable from a given option set.
Relevance to active research
Options remain the canonical formalization of temporal abstraction in RL and appear in essentially every hierarchical-RL paper. Recent unification efforts (White 2017, this paper) recast options as RL subtasks defined by transition-based discounting and interest functions, dissolving the historical syntactic separation between options and general value functions.