Definition

The options framework introduces temporally extended actions, called options, into reinforcement learning. An option is a tuple (π, β, I) where:

  • π : S × A → [0, 1] is an internal policy executed while the option is active;
  • β : S → [0, 1] is a termination probability function — at each step the option terminates with probability β(s);
  • I ⊂ S is an initiation set of states from which the option may be invoked.

Once invoked from a state in I, the option follows π until it terminates according to β, at which point control returns to the higher-level policy. A higher-level policy that selects among options (and possibly primitive actions) defines a semi-Markov decision process (SMDP) over the original MDP.

Intuition

Options are the standard formalization of temporal abstraction in reinforcement learning: they let the agent reason and plan over coarse-grained “skills” (open-the-door, walk-to-kitchen) rather than only over single primitive actions. This gives exponential speedups in planning depth and supports hierarchical learning and transfer.

Formal notation

The composite system (MDP, options) becomes a semi-Markov decision process. For an option o = (π, β, I) invoked at state s ∈ I, the option’s discounted reward and transition model are

r_o(s) = E[ Σ_{k=0}^{τ-1} γ^k R_{t+k+1} | s_t = s, o ], P_o(s, s') = E[ γ^τ 1{s_{t+τ} = s'} | s_t = s, o ],

where τ is the (random) number of steps until the option terminates. The Bellman equation over options uses these in place of single-step r and P. Sutton, Precup & Singh (1999, Theorem 1) prove that planning and learning over options is equivalent to standard SMDP planning.

Key variants

  • Markov optionsπ and β depend only on the current state.
  • Semi-Markov optionsπ and β may depend on the entire history since invocation.
  • Intra-option learning — value updates that share information across options that pass through the same states, accelerating learning.
  • Option-critic architecture (Bacon, Harb, Precup 2017) — end-to-end learning of π, β, and the higher-level policy via policy gradient.
  • Feudal / hierarchical RL — related but distinct frameworks that decompose tasks into subgoals without explicitly using the (π, β, I) triple.

Known limitations

  • The classical SMDP treatment requires separate machinery from primitive-action MDPs, and the connection to general value functions and to interest-weighted prediction has been historically opaque (one of the motivations for transition-based-discounting is to absorb options into a single RL-task formalism).
  • Discovering useful options automatically (rather than hand-designing them) remains hard: the literature on option discovery (eigenoptions, bottleneck states, information-theoretic objectives) has produced partial solutions but no consensus method.
  • Options can introduce additional bias and variance in TD-style updates; intra-option learning helps but does not eliminate this.

Open problems

  • Principled, general-purpose option discovery from experience.
  • Compositional reuse of options across tasks and environments.
  • Theoretical bounds on the planning-depth speedup achievable from a given option set.

Relevance to active research

Options remain the canonical formalization of temporal abstraction in RL and appear in essentially every hierarchical-RL paper. Recent unification efforts (White 2017, this paper) recast options as RL subtasks defined by transition-based discounting and interest functions, dissolving the historical syntactic separation between options and general value functions.