Problem
Linear Dynamical Systems (LDS) with Gaussian noise admit exact Kalman-filter inference, but most physical systems are non-linear or non-Gaussian. Two stylised extensions exist: (a) discretise the latent state into a Dynamic Bayesian Network / HMM, but the belief state becomes exponential in the number of latent variables; or (b) keep the state continuous and Gaussian, but allow a discrete switch variable (with Markov dynamics over modes) to select which dynamics matrix and/or which observation matrix is active at each time step. The second construction — a Switching Kalman Filter (SKF), also called a Switching State-Space Model, Jump Markov Linear System, or Conditional Gaussian SSM — is a universal hammer for piecewise-linear dynamics, non-Gaussian observation noise modelled as a mixture, sensor failure, manoeuvring-target tracking, and data-association ambiguity (Multiple Hypothesis Tracking).
The fundamental obstruction is that in an SKF the exact filtering posterior is a mixture of Gaussians — one component per possible regime history . Inference is therefore intractable even though the model is conditionally linear-Gaussian. This paper unifies the prior literature on SKFs, lays out the four families of approximation, derives GPB2 in detail, and extends EM-based maximum-likelihood learning from the non-switching case ([DRO93], [GH96a]) and from the switching-but-fully-observed case ([Ham90]) to the fully-latent SKF case.
Key idea
There is essentially one inference primitive — moment-matching collapse of a mixture of Gaussians into a smaller mixture (or a single Gaussian) in closed form — and one learning primitive — EM with the inferred switch posterior playing the role of soft component weights. All previously published switching-KF algorithms (GPB1, GPB2, IMM, the Ghahramani-Hinton variational SKF, Multiple-Hypothesis Tracking) can be viewed as different choices of which subset of the history mixture to keep and how aggressively to collapse the rest. Murphy presents the unification, gives the cleanest GPB2 filter+smoother+EM derivation for the SKF case (handling the cross-variance term required by EM, which IMM cannot deliver), and notes that moment-matching is the optimal single-Gaussian approximation in KL [Lau96] and that the per-step approximation error stays bounded thanks to the contracting/spreading property proved by Boyen-Koller [BK98a, BK98b].
Method
Model. is a discrete Markov chain with transition matrix and initial . Conditional on , , , and , . Special cases include the Switching AR (SAR) model (set , observe directly — exact inference is then tractable since the only hidden variable is discrete) and the Ghahramani-Hinton “data-association” SKF (the switch selects which sub-process is read out into ).
Inference — the four families. The exact filtering posterior is a mixture of Gaussians, one per regime path . Murphy enumerates four ways to defeat this growth:
- Collapsing — keep a fixed-size mixture of Gaussians and project the rest down via moment matching. This is GPB(): GPB1 collapses to a single Gaussian per step; GPB2 collapses two-step histories (so it runs filter banks per step); IMM is a cheaper -filter approximation to GPB2 but cannot be used for smoothing because it does not propagate the cross-variance term required by EM.
- Selection — keep only the highest-probability paths in the regime tree (Multiple Hypothesis Tracking, [BSF88]).
- Iterative / sampling — Gibbs / MCMC over ([CK96, BMR98]) or alternating Viterbi-style segmentation refinement.
- Variational — break the vertical links and reintroduce variational coupling parameters; EM then maximises a lower bound on the log-likelihood ([GH96b]).
The paper focuses on the collapsing approach.
GPB2 SKF filter (detailed). For each pair Murphy runs a Kalman update , then forms the joint regime posterior , marginalises to , and collapses the ""-conditioned Gaussians into a single -conditioned Gaussian via moment matching with weights . The GPB2 smoother and the backward-pass approximation ([Kim94]) follow analogously. The key algebraic objects to track for EM are the cross-variance terms , which IMM cannot supply.
Collapse operator. For a mixture , moment-matched moments are , , and . This is the closest single Gaussian (in KL) to the original mixture [Lau96].
EM learning. The complete-data log-likelihood factorises into a linear-regression-like term per regime plus an HMM-like term for and . The E-step uses the soft regime weights plus the smoothed first and second moments . The M-step gives regime-weighted analogues of the standard LDS-EM updates: and are weighted linear-regression formulas, and are weighted observation-equation formulas, and are the standard HMM Baum-Welch updates, and the initial are mixture-of-Gaussians updates. Hamilton’s [Ham90, Ham91] Wishart prior on regularises the well-known mixture-of-Gaussians singularity (Murphy reports , “works quite well in practice”). EM is initialised with deterministic annealing ([UN98], suggested by [GH96b]) by raising the regime posterior to power to escape the local-minima landscape.
Approximation arising in EM. The expectation of uses in place of to keep the formulas tractable; the latter is an exponential number of vectors (one per segmentation).
Results
This is a synthesis paper rather than a benchmark paper. The contributions are:
- Unification. Five previously distinct algorithm families (GPB1, GPB2, IMM, variational SKF, MHT) are presented as different points in the same approximation space (which subset of the history mixture to keep, and how to collapse the rest).
- Cleanly-derived GPB2 SKF inference. The forward filter and the smoother are written in a common notation that makes the cross-variance bookkeeping explicit. This is what makes EM-based learning of SKF parameters possible — IMM, although cheaper ( filters versus ), cannot provide the cross-variance terms.
- Switching-EM derivation. Appendix B carries through the derivative calculations for , , , in the switching case. The result is that every MLE update is the standard non-switching MLE update with the per-step expectations re-weighted by , and the regime-transition update is the standard Baum-Welch update on . Constrained MLE (e.g., diagonal covariance, parameter tying across regimes) is obtained by computing the unconstrained MLE and then projecting onto the allowable subspace, or by pooling sufficient statistics over the equivalence class.
- Bounded-error rationale. Murphy points to the Boyen-Koller [BK98a, BK98b] result that, although moment-matching collapse introduces error at every step, the stochasticity of the dynamics causes the true and approximate posteriors to “spread out” and overlap, so the accumulated error stays bounded.
- Wishart regularisation. Hamilton’s Wishart prior on prevents the SKF from collapsing to a singular covariance matrix, with a closed-form MAP update that drops out of the same EM machinery.
Limitations
- Approximation, not exact inference. All practical SKF algorithms in the collapsing family are biased: they replace a mixture of Gaussians with a much smaller mixture, and there is no easy a-priori control on the error introduced (only the bounded-but-not-quantified Boyen-Koller guarantee).
- No empirical comparison. GPB1 vs GPB2 vs IMM vs variational vs MHT are presented as design points; the paper does not benchmark them on a common problem.
- EM uses an extra approximation. The EM update replaces with to avoid an exponential number of vectors. The paper does not characterise when this matters.
- Local minima. EM on an SKF has candidate segmentations; Murphy recommends deterministic annealing but offers no convergence guarantee.
- No handling of model selection. The number of modes , the parameter tying scheme, and the existence of unobserved sub-processes are all assumed to be known a-priori.
- Smoothing requires two-step history. GPB1 and IMM cannot supply the cross-variance term needed by EM, so EM-based learning specifically needs GPB2 (or one of the more expensive selection / sampling schemes).
Open questions
- How tight is the Boyen-Koller “bounded error” guarantee in practice for SKF problems with weak ergodicity (slow regime mixing) or near-deterministic dynamics?
- For the EM step, when does replacing with produce material parameter bias?
- Can the variational SKF of [GH96b] be combined with collapsing-style filtering to get a tighter bound at lower cost than full GPB2?
- What is the right way to choose the order in GPB() adaptively, based on regime entropy or filter disagreement?
My take
This is the paper to cite when you need a single, citable, modern statement of
“the SKF problem and its standard fixes.” It is the technical-report sibling of
the SKF chapter of Murphy’s later textbook, and its real contribution is
nomenclatural normalisation: GPB1, GPB2, IMM, MHT, variational SKF, and
data-association ambiguity are all the same animal viewed from different angles,
and the cross-variance bookkeeping needed for EM is explicit. For the CRE
asset-pricing model in this workspace, the relevance is direct: the Rao-Blackwellised
particle filter (RBPF) used in SimMdlPrices/src/rbpf.jl is a selection-class
member of Murphy’s taxonomy — particles carry regime histories, and conditional
on a regime path the continuous state is exact-Kalman and asset prices are
exponential-quadratic. GPB2 / IMM are the collapsing-class alternatives, and
Murphy’s framing makes clear what is gained (cheaper, no resampling variance)
and what is lost (the cross-variance approximation, the inability to represent
multimodal switch posteriors crisply). The “EM updates are weighted versions of
the non-switching updates” observation is also the right mental model for any
future MAP/MLE work over the regime parameters.