Markov Chains & Hidden Markov Models

A Markov chain is a stochastic process where the future depends on the past only through the present — the memoryless property. Hidden Markov Models extend the framework by letting the chain itself be unobserved and inferring it from noisy observations. The two together form one of the most heavily reused probabilistic tools in physics, biology, finance, speech, and modern deep state-space models.

1. Discrete-Time Markov Chains (DTMC)

1.1 Formal setup

A sequence of random variables X_0, X_1, X_2, ... on countable state space S = {1, ..., N} is a Markov chain if

P(X_{n+1} = j | X_n = i, X_{n-1} = i_{n-1}, ..., X_0 = i_0) = P(X_{n+1} = j | X_n = i) = P_{ij}

The conditional distribution given the past depends only on the immediate previous state. The matrix P = (P_{ij}) is the transition matrix, row-stochastic: each row sums to 1 and each entry is in [0, 1].

The chain is time-homogeneous when P_{ij} is independent of n. Non-homogeneous chains have P^{(n)} per step and are common in adaptive MCMC and stochastic-approximation settings.

1.2 Chapman-Kolmogorov

The n-step transition is the matrix power:

P^{(n)}_{ij} = P(X_n = j | X_0 = i) = (P^n)_{ij}

The Chapman-Kolmogorov equation says

P^{(m+n)}_{ij} = sum_k P^{(m)}_{ik} P^{(n)}_{kj}

This is just matrix multiplication once you accept the Markov property. Starting from distribution mu_0 (a row vector), the marginal at step n is mu_n = mu_0 P^n.

1.3 Classification of states

  • Accessible i -> j if P^{(n)}_{ij} > 0 for some n >= 0.
  • Communicating i <-> j if both directions hold. Communicating classes partition the state space.
  • Irreducible chain — exactly one communicating class.
  • Recurrent state — returns with probability 1. Transient — finite expected number of visits.
  • Positive recurrent — finite expected return time. Null recurrent — infinite expected return time but probability-1 return.
  • Period d(i) = gcd{n : P^{(n)}_{ii} > 0}. Aperiodic when d = 1. Self-loop P_{ii} > 0 is sufficient but not necessary.

1.4 Stationary distribution

A row vector π with π_i >= 0, sum π_i = 1 is stationary when

π = π P

For finite irreducible aperiodic chains the fundamental limit theorem (Markov 1906, Doeblin 1937) gives

lim_{n -> ∞} P^{(n)}_{ij} = π_j   for all i, j

So P^n converges to the rank-1 matrix 1 π. Existence and uniqueness of π is guaranteed for finite irreducible chains; countable-state chains need positive recurrence.

Detailed balance is a stronger condition:

π_i P_{ij} = π_j P_{ji}    for all i, j

A chain satisfying detailed balance is reversible. Detailed balance implies stationarity but not conversely. Reversibility makes the transfer operator self-adjoint in L^2(π), which simplifies spectral analysis and is the key design lever for MCMC.

1.5 Perron-Frobenius and spectral gap

For an irreducible stochastic matrix P, the Perron-Frobenius theorem (Perron 1907, Frobenius 1912) gives:

  • The dominant eigenvalue is λ_1 = 1, with right eigenvector 1 = (1, 1, ..., 1)^T and left eigenvector π.
  • All eigenvalues satisfy |λ_k| <= 1.
  • For aperiodic chains, every other eigenvalue is strictly less than 1 in modulus.
  • The spectral gap is γ = 1 - |λ_2|. Larger gap → faster mixing.

1.6 Mean first passage time

Define

m_{ij} = E[T_j | X_0 = i],   T_j = inf{n >= 1 : X_n = j}

For finite irreducible chains, m_{jj} = 1 / π_j (Kac’s lemma). For i ≠ j,

m_{ij} = 1 + sum_{k ≠ j} P_{ik} m_{kj}

This is a linear system. Mean first passage matrices feed network-analysis metrics, MCMC convergence diagnostics, and PageRank-style centrality.

1.7 Mixing time

Mixing time is

t_mix(ε) = min{n : max_i || P^n(i, ·) - π ||_TV <= ε}

where ||·||_TV is total variation distance. For reversible chains the spectral gap bound gives

t_mix(ε) <= (1/γ) log(1 / (ε π_min))

A tighter geometric bound uses Cheeger’s inequality (Cheeger 1970, Jerrum-Sinclair 1989 for chains):

Φ^2 / 2 <= γ <= 2 Φ,    Φ = min_{S : π(S) <= 1/2} Q(S, S^c) / π(S)

where Q(S, S^c) is the probability flux out of set S under the stationary chain. Conductance Φ is hard to compute but easy to lower-bound combinatorially; Cheeger then gives a mixing rate. This connection between graph expansion and chain mixing drives modern analyses of MCMC and randomized algorithms.

1.8 Ergodic theorem

For irreducible positive-recurrent chains and any f with E_π[|f|] < ∞,

(1/n) sum_{k=0}^{n-1} f(X_k)  ->  E_π[f]   almost surely

This is the engine of Markov chain Monte Carlo — time averages along a trajectory equal space averages under π, and you do not need independent samples.

2. Continuous-Time Markov Chains (CTMC)

2.1 Generator matrix

For continuous time t >= 0 define P_{ij}(t) = P(X_t = j | X_0 = i). The transition matrix P(t) satisfies the semigroup property P(s + t) = P(s) P(t) and P(0) = I. The infinitesimal generator (also called rate matrix or Q-matrix) is

Q = lim_{t -> 0+} (P(t) - I) / t

with off-diagonal q_{ij} >= 0 (rate of jumps from i to j) and diagonal q_{ii} = -sum_{j ≠ i} q_{ij} (negative). Rows sum to zero.

2.2 Holding times and embedded chain

Sitting in state i, the holding time is exponential with rate -q_{ii}:

P(holding time > t | X_0 = i) = exp(q_{ii} t)

On leaving i, the next state is j with probability q_{ij} / (-q_{ii}). The sequence of visited states (ignoring durations) is the embedded discrete chain with transition matrix

P^{embed}_{ij} = q_{ij} / (-q_{ii})   for i ≠ j,   0 on diagonal

2.3 Kolmogorov equations

P(t) satisfies the Kolmogorov forward equation (also called the master equation in physics):

dP(t) / dt = P(t) Q

and the Kolmogorov backward equation:

dP(t) / dt = Q P(t)

Both are matrix ODEs with solution P(t) = exp(t Q). For probability distributions μ(t) = μ(0) P(t),

dμ/dt = μ Q

— a linear ODE, the master equation in statistical physics.

2.4 Stationary distribution and detailed balance

π stationary iff π Q = 0. Detailed balance is

π_i q_{ij} = π_j q_{ji}

implying time-reversibility. CTMCs in physics with detailed balance are at thermal equilibrium; without it they are in non-equilibrium steady state (NESS).

2.5 Classical chains

  • Ehrenfest urn (Ehrenfest 1907) — N particles in two urns, swap one at random per step. State = particles in urn 1. Used to model diffusion and to demolish naive “no recurrence in physics” arguments.
  • Wright-Fisher model (Wright 1931, Fisher 1930) — 2N gene copies; next generation samples with replacement. State = count of allele A. Foundational in population genetics. The Kingman coalescent (Kingman 1982) is the genealogical dual.
  • M/M/1 queue — Poisson arrivals rate λ, exponential service rate μ, single server. State = number in system. Stable when ρ = λ/μ < 1; stationary distribution geometric with parameter ρ. Backbone of networking-foundations traffic models.
  • Erlang Bm servers, blocked customers cleared. Erlang B formula B(m, ρ) = (ρ^m / m!) / sum_{k=0..m} (ρ^k / k!) gives the blocking probability. Erlang Cm servers with infinite queue; gives wait probability. Telecom and call-center sizing.
  • Birth-death chains — transitions only to neighbors, q_{i, i+1} = λ_i, q_{i, i-1} = μ_i. M/M/1 and Ehrenfest are birth-death; nearly always tractable analytically.
  • Random walks — Z^1, Z^2 random walks are recurrent; Z^3 and higher are transient (Pólya 1921). The classical “drunk goes home but drunk bird lost forever” result.

3. MCMC and the design of reversible chains

Markov chain Monte Carlo turns the ergodic theorem into an estimation tool: construct a chain whose stationary distribution is the target π and average along trajectories. Detailed balance is the standard design constraint because it makes verifying π trivial.

3.1 Metropolis-Hastings (Metropolis et al 1953, Hastings 1970)

Given proposal density q(y | x) and target π(x), the acceptance probability is

α(x, y) = min{1, [π(y) q(x | y)] / [π(x) q(y | x)]}

This satisfies detailed balance by construction. Random-walk Metropolis uses q(y | x) = N(x, Σ) and acceptance reduces to π(y) / π(x). Independence sampler uses q(y | x) = q(y). Hamiltonian Monte Carlo (Duane et al 1987, Neal 1993, Hoffman-Gelman NUTS 2014) uses gradient-driven Hamiltonian dynamics as the proposal; covered in mcmc-sampling.

3.2 Gibbs sampling (Geman-Geman 1984)

When the full conditionals π(x_i | x_{-i}) are easy, sweep through them. Detailed balance holds for each conditional update, and the composition (deterministic-scan or random-scan) leaves π invariant. Gibbs is a special case of Metropolis-Hastings with acceptance 1.

3.3 Reversible vs non-reversible MCMC

Non-reversible chains can mix faster than the best reversible chain — the lifted chain construction (Diaconis-Holmes-Neal 2000) duplicates each state with a direction bit and breaks reversibility. Used in modern algorithms like bouncy particle sampler (Bouchard-Côté et al 2018) and zig-zag process (Bierkens-Fearnhead-Roberts 2019).

4. Hidden Markov Models

A Hidden Markov Model is a doubly-stochastic process: a latent Markov chain X_1, ..., X_T over states {1, ..., N} and an observed sequence Y_1, ..., Y_T such that Y_t depends only on X_t. The HMM is fully specified by λ = (A, B, π):

  • A_{ij} = P(X_{t+1} = j | X_t = i) — transition matrix
  • b_j(y) = P(Y_t = y | X_t = j) — emission density / distribution
  • π_i = P(X_1 = i) — initial state distribution

Following the canonical reference Rabiner 1989 (“A Tutorial on Hidden Markov Models”, Proc. IEEE 77(2)), HMM tasks split into three problems.

4.1 Problem 1 — evaluation

Given λ and observation sequence Y_{1:T}, compute the likelihood P(Y_{1:T} | λ). Brute force sum_X P(Y, X | λ) is O(N^T) — intractable. The Forward algorithm uses

α_1(i) = π_i b_i(Y_1)
α_{t+1}(j) = [sum_i α_t(i) A_{ij}] b_j(Y_{t+1})
P(Y_{1:T} | λ) = sum_i α_T(i)

This runs in O(N^2 T). Numerical underflow is handled by scaling each α_t to sum to 1 and keeping the log-scale factor.

4.2 Problem 2 — decoding

Find argmax_X P(X | Y, λ). The Viterbi algorithm (Viterbi 1967) — originally for convolutional codes — is dynamic programming in log-space:

δ_1(i) = log π_i + log b_i(Y_1)
δ_{t+1}(j) = max_i [δ_t(i) + log A_{ij}] + log b_j(Y_{t+1})
ψ_{t+1}(j) = argmax_i [δ_t(i) + log A_{ij}]

with traceback via ψ. Cost O(N^2 T), single pass.

Two distinct notions of “decode” exist:

  • Viterbi path — joint MAP argmax_X P(X | Y).
  • Posterior decoding — per-step MAP argmax_{x_t} P(X_t = x_t | Y), computed from forward-backward marginals. May yield a path with zero probability under the model but is sometimes preferred for biology (e.g. gene finding labels per nucleotide).

4.3 Problem 3 — learning

Estimate λ from observations. Baum-Welch (Baum-Petrie 1966, Baum-Eagon 1967, Baum 1972) is the EM algorithm applied to HMMs:

  • E-step — forward-backward gives marginals γ_t(i) = P(X_t = i | Y) and pair-marginals ξ_t(i, j) = P(X_t = i, X_{t+1} = j | Y).

  • M-step

    π_i^new = γ_1(i)
    A_{ij}^new = (sum_t ξ_t(i, j)) / (sum_t γ_t(i))
    b_j(y)^new ∝ sum_{t : Y_t = y} γ_t(j)    (discrete emissions)
    

Monotone non-decrease of P(Y | λ) at each iteration follows the general EM convergence (Dempster-Laird-Rubin 1977). Local optima are the price; multiple restarts and informed initialization (e.g. k-means on observations) help in practice. Discussed in the broader EM framework at variational-inference.

4.4 Forward-Backward smoothing

The Backward variable

β_T(i) = 1
β_t(i) = sum_j A_{ij} b_j(Y_{t+1}) β_{t+1}(j)

combines with forward to give marginal posterior

γ_t(i) = α_t(i) β_t(i) / sum_k α_t(k) β_t(k)

and pairwise

ξ_t(i, j) = α_t(i) A_{ij} b_j(Y_{t+1}) β_{t+1}(j) / P(Y | λ)

Forward-backward is smoothing — using the whole sequence to estimate X_t. Forward alone is filtering — using only Y_{1:t}.

5. Continuous emissions

5.1 Gaussian emissions

b_j(y) = N(y; μ_j, Σ_j). M-step gives

μ_j^new = (sum_t γ_t(j) Y_t) / (sum_t γ_t(j))
Σ_j^new = (sum_t γ_t(j) (Y_t - μ_j)(Y_t - μ_j)^T) / (sum_t γ_t(j))

5.2 Gaussian mixture emissions

b_j(y) = sum_m c_{jm} N(y; μ_{jm}, Σ_{jm}). Treat mixture component as a second latent variable; full EM has nested forward-backward + standard GMM M-step. Used in classical HMM-GMM speech recognition (1980s–early 2010s).

5.3 Multivariate continuous, autoregressive

Coupling consecutive observations directly gives autoregressive HMMs: Y_t | X_t, Y_{t-1} ∼ N(A_{X_t} Y_{t-1} + b_{X_t}, Σ_{X_t}). Combines with Markov switching state-space models for finance and econometrics.

6. Variations and extensions

6.1 Hidden Semi-Markov Models (HSMM)

In a vanilla HMM the duration in state j is geometric, P(d) = A_{jj}^{d-1}(1 - A_{jj}), often unrealistic. HSMMs (Ferguson 1980) attach an explicit duration distribution p_j(d). Forward-backward becomes O(N^2 T D) for max-duration D. Speech, biology, animal-behavior modeling.

6.2 Hierarchical HMMs

HHMM (Fine-Singer-Tishby 1998) — nested HMMs where states themselves emit subsequences from sub-HMMs. Reformulated as a flat HMM (Murphy-Paskin 2001) it admits standard forward-backward at O(D N^2 T) for depth D.

6.3 Factorial HMMs

Multiple parallel chains share an emission distribution: X_t = (X_t^{(1)}, ..., X_t^{(M)}). State space is exponential, so exact inference requires structured variational methods (Ghahramani-Jordan 1997). A natural fit for multi-source speech and source separation.

6.4 Input-Output HMMs

Bengio-Frasconi 1996. Transitions and emissions condition on an external input sequence; precursor to attention-based seq2seq.

7. Particle filters / sequential Monte Carlo

For nonlinear / non-Gaussian state-space models exact filtering is intractable. Particle filters represent the posterior by a weighted sample.

7.1 Bootstrap filter (Gordon-Salmond-Smith 1993)

For state-space model

X_t = f(X_{t-1}) + w_t,   Y_t = h(X_t) + v_t

with particles {x_t^{(i)}, w_t^{(i)}}:

  1. Predict — propagate x_t^{(i)} ∼ p(X_t | X_{t-1} = x_{t-1}^{(i)}).
  2. Weight — w_t^{(i)} ∝ p(Y_t | X_t = x_t^{(i)}).
  3. Resample (SIR) — draw N particles with replacement from {x_t^{(i)}, w_t^{(i)}}.

Resampling cures degeneracy (one particle hogs all weight). After resampling all weights reset to 1/N.

7.2 Effective sample size (ESS)

ESS = 1 / sum_i (w_t^{(i)})^2

Resample only when ESS < N/2 to balance variance reduction against the loss from resampling. Liu 2001 (“Monte Carlo Strategies in Scientific Computing”) is the reference.

7.3 Auxiliary particle filter (Pitt-Shephard 1999)

Use a look-ahead at Y_t to pre-select particles likely to survive the upcoming weighting step. Reduces variance when likelihood is peaked relative to the prior.

7.4 Regularization and Rao-Blackwellization

  • Regularized PF (Musso-Oudjane-Le Gland 2001) — smooth the resampled empirical measure to combat sample impoverishment, often via a Gaussian kernel.
  • Rao-Blackwellized PF (Doucet-de Freitas-Murphy-Russell 2000) — marginalize out a linear-Gaussian subset of the state analytically, particle-filter only the nonlinear subset. Standard for FastSLAM (Montemerlo-Thrun 2002); see slam.

7.5 Smoothing and learning

Forward-filtering backward-smoothing for PF — Godsill-Doucet-West 2004. Particle MCMC (Andrieu-Doucet-Holenstein 2010) embeds a PF inside an MH step to do parameter inference in nonlinear state-space models.

8. Kalman filter family

For linear-Gaussian state space models the optimal Bayesian filter is closed form — the Kalman filter (Kalman 1960).

8.1 Standard Kalman filter

Model:

X_t = F X_{t-1} + B u_t + w_t,   w_t ∼ N(0, Q)
Y_t = H X_t + v_t,                v_t ∼ N(0, R)

Predict:

x̂_{t|t-1} = F x̂_{t-1|t-1} + B u_t
P_{t|t-1} = F P_{t-1|t-1} F^T + Q

Update:

y_t = Y_t - H x̂_{t|t-1}    (innovation)
S_t = H P_{t|t-1} H^T + R   (innovation covariance)
K_t = P_{t|t-1} H^T S_t^{-1} (Kalman gain)
x̂_{t|t} = x̂_{t|t-1} + K_t y_t
P_{t|t} = (I - K_t H) P_{t|t-1}

8.2 Numerical-stability forms

  • Joseph form

    P_{t|t} = (I - K_t H) P_{t|t-1} (I - K_t H)^T + K_t R K_t^T
    

    Symmetric and positive-semidefinite by construction even with floating-point error.

  • Information form — track Y_t = P^{-1} and y_t = P^{-1} x̂. Predict is harder, update is simpler; natural for sensor fusion with many measurements.

  • Square-root form — track a Cholesky factor L with P = L L^T; update via QR. Doubles the working precision in floating point.

8.3 Extended Kalman Filter (EKF)

For

X_t = f(X_{t-1}, u_t) + w_t,   Y_t = h(X_t) + v_t

linearize around the current estimate, F_t = ∂f/∂x, H_t = ∂h/∂x. First-order accurate; biased and can diverge for strong nonlinearity. The default before UKF/PF in many engineering domains.

8.4 Unscented Kalman Filter (UKF)

Julier-Uhlmann 1997. Propagate 2n + 1 deterministically chosen sigma points through f and h, fit a Gaussian to the transformed cloud. No Jacobians required and second-order accurate for Gaussian inputs. The unscented transform underlies a family of derivative-free filters.

8.5 Cubature Kalman Filter

Arasaratnam-Haykin 2009 — replace sigma points with a third-order spherical-radial cubature rule. Cleaner derivation, sometimes more stable for high-dimensional state.

8.6 Interacting Multiple Model (IMM)

Run a bank of Kalman filters under different motion models (e.g. constant-velocity vs constant-acceleration in tracking), mix the estimates by a Markov transition over model index. Standard in radar tracking and autonomous-driving target estimation.

8.7 Connection to HMM forward algorithm

The Kalman forward pass is exactly the HMM forward algorithm in the linear-Gaussian limit, with Gaussian sufficient statistics replacing the categorical α_t. The Rauch-Tung-Striebel (RTS) smoother (Rauch-Tung-Striebel 1965) is the backward pass — direct Gaussian analog of HMM forward-backward.

9. Applications

9.1 Speech recognition

The HMM era of automatic speech recognition (ASR) ran from Rabiner 1989 through the early 2010s. State = phone (or sub-phone “senone”), emission = GMM over MFCC features (HMM-GMM). The DNN-HMM hybrid era (Hinton-Deng-Yu et al 2012, “Deep Neural Networks for Acoustic Modeling in Speech Recognition”) replaced the GMM with a DNN that estimated P(state | observation). Pure end-to-end recurrent and Transformer ASR (Graves CTC 2006, Chan-Jaitly-Le-Vinyals “Listen Attend and Spell” 2016, Conformer 2020) eventually supplanted HMM decoding, but the alignment intuition from forward-backward survives in CTC (Connectionist Temporal Classification, Graves et al 2006) — itself an HMM in disguise.

9.2 Part-of-speech tagging

A first-order HMM with tag = state and word = observation gave the standard NLP baseline before CRFs (Lafferty-McCallum-Pereira 2001) and neural sequence taggers. The HMM Viterbi tagger still ships as a fallback in NLTK, spaCy, and Stanford CoreNLP.

9.3 Computational biology

  • Gene findingGENSCAN (Burge-Karlin 1997, Stanford) — generalized HMM over the eukaryotic gene grammar (exons, introns, splice sites). The standard until learning-based tools (AUGUSTUS 2006, BRAKER) caught up.
  • Profile HMMsHMMER (Sean Eddy, HHMI; HMMER 1, 2, 3 from 1995 onward) and SAM (Krogh-Brown-Mian-Sjolander-Haussler 1994) — model conserved protein/DNA family profiles. PFam database is built on HMMER profiles.
  • CpG island detection — two-state HMM (in / out of island) with nucleotide emissions; canonical first example in Durbin-Eddy-Krogh-Mitchison 1998 (“Biological Sequence Analysis”).
  • Ion channel kinetics — single-channel patch-clamp recordings analyzed by HMMs of channel-open/closed states (Chung-Moore-Xia 1990, Qin-Auerbach-Sachs 1996). MAC-ion-channel work funded by NIH for decades.

9.4 Finance and econometrics

Hamilton 1989 (“A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle”, Econometrica) introduced the Markov-switching ARMA — regime-switching AR with hidden regime via HMM filter. Hamilton filter = HMM forward pass on the regime. Extensions: MSGARCH (Markov-switching GARCH, Haas-Mittnik-Paolella 2004) for volatility regimes. Markov-modulated Poisson processes price credit and insurance.

9.5 Activity and behavior modeling

Sleep staging (Rechtschaffen-Kales 1968 staging, automated by HMM since the 1990s); animal-behavior segmentation (Anderson-Perona MoSeq 2015 for mice); wearable-sensor activity recognition.

10. Software ecosystem

Discrete and Gaussian HMM:

  • hmmlearn — Python, scikit-learn-style API, GMM + categorical emissions.
  • pomegranate — Python, PyTorch backend, supports more general emission distributions and discrete Bayes nets.
  • TensorFlow Probabilitytfp.distributions.HiddenMarkovModel, GPU-accelerated, differentiable end-to-end.
  • NumPyro / PyMC — probabilistic programming with HMM via the Scan primitive (NumPyro) or Scan in Theano/PyTensor (PyMC).
  • R: hmm, depmixS4, HiddenMarkov.
  • JAX-HMM — pure-JAX implementation, good for research.

Particle filters:

  • pfilter — minimal Python implementation.
  • particles (Chopin) — Python, comprehensive SMC including PMCMC and SMC^2.
  • NimbleSMC (R), LibBi (C++).

Kalman filters:

  • filterpy (Roger Labbe) — Python; canonical KF / EKF / UKF / IMM reference implementation.
  • FilterPy examples in the textbook “Kalman and Bayesian Filters in Python”.
  • GTSAM (Dellaert et al, Georgia Tech) — factor-graph framing; used heavily in robotics SLAM.
  • OpenCVcv::KalmanFilter for vision tracking.
  • EKF/UKF baked into ROS 2 nav2 and Cartographer; see slam.

11. Frontiers

11.1 Deep state-space models

  • Switching Linear Dynamical Systems (Linderman-Miller-Adams-Paninski-Ryden-Brubaker 2017, “Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems”) — recurrent SLDS with deep transition functions and amortized inference. Used in neuroscience to model behaving animals.
  • Structured variational inference for HMM — Johnson-Duvenaud-Wiltschko-Datta-Adams 2016 (“Composing graphical models with neural networks for structured representations and fast inference”). Couple a DNN emission with exact HMM forward-backward for posteriors.
  • Neural HMMs — Tran-Bauer-Sankaran-Daniluk-Rush-Reisinger-Burch 2016 and many follow-ups; transition and emission parametrized by neural nets, trained by gradient.
  • S4 / S5 / Mamba (Gu-Goel-Re 2022, Gu-Dao 2023) — deep linear state-space models trainable as convolutions; descend from the same continuous-time linear-Gaussian filter and have eaten part of the Transformer’s lunch for long-sequence modeling. See transformer-architecture.
  • CTC and RNN-T as HMMs — explicit derivation in Sak-Senior-Beaufays 2014 and Graves 2012. Bridge between HMM theory and modern sequence models.

11.2 Differentiable particle filters

Naesseth-Linderman-Ranganath-Blei 2018 (“Variational Sequential Monte Carlo”), Le-Igl-Rainforth-Jin-Wood 2018 (“Auto-Encoding Sequential Monte Carlo”) — backprop through PF resampling using soft / Gumbel resampling and reparameterized proposals. Enables end-to-end training of nonlinear state-space models.

11.3 Non-reversible MCMC, lifted chains

See section 3.3 — current frontier of MCMC theory. Connected to mcmc-sampling and stochastic-calculus.

11.4 Quantum Markov chains

Quantum analog where state is a density matrix and transitions are CPTP maps. Foundational for quantum Monte Carlo and dissipative quantum dynamics.

12. Worked numerical micro-examples

12.1 Two-state DTMC stationary distribution

Take

P = [[0.7, 0.3],
     [0.4, 0.6]]

Solve π P = π, π_1 + π_2 = 1:

0.7 π_1 + 0.4 π_2 = π_1   =>   -0.3 π_1 + 0.4 π_2 = 0
π_1 = 0.4 / 0.7 · π_2 = (4/7) π_2

π_1 + π_2 = 1  =>  π_2 (4/7 + 1) = 1  =>  π_2 = 7/11
π_1 = 4/11

Spectral check: trace P is 1.3, determinant is 0.7·0.6 - 0.3·0.4 = 0.30. Eigenvalues are λ_1 = 1, λ_2 = 0.3. Spectral gap γ = 0.7. Mixing is fast.

12.2 Birth-death stationary distribution

For a birth-death chain with rates q_{i,i+1} = λ_i, q_{i,i-1} = μ_i, detailed balance gives the closed-form

π_i = π_0 · prod_{k=1..i} (λ_{k-1} / μ_k)

For the M/M/1 queue (λ_i = λ, μ_i = μ), π_i = (1 - ρ) ρ^i with ρ = λ/μ. Mean queue length ρ / (1 - ρ) — the classic blow-up as ρ → 1.

12.3 Coupling-from-the-past intuition

Propp-Wilson 1996 (“Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics”) — run chains from -T, -T+1, ..., 0 under the same random innovations; if all starting states collapse to the same state at time 0, that state is a perfect sample from π. Powerful in lattice models and graph colorings; harder to apply in high-dimensional continuous targets. The PCG variant is implemented in R-package CFTP and in lattice-physics codes.

13. Identifiability and model selection

13.1 Identifiability

HMM parameters are identifiable only up to relabeling of hidden states — any permutation of state indices gives the same likelihood. Label-switching is the central nuisance for Bayesian HMM inference; standard remedies are constrained parametrization (order means) or post-hoc relabeling (Stephens 2000).

Beyond label switching, two HMMs with different N can produce identical observation distributions if one has redundant states. Allman-Matias-Rhodes 2009 (“Identifiability of Parameters in Latent Structure Models with Many Observed Variables”, Annals of Statistics) give algebraic conditions for unique identification from finite-sample marginals.

13.2 Choosing the number of states

  • AIC, BIC, ICL — penalized likelihood; BIC is consistent for N under mild conditions.
  • Cross-validated likelihood — leave-one-sequence-out for multi-sequence data.
  • Bayesian non-parametric HMMsHDP-HMM (Teh-Jordan-Beal-Blei 2006) and sticky HDP-HMM (Fox-Sudderth-Jordan-Willsky 2008) let the number of states be inferred. Implemented in pyhsmm (Matt Johnson).
  • Information-criterion sweep + model averaging is the pragmatic default in speech and bio.

13.3 Sample-complexity lower bounds

Hsu-Kakade-Zhang 2012 (“A Spectral Algorithm for Learning Hidden Markov Models”) give the first polynomial-sample-complexity learner for HMMs via tensor decomposition of triplet marginals, sidestepping EM local optima. Sample complexity polynomial in N, 1/ε, and inverse spectral gap.

14. Numerical pitfalls

14.1 Forward-algorithm underflow

α_t(i) shrinks geometrically — for long sequences α_T underflows in IEEE double around T ≈ 750 for chains with average emission probability < 0.05. Fixes:

  • Rescaling — divide each α_t by its sum c_t, accumulate log P(Y) = sum_t log(1/c_t).
  • Log-space forward — store log α_t and use logsumexp for the recursion; standard in modern libraries.

14.2 Viterbi underflow

Same problem, simpler fix: always run Viterbi in log-space. The recursion becomes a sequence of sums and max’s — numerically clean and O(N^2 T).

14.3 Baum-Welch local optima

Initialization matters a lot. Standard heuristics: k-means on observations for emission means, uniform transitions for A, multiple random restarts (10–50) keeping the best. Spectral / tensor initialization (Hsu-Kakade-Zhang 2012) gives provable warm-starts.

14.4 Kalman gain numerical drift

The naive (I - K H) P form loses positive-definiteness in single precision after thousands of updates. Always use Joseph form or square-root form for long-horizon filtering and especially for aerospace navigation (where INS integration runs for hours).

15. Continuous-time generalizations

15.1 Markov jump processes on uncountable state spaces

Generators become integral operators; theory in Feller-Dynkin processes. Q f(x) = lim_{h -> 0} (E[f(X_h) | X_0 = x] - f(x)) / h. The class of processes encompasses CTMC, Brownian motion (Q = (1/2) Δ), Ornstein-Uhlenbeck, and Lévy processes. See stochastic-calculus.

15.2 Markov decision processes (MDPs)

Add an action and a reward to a Markov chain and you get an MDP — the foundation of reinforcement learning. The Bellman operator T_π V(x) = R(x, π(x)) + γ sum_y P(y | x, π(x)) V(y) is a contraction (Banach 1922 fixed point) and the value-iteration / policy-iteration analyses inherit directly from the Markov-chain theory above. See reinforcement-learning-theory.

15.3 Filtering for diffusions — Zakai and Kushner-Stratonovich

Continuous-time analog of Kalman / HMM filtering is the Zakai equation (Zakai 1969) — a linear SPDE for the unnormalized conditional density. The normalized version is the Kushner-Stratonovich equation (Kushner 1964, Stratonovich 1960). Numerical solution typically by projection (Brigo-Hanzon-Le Gland 1998) or by particle approximation.

16. Computational complexity summary

AlgorithmTimeMemoryNotes
Forward (HMM evaluation)O(N^2 T)O(N T)O(N) if only likelihood needed
Viterbi (HMM decoding)O(N^2 T)O(N T)log-space arithmetic
Forward-Backward (smoothing)O(N^2 T)O(N T)two passes
Baum-Welch (per EM iter)O(N^2 T)O(N T)typically 20-200 iters
HSMM Forward-BackwardO(N^2 T D)O(N T D)D = max duration
Kalman filter stepO(n^3)O(n^2)dense; sparse can be O(n)
EKF stepO(n^3)O(n^2)plus Jacobian cost
UKF stepO(n^3)O(n^2)2n+1 sigma points
Bootstrap particle filter stepO(N_p)O(N_p n)N_p = particles
HMM spectral learning (Hsu-Kakade-Zhang)O(N^3 T + N^6)O(N^2)tensor decomposition

For real-time systems:

  • Voice activity detection on edge: HMM forward pass N = 4, T = 100/s → trivial.
  • LIDAR-based localization on autonomous vehicle: EKF with n ≈ 15 state at 100 Hz; well within budget.
  • Aerospace nav: 24-state EKF at 200 Hz, square-root form mandatory.

17. The HMM-as-graphical-model view

An HMM is the simplest non-trivial dynamic Bayesian network (DBN):

X_1 -> X_2 -> X_3 -> ... -> X_T
|     |     |             |
v     v     v             v
Y_1   Y_2   Y_3           Y_T

Conditional independence: X_t ⫫ X_{<t-1} | X_{t-1} and Y_t ⫫ everything | X_t. The HMM joint factorizes as

P(X, Y) = P(X_1) prod_t P(X_t | X_{t-1}) prod_t P(Y_t | X_t)

Belief propagation on this chain graph gives forward-backward exactly. Generalizing the chain to a tree gives junction-tree inference; generalizing to a loopy graph requires approximate inference (loopy BP, variational, sampling). The connection between HMM, Kalman, CRF, and modern factor graphs is laid out in Bishop 2006 chapter 13 and Murphy 2012 chapters 17-18.

18. Linear-chain CRF — the discriminative cousin

A conditional random field (Lafferty-McCallum-Pereira 2001, ICML) models P(X | Y) directly:

P(X | Y) = (1/Z(Y)) exp( sum_t [ θ^T f(X_{t-1}, X_t, Y, t) + λ^T g(X_t, Y, t) ] )

Same chain structure as HMM but undirected and discriminative — features can depend on the entire observation sequence Y, not just Y_t. Forward-backward still applies (the partition function Z(Y) is computed by the standard O(N^2 T) recursion). Trained by maximum conditional likelihood; widely used for NER and shallow parsing before being supplanted by BiLSTM-CRFs (Huang-Xu-Yu 2015) and Transformer taggers.

The CRF / HMM contrast — discriminative vs generative on the same chain graph — mirrors the logistic-regression / naive-Bayes contrast. Discriminative wins when features are rich and data is plentiful; generative wins with small data or missing-data scenarios.

19. Bayesian HMMs and HDP-HMM

19.1 Conjugate priors

Standard Bayesian HMM uses Dirichlet priors on rows of A and on π, and conjugate priors on each emission (e.g. NIW for Gaussian emissions). Gibbs sampler alternates: sample X_{1:T} given parameters (forward-filter backward-sample), sample parameters given X. Forward-filter backward-sample (Carter-Kohn 1994, Frühwirth-Schnatter 1994) draws one sample of X_{1:T} jointly — much better mixing than per-step Gibbs.

19.2 Hierarchical Dirichlet Process HMM

HDP-HMM (Teh-Jordan-Beal-Blei 2006) — uses a hierarchical Dirichlet process prior so the number of states is countably infinite a priori but only finitely many appear in any data set. Sticky HDP-HMM (Fox-Sudderth-Jordan-Willsky 2008, “An HDP-HMM for systems with state persistence”, ICML) adds a self-transition bias for realistic dwell times. Implemented in pyhsmm (Matt Johnson). Used in speaker diarization, behavioral state segmentation.

19.3 Variational Bayes HMM

MacKay 1997 and Beal 2003 derive a tractable variational lower bound for HMMs with conjugate priors. Implemented in bnpy. Often used as a regularizer / model-size selector.

Adjacent

  • mcmc-sampling — the algorithm family that consumes Markov-chain theory; Metropolis, HMC, NUTS.
  • variational-inference — EM that Baum-Welch instantiates; structured VI for HMMs.
  • probability-fundamentals — measure-theoretic backbone (filtrations, Markov property).
  • stochastic-calculus — continuous-state continuous-time analog (SDEs); Feller processes generalize CTMC.
  • time-series-and-hmm — Tier-2 sibling on classical time-series modeling.
  • bayesian-inference — broader Bayesian framework underlying particle filtering and HMM learning.
  • transformer-architecture — successor sequence model; S4/Mamba descend from linear state-space filters.
  • slam — concrete application of Kalman, EKF, UKF, and particle filters in robot localization and mapping.