Markov Chains & Hidden Markov Models
A Markov chain is a stochastic process where the future depends on the past only through the present — the memoryless property. Hidden Markov Models extend the framework by letting the chain itself be unobserved and inferring it from noisy observations. The two together form one of the most heavily reused probabilistic tools in physics, biology, finance, speech, and modern deep state-space models.
1. Discrete-Time Markov Chains (DTMC)
1.1 Formal setup
A sequence of random variables X_0, X_1, X_2, ... on countable state space S = {1, ..., N} is a Markov chain if
P(X_{n+1} = j | X_n = i, X_{n-1} = i_{n-1}, ..., X_0 = i_0) = P(X_{n+1} = j | X_n = i) = P_{ij}
The conditional distribution given the past depends only on the immediate previous state. The matrix P = (P_{ij}) is the transition matrix, row-stochastic: each row sums to 1 and each entry is in [0, 1].
The chain is time-homogeneous when P_{ij} is independent of n. Non-homogeneous chains have P^{(n)} per step and are common in adaptive MCMC and stochastic-approximation settings.
1.2 Chapman-Kolmogorov
The n-step transition is the matrix power:
P^{(n)}_{ij} = P(X_n = j | X_0 = i) = (P^n)_{ij}
The Chapman-Kolmogorov equation says
P^{(m+n)}_{ij} = sum_k P^{(m)}_{ik} P^{(n)}_{kj}
This is just matrix multiplication once you accept the Markov property. Starting from distribution mu_0 (a row vector), the marginal at step n is mu_n = mu_0 P^n.
1.3 Classification of states
- Accessible
i -> jifP^{(n)}_{ij} > 0for somen >= 0. - Communicating
i <-> jif both directions hold. Communicating classes partition the state space. - Irreducible chain — exactly one communicating class.
- Recurrent state — returns with probability 1. Transient — finite expected number of visits.
- Positive recurrent — finite expected return time. Null recurrent — infinite expected return time but probability-1 return.
- Period
d(i) = gcd{n : P^{(n)}_{ii} > 0}. Aperiodic whend = 1. Self-loopP_{ii} > 0is sufficient but not necessary.
1.4 Stationary distribution
A row vector π with π_i >= 0, sum π_i = 1 is stationary when
π = π P
For finite irreducible aperiodic chains the fundamental limit theorem (Markov 1906, Doeblin 1937) gives
lim_{n -> ∞} P^{(n)}_{ij} = π_j for all i, j
So P^n converges to the rank-1 matrix 1 π. Existence and uniqueness of π is guaranteed for finite irreducible chains; countable-state chains need positive recurrence.
Detailed balance is a stronger condition:
π_i P_{ij} = π_j P_{ji} for all i, j
A chain satisfying detailed balance is reversible. Detailed balance implies stationarity but not conversely. Reversibility makes the transfer operator self-adjoint in L^2(π), which simplifies spectral analysis and is the key design lever for MCMC.
1.5 Perron-Frobenius and spectral gap
For an irreducible stochastic matrix P, the Perron-Frobenius theorem (Perron 1907, Frobenius 1912) gives:
- The dominant eigenvalue is
λ_1 = 1, with right eigenvector1 = (1, 1, ..., 1)^Tand left eigenvectorπ. - All eigenvalues satisfy
|λ_k| <= 1. - For aperiodic chains, every other eigenvalue is strictly less than 1 in modulus.
- The spectral gap is
γ = 1 - |λ_2|. Larger gap → faster mixing.
1.6 Mean first passage time
Define
m_{ij} = E[T_j | X_0 = i], T_j = inf{n >= 1 : X_n = j}
For finite irreducible chains, m_{jj} = 1 / π_j (Kac’s lemma). For i ≠ j,
m_{ij} = 1 + sum_{k ≠ j} P_{ik} m_{kj}
This is a linear system. Mean first passage matrices feed network-analysis metrics, MCMC convergence diagnostics, and PageRank-style centrality.
1.7 Mixing time
Mixing time is
t_mix(ε) = min{n : max_i || P^n(i, ·) - π ||_TV <= ε}
where ||·||_TV is total variation distance. For reversible chains the spectral gap bound gives
t_mix(ε) <= (1/γ) log(1 / (ε π_min))
A tighter geometric bound uses Cheeger’s inequality (Cheeger 1970, Jerrum-Sinclair 1989 for chains):
Φ^2 / 2 <= γ <= 2 Φ, Φ = min_{S : π(S) <= 1/2} Q(S, S^c) / π(S)
where Q(S, S^c) is the probability flux out of set S under the stationary chain. Conductance Φ is hard to compute but easy to lower-bound combinatorially; Cheeger then gives a mixing rate. This connection between graph expansion and chain mixing drives modern analyses of MCMC and randomized algorithms.
1.8 Ergodic theorem
For irreducible positive-recurrent chains and any f with E_π[|f|] < ∞,
(1/n) sum_{k=0}^{n-1} f(X_k) -> E_π[f] almost surely
This is the engine of Markov chain Monte Carlo — time averages along a trajectory equal space averages under π, and you do not need independent samples.
2. Continuous-Time Markov Chains (CTMC)
2.1 Generator matrix
For continuous time t >= 0 define P_{ij}(t) = P(X_t = j | X_0 = i). The transition matrix P(t) satisfies the semigroup property P(s + t) = P(s) P(t) and P(0) = I. The infinitesimal generator (also called rate matrix or Q-matrix) is
Q = lim_{t -> 0+} (P(t) - I) / t
with off-diagonal q_{ij} >= 0 (rate of jumps from i to j) and diagonal q_{ii} = -sum_{j ≠ i} q_{ij} (negative). Rows sum to zero.
2.2 Holding times and embedded chain
Sitting in state i, the holding time is exponential with rate -q_{ii}:
P(holding time > t | X_0 = i) = exp(q_{ii} t)
On leaving i, the next state is j with probability q_{ij} / (-q_{ii}). The sequence of visited states (ignoring durations) is the embedded discrete chain with transition matrix
P^{embed}_{ij} = q_{ij} / (-q_{ii}) for i ≠ j, 0 on diagonal
2.3 Kolmogorov equations
P(t) satisfies the Kolmogorov forward equation (also called the master equation in physics):
dP(t) / dt = P(t) Q
and the Kolmogorov backward equation:
dP(t) / dt = Q P(t)
Both are matrix ODEs with solution P(t) = exp(t Q). For probability distributions μ(t) = μ(0) P(t),
dμ/dt = μ Q
— a linear ODE, the master equation in statistical physics.
2.4 Stationary distribution and detailed balance
π stationary iff π Q = 0. Detailed balance is
π_i q_{ij} = π_j q_{ji}
implying time-reversibility. CTMCs in physics with detailed balance are at thermal equilibrium; without it they are in non-equilibrium steady state (NESS).
2.5 Classical chains
- Ehrenfest urn (Ehrenfest 1907) —
Nparticles in two urns, swap one at random per step. State = particles in urn 1. Used to model diffusion and to demolish naive “no recurrence in physics” arguments. - Wright-Fisher model (Wright 1931, Fisher 1930) —
2Ngene copies; next generation samples with replacement. State = count of allele A. Foundational in population genetics. The Kingman coalescent (Kingman 1982) is the genealogical dual. - M/M/1 queue — Poisson arrivals rate
λ, exponential service rateμ, single server. State = number in system. Stable whenρ = λ/μ < 1; stationary distribution geometric with parameterρ. Backbone of networking-foundations traffic models. - Erlang B —
mservers, blocked customers cleared. Erlang B formulaB(m, ρ) = (ρ^m / m!) / sum_{k=0..m} (ρ^k / k!)gives the blocking probability. Erlang C —mservers with infinite queue; gives wait probability. Telecom and call-center sizing. - Birth-death chains — transitions only to neighbors,
q_{i, i+1} = λ_i,q_{i, i-1} = μ_i. M/M/1 and Ehrenfest are birth-death; nearly always tractable analytically. - Random walks — Z^1, Z^2 random walks are recurrent; Z^3 and higher are transient (Pólya 1921). The classical “drunk goes home but drunk bird lost forever” result.
3. MCMC and the design of reversible chains
Markov chain Monte Carlo turns the ergodic theorem into an estimation tool: construct a chain whose stationary distribution is the target π and average along trajectories. Detailed balance is the standard design constraint because it makes verifying π trivial.
3.1 Metropolis-Hastings (Metropolis et al 1953, Hastings 1970)
Given proposal density q(y | x) and target π(x), the acceptance probability is
α(x, y) = min{1, [π(y) q(x | y)] / [π(x) q(y | x)]}
This satisfies detailed balance by construction. Random-walk Metropolis uses q(y | x) = N(x, Σ) and acceptance reduces to π(y) / π(x). Independence sampler uses q(y | x) = q(y). Hamiltonian Monte Carlo (Duane et al 1987, Neal 1993, Hoffman-Gelman NUTS 2014) uses gradient-driven Hamiltonian dynamics as the proposal; covered in mcmc-sampling.
3.2 Gibbs sampling (Geman-Geman 1984)
When the full conditionals π(x_i | x_{-i}) are easy, sweep through them. Detailed balance holds for each conditional update, and the composition (deterministic-scan or random-scan) leaves π invariant. Gibbs is a special case of Metropolis-Hastings with acceptance 1.
3.3 Reversible vs non-reversible MCMC
Non-reversible chains can mix faster than the best reversible chain — the lifted chain construction (Diaconis-Holmes-Neal 2000) duplicates each state with a direction bit and breaks reversibility. Used in modern algorithms like bouncy particle sampler (Bouchard-Côté et al 2018) and zig-zag process (Bierkens-Fearnhead-Roberts 2019).
4. Hidden Markov Models
A Hidden Markov Model is a doubly-stochastic process: a latent Markov chain X_1, ..., X_T over states {1, ..., N} and an observed sequence Y_1, ..., Y_T such that Y_t depends only on X_t. The HMM is fully specified by λ = (A, B, π):
A_{ij} = P(X_{t+1} = j | X_t = i)— transition matrixb_j(y) = P(Y_t = y | X_t = j)— emission density / distributionπ_i = P(X_1 = i)— initial state distribution
Following the canonical reference Rabiner 1989 (“A Tutorial on Hidden Markov Models”, Proc. IEEE 77(2)), HMM tasks split into three problems.
4.1 Problem 1 — evaluation
Given λ and observation sequence Y_{1:T}, compute the likelihood P(Y_{1:T} | λ). Brute force sum_X P(Y, X | λ) is O(N^T) — intractable. The Forward algorithm uses
α_1(i) = π_i b_i(Y_1)
α_{t+1}(j) = [sum_i α_t(i) A_{ij}] b_j(Y_{t+1})
P(Y_{1:T} | λ) = sum_i α_T(i)
This runs in O(N^2 T). Numerical underflow is handled by scaling each α_t to sum to 1 and keeping the log-scale factor.
4.2 Problem 2 — decoding
Find argmax_X P(X | Y, λ). The Viterbi algorithm (Viterbi 1967) — originally for convolutional codes — is dynamic programming in log-space:
δ_1(i) = log π_i + log b_i(Y_1)
δ_{t+1}(j) = max_i [δ_t(i) + log A_{ij}] + log b_j(Y_{t+1})
ψ_{t+1}(j) = argmax_i [δ_t(i) + log A_{ij}]
with traceback via ψ. Cost O(N^2 T), single pass.
Two distinct notions of “decode” exist:
- Viterbi path — joint MAP
argmax_X P(X | Y). - Posterior decoding — per-step MAP
argmax_{x_t} P(X_t = x_t | Y), computed from forward-backward marginals. May yield a path with zero probability under the model but is sometimes preferred for biology (e.g. gene finding labels per nucleotide).
4.3 Problem 3 — learning
Estimate λ from observations. Baum-Welch (Baum-Petrie 1966, Baum-Eagon 1967, Baum 1972) is the EM algorithm applied to HMMs:
-
E-step — forward-backward gives marginals
γ_t(i) = P(X_t = i | Y)and pair-marginalsξ_t(i, j) = P(X_t = i, X_{t+1} = j | Y). -
M-step
π_i^new = γ_1(i) A_{ij}^new = (sum_t ξ_t(i, j)) / (sum_t γ_t(i)) b_j(y)^new ∝ sum_{t : Y_t = y} γ_t(j) (discrete emissions)
Monotone non-decrease of P(Y | λ) at each iteration follows the general EM convergence (Dempster-Laird-Rubin 1977). Local optima are the price; multiple restarts and informed initialization (e.g. k-means on observations) help in practice. Discussed in the broader EM framework at variational-inference.
4.4 Forward-Backward smoothing
The Backward variable
β_T(i) = 1
β_t(i) = sum_j A_{ij} b_j(Y_{t+1}) β_{t+1}(j)
combines with forward to give marginal posterior
γ_t(i) = α_t(i) β_t(i) / sum_k α_t(k) β_t(k)
and pairwise
ξ_t(i, j) = α_t(i) A_{ij} b_j(Y_{t+1}) β_{t+1}(j) / P(Y | λ)
Forward-backward is smoothing — using the whole sequence to estimate X_t. Forward alone is filtering — using only Y_{1:t}.
5. Continuous emissions
5.1 Gaussian emissions
b_j(y) = N(y; μ_j, Σ_j). M-step gives
μ_j^new = (sum_t γ_t(j) Y_t) / (sum_t γ_t(j))
Σ_j^new = (sum_t γ_t(j) (Y_t - μ_j)(Y_t - μ_j)^T) / (sum_t γ_t(j))
5.2 Gaussian mixture emissions
b_j(y) = sum_m c_{jm} N(y; μ_{jm}, Σ_{jm}). Treat mixture component as a second latent variable; full EM has nested forward-backward + standard GMM M-step. Used in classical HMM-GMM speech recognition (1980s–early 2010s).
5.3 Multivariate continuous, autoregressive
Coupling consecutive observations directly gives autoregressive HMMs: Y_t | X_t, Y_{t-1} ∼ N(A_{X_t} Y_{t-1} + b_{X_t}, Σ_{X_t}). Combines with Markov switching state-space models for finance and econometrics.
6. Variations and extensions
6.1 Hidden Semi-Markov Models (HSMM)
In a vanilla HMM the duration in state j is geometric, P(d) = A_{jj}^{d-1}(1 - A_{jj}), often unrealistic. HSMMs (Ferguson 1980) attach an explicit duration distribution p_j(d). Forward-backward becomes O(N^2 T D) for max-duration D. Speech, biology, animal-behavior modeling.
6.2 Hierarchical HMMs
HHMM (Fine-Singer-Tishby 1998) — nested HMMs where states themselves emit subsequences from sub-HMMs. Reformulated as a flat HMM (Murphy-Paskin 2001) it admits standard forward-backward at O(D N^2 T) for depth D.
6.3 Factorial HMMs
Multiple parallel chains share an emission distribution: X_t = (X_t^{(1)}, ..., X_t^{(M)}). State space is exponential, so exact inference requires structured variational methods (Ghahramani-Jordan 1997). A natural fit for multi-source speech and source separation.
6.4 Input-Output HMMs
Bengio-Frasconi 1996. Transitions and emissions condition on an external input sequence; precursor to attention-based seq2seq.
7. Particle filters / sequential Monte Carlo
For nonlinear / non-Gaussian state-space models exact filtering is intractable. Particle filters represent the posterior by a weighted sample.
7.1 Bootstrap filter (Gordon-Salmond-Smith 1993)
For state-space model
X_t = f(X_{t-1}) + w_t, Y_t = h(X_t) + v_t
with particles {x_t^{(i)}, w_t^{(i)}}:
- Predict — propagate
x_t^{(i)} ∼ p(X_t | X_{t-1} = x_{t-1}^{(i)}). - Weight —
w_t^{(i)} ∝ p(Y_t | X_t = x_t^{(i)}). - Resample (SIR) — draw
Nparticles with replacement from{x_t^{(i)}, w_t^{(i)}}.
Resampling cures degeneracy (one particle hogs all weight). After resampling all weights reset to 1/N.
7.2 Effective sample size (ESS)
ESS = 1 / sum_i (w_t^{(i)})^2
Resample only when ESS < N/2 to balance variance reduction against the loss from resampling. Liu 2001 (“Monte Carlo Strategies in Scientific Computing”) is the reference.
7.3 Auxiliary particle filter (Pitt-Shephard 1999)
Use a look-ahead at Y_t to pre-select particles likely to survive the upcoming weighting step. Reduces variance when likelihood is peaked relative to the prior.
7.4 Regularization and Rao-Blackwellization
- Regularized PF (Musso-Oudjane-Le Gland 2001) — smooth the resampled empirical measure to combat sample impoverishment, often via a Gaussian kernel.
- Rao-Blackwellized PF (Doucet-de Freitas-Murphy-Russell 2000) — marginalize out a linear-Gaussian subset of the state analytically, particle-filter only the nonlinear subset. Standard for FastSLAM (Montemerlo-Thrun 2002); see slam.
7.5 Smoothing and learning
Forward-filtering backward-smoothing for PF — Godsill-Doucet-West 2004. Particle MCMC (Andrieu-Doucet-Holenstein 2010) embeds a PF inside an MH step to do parameter inference in nonlinear state-space models.
8. Kalman filter family
For linear-Gaussian state space models the optimal Bayesian filter is closed form — the Kalman filter (Kalman 1960).
8.1 Standard Kalman filter
Model:
X_t = F X_{t-1} + B u_t + w_t, w_t ∼ N(0, Q)
Y_t = H X_t + v_t, v_t ∼ N(0, R)
Predict:
x̂_{t|t-1} = F x̂_{t-1|t-1} + B u_t
P_{t|t-1} = F P_{t-1|t-1} F^T + Q
Update:
y_t = Y_t - H x̂_{t|t-1} (innovation)
S_t = H P_{t|t-1} H^T + R (innovation covariance)
K_t = P_{t|t-1} H^T S_t^{-1} (Kalman gain)
x̂_{t|t} = x̂_{t|t-1} + K_t y_t
P_{t|t} = (I - K_t H) P_{t|t-1}
8.2 Numerical-stability forms
-
Joseph form
P_{t|t} = (I - K_t H) P_{t|t-1} (I - K_t H)^T + K_t R K_t^TSymmetric and positive-semidefinite by construction even with floating-point error.
-
Information form — track
Y_t = P^{-1}andy_t = P^{-1} x̂. Predict is harder, update is simpler; natural for sensor fusion with many measurements. -
Square-root form — track a Cholesky factor
LwithP = L L^T; update via QR. Doubles the working precision in floating point.
8.3 Extended Kalman Filter (EKF)
For
X_t = f(X_{t-1}, u_t) + w_t, Y_t = h(X_t) + v_t
linearize around the current estimate, F_t = ∂f/∂x, H_t = ∂h/∂x. First-order accurate; biased and can diverge for strong nonlinearity. The default before UKF/PF in many engineering domains.
8.4 Unscented Kalman Filter (UKF)
Julier-Uhlmann 1997. Propagate 2n + 1 deterministically chosen sigma points through f and h, fit a Gaussian to the transformed cloud. No Jacobians required and second-order accurate for Gaussian inputs. The unscented transform underlies a family of derivative-free filters.
8.5 Cubature Kalman Filter
Arasaratnam-Haykin 2009 — replace sigma points with a third-order spherical-radial cubature rule. Cleaner derivation, sometimes more stable for high-dimensional state.
8.6 Interacting Multiple Model (IMM)
Run a bank of Kalman filters under different motion models (e.g. constant-velocity vs constant-acceleration in tracking), mix the estimates by a Markov transition over model index. Standard in radar tracking and autonomous-driving target estimation.
8.7 Connection to HMM forward algorithm
The Kalman forward pass is exactly the HMM forward algorithm in the linear-Gaussian limit, with Gaussian sufficient statistics replacing the categorical α_t. The Rauch-Tung-Striebel (RTS) smoother (Rauch-Tung-Striebel 1965) is the backward pass — direct Gaussian analog of HMM forward-backward.
9. Applications
9.1 Speech recognition
The HMM era of automatic speech recognition (ASR) ran from Rabiner 1989 through the early 2010s. State = phone (or sub-phone “senone”), emission = GMM over MFCC features (HMM-GMM). The DNN-HMM hybrid era (Hinton-Deng-Yu et al 2012, “Deep Neural Networks for Acoustic Modeling in Speech Recognition”) replaced the GMM with a DNN that estimated P(state | observation). Pure end-to-end recurrent and Transformer ASR (Graves CTC 2006, Chan-Jaitly-Le-Vinyals “Listen Attend and Spell” 2016, Conformer 2020) eventually supplanted HMM decoding, but the alignment intuition from forward-backward survives in CTC (Connectionist Temporal Classification, Graves et al 2006) — itself an HMM in disguise.
9.2 Part-of-speech tagging
A first-order HMM with tag = state and word = observation gave the standard NLP baseline before CRFs (Lafferty-McCallum-Pereira 2001) and neural sequence taggers. The HMM Viterbi tagger still ships as a fallback in NLTK, spaCy, and Stanford CoreNLP.
9.3 Computational biology
- Gene finding — GENSCAN (Burge-Karlin 1997, Stanford) — generalized HMM over the eukaryotic gene grammar (exons, introns, splice sites). The standard until learning-based tools (AUGUSTUS 2006, BRAKER) caught up.
- Profile HMMs — HMMER (Sean Eddy, HHMI; HMMER 1, 2, 3 from 1995 onward) and SAM (Krogh-Brown-Mian-Sjolander-Haussler 1994) — model conserved protein/DNA family profiles. PFam database is built on HMMER profiles.
- CpG island detection — two-state HMM (in / out of island) with nucleotide emissions; canonical first example in Durbin-Eddy-Krogh-Mitchison 1998 (“Biological Sequence Analysis”).
- Ion channel kinetics — single-channel patch-clamp recordings analyzed by HMMs of channel-open/closed states (Chung-Moore-Xia 1990, Qin-Auerbach-Sachs 1996). MAC-ion-channel work funded by NIH for decades.
9.4 Finance and econometrics
Hamilton 1989 (“A New Approach to the Economic Analysis of Nonstationary Time Series and the Business Cycle”, Econometrica) introduced the Markov-switching ARMA — regime-switching AR with hidden regime via HMM filter. Hamilton filter = HMM forward pass on the regime. Extensions: MSGARCH (Markov-switching GARCH, Haas-Mittnik-Paolella 2004) for volatility regimes. Markov-modulated Poisson processes price credit and insurance.
9.5 Activity and behavior modeling
Sleep staging (Rechtschaffen-Kales 1968 staging, automated by HMM since the 1990s); animal-behavior segmentation (Anderson-Perona MoSeq 2015 for mice); wearable-sensor activity recognition.
10. Software ecosystem
Discrete and Gaussian HMM:
- hmmlearn — Python, scikit-learn-style API, GMM + categorical emissions.
- pomegranate — Python, PyTorch backend, supports more general emission distributions and discrete Bayes nets.
- TensorFlow Probability —
tfp.distributions.HiddenMarkovModel, GPU-accelerated, differentiable end-to-end. - NumPyro / PyMC — probabilistic programming with HMM via the
Scanprimitive (NumPyro) orScanin Theano/PyTensor (PyMC). - R:
hmm,depmixS4,HiddenMarkov. - JAX-HMM — pure-JAX implementation, good for research.
Particle filters:
- pfilter — minimal Python implementation.
- particles (Chopin) — Python, comprehensive SMC including PMCMC and SMC^2.
- NimbleSMC (R), LibBi (C++).
Kalman filters:
- filterpy (Roger Labbe) — Python; canonical KF / EKF / UKF / IMM reference implementation.
- FilterPy examples in the textbook “Kalman and Bayesian Filters in Python”.
- GTSAM (Dellaert et al, Georgia Tech) — factor-graph framing; used heavily in robotics SLAM.
- OpenCV —
cv::KalmanFilterfor vision tracking. - EKF/UKF baked into ROS 2 nav2 and Cartographer; see slam.
11. Frontiers
11.1 Deep state-space models
- Switching Linear Dynamical Systems (Linderman-Miller-Adams-Paninski-Ryden-Brubaker 2017, “Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems”) — recurrent SLDS with deep transition functions and amortized inference. Used in neuroscience to model behaving animals.
- Structured variational inference for HMM — Johnson-Duvenaud-Wiltschko-Datta-Adams 2016 (“Composing graphical models with neural networks for structured representations and fast inference”). Couple a DNN emission with exact HMM forward-backward for posteriors.
- Neural HMMs — Tran-Bauer-Sankaran-Daniluk-Rush-Reisinger-Burch 2016 and many follow-ups; transition and emission parametrized by neural nets, trained by gradient.
- S4 / S5 / Mamba (Gu-Goel-Re 2022, Gu-Dao 2023) — deep linear state-space models trainable as convolutions; descend from the same continuous-time linear-Gaussian filter and have eaten part of the Transformer’s lunch for long-sequence modeling. See transformer-architecture.
- CTC and RNN-T as HMMs — explicit derivation in Sak-Senior-Beaufays 2014 and Graves 2012. Bridge between HMM theory and modern sequence models.
11.2 Differentiable particle filters
Naesseth-Linderman-Ranganath-Blei 2018 (“Variational Sequential Monte Carlo”), Le-Igl-Rainforth-Jin-Wood 2018 (“Auto-Encoding Sequential Monte Carlo”) — backprop through PF resampling using soft / Gumbel resampling and reparameterized proposals. Enables end-to-end training of nonlinear state-space models.
11.3 Non-reversible MCMC, lifted chains
See section 3.3 — current frontier of MCMC theory. Connected to mcmc-sampling and stochastic-calculus.
11.4 Quantum Markov chains
Quantum analog where state is a density matrix and transitions are CPTP maps. Foundational for quantum Monte Carlo and dissipative quantum dynamics.
12. Worked numerical micro-examples
12.1 Two-state DTMC stationary distribution
Take
P = [[0.7, 0.3],
[0.4, 0.6]]
Solve π P = π, π_1 + π_2 = 1:
0.7 π_1 + 0.4 π_2 = π_1 => -0.3 π_1 + 0.4 π_2 = 0
π_1 = 0.4 / 0.7 · π_2 = (4/7) π_2
π_1 + π_2 = 1 => π_2 (4/7 + 1) = 1 => π_2 = 7/11
π_1 = 4/11
Spectral check: trace P is 1.3, determinant is 0.7·0.6 - 0.3·0.4 = 0.30. Eigenvalues are λ_1 = 1, λ_2 = 0.3. Spectral gap γ = 0.7. Mixing is fast.
12.2 Birth-death stationary distribution
For a birth-death chain with rates q_{i,i+1} = λ_i, q_{i,i-1} = μ_i, detailed balance gives the closed-form
π_i = π_0 · prod_{k=1..i} (λ_{k-1} / μ_k)
For the M/M/1 queue (λ_i = λ, μ_i = μ), π_i = (1 - ρ) ρ^i with ρ = λ/μ. Mean queue length ρ / (1 - ρ) — the classic blow-up as ρ → 1.
12.3 Coupling-from-the-past intuition
Propp-Wilson 1996 (“Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics”) — run chains from -T, -T+1, ..., 0 under the same random innovations; if all starting states collapse to the same state at time 0, that state is a perfect sample from π. Powerful in lattice models and graph colorings; harder to apply in high-dimensional continuous targets. The PCG variant is implemented in R-package CFTP and in lattice-physics codes.
13. Identifiability and model selection
13.1 Identifiability
HMM parameters are identifiable only up to relabeling of hidden states — any permutation of state indices gives the same likelihood. Label-switching is the central nuisance for Bayesian HMM inference; standard remedies are constrained parametrization (order means) or post-hoc relabeling (Stephens 2000).
Beyond label switching, two HMMs with different N can produce identical observation distributions if one has redundant states. Allman-Matias-Rhodes 2009 (“Identifiability of Parameters in Latent Structure Models with Many Observed Variables”, Annals of Statistics) give algebraic conditions for unique identification from finite-sample marginals.
13.2 Choosing the number of states
- AIC, BIC, ICL — penalized likelihood; BIC is consistent for
Nunder mild conditions. - Cross-validated likelihood — leave-one-sequence-out for multi-sequence data.
- Bayesian non-parametric HMMs — HDP-HMM (Teh-Jordan-Beal-Blei 2006) and sticky HDP-HMM (Fox-Sudderth-Jordan-Willsky 2008) let the number of states be inferred. Implemented in
pyhsmm(Matt Johnson). - Information-criterion sweep + model averaging is the pragmatic default in speech and bio.
13.3 Sample-complexity lower bounds
Hsu-Kakade-Zhang 2012 (“A Spectral Algorithm for Learning Hidden Markov Models”) give the first polynomial-sample-complexity learner for HMMs via tensor decomposition of triplet marginals, sidestepping EM local optima. Sample complexity polynomial in N, 1/ε, and inverse spectral gap.
14. Numerical pitfalls
14.1 Forward-algorithm underflow
α_t(i) shrinks geometrically — for long sequences α_T underflows in IEEE double around T ≈ 750 for chains with average emission probability < 0.05. Fixes:
- Rescaling — divide each
α_tby its sumc_t, accumulatelog P(Y) = sum_t log(1/c_t). - Log-space forward — store
log α_tand uselogsumexpfor the recursion; standard in modern libraries.
14.2 Viterbi underflow
Same problem, simpler fix: always run Viterbi in log-space. The recursion becomes a sequence of sums and max’s — numerically clean and O(N^2 T).
14.3 Baum-Welch local optima
Initialization matters a lot. Standard heuristics: k-means on observations for emission means, uniform transitions for A, multiple random restarts (10–50) keeping the best. Spectral / tensor initialization (Hsu-Kakade-Zhang 2012) gives provable warm-starts.
14.4 Kalman gain numerical drift
The naive (I - K H) P form loses positive-definiteness in single precision after thousands of updates. Always use Joseph form or square-root form for long-horizon filtering and especially for aerospace navigation (where INS integration runs for hours).
15. Continuous-time generalizations
15.1 Markov jump processes on uncountable state spaces
Generators become integral operators; theory in Feller-Dynkin processes. Q f(x) = lim_{h -> 0} (E[f(X_h) | X_0 = x] - f(x)) / h. The class of processes encompasses CTMC, Brownian motion (Q = (1/2) Δ), Ornstein-Uhlenbeck, and Lévy processes. See stochastic-calculus.
15.2 Markov decision processes (MDPs)
Add an action and a reward to a Markov chain and you get an MDP — the foundation of reinforcement learning. The Bellman operator T_π V(x) = R(x, π(x)) + γ sum_y P(y | x, π(x)) V(y) is a contraction (Banach 1922 fixed point) and the value-iteration / policy-iteration analyses inherit directly from the Markov-chain theory above. See reinforcement-learning-theory.
15.3 Filtering for diffusions — Zakai and Kushner-Stratonovich
Continuous-time analog of Kalman / HMM filtering is the Zakai equation (Zakai 1969) — a linear SPDE for the unnormalized conditional density. The normalized version is the Kushner-Stratonovich equation (Kushner 1964, Stratonovich 1960). Numerical solution typically by projection (Brigo-Hanzon-Le Gland 1998) or by particle approximation.
16. Computational complexity summary
| Algorithm | Time | Memory | Notes |
|---|---|---|---|
| Forward (HMM evaluation) | O(N^2 T) | O(N T) | O(N) if only likelihood needed |
| Viterbi (HMM decoding) | O(N^2 T) | O(N T) | log-space arithmetic |
| Forward-Backward (smoothing) | O(N^2 T) | O(N T) | two passes |
| Baum-Welch (per EM iter) | O(N^2 T) | O(N T) | typically 20-200 iters |
| HSMM Forward-Backward | O(N^2 T D) | O(N T D) | D = max duration |
| Kalman filter step | O(n^3) | O(n^2) | dense; sparse can be O(n) |
| EKF step | O(n^3) | O(n^2) | plus Jacobian cost |
| UKF step | O(n^3) | O(n^2) | 2n+1 sigma points |
| Bootstrap particle filter step | O(N_p) | O(N_p n) | N_p = particles |
| HMM spectral learning (Hsu-Kakade-Zhang) | O(N^3 T + N^6) | O(N^2) | tensor decomposition |
For real-time systems:
- Voice activity detection on edge: HMM forward pass
N = 4,T = 100/s→ trivial. - LIDAR-based localization on autonomous vehicle: EKF with
n ≈ 15state at 100 Hz; well within budget. - Aerospace nav: 24-state EKF at 200 Hz, square-root form mandatory.
17. The HMM-as-graphical-model view
An HMM is the simplest non-trivial dynamic Bayesian network (DBN):
X_1 -> X_2 -> X_3 -> ... -> X_T
| | | |
v v v v
Y_1 Y_2 Y_3 Y_T
Conditional independence: X_t ⫫ X_{<t-1} | X_{t-1} and Y_t ⫫ everything | X_t. The HMM joint factorizes as
P(X, Y) = P(X_1) prod_t P(X_t | X_{t-1}) prod_t P(Y_t | X_t)
Belief propagation on this chain graph gives forward-backward exactly. Generalizing the chain to a tree gives junction-tree inference; generalizing to a loopy graph requires approximate inference (loopy BP, variational, sampling). The connection between HMM, Kalman, CRF, and modern factor graphs is laid out in Bishop 2006 chapter 13 and Murphy 2012 chapters 17-18.
18. Linear-chain CRF — the discriminative cousin
A conditional random field (Lafferty-McCallum-Pereira 2001, ICML) models P(X | Y) directly:
P(X | Y) = (1/Z(Y)) exp( sum_t [ θ^T f(X_{t-1}, X_t, Y, t) + λ^T g(X_t, Y, t) ] )
Same chain structure as HMM but undirected and discriminative — features can depend on the entire observation sequence Y, not just Y_t. Forward-backward still applies (the partition function Z(Y) is computed by the standard O(N^2 T) recursion). Trained by maximum conditional likelihood; widely used for NER and shallow parsing before being supplanted by BiLSTM-CRFs (Huang-Xu-Yu 2015) and Transformer taggers.
The CRF / HMM contrast — discriminative vs generative on the same chain graph — mirrors the logistic-regression / naive-Bayes contrast. Discriminative wins when features are rich and data is plentiful; generative wins with small data or missing-data scenarios.
19. Bayesian HMMs and HDP-HMM
19.1 Conjugate priors
Standard Bayesian HMM uses Dirichlet priors on rows of A and on π, and conjugate priors on each emission (e.g. NIW for Gaussian emissions). Gibbs sampler alternates: sample X_{1:T} given parameters (forward-filter backward-sample), sample parameters given X. Forward-filter backward-sample (Carter-Kohn 1994, Frühwirth-Schnatter 1994) draws one sample of X_{1:T} jointly — much better mixing than per-step Gibbs.
19.2 Hierarchical Dirichlet Process HMM
HDP-HMM (Teh-Jordan-Beal-Blei 2006) — uses a hierarchical Dirichlet process prior so the number of states is countably infinite a priori but only finitely many appear in any data set. Sticky HDP-HMM (Fox-Sudderth-Jordan-Willsky 2008, “An HDP-HMM for systems with state persistence”, ICML) adds a self-transition bias for realistic dwell times. Implemented in pyhsmm (Matt Johnson). Used in speaker diarization, behavioral state segmentation.
19.3 Variational Bayes HMM
MacKay 1997 and Beal 2003 derive a tractable variational lower bound for HMMs with conjugate priors. Implemented in bnpy. Often used as a regularizer / model-size selector.
Adjacent
- mcmc-sampling — the algorithm family that consumes Markov-chain theory; Metropolis, HMC, NUTS.
- variational-inference — EM that Baum-Welch instantiates; structured VI for HMMs.
- probability-fundamentals — measure-theoretic backbone (filtrations, Markov property).
- stochastic-calculus — continuous-state continuous-time analog (SDEs); Feller processes generalize CTMC.
- time-series-and-hmm — Tier-2 sibling on classical time-series modeling.
- bayesian-inference — broader Bayesian framework underlying particle filtering and HMM learning.
- transformer-architecture — successor sequence model; S4/Mamba descend from linear state-space filters.
- slam — concrete application of Kalman, EKF, UKF, and particle filters in robot localization and mapping.