Sampling Algorithms Catalog
A reference catalog of algorithms for drawing samples from probability distributions, organized by family. Symbols follow standard usage: target density π(x) (often unnormalized π̃(x) = Zπ(x) with Z unknown), proposal q(x’|x), state space X ⊆ R^d, acceptance probability α, Markov transition kernel K(x → x’), and stationary distribution π. Throughout, X ~ p means X is distributed according to p.
1. Direct (non-iterative) samplers
1.1 Inverse-CDF (probability integral transform)
For univariate target with CDF F: if U ~ Uniform(0,1), then X = F^{-1}(U) ~ F. Requires closed-form or numerical inverse of F. Examples: exponential X = -log(1-U)/λ; Gumbel; Cauchy X = tan(π(U - 1/2)).
Reference: Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer.
1.2 Box-Muller (1958)
Two independent standard normals from two uniforms:
- Z_1 = sqrt(-2 ln U_1) cos(2π U_2)
- Z_2 = sqrt(-2 ln U_1) sin(2π U_2)
Marsaglia polar variant avoids trig calls. Reference: Box, G.E.P. and Muller, M.E. (1958). “A note on the generation of random normal deviates,” Annals of Mathematical Statistics, 29(2), 610-611.
1.3 Ziggurat (Marsaglia-Tsang 2000)
Rectangle-decomposition rejection sampler for normal and exponential. Mostly accepts on first table lookup; very fast for repeated draws. Reference: Marsaglia, G. and Tsang, W.W. (2000). “The Ziggurat Method for Generating Random Variables,” Journal of Statistical Software, 5(8).
1.4 Rejection sampling (von Neumann 1951)
To draw X ~ π given an envelope q with π(x) ≤ M q(x) for finite M ≥ 1:
- Draw X ~ q, U ~ Uniform(0,1)
- Accept X if U ≤ π(X) / (M q(X)); else repeat
Acceptance probability is 1/M. Adaptive Rejection Sampling (ARS) of Gilks and Wild (1992) builds piecewise-linear log-concave envelopes adaptively. ARS extends to log-concave densities with derivatives; ARMS (Gilks, Best, Tan 1995) adds a Metropolis step for non-log-concave targets.
References: von Neumann, J. (1951). “Various techniques used in connection with random digits,” NBS Applied Math Series, 12, 36-38. Gilks, W.R. and Wild, P. (1992). “Adaptive rejection sampling for Gibbs sampling,” Applied Statistics, 41, 337-348.
1.5 Ratio-of-Uniforms (Kinderman-Monahan 1977)
For density f(x) ∝ h(x), define A = {(u,v) : 0 < u ≤ sqrt(h(v/u))}. If (U,V) ~ Uniform(A), then V/U ~ f. The region A is bounded for a wide class of distributions. Generalizations: Wakefield, Gelfand, and Smith (1991).
Reference: Kinderman, A.J. and Monahan, J.F. (1977). “Computer generation of random variables using the ratio of uniform deviates,” ACM Transactions on Mathematical Software, 3(3), 257-260.
1.6 Composition method
Mixture sampling: to sample X ~ Σ_k w_k p_k(x), first sample index K with P(K=k) = w_k, then X ~ p_K. Useful when components are easy and mixture target is hard.
1.7 Importance sampling
For estimating E_π[f(X)] = ∫ f(x) π(x) dx using proposal q with q > 0 wherever π > 0:
- Draw X_1, …, X_N ~ q
- Compute weights w_i = π(X_i) / q(X_i)
- Estimator: μ̂ = (1/N) Σ_i w_i f(X_i) (unbiased)
Self-normalized importance sampling (SNIS) uses w̃_i = w_i / Σ_j w_j; biased but variance often lower; required when π is known only up to normalization. Effective sample size ESS = (Σ w_i)² / Σ w_i² ∈ [1, N]. Optimal proposal q*(x) ∝ |f(x)| π(x) (Kahn-Marshall 1953).
Reference: Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed., Springer.
1.8 Stratified sampling
Partition X into strata X_1, …, X_K with weights p_k = π(X_k). Allocate n_k samples per stratum (often n_k ∝ p_k σ_k for variance-optimal Neyman allocation, Neyman 1934). Estimator: Σ_k p_k μ̂_k.
1.9 Latin Hypercube Sampling (McKay-Beckman-Conover 1979)
In d dimensions with N samples, partition each axis into N equal-probability intervals; place one sample in each row and column (a Latin square arrangement). Lower variance than crude MC for additive integrands.
Reference: McKay, M.D., Beckman, R.J., and Conover, W.J. (1979). “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, 21(2), 239-245.
2. Quasi-Monte Carlo (low-discrepancy sequences)
QMC replaces iid uniform points with deterministic low-discrepancy point sets {x_1, …, x_N} ⊂ [0,1]^d, achieving error O((log N)^d / N) for sufficiently smooth integrands rather than O(N^{-1/2}).
2.1 Halton sequence (1960)
x_i = (φ_{b_1}(i), …, φ_{b_d}(i)) where φ_b is the radical-inverse function in base b; the b_j are the first d primes. Suffers correlation in high dimensions.
2.2 Sobol sequence (1967)
Base-2 sequences using direction numbers from primitive polynomials over GF(2). Joe-Kuo (2008) tables provide direction numbers up to dimension 21201. Used heavily in finance, physics, and scientific computing.
2.3 Niederreiter (t,m,s)-nets and (t,s)-sequences
Generalizes Halton/Sobol to arbitrary bases; defines uniformity through equidistribution properties. Reference: Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, SIAM.
2.4 Faure sequence (1982)
Prime-base permutation sequences with better high-dimensional behavior than Halton.
2.5 Scrambled QMC (Owen 1995, 1997)
Apply random digit permutations (Owen scrambling) to a digital net to obtain unbiased estimators with O(N^{-3/2}) variance for sufficiently smooth integrands and providing error estimates from independent replicates. Reference: Owen, A.B. (1997). “Scrambled net variance for integrals of smooth functions,” Annals of Statistics, 25(4), 1541-1562.
2.6 Randomized lattice rules
Korobov rules, rank-1 lattices: x_i = {i z / N} for generating vector z; randomized via random shifts (Sloan-Joe 1994). Component-by-component construction (Kuo-Sloan-Wozniakowski).
3. Markov chain Monte Carlo (MCMC)
A discrete-time Markov chain {X_t} with transition kernel K admitting stationary distribution π satisfies, after burn-in, X_t ~ π and ergodic averages (1/T) Σ_{t=1}^T f(X_t) → E_π[f] (a.s.) for π-integrable f.
3.1 Metropolis-Hastings (Metropolis 1953, Hastings 1970)
At state x with proposal q(x’|x):
- Propose x’ ~ q(·|x)
- Acceptance α(x, x’) = min{1, [π̃(x’) q(x|x’)] / [π̃(x) q(x’|x)]}
- With probability α set X_{t+1} = x’; else X_{t+1} = x
Symmetric proposals (q(x’|x) = q(x|x’), e.g. Gaussian random walk) give α = min{1, π̃(x’)/π̃(x)} — the original Metropolis algorithm. Independence sampler uses q(x’|x) = q(x’). Optimal scaling for d-dim Gaussian target: step size ∝ 2.38/sqrt(d) with acceptance ≈ 0.234 (Roberts-Gelman-Gilks 1997).
References: Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. (1953). “Equation of state calculations by fast computing machines,” Journal of Chemical Physics, 21(6), 1087-1092. Hastings, W.K. (1970). “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, 57(1), 97-109.
3.2 Gibbs sampler (Geman-Geman 1984)
For X = (X_1, …, X_d), cycle through full conditionals:
- X_i^{(t+1)} ~ π(x_i | x_1^{(t+1)}, …, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, …, x_d^{(t)})
All proposals are accepted (α = 1) when full conditionals are sampled exactly. Block Gibbs samples groups (X_{B_1}, …, X_{B_K}) jointly; reduces correlation when within-block dependence is strong. Random-scan Gibbs picks coordinate at random.
References: Geman, S. and Geman, D. (1984). “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721-741. Gelfand, A.E. and Smith, A.F.M. (1990). “Sampling-based approaches to calculating marginal densities,” JASA, 85(410), 398-409.
3.3 Metropolis-within-Gibbs
When full conditionals lack closed form, replace direct sampling with a Metropolis-Hastings step targeting each conditional. Each step preserves the joint stationary distribution. Tierney (1994) gives the theoretical framework.
3.4 Slice sampling (Neal 2003)
Auxiliary variable U ~ Uniform(0, π̃(X)); then sample X | U ~ Uniform({x : π̃(x) ≥ U}). Implementation via “stepping out” and “shrinkage” to find the slice. Reference: Neal, R.M. (2003). “Slice sampling,” Annals of Statistics, 31(3), 705-767.
3.5 Hamiltonian (Hybrid) Monte Carlo — HMC (Duane et al. 1987; Neal 2011)
Augment with momentum p ~ N(0, M) and define Hamiltonian H(x, p) = -log π̃(x) + (1/2) p^T M^{-1} p. Simulate Hamilton’s equations
- dx/dt = M^{-1} p
- dp/dt = ∇ log π̃(x) using a symplectic integrator (typically leapfrog with step size ε and L steps), then Metropolis-correct with α = min{1, exp(-ΔH)}. Mass matrix M tuned to target geometry. Scales O(d^{1/4}) in dimension (Beskos et al. 2013) vs. O(d) for random-walk Metropolis. Reference: Neal, R.M. (2011). “MCMC using Hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, CRC, ch. 5.
3.6 No-U-Turn Sampler — NUTS (Hoffman-Gelman 2014)
Adaptive HMC variant: build a binary tree of leapfrog steps in alternating forward/backward directions, doubling until the trajectory turns back on itself (the U-turn criterion (x_+ - x_-) · p < 0). Selects a state from the tree by slice sampling. Auto-tunes the step size ε via dual averaging (Nesterov 2009) targeting acceptance ≈ 0.8. The default sampler in Stan and PyMC. Reference: Hoffman, M.D. and Gelman, A. (2014). “The No-U-Turn Sampler,” JMLR, 15, 1593-1623.
3.7 Riemannian-manifold HMC — RMHMC (Girolami-Calderhead 2011)
Use a position-dependent metric G(x) (Fisher information + prior) so the kinetic energy becomes (1/2) p^T G(x)^{-1} p. Trajectories follow geodesics with respect to the Fisher-Rao metric; requires implicit (Stoermer-Verlet generalized) integrator. Captures local curvature; handles strongly correlated and multi-scale posteriors. Reference: Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo,” JRSS B, 73(2), 123-214.
3.8 Langevin algorithms
Overdamped Langevin SDE: dX_t = ∇ log π(X_t) dt + sqrt(2) dW_t has π as invariant distribution.
ULA — Unadjusted Langevin Algorithm:
- X_{t+1} = X_t + (ε/2) ∇ log π(X_t) + sqrt(ε) ξ_t, ξ_t ~ N(0, I)
Has discretization bias; mixing analyzed by Durmus-Moulines (2017). MALA — Metropolis-Adjusted Langevin Algorithm (Roberts-Tweedie 1996) Metropolis-corrects this proposal. Optimal acceptance ≈ 0.574 (Roberts-Rosenthal 1998). Underdamped (kinetic) Langevin: dX_t = V_t dt, dV_t = (-∇ log π(X_t) - γ V_t) dt + sqrt(2γ) dW_t.
3.9 Parallel tempering / replica exchange (Geyer 1991, Marinari-Parisi 1992)
Run K chains at temperatures 1 = T_1 < T_2 < … < T_K with stationary distributions π_k ∝ π̃^{1/T_k}. Periodically swap configurations between adjacent chains with Metropolis probability α = min{1, [π_k(x_{k+1}) π_{k+1}(x_k)]/[π_k(x_k) π_{k+1}(x_{k+1})]}. Hot chains explore globally; the cold chain (T_1 = 1) samples π. Annealed Importance Sampling (Neal 2001) is the IS analog.
3.10 Adaptive MCMC (Haario-Saksman-Tamminen 2001)
Tune proposal covariance using running empirical covariance Σ_t of past samples: q(·|x_t) = N(x_t, (2.38^2/d) Σ_t). Diminishing-adaptation condition + containment ensures ergodicity (Roberts-Rosenthal 2007). Reference: Haario, H., Saksman, E., and Tamminen, J. (2001). “An adaptive Metropolis algorithm,” Bernoulli, 7(2), 223-242.
3.11 Reversible-jump MCMC (Green 1995)
Targets distributions on spaces of varying dimension (model selection); proposals include dimension changes with Jacobian-corrected acceptance. Used for change-point models, mixture-component selection. Reference: Green, P.J. (1995). “Reversible jump MCMC computation and Bayesian model determination,” Biometrika, 82(4), 711-732.
3.12 Pseudo-marginal MCMC (Andrieu-Roberts 2009)
Use an unbiased estimator π̂(x) of π̃(x) (e.g. from particle filter) in the MH ratio; preserves π as marginal stationary. Enables Bayesian inference for intractable likelihoods. Particle MCMC (Andrieu-Doucet-Holenstein 2010) is the canonical instance.
3.13 Stochastic-gradient MCMC
SGLD — Stochastic Gradient Langevin Dynamics (Welling-Teh 2011): replace full gradient with mini-batch in ULA; subsample-based scalable Bayesian inference. SGHMC — Stochastic Gradient HMC (Chen-Fox-Guestrin 2014): add friction term to handle gradient noise. SGNHT — Stochastic Gradient Nosé-Hoover Thermostat (Ding et al. 2014).
3.14 Bouncy Particle / Zig-Zag (piecewise-deterministic MCMC)
Continuous-time non-reversible MCMC (Bouchard-Côté et al. 2018, Bierkens-Fearnhead-Roberts 2019). Particle moves deterministically along straight lines; velocity flips at event times of a Poisson process with rate λ(x, v) = max(0, -v · ∇ log π(x)). Provably faster mixing than reversible chains in some regimes.
3.15 Implicit / Hamiltonian implicit samplers
Implicit-step variants of HMC for stiff problems (Chorin-Tu 2009, Morzfeld-Chorin); generalized HMC; Riemannian symplectic methods.
4. Sequential Monte Carlo (SMC) and particle methods
4.1 Particle filter (Gordon-Salmond-Smith 1993)
Bootstrap particle filter for state-space model X_t | X_{t-1} ~ p(·|x_{t-1}), Y_t | X_t ~ p(·|x_t):
- Propagate: X_t^{(i)} ~ p(·|X_{t-1}^{(i)})
- Weight: w_t^{(i)} ∝ p(Y_t | X_t^{(i)})
- Resample (multinomial / systematic / stratified) when ESS drops below threshold
Variants: auxiliary particle filter (Pitt-Shephard 1999), ensemble Kalman filter (Evensen 1994) hybrid Gaussian/particle, unscented particle filter (van der Merwe 2000).
4.2 Sequential Monte Carlo samplers (Del Moral-Doucet-Jasra 2006)
Sample sequence of distributions π_1, π_2, …, π_T (e.g. tempered π_t ∝ π̃^{φ_t} with 0 = φ_0 < φ_1 < … < φ_T = 1). Each step: importance reweight, resample, MCMC mutation. Provides unbiased normalizing-constant estimator log Ẑ — used for marginal likelihood / Bayes factors.
Reference: Doucet, A., de Freitas, N., and Gordon, N. (eds.) (2001). Sequential Monte Carlo Methods in Practice, Springer.
4.3 Annealed Importance Sampling — AIS (Neal 2001)
Discrete annealing 1 → 0 of inverse temperature; provides unbiased Ẑ estimator. Reverse AIS (Burda et al. 2015) bounds VAE log-likelihoods.
4.4 Population Monte Carlo (Cappé-Guillin-Marin-Robert 2004)
Adaptive multivariate importance sampling: at iteration t, draw from mixture of N proposals adapted to previous weighted population.
4.5 Nested sampling (Skilling 2004)
Bayesian evidence computation: shrink “live points” through prior-mass contours of increasing likelihood. Implementations: MultiNest (Feroz-Hobson-Bridges 2009), PolyChord (Handley-Hobson-Lasenby 2015), dynesty (Speagle 2020). Used extensively in cosmology and astrophysics.
5. Variational inference (deterministic approximation)
Approximate π by q_φ ∈ Q minimizing KL(q_φ || π) = E_{q_φ}[log q_φ - log π̃] + log Z. Equivalent to maximizing ELBO: L(φ) = E_{q_φ}[log π̃(x)] + H[q_φ].
5.1 Mean-field variational inference (Saul-Jaakkola-Jordan 1996)
Factorize q(x) = Π_i q_i(x_i); coordinate ascent updates q_i*(x_i) ∝ exp(E_{q_{-i}}[log π̃(x)]). For conjugate-exponential families (Beal-Ghahramani 2003), closed-form CAVI.
5.2 Stochastic variational inference — SVI (Hoffman-Blei-Wang-Paisley 2013)
Natural-gradient mini-batch updates for large-scale conjugate models; scales LDA, hierarchical regression to millions of documents.
5.3 Black-box variational inference — BBVI (Ranganath-Gerrish-Blei 2014)
Score-function (REINFORCE) gradient estimator for arbitrary q, π. High variance; reduced via control variates and Rao-Blackwellization.
5.4 Reparameterization trick (Kingma-Welling 2014, Rezende-Mohamed-Wierstra 2014)
For location-scale q (Gaussian, Laplace), write X = g(ε, φ) with ε ~ p_0. Then ∇φ E{q_φ}[f(X)] = E_{ε}[∇_φ f(g(ε, φ))]; low variance pathwise gradient. Foundation of VAEs and modern amortized inference.
5.5 Normalizing flows for VI (Rezende-Mohamed 2015)
Parameterize q as transform of base distribution through invertible neural network; log-density via change-of-variables formula log q(x) = log p_0(z_0) - Σ_k log |det ∂f_k/∂z_{k-1}|. Families: planar/radial flows, RealNVP (Dinh-Sohl-Dickstein-Bengio 2017), NICE, MAF (Papamakarios-Pavlakou-Murray 2017), IAF (Kingma et al. 2016), Glow (Kingma-Dhariwal 2018), neural ODE flows / FFJORD (Grathwohl et al. 2019).
5.6 Amortized inference
Train inference network q_φ(z|x) to output approximate posteriors as a function of observed x; the encoder in VAEs. Avoids per-datum optimization.
5.7 Stein Variational Gradient Descent — SVGD (Liu-Wang 2016)
Particle-based VI: iteratively update N particles {x_i} by
- x_i ← x_i + ε φ*(x_i), with φ*(x) = (1/N) Σ_j [k(x_j, x) ∇*{x_j} log π(x_j) + ∇*{x_j} k(x_j, x)]
where k is a positive-definite kernel (typically RBF). Kernelized Stein discrepancy is the local descent direction. Reference: Liu, Q. and Wang, D. (2016). “Stein variational gradient descent: A general purpose Bayesian inference algorithm,” NeurIPS.
5.8 Pathfinder (Zhang et al. 2021)
Variational pre-conditioning: run quasi-Newton (L-BFGS) optimization on -log π; along the path of iterates, fit Gaussian approximations with low-rank-plus-diagonal covariance from L-BFGS Hessian estimates. Pareto-smoothed importance resample to pick best. Strong NUTS initializer in Stan.
5.9 Normalizing-flow samplers / BayesFlow
BayesFlow (Radev et al. 2020, 2022): simulation-based inference via conditional normalizing flows; train invertible network on simulator pairs (θ, x) for amortized posterior q(θ|x). Related: SNPE/SNL/SNRE in sbi package (Tejero-Cantero et al. 2020).
6. Bayesian quadrature and probabilistic integration
6.1 Bayesian Monte Carlo (O’Hagan 1991)
Treat the integrand as a Gaussian-process prior; the integral becomes Gaussian-distributed with posterior mean μ_Q = z^T K^{-1} f and variance σ²_Q = z_0 - z^T K^{-1} z, with kernel-mean embedding z. Reference: O’Hagan, A. (1991). “Bayes-Hermite quadrature,” Journal of Statistical Planning and Inference, 29(3), 245-260.
6.2 Probabilistic numerics (Hennig-Osborne-Girolami 2015)
Treat numerical algorithms (quadrature, ODE solvers, linear solves) as inference; provides calibrated uncertainty. Tools: ProbNum, Stheno.
6.3 Quadrature rules (deterministic alternative to MC)
- Newton-Cotes (closed): trapezoidal (order 2), Simpson (order 4), Boole/Bode (order 6).
- Gauss-Legendre: n-point exact for polynomials of degree 2n-1 on [-1,1].
- Gauss-Hermite: nodes and weights for ∫ exp(-x²) f(x) dx (used for normal expectations).
- Gauss-Laguerre: ∫_0^∞ exp(-x) f(x) dx (exponential).
- Gauss-Chebyshev: ∫_{-1}^1 f(x)/sqrt(1-x²) dx.
- Clenshaw-Curtis: Chebyshev nodes; nested → efficient adaptivity.
- Tanh-sinh (Takahasi-Mori 1974): doubly exponential; handles endpoint singularities.
- Sparse-grid / Smolyak (1963): tensor-product reduction beating curse of dimensionality up to d ≈ 20.
- Cubature (Stroud 1971): exact rules on multidimensional regions.
7. Specialized samplers
7.1 Dirichlet process / Chinese restaurant / Stick-breaking
Sethuraman (1994) stick-breaking: π_k = V_k Π_{j<k} (1 - V_j) with V_k ~ Beta(1, α). Polya urn / Chinese restaurant process (CRP) for collapsed Gibbs in nonparametric Bayes (Neal 2000 algorithm 8).
7.2 Determinantal point process sampling
Hough-Krishnapur-Peres-Virág (2006) spectral algorithm; k-DPP (Kulesza-Taskar 2011). Recent: tree-based exact DPP sampling (Launay et al. 2020).
7.3 Coupling from the past — CFTP (Propp-Wilson 1996)
Exact (perfect) sampling for monotone Markov chains; uses dominated coupling to detect coalescence into stationary distribution.
7.4 Hit-and-run (Smith 1984)
Mixing on convex bodies in poly(d) time (Lovász-Vempala). Pick random direction at current point; sample uniformly on the chord.
7.5 Multilevel Monte Carlo — MLMC (Giles 2008)
Telescoping estimator Σ_l E[P_l - P_{l-1}] with decreasing samples at higher resolution; reduces cost from O(ε^{-3}) to O(ε^{-2}) for SDEs.
8. Diagnostics and convergence
8.1 Trace plots and autocorrelation
Integrated autocorrelation time τ_int = 1 + 2 Σ_{k=1}^∞ ρ(k); ESS = N / τ_int.
8.2 Gelman-Rubin R̂ (Gelman-Rubin 1992)
Run M ≥ 2 chains; compare within-chain variance W to between-chain variance B; R̂ = sqrt((W(1 - 1/N) + B)/W). Convergence claimed when R̂ < 1.01 (Vehtari et al. 2021 rank-normalized R̂).
8.3 Effective Sample Size (ESS)
Bulk-ESS and tail-ESS (Vehtari et al. 2021); both should be ≥ 400 per chain for reliable quantile estimates.
8.4 Divergence diagnostics (Stan)
Track energy E(x, p); divergent transitions (Δenergy too large) indicate biased exploration; treedepth limits indicate insufficient leapfrog steps. Pair plots in cm-divergences mode (Betancourt 2017).
8.5 Posterior predictive checks
Sample y_rep | y ~ p(y | θ) for θ ~ posterior; visual / quantitative comparison to observed y.
8.6 Simulation-Based Calibration — SBC (Talts-Betancourt-Simpson-Vehtari-Gelman 2018)
For ranks of true θ relative to L posterior draws across N simulations, ranks should be uniform; tests calibration of inference algorithm + model implementation jointly.
9. Software / Implementations
9.1 Stan (Carpenter et al. 2017)
Probabilistic programming language compiling to NUTS with dual-averaging warmup, Pathfinder initialization (2021+), variational inference (ADVI), and L-BFGS optimization. Interfaces: CmdStan, RStan, PyStan, CmdStanPy, CmdStanR. Reference: Carpenter, B. et al. (2017). “Stan: A probabilistic programming language,” Journal of Statistical Software, 76(1).
9.2 PyMC (Salvatier-Wiecki-Fonnesbeck 2016, v5 by Wiecki et al. 2023)
Python PPL with NUTS, HMC, Metropolis, SMC, ADVI; uses PyTensor backend (Aesara fork). Marginal-MAP, JAX, Numba sampling backends.
9.3 NumPyro (Phan-Pradhan-Jankowiak 2019)
JAX-based NUTS implementation; fastest pure-Python sampler for moderate dimensions. Single-program multiple-data parallelism over chains/devices. Reference: Phan, D., Pradhan, N., and Jankowiak, M. (2019). “Composable effects for flexible and accelerated probabilistic programming in NumPyro,” arXiv:1912.11554.
9.4 Turing.jl (Ge-Xu-Ghahramani 2018)
Julia PPL with NUTS, HMC, Gibbs composition, particle MCMC, AdvancedHMC backend.
9.5 BlackJAX (Cabezas et al. 2024)
JAX-native library of HMC/NUTS/MALA/SMC/Pathfinder building blocks.
9.6 emcee (Foreman-Mackey et al. 2013)
Goodman-Weare affine-invariant ensemble sampler; the de facto standard in astrophysics.
9.7 dynesty / MultiNest / PolyChord
Nested-sampling implementations for evidence + posteriors; popular in cosmology.
9.8 sbi / BayesFlow
Simulation-based inference: SNPE, SNLE, SNRE, neural-flow posterior estimation.
10. Recent advances (2020-2025)
10.1 Pathfinder (Zhang-Carpenter-Gelman-Vehtari 2022)
Variational + L-BFGS path approximations for NUTS warm-up; large speedup vs. dual-averaging-only warmup in Stan.
10.2 BayesFlow (Radev et al. 2022)
Conditional normalizing flows for amortized simulation-based posterior estimation. Reference: Radev, S.T., Mertens, U.K., Voss, A., Ardizzone, L., and Köthe, U. (2022). “BayesFlow: Learning complex stochastic models with invertible neural networks,” IEEE TNNLS, 33(4), 1452-1466.
10.3 Diffusion-based samplers
Score-matching MCMC and Annealed Langevin (Song-Ermon 2019; Song et al. 2021). Connection to sampling from data distribution learned via denoising score matching.
10.4 Continuous normalizing flow / FFJORD / Riemannian flows
Flow matching (Lipman et al. 2023), conditional flow matching, Riemannian flow matching on manifolds (Chen-Lipman 2024).
10.5 Coreset MCMC (Campbell-Broderick 2019)
Compress dataset to weighted coreset that preserves posterior; combine with off-the-shelf MCMC.
10.6 Parallel MCMC
GPU-parallel NUTS in CmdStan, JAX-NUTS, Theseus; tensor cores in latest NVIDIA / Apple Silicon. Mass-matrix adaptation in parallel (Bales-Pourzanjani-Vehtari 2019).
10.7 Tempered Markov Chain Monte Carlo (Syed-Bouchard-Côté-Deligiannidis-Doucet 2022)
Non-reversible parallel tempering with optimal swap schedules.
10.8 Sequential neural posterior estimation (SNPE-C, SNPE-D)
Greenberg-Nonnenmacher-Macke (2019); Durkan-Murray-Papamakarios (2020); Lueckmann et al. benchmarks (2021).
11. Theoretical foundations
11.1 Detailed balance and reversibility
A chain is π-reversible iff π(x) K(x, x’) = π(x’) K(x’, x). All MH chains satisfy detailed balance by construction.
11.2 Ergodic theorems
For π-irreducible aperiodic Markov chain with stationary π, (1/T) Σ f(X_t) → E_π[f] a.s. (Birkhoff); CLT under polynomial / geometric ergodicity (Meyn-Tweedie 2009).
11.3 Spectral gap and mixing time
Mixing time t_mix(ε) = min{t : sup_x ‖K^t(x, ·) - π‖_TV ≤ ε}. Bounded by spectral gap 1 - λ_2. Cheeger / conductance bounds (Lawler-Sokal 1988, Sinclair-Jerrum 1989).
11.4 Geometric ergodicity
π-irreducible, aperiodic chain with Lyapunov drift condition V(x) → ∞: sup_x ‖K^t(x, ·) - π‖_V ≤ C(x) ρ^t. MALA geometrically ergodic for super-exponential tails (Roberts-Tweedie 1996).
11.5 Optimal scaling theory
For d-dim iid product target, RWM has acceptance ≈ 0.234 (Roberts-Gelman-Gilks 1997); MALA acceptance ≈ 0.574 (Roberts-Rosenthal 1998); HMC acceptance ≈ 0.65 - 0.8 (Beskos et al. 2013).
11.6 Wasserstein / Sinkhorn analyses
Distance to stationary in W_2 / KL / TV for ULA/MALA/HMC via Talagrand T1, log-Sobolev, Poincaré, Bakry-Émery curvature criteria (Bakry-Gentil-Ledoux 2014).
11.7 Discretization error of Langevin
ULA bias O(ε); MALA O(ε²) for smooth targets; high-order integrators (Strang splitting, fourth-order leapfrog) give O(ε^p) for HMC.
12. Application domains
- Bayesian statistics: posterior + predictive + model comparison (Bayes factors via SMC/AIS/nested sampling)
- Statistical physics: Ising, Potts, lattice gauge (origin of Metropolis 1953)
- Computational chemistry: free-energy methods, replica exchange MD
- Population genetics: coalescent / ABC inference; MCMC over genealogies
- Cosmology: parameter inference (CosmoMC, MontePython, Cobaya), nested sampling for evidence
- Machine learning: Bayesian neural nets (HMC, SGLD, variational), VAEs (reparameterization), diffusion models (score-matched Langevin)
- Robotics: particle filters for SLAM; sampling-based motion planning (RRT, PRM share random-sampling primitives)
- Finance: option pricing (QMC), risk integrals, Bayesian forecasting