Sampling Algorithms Catalog

A reference catalog of algorithms for drawing samples from probability distributions, organized by family. Symbols follow standard usage: target density π(x) (often unnormalized π̃(x) = Zπ(x) with Z unknown), proposal q(x’|x), state space X ⊆ R^d, acceptance probability α, Markov transition kernel K(x → x’), and stationary distribution π. Throughout, X ~ p means X is distributed according to p.

1. Direct (non-iterative) samplers

1.1 Inverse-CDF (probability integral transform)

For univariate target with CDF F: if U ~ Uniform(0,1), then X = F^{-1}(U) ~ F. Requires closed-form or numerical inverse of F. Examples: exponential X = -log(1-U)/λ; Gumbel; Cauchy X = tan(π(U - 1/2)).

Reference: Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer.

1.2 Box-Muller (1958)

Two independent standard normals from two uniforms:

Z_1 = sqrt(-2 ln U_1) cos(2π U_2)
Z_2 = sqrt(-2 ln U_1) sin(2π U_2)

Marsaglia polar variant avoids trig calls. Reference: Box, G.E.P. and Muller, M.E. (1958). “A note on the generation of random normal deviates,” Annals of Mathematical Statistics, 29(2), 610-611.

1.3 Ziggurat (Marsaglia-Tsang 2000)

Rectangle-decomposition rejection sampler for normal and exponential. Mostly accepts on first table lookup; very fast for repeated draws. Reference: Marsaglia, G. and Tsang, W.W. (2000). “The Ziggurat Method for Generating Random Variables,” Journal of Statistical Software, 5(8).

1.4 Rejection sampling (von Neumann 1951)

To draw X ~ π given an envelope q with π(x) ≤ M q(x) for finite M ≥ 1:

Draw X ~ q, U ~ Uniform(0,1)
Accept X if U ≤ π(X) / (M q(X)); else repeat

Acceptance probability is 1/M. Adaptive Rejection Sampling (ARS) of Gilks and Wild (1992) builds piecewise-linear log-concave envelopes adaptively. ARS extends to log-concave densities with derivatives; ARMS (Gilks, Best, Tan 1995) adds a Metropolis step for non-log-concave targets.

References: von Neumann, J. (1951). “Various techniques used in connection with random digits,” NBS Applied Math Series, 12, 36-38. Gilks, W.R. and Wild, P. (1992). “Adaptive rejection sampling for Gibbs sampling,” Applied Statistics, 41, 337-348.

1.5 Ratio-of-Uniforms (Kinderman-Monahan 1977)

For density f(x) ∝ h(x), define A = {(u,v) : 0 < u ≤ sqrt(h(v/u))}. If (U,V) ~ Uniform(A), then V/U ~ f. The region A is bounded for a wide class of distributions. Generalizations: Wakefield, Gelfand, and Smith (1991).

Reference: Kinderman, A.J. and Monahan, J.F. (1977). “Computer generation of random variables using the ratio of uniform deviates,” ACM Transactions on Mathematical Software, 3(3), 257-260.

1.6 Composition method

Mixture sampling: to sample X ~ Σ_k w_k p_k(x), first sample index K with P(K=k) = w_k, then X ~ p_K. Useful when components are easy and mixture target is hard.

1.7 Importance sampling

For estimating E_π[f(X)] = ∫ f(x) π(x) dx using proposal q with q > 0 wherever π > 0:

Draw X_1, …, X_N ~ q
Compute weights w_i = π(X_i) / q(X_i)
Estimator: μ̂ = (1/N) Σ_i w_i f(X_i) (unbiased)

Self-normalized importance sampling (SNIS) uses w̃_i = w_i / Σ_j w_j; biased but variance often lower; required when π is known only up to normalization. Effective sample size ESS = (Σ w_i)² / Σ w_i² ∈ [1, N]. Optimal proposal q*(x) ∝ |f(x)| π(x) (Kahn-Marshall 1953).

Reference: Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed., Springer.

1.8 Stratified sampling

Partition X into strata X_1, …, X_K with weights p_k = π(X_k). Allocate n_k samples per stratum (often n_k ∝ p_k σ_k for variance-optimal Neyman allocation, Neyman 1934). Estimator: Σ_k p_k μ̂_k.

1.9 Latin Hypercube Sampling (McKay-Beckman-Conover 1979)

In d dimensions with N samples, partition each axis into N equal-probability intervals; place one sample in each row and column (a Latin square arrangement). Lower variance than crude MC for additive integrands.

Reference: McKay, M.D., Beckman, R.J., and Conover, W.J. (1979). “A comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, 21(2), 239-245.

2. Quasi-Monte Carlo (low-discrepancy sequences)

QMC replaces iid uniform points with deterministic low-discrepancy point sets {x_1, …, x_N} ⊂ [0,1]^d, achieving error O((log N)^d / N) for sufficiently smooth integrands rather than O(N^{-1/2}).

2.1 Halton sequence (1960)

x_i = (φ_{b_1}(i), …, φ_{b_d}(i)) where φ_b is the radical-inverse function in base b; the b_j are the first d primes. Suffers correlation in high dimensions.

2.2 Sobol sequence (1967)

Base-2 sequences using direction numbers from primitive polynomials over GF(2). Joe-Kuo (2008) tables provide direction numbers up to dimension 21201. Used heavily in finance, physics, and scientific computing.

2.3 Niederreiter (t,m,s)-nets and (t,s)-sequences

Generalizes Halton/Sobol to arbitrary bases; defines uniformity through equidistribution properties. Reference: Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods, SIAM.

2.4 Faure sequence (1982)

Prime-base permutation sequences with better high-dimensional behavior than Halton.

2.5 Scrambled QMC (Owen 1995, 1997)

Apply random digit permutations (Owen scrambling) to a digital net to obtain unbiased estimators with O(N^{-3/2}) variance for sufficiently smooth integrands and providing error estimates from independent replicates. Reference: Owen, A.B. (1997). “Scrambled net variance for integrals of smooth functions,” Annals of Statistics, 25(4), 1541-1562.

2.6 Randomized lattice rules

Korobov rules, rank-1 lattices: x_i = {i z / N} for generating vector z; randomized via random shifts (Sloan-Joe 1994). Component-by-component construction (Kuo-Sloan-Wozniakowski).

3. Markov chain Monte Carlo (MCMC)

A discrete-time Markov chain {X_t} with transition kernel K admitting stationary distribution π satisfies, after burn-in, X_t ~ π and ergodic averages (1/T) Σ_{t=1}^T f(X_t) → E_π[f] (a.s.) for π-integrable f.

3.1 Metropolis-Hastings (Metropolis 1953, Hastings 1970)

At state x with proposal q(x’|x):

Propose x’ ~ q(·|x)
Acceptance α(x, x’) = min{1, [π̃(x’) q(x|x’)] / [π̃(x) q(x’|x)]}
With probability α set X_{t+1} = x’; else X_{t+1} = x

Symmetric proposals (q(x’|x) = q(x|x’), e.g. Gaussian random walk) give α = min{1, π̃(x’)/π̃(x)} — the original Metropolis algorithm. Independence sampler uses q(x’|x) = q(x’). Optimal scaling for d-dim Gaussian target: step size ∝ 2.38/sqrt(d) with acceptance ≈ 0.234 (Roberts-Gelman-Gilks 1997).

References: Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., and Teller, E. (1953). “Equation of state calculations by fast computing machines,” Journal of Chemical Physics, 21(6), 1087-1092. Hastings, W.K. (1970). “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, 57(1), 97-109.

3.2 Gibbs sampler (Geman-Geman 1984)

For X = (X_1, …, X_d), cycle through full conditionals:

X_i^{(t+1)} ~ π(x_i | x_1^{(t+1)}, …, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, …, x_d^{(t)})

All proposals are accepted (α = 1) when full conditionals are sampled exactly. Block Gibbs samples groups (X_{B_1}, …, X_{B_K}) jointly; reduces correlation when within-block dependence is strong. Random-scan Gibbs picks coordinate at random.

References: Geman, S. and Geman, D. (1984). “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721-741. Gelfand, A.E. and Smith, A.F.M. (1990). “Sampling-based approaches to calculating marginal densities,” JASA, 85(410), 398-409.

3.3 Metropolis-within-Gibbs

When full conditionals lack closed form, replace direct sampling with a Metropolis-Hastings step targeting each conditional. Each step preserves the joint stationary distribution. Tierney (1994) gives the theoretical framework.

3.4 Slice sampling (Neal 2003)

Auxiliary variable U ~ Uniform(0, π̃(X)); then sample X | U ~ Uniform({x : π̃(x) ≥ U}). Implementation via “stepping out” and “shrinkage” to find the slice. Reference: Neal, R.M. (2003). “Slice sampling,” Annals of Statistics, 31(3), 705-767.

3.5 Hamiltonian (Hybrid) Monte Carlo — HMC (Duane et al. 1987; Neal 2011)

Augment with momentum p ~ N(0, M) and define Hamiltonian H(x, p) = -log π̃(x) + (1/2) p^T M^{-1} p. Simulate Hamilton’s equations

dx/dt = M^{-1} p
dp/dt = ∇ log π̃(x) using a symplectic integrator (typically leapfrog with step size ε and L steps), then Metropolis-correct with α = min{1, exp(-ΔH)}. Mass matrix M tuned to target geometry. Scales O(d^{1/4}) in dimension (Beskos et al. 2013) vs. O(d) for random-walk Metropolis. Reference: Neal, R.M. (2011). “MCMC using Hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, CRC, ch. 5.

3.6 No-U-Turn Sampler — NUTS (Hoffman-Gelman 2014)

Adaptive HMC variant: build a binary tree of leapfrog steps in alternating forward/backward directions, doubling until the trajectory turns back on itself (the U-turn criterion (x_+ - x_-) · p < 0). Selects a state from the tree by slice sampling. Auto-tunes the step size ε via dual averaging (Nesterov 2009) targeting acceptance ≈ 0.8. The default sampler in Stan and PyMC. Reference: Hoffman, M.D. and Gelman, A. (2014). “The No-U-Turn Sampler,” JMLR, 15, 1593-1623.

3.7 Riemannian-manifold HMC — RMHMC (Girolami-Calderhead 2011)

Use a position-dependent metric G(x) (Fisher information + prior) so the kinetic energy becomes (1/2) p^T G(x)^{-1} p. Trajectories follow geodesics with respect to the Fisher-Rao metric; requires implicit (Stoermer-Verlet generalized) integrator. Captures local curvature; handles strongly correlated and multi-scale posteriors. Reference: Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo,” JRSS B, 73(2), 123-214.

3.8 Langevin algorithms

Overdamped Langevin SDE: dX_t = ∇ log π(X_t) dt + sqrt(2) dW_t has π as invariant distribution.

ULA — Unadjusted Langevin Algorithm:

X_{t+1} = X_t + (ε/2) ∇ log π(X_t) + sqrt(ε) ξ_t, ξ_t ~ N(0, I)

Has discretization bias; mixing analyzed by Durmus-Moulines (2017). MALA — Metropolis-Adjusted Langevin Algorithm (Roberts-Tweedie 1996) Metropolis-corrects this proposal. Optimal acceptance ≈ 0.574 (Roberts-Rosenthal 1998). Underdamped (kinetic) Langevin: dX_t = V_t dt, dV_t = (-∇ log π(X_t) - γ V_t) dt + sqrt(2γ) dW_t.

3.9 Parallel tempering / replica exchange (Geyer 1991, Marinari-Parisi 1992)

Run K chains at temperatures 1 = T_1 < T_2 < … < T_K with stationary distributions π_k ∝ π̃^{1/T_k}. Periodically swap configurations between adjacent chains with Metropolis probability α = min{1, [π_k(x_{k+1}) π_{k+1}(x_k)]/[π_k(x_k) π_{k+1}(x_{k+1})]}. Hot chains explore globally; the cold chain (T_1 = 1) samples π. Annealed Importance Sampling (Neal 2001) is the IS analog.

3.10 Adaptive MCMC (Haario-Saksman-Tamminen 2001)

Tune proposal covariance using running empirical covariance Σ_t of past samples: q(·|x_t) = N(x_t, (2.38^2/d) Σ_t). Diminishing-adaptation condition + containment ensures ergodicity (Roberts-Rosenthal 2007). Reference: Haario, H., Saksman, E., and Tamminen, J. (2001). “An adaptive Metropolis algorithm,” Bernoulli, 7(2), 223-242.

3.11 Reversible-jump MCMC (Green 1995)

Targets distributions on spaces of varying dimension (model selection); proposals include dimension changes with Jacobian-corrected acceptance. Used for change-point models, mixture-component selection. Reference: Green, P.J. (1995). “Reversible jump MCMC computation and Bayesian model determination,” Biometrika, 82(4), 711-732.

3.12 Pseudo-marginal MCMC (Andrieu-Roberts 2009)

Use an unbiased estimator π̂(x) of π̃(x) (e.g. from particle filter) in the MH ratio; preserves π as marginal stationary. Enables Bayesian inference for intractable likelihoods. Particle MCMC (Andrieu-Doucet-Holenstein 2010) is the canonical instance.

3.13 Stochastic-gradient MCMC

SGLD — Stochastic Gradient Langevin Dynamics (Welling-Teh 2011): replace full gradient with mini-batch in ULA; subsample-based scalable Bayesian inference. SGHMC — Stochastic Gradient HMC (Chen-Fox-Guestrin 2014): add friction term to handle gradient noise. SGNHT — Stochastic Gradient Nosé-Hoover Thermostat (Ding et al. 2014).

3.14 Bouncy Particle / Zig-Zag (piecewise-deterministic MCMC)

Continuous-time non-reversible MCMC (Bouchard-Côté et al. 2018, Bierkens-Fearnhead-Roberts 2019). Particle moves deterministically along straight lines; velocity flips at event times of a Poisson process with rate λ(x, v) = max(0, -v · ∇ log π(x)). Provably faster mixing than reversible chains in some regimes.

3.15 Implicit / Hamiltonian implicit samplers

Implicit-step variants of HMC for stiff problems (Chorin-Tu 2009, Morzfeld-Chorin); generalized HMC; Riemannian symplectic methods.

4. Sequential Monte Carlo (SMC) and particle methods

4.1 Particle filter (Gordon-Salmond-Smith 1993)

Bootstrap particle filter for state-space model X_t | X_{t-1} ~ p(·|x_{t-1}), Y_t | X_t ~ p(·|x_t):

Propagate: X_t^{(i)} ~ p(·|X_{t-1}^{(i)})
Weight: w_t^{(i)} ∝ p(Y_t | X_t^{(i)})
Resample (multinomial / systematic / stratified) when ESS drops below threshold

Variants: auxiliary particle filter (Pitt-Shephard 1999), ensemble Kalman filter (Evensen 1994) hybrid Gaussian/particle, unscented particle filter (van der Merwe 2000).

4.2 Sequential Monte Carlo samplers (Del Moral-Doucet-Jasra 2006)

Sample sequence of distributions π_1, π_2, …, π_T (e.g. tempered π_t ∝ π̃^{φ_t} with 0 = φ_0 < φ_1 < … < φ_T = 1). Each step: importance reweight, resample, MCMC mutation. Provides unbiased normalizing-constant estimator log Ẑ — used for marginal likelihood / Bayes factors.

Reference: Doucet, A., de Freitas, N., and Gordon, N. (eds.) (2001). Sequential Monte Carlo Methods in Practice, Springer.

4.3 Annealed Importance Sampling — AIS (Neal 2001)

Discrete annealing 1 → 0 of inverse temperature; provides unbiased Ẑ estimator. Reverse AIS (Burda et al. 2015) bounds VAE log-likelihoods.

4.4 Population Monte Carlo (Cappé-Guillin-Marin-Robert 2004)

Adaptive multivariate importance sampling: at iteration t, draw from mixture of N proposals adapted to previous weighted population.

4.5 Nested sampling (Skilling 2004)

Bayesian evidence computation: shrink “live points” through prior-mass contours of increasing likelihood. Implementations: MultiNest (Feroz-Hobson-Bridges 2009), PolyChord (Handley-Hobson-Lasenby 2015), dynesty (Speagle 2020). Used extensively in cosmology and astrophysics.

5. Variational inference (deterministic approximation)

Approximate π by q_φ ∈ Q minimizing KL(q_φ || π) = E_{q_φ}[log q_φ - log π̃] + log Z. Equivalent to maximizing ELBO: L(φ) = E_{q_φ}[log π̃(x)] + H[q_φ].

5.1 Mean-field variational inference (Saul-Jaakkola-Jordan 1996)

Factorize q(x) = Π_i q_i(x_i); coordinate ascent updates q_i*(x_i) ∝ exp(E_{q_{-i}}[log π̃(x)]). For conjugate-exponential families (Beal-Ghahramani 2003), closed-form CAVI.

5.2 Stochastic variational inference — SVI (Hoffman-Blei-Wang-Paisley 2013)

Natural-gradient mini-batch updates for large-scale conjugate models; scales LDA, hierarchical regression to millions of documents.

5.3 Black-box variational inference — BBVI (Ranganath-Gerrish-Blei 2014)

Score-function (REINFORCE) gradient estimator for arbitrary q, π. High variance; reduced via control variates and Rao-Blackwellization.

5.4 Reparameterization trick (Kingma-Welling 2014, Rezende-Mohamed-Wierstra 2014)

For location-scale q (Gaussian, Laplace), write X = g(ε, φ) with ε ~ p_0. Then ∇φ E{q_φ}[f(X)] = E_{ε}[∇_φ f(g(ε, φ))]; low variance pathwise gradient. Foundation of VAEs and modern amortized inference.

5.5 Normalizing flows for VI (Rezende-Mohamed 2015)

Parameterize q as transform of base distribution through invertible neural network; log-density via change-of-variables formula log q(x) = log p_0(z_0) - Σ_k log |det ∂f_k/∂z_{k-1}|. Families: planar/radial flows, RealNVP (Dinh-Sohl-Dickstein-Bengio 2017), NICE, MAF (Papamakarios-Pavlakou-Murray 2017), IAF (Kingma et al. 2016), Glow (Kingma-Dhariwal 2018), neural ODE flows / FFJORD (Grathwohl et al. 2019).

5.6 Amortized inference

Train inference network q_φ(z|x) to output approximate posteriors as a function of observed x; the encoder in VAEs. Avoids per-datum optimization.

5.7 Stein Variational Gradient Descent — SVGD (Liu-Wang 2016)

Particle-based VI: iteratively update N particles {x_i} by

x_i ← x_i + ε φ*(x_i), with φ*(x) = (1/N) Σ_j [k(x_j, x) ∇*{x_j} log π(x_j) + ∇*{x_j} k(x_j, x)]

where k is a positive-definite kernel (typically RBF). Kernelized Stein discrepancy is the local descent direction. Reference: Liu, Q. and Wang, D. (2016). “Stein variational gradient descent: A general purpose Bayesian inference algorithm,” NeurIPS.

5.8 Pathfinder (Zhang et al. 2021)

Variational pre-conditioning: run quasi-Newton (L-BFGS) optimization on -log π; along the path of iterates, fit Gaussian approximations with low-rank-plus-diagonal covariance from L-BFGS Hessian estimates. Pareto-smoothed importance resample to pick best. Strong NUTS initializer in Stan.

5.9 Normalizing-flow samplers / BayesFlow

BayesFlow (Radev et al. 2020, 2022): simulation-based inference via conditional normalizing flows; train invertible network on simulator pairs (θ, x) for amortized posterior q(θ|x). Related: SNPE/SNL/SNRE in sbi package (Tejero-Cantero et al. 2020).

6. Bayesian quadrature and probabilistic integration

6.1 Bayesian Monte Carlo (O’Hagan 1991)

Treat the integrand as a Gaussian-process prior; the integral becomes Gaussian-distributed with posterior mean μ_Q = z^T K^{-1} f and variance σ²_Q = z_0 - z^T K^{-1} z, with kernel-mean embedding z. Reference: O’Hagan, A. (1991). “Bayes-Hermite quadrature,” Journal of Statistical Planning and Inference, 29(3), 245-260.

6.2 Probabilistic numerics (Hennig-Osborne-Girolami 2015)

Treat numerical algorithms (quadrature, ODE solvers, linear solves) as inference; provides calibrated uncertainty. Tools: ProbNum, Stheno.

6.3 Quadrature rules (deterministic alternative to MC)

Newton-Cotes (closed): trapezoidal (order 2), Simpson (order 4), Boole/Bode (order 6).
Gauss-Legendre: n-point exact for polynomials of degree 2n-1 on [-1,1].
Gauss-Hermite: nodes and weights for ∫ exp(-x²) f(x) dx (used for normal expectations).
Gauss-Laguerre: ∫_0^∞ exp(-x) f(x) dx (exponential).
Gauss-Chebyshev: ∫_{-1}^1 f(x)/sqrt(1-x²) dx.
Clenshaw-Curtis: Chebyshev nodes; nested → efficient adaptivity.
Tanh-sinh (Takahasi-Mori 1974): doubly exponential; handles endpoint singularities.
Sparse-grid / Smolyak (1963): tensor-product reduction beating curse of dimensionality up to d ≈ 20.
Cubature (Stroud 1971): exact rules on multidimensional regions.

7. Specialized samplers

7.1 Dirichlet process / Chinese restaurant / Stick-breaking

Sethuraman (1994) stick-breaking: π_k = V_k Π_{j<k} (1 - V_j) with V_k ~ Beta(1, α). Polya urn / Chinese restaurant process (CRP) for collapsed Gibbs in nonparametric Bayes (Neal 2000 algorithm 8).

7.2 Determinantal point process sampling

Hough-Krishnapur-Peres-Virág (2006) spectral algorithm; k-DPP (Kulesza-Taskar 2011). Recent: tree-based exact DPP sampling (Launay et al. 2020).

7.3 Coupling from the past — CFTP (Propp-Wilson 1996)

Exact (perfect) sampling for monotone Markov chains; uses dominated coupling to detect coalescence into stationary distribution.

7.4 Hit-and-run (Smith 1984)

Mixing on convex bodies in poly(d) time (Lovász-Vempala). Pick random direction at current point; sample uniformly on the chord.

7.5 Multilevel Monte Carlo — MLMC (Giles 2008)

Telescoping estimator Σ_l E[P_l - P_{l-1}] with decreasing samples at higher resolution; reduces cost from O(ε^{-3}) to O(ε^{-2}) for SDEs.

8. Diagnostics and convergence

8.1 Trace plots and autocorrelation

Integrated autocorrelation time τ_int = 1 + 2 Σ_{k=1}^∞ ρ(k); ESS = N / τ_int.

8.2 Gelman-Rubin R̂ (Gelman-Rubin 1992)

Run M ≥ 2 chains; compare within-chain variance W to between-chain variance B; R̂ = sqrt((W(1 - 1/N) + B)/W). Convergence claimed when R̂ < 1.01 (Vehtari et al. 2021 rank-normalized R̂).

8.3 Effective Sample Size (ESS)

Bulk-ESS and tail-ESS (Vehtari et al. 2021); both should be ≥ 400 per chain for reliable quantile estimates.

8.4 Divergence diagnostics (Stan)

Track energy E(x, p); divergent transitions (Δenergy too large) indicate biased exploration; treedepth limits indicate insufficient leapfrog steps. Pair plots in cm-divergences mode (Betancourt 2017).

8.5 Posterior predictive checks

Sample y_rep | y ~ p(y | θ) for θ ~ posterior; visual / quantitative comparison to observed y.

8.6 Simulation-Based Calibration — SBC (Talts-Betancourt-Simpson-Vehtari-Gelman 2018)

For ranks of true θ relative to L posterior draws across N simulations, ranks should be uniform; tests calibration of inference algorithm + model implementation jointly.

9. Software / Implementations

9.1 Stan (Carpenter et al. 2017)

Probabilistic programming language compiling to NUTS with dual-averaging warmup, Pathfinder initialization (2021+), variational inference (ADVI), and L-BFGS optimization. Interfaces: CmdStan, RStan, PyStan, CmdStanPy, CmdStanR. Reference: Carpenter, B. et al. (2017). “Stan: A probabilistic programming language,” Journal of Statistical Software, 76(1).

9.2 PyMC (Salvatier-Wiecki-Fonnesbeck 2016, v5 by Wiecki et al. 2023)

Python PPL with NUTS, HMC, Metropolis, SMC, ADVI; uses PyTensor backend (Aesara fork). Marginal-MAP, JAX, Numba sampling backends.

9.3 NumPyro (Phan-Pradhan-Jankowiak 2019)

JAX-based NUTS implementation; fastest pure-Python sampler for moderate dimensions. Single-program multiple-data parallelism over chains/devices. Reference: Phan, D., Pradhan, N., and Jankowiak, M. (2019). “Composable effects for flexible and accelerated probabilistic programming in NumPyro,” arXiv:1912.11554.

9.4 Turing.jl (Ge-Xu-Ghahramani 2018)

Julia PPL with NUTS, HMC, Gibbs composition, particle MCMC, AdvancedHMC backend.

9.5 BlackJAX (Cabezas et al. 2024)

JAX-native library of HMC/NUTS/MALA/SMC/Pathfinder building blocks.

9.6 emcee (Foreman-Mackey et al. 2013)

Goodman-Weare affine-invariant ensemble sampler; the de facto standard in astrophysics.

9.7 dynesty / MultiNest / PolyChord

Nested-sampling implementations for evidence + posteriors; popular in cosmology.

9.8 sbi / BayesFlow

Simulation-based inference: SNPE, SNLE, SNRE, neural-flow posterior estimation.

10. Recent advances (2020-2025)

10.1 Pathfinder (Zhang-Carpenter-Gelman-Vehtari 2022)

Variational + L-BFGS path approximations for NUTS warm-up; large speedup vs. dual-averaging-only warmup in Stan.

10.2 BayesFlow (Radev et al. 2022)

Conditional normalizing flows for amortized simulation-based posterior estimation. Reference: Radev, S.T., Mertens, U.K., Voss, A., Ardizzone, L., and Köthe, U. (2022). “BayesFlow: Learning complex stochastic models with invertible neural networks,” IEEE TNNLS, 33(4), 1452-1466.

10.3 Diffusion-based samplers

Score-matching MCMC and Annealed Langevin (Song-Ermon 2019; Song et al. 2021). Connection to sampling from data distribution learned via denoising score matching.

10.4 Continuous normalizing flow / FFJORD / Riemannian flows

Flow matching (Lipman et al. 2023), conditional flow matching, Riemannian flow matching on manifolds (Chen-Lipman 2024).

10.5 Coreset MCMC (Campbell-Broderick 2019)

Compress dataset to weighted coreset that preserves posterior; combine with off-the-shelf MCMC.

10.6 Parallel MCMC

GPU-parallel NUTS in CmdStan, JAX-NUTS, Theseus; tensor cores in latest NVIDIA / Apple Silicon. Mass-matrix adaptation in parallel (Bales-Pourzanjani-Vehtari 2019).

10.7 Tempered Markov Chain Monte Carlo (Syed-Bouchard-Côté-Deligiannidis-Doucet 2022)

Non-reversible parallel tempering with optimal swap schedules.

10.8 Sequential neural posterior estimation (SNPE-C, SNPE-D)

Greenberg-Nonnenmacher-Macke (2019); Durkan-Murray-Papamakarios (2020); Lueckmann et al. benchmarks (2021).

11. Theoretical foundations

11.1 Detailed balance and reversibility

A chain is π-reversible iff π(x) K(x, x’) = π(x’) K(x’, x). All MH chains satisfy detailed balance by construction.

11.2 Ergodic theorems

For π-irreducible aperiodic Markov chain with stationary π, (1/T) Σ f(X_t) → E_π[f] a.s. (Birkhoff); CLT under polynomial / geometric ergodicity (Meyn-Tweedie 2009).

11.3 Spectral gap and mixing time

Mixing time t_mix(ε) = min{t : sup_x ‖K^t(x, ·) - π‖_TV ≤ ε}. Bounded by spectral gap 1 - λ_2. Cheeger / conductance bounds (Lawler-Sokal 1988, Sinclair-Jerrum 1989).

11.4 Geometric ergodicity

π-irreducible, aperiodic chain with Lyapunov drift condition V(x) → ∞: sup_x ‖K^t(x, ·) - π‖_V ≤ C(x) ρ^t. MALA geometrically ergodic for super-exponential tails (Roberts-Tweedie 1996).

11.5 Optimal scaling theory

For d-dim iid product target, RWM has acceptance ≈ 0.234 (Roberts-Gelman-Gilks 1997); MALA acceptance ≈ 0.574 (Roberts-Rosenthal 1998); HMC acceptance ≈ 0.65 - 0.8 (Beskos et al. 2013).

11.6 Wasserstein / Sinkhorn analyses

Distance to stationary in W_2 / KL / TV for ULA/MALA/HMC via Talagrand T1, log-Sobolev, Poincaré, Bakry-Émery curvature criteria (Bakry-Gentil-Ledoux 2014).

11.7 Discretization error of Langevin

ULA bias O(ε); MALA O(ε²) for smooth targets; high-order integrators (Strang splitting, fourth-order leapfrog) give O(ε^p) for HMC.

12. Application domains

Bayesian statistics: posterior + predictive + model comparison (Bayes factors via SMC/AIS/nested sampling)
Statistical physics: Ising, Potts, lattice gauge (origin of Metropolis 1953)
Computational chemistry: free-energy methods, replica exchange MD
Population genetics: coalescent / ABC inference; MCMC over genealogies
Cosmology: parameter inference (CosmoMC, MontePython, Cobaya), nested sampling for evidence
Machine learning: Bayesian neural nets (HMC, SGLD, variational), VAEs (reparameterization), diffusion models (score-matched Langevin)
Robotics: particle filters for SLAM; sampling-based motion planning (RRT, PRM share random-sampling primitives)
Finance: option pricing (QMC), risk integrals, Bayesian forecasting

Compendium

Explorer

Sampling Algorithms Catalog

Sampling Algorithms Catalog

1. Direct (non-iterative) samplers

1.1 Inverse-CDF (probability integral transform)

1.2 Box-Muller (1958)

1.3 Ziggurat (Marsaglia-Tsang 2000)

1.4 Rejection sampling (von Neumann 1951)

1.5 Ratio-of-Uniforms (Kinderman-Monahan 1977)

1.6 Composition method

1.7 Importance sampling

1.8 Stratified sampling

1.9 Latin Hypercube Sampling (McKay-Beckman-Conover 1979)

2. Quasi-Monte Carlo (low-discrepancy sequences)

2.1 Halton sequence (1960)

2.2 Sobol sequence (1967)

2.3 Niederreiter (t,m,s)-nets and (t,s)-sequences

2.4 Faure sequence (1982)

2.5 Scrambled QMC (Owen 1995, 1997)

2.6 Randomized lattice rules

3. Markov chain Monte Carlo (MCMC)

3.1 Metropolis-Hastings (Metropolis 1953, Hastings 1970)

3.2 Gibbs sampler (Geman-Geman 1984)

3.3 Metropolis-within-Gibbs

3.4 Slice sampling (Neal 2003)

3.5 Hamiltonian (Hybrid) Monte Carlo — HMC (Duane et al. 1987; Neal 2011)

3.6 No-U-Turn Sampler — NUTS (Hoffman-Gelman 2014)

3.7 Riemannian-manifold HMC — RMHMC (Girolami-Calderhead 2011)

3.8 Langevin algorithms

3.9 Parallel tempering / replica exchange (Geyer 1991, Marinari-Parisi 1992)

3.10 Adaptive MCMC (Haario-Saksman-Tamminen 2001)

3.11 Reversible-jump MCMC (Green 1995)

3.12 Pseudo-marginal MCMC (Andrieu-Roberts 2009)

3.13 Stochastic-gradient MCMC

3.14 Bouncy Particle / Zig-Zag (piecewise-deterministic MCMC)

3.15 Implicit / Hamiltonian implicit samplers

4. Sequential Monte Carlo (SMC) and particle methods

4.1 Particle filter (Gordon-Salmond-Smith 1993)

4.2 Sequential Monte Carlo samplers (Del Moral-Doucet-Jasra 2006)

4.3 Annealed Importance Sampling — AIS (Neal 2001)

4.4 Population Monte Carlo (Cappé-Guillin-Marin-Robert 2004)

4.5 Nested sampling (Skilling 2004)

5. Variational inference (deterministic approximation)

5.1 Mean-field variational inference (Saul-Jaakkola-Jordan 1996)

5.2 Stochastic variational inference — SVI (Hoffman-Blei-Wang-Paisley 2013)

5.3 Black-box variational inference — BBVI (Ranganath-Gerrish-Blei 2014)

5.4 Reparameterization trick (Kingma-Welling 2014, Rezende-Mohamed-Wierstra 2014)

5.5 Normalizing flows for VI (Rezende-Mohamed 2015)

5.6 Amortized inference

5.7 Stein Variational Gradient Descent — SVGD (Liu-Wang 2016)

5.8 Pathfinder (Zhang et al. 2021)

5.9 Normalizing-flow samplers / BayesFlow

6. Bayesian quadrature and probabilistic integration

6.1 Bayesian Monte Carlo (O’Hagan 1991)

6.2 Probabilistic numerics (Hennig-Osborne-Girolami 2015)

6.3 Quadrature rules (deterministic alternative to MC)

7. Specialized samplers

7.1 Dirichlet process / Chinese restaurant / Stick-breaking

7.2 Determinantal point process sampling

7.3 Coupling from the past — CFTP (Propp-Wilson 1996)

7.4 Hit-and-run (Smith 1984)

7.5 Multilevel Monte Carlo — MLMC (Giles 2008)

8. Diagnostics and convergence

8.1 Trace plots and autocorrelation

8.2 Gelman-Rubin R̂ (Gelman-Rubin 1992)

8.3 Effective Sample Size (ESS)

8.4 Divergence diagnostics (Stan)

8.5 Posterior predictive checks

8.6 Simulation-Based Calibration — SBC (Talts-Betancourt-Simpson-Vehtari-Gelman 2018)

9. Software / Implementations

9.1 Stan (Carpenter et al. 2017)

9.2 PyMC (Salvatier-Wiecki-Fonnesbeck 2016, v5 by Wiecki et al. 2023)

9.3 NumPyro (Phan-Pradhan-Jankowiak 2019)

9.4 Turing.jl (Ge-Xu-Ghahramani 2018)

9.5 BlackJAX (Cabezas et al. 2024)

9.6 emcee (Foreman-Mackey et al. 2013)

9.7 dynesty / MultiNest / PolyChord

9.8 sbi / BayesFlow

10. Recent advances (2020-2025)