MCMC & Monte Carlo Sampling — Math Reference

1. At a glance

Monte Carlo (MC) methods draw samples from a probability distribution p(x) to approximate expectations, integrals, marginals, and uncertainty. Markov Chain Monte Carlo (MCMC) extends this to settings where direct sampling is infeasible — typically because the normalizing constant Z = ∫ p̃(x) dx is intractable, as in most Bayesian posteriors p(θ|D) ∝ p(D|θ)·p(θ).

Core estimator: for X_i ~ p iid,

E_p[f(X)] ≈ (1/n) · Σ_{i=1..n} f(X_i)

Strong Law of Large Numbers gives consistency; CLT gives O(1/√n) error rate regardless of dimension (the famous escape from the curse of dimensionality for integration).

Two big families:

Exact / direct sampling — when F⁻¹ is available (inverse CDF), or special tricks (Box-Muller for Gaussian) work. Fast, iid samples.
Markov-chain sampling (MCMC) — construct a Markov chain X_0 → X_1 → ... whose stationary distribution is p. After burn-in, samples are correlated draws from p. Asymptotically correct under ergodicity.

Used for: Bayesian inference, statistical physics, finance derivatives, computer graphics (path tracing), state-space models, model selection, rare-event simulation.

Cross-refs: see [[Math/bayesian-inference]] for the inferential framework MCMC supports, [[Math/probability-fundamentals]] for measure-theoretic foundations.

2. Direct sampling

2.1 Inverse CDF (probability integral transform)

If F is the CDF of X and U ~ Uniform(0,1), then X = F⁻¹(U) has distribution F. Works when F⁻¹ is computable.

Examples:

Exponential: F(x) = 1 − e^{−λx} → X = −log(1 − U)/λ.
Cauchy: X = tan(π·(U − 0.5)).
Discrete: store CDF, binary-search on U.

2.2 Box-Muller (Box + Muller 1958)

Two iid U_1, U_2 ~ Uniform(0,1) → two iid Z_1, Z_2 ~ N(0,1):

Z_1 = sqrt(−2·log(U_1)) · cos(2π·U_2)
Z_2 = sqrt(−2·log(U_1)) · sin(2π·U_2)

2.3 Marsaglia polar method (Marsaglia + Bray 1964)

Avoids the trig calls of Box-Muller. Sample (V_1, V_2) ~ Uniform(square), reject if S = V_1² + V_2² > 1, else

Z_1 = V_1 · sqrt(−2·log(S)/S)
Z_2 = V_2 · sqrt(−2·log(S)/S)

2.4 Ziggurat (Marsaglia + Tsang 2000)

State-of-the-art fast Gaussian (and exponential) sampler. Pre-tabulates a stack of rectangles covering the density; ~97% of samples accepted by table lookup, fallback for tail. Used in NumPy randn, R, MATLAB.

2.5 Rejection sampling (von Neumann 1951)

Given p(x) ≤ M·q(x) for an envelope q we can sample:

Draw X ~ q.
Draw U ~ Uniform(0, 1).
Accept X if U ≤ p(X)/(M·q(X)), else reject.

Acceptance probability is 1/M. Inefficient in high dimensions: M typically grows exponentially with d.

3. Importance sampling (IS)

Rewrite the expectation under a tractable proposal q:

E_p[f(X)] = ∫ f(x)·p(x) dx = ∫ f(x)·(p(x)/q(x))·q(x) dx = E_q[f(X)·w(X)]

where w(x) = p(x)/q(x) is the importance weight. Estimator:

μ̂_IS = (1/n) · Σ f(X_i)·w(X_i),   X_i ~ q

When p is known only up to a constant (p̃ = Z·p), use self-normalized IS (SNIS):

μ̂_SNIS = Σ f(X_i)·w̃(X_i) / Σ w̃(X_i),   w̃ = p̃/q

SNIS is biased (consistent) but works without Z.

Variance: Var(μ̂_IS) = Var_q(f·w)/n. Minimum at q*(x) ∝ |f(x)|·p(x). In practice, choose q with heavier tails than p to avoid infinite-variance weights; check the effective sample size ESS = (Σw_i)² / Σ w_i² — small ESS means a few weights dominate.

Use cases: rare-event simulation (tilted proposal); particle filters (sequential IS, see §10); variational inference reweighting (IWAE, Burda et al 2016).

4. MCMC: the core idea

Metropolis et al 1953 (Metropolis, Rosenbluth, Rosenbluth, Teller, Teller) introduced the idea at Los Alamos: simulate a Markov chain whose stationary distribution is the Boltzmann distribution. Hastings 1970 generalized to arbitrary p.

Key facts:

Chain transition kernel K(x → x') must satisfy detailed balance: p(x)·K(x → x') = p(x')·K(x' → x). This implies p is stationary.
Irreducibility + aperiodicity ⇒ unique stationary distribution p, and (1/n)·Σ f(X_i) → E_p[f] (ergodic theorem).
Samples are correlated, so effective sample size n_eff < n.
Need burn-in to forget initial state.

The art is designing fast-mixing kernels.

5. Metropolis-Hastings (MH)

Given current state x, propose x' ~ q(·|x), accept with probability

α(x → x') = min(1, [p(x')·q(x|x')] / [p(x)·q(x'|x)])

If accepted, x_{t+1} = x'; else x_{t+1} = x. The acceptance ratio uses only ratios of p, so the normalizer Z cancels.

# Random-walk Metropolis-Hastings (toy, log-target)
import numpy as np
 
def rw_mh(log_p, x0, n_iter, step):
    x = np.asarray(x0, dtype=float)
    lp = log_p(x)
    samples = np.empty((n_iter, x.size))
    accept = 0
    for t in range(n_iter):
        x_prop = x + step * np.random.randn(*x.shape)
        lp_prop = log_p(x_prop)
        # Symmetric q ⇒ q-terms cancel
        if np.log(np.random.rand()) < lp_prop - lp:
            x, lp = x_prop, lp_prop
            accept += 1
        samples[t] = x
    return samples, accept / n_iter

Tuning random-walk MH: Roberts + Gelman + Gilks 1997 showed the optimal acceptance rate is 0.234 for d → ∞ (Gaussian target, isotropic Gaussian proposal). In practice tune step so acceptance ≈ 0.25–0.35 in moderate dim, ≈ 0.44 in 1D.

Independence sampler: q(x'|x) = q(x') ignores current state — essentially IS with a Markov-chain twist. Good when q ≈ p.

Adaptive MH (Haario, Saksman, Tamminen 2001) — adapt proposal covariance from sample history. Care: adaptation must diminish or break detailed balance.

6. Gibbs sampling (Geman + Geman 1984)

For multivariate x = (x_1, ..., x_d), sample one component at a time from its full conditional:

x_i^{(t+1)} ~ p(x_i | x_1^{(t+1)}, ..., x_{i−1}^{(t+1)}, x_{i+1}^{(t)}, ..., x_d^{(t)})

This is a special MH where each conditional update has acceptance 1. Works beautifully when conditionals are tractable — conjugate Bayesian hierarchical models (Normal-Inverse-Gamma, Beta-Binomial, Dirichlet-Multinomial).

Block Gibbs: update several components jointly to reduce autocorrelation when components are strongly correlated.

Collapsed Gibbs: marginalize out some variables analytically before sampling (e.g., LDA topic models, Griffiths + Steyvers 2004).

Trap: Gibbs mixes poorly when components are highly correlated — it can only move along axis-aligned directions.

7. Hamiltonian Monte Carlo (HMC)

Introduced by Duane, Kennedy, Pendleton, Roweth 1987 in lattice QCD as “Hybrid Monte Carlo”; brought to statistics by Neal 1993 (and 2011 chapter in MCMC Handbook).

Idea: augment state space with auxiliary momentum p ~ N(0, M) (mass matrix M). Define Hamiltonian

H(x, p) = U(x) + K(p) = −log p(x) + (1/2)·pᵀ M⁻¹ p

Joint density π(x, p) ∝ exp(−H(x, p)) factorizes; marginal on x is the target p. To propose, simulate Hamiltonian dynamics

dx/dt =  ∂H/∂p = M⁻¹·p
dp/dt = −∂H/∂x = −∇U(x) = ∇ log p(x)

for L steps of size ε via the leapfrog integrator (symplectic, second-order):

p ← p − (ε/2)·∇U(x)
x ← x + ε·M⁻¹·p
p ← p − (ε/2)·∇U(x)

Leapfrog preserves phase-space volume exactly and H up to O(ε²). Accept proposed (x', p') with prob min(1, exp(H(x,p) − H(x',p'))).

# Vanilla HMC step (M = I)
def hmc_step(x, log_p, grad_log_p, eps, L):
    p = np.random.randn(*x.shape)
    H0 = -log_p(x) + 0.5 * (p @ p)
    x_new, p_new = x.copy(), p.copy()
    # Leapfrog L steps
    p_new = p_new + 0.5 * eps * grad_log_p(x_new)
    for _ in range(L - 1):
        x_new = x_new + eps * p_new
        p_new = p_new + eps * grad_log_p(x_new)
    x_new = x_new + eps * p_new
    p_new = p_new + 0.5 * eps * grad_log_p(x_new)
    H1 = -log_p(x_new) + 0.5 * (p_new @ p_new)
    if np.log(np.random.rand()) < H0 - H1:
        return x_new, True
    return x, False

Because HMC uses gradients, it makes coherent long moves in high dimensions and dramatically outperforms random-walk MH on smooth targets — scaling is roughly O(d^{5/4}) work per effective sample vs O(d^2) for RW-MH (Beskos et al 2013).

8. NUTS — No-U-Turn Sampler (Hoffman + Gelman 2014)

HMC requires choosing trajectory length L·ε. Too short → diffusive. Too long → wasted gradient evals and possible U-turns back toward start.

NUTS solves this: build a balanced binary tree of leapfrog steps in both forward and backward time directions, doubling until a U-turn criterion triggers:

(x_+ − x_−) · p_+ < 0   or   (x_+ − x_−) · p_− < 0

Sample uniformly from the trajectory with care to preserve detailed balance (slice sampler over (x, p, u)).

Plus dual-averaging step-size adaptation (Nesterov 2009 applied by H+G): during warmup, adapt ε to target an acceptance statistic (default 0.8 in Stan).

NUTS is the default sampler in Stan, PyMC, NumPyro, Turing.jl. The original H+G paper is essential reading.

9. SGLD — Stochastic Gradient Langevin Dynamics (Welling + Teh 2011, ICML)

Marries SGD with Langevin diffusion to scale Bayesian inference to massive data. Discretized overdamped Langevin:

x_{t+1} = x_t + (ε_t/2)·(∇ log p(x_t) + ∇ log p(D|x_t)) + sqrt(ε_t)·η_t,   η_t ~ N(0,I)

Replace the full-data gradient with a minibatch estimate. As ε_t → 0 (Robbins-Monro schedule), the discretization error and minibatch noise both vanish, and the chain targets the posterior — no MH accept step needed.

Extensions:

SGHMC (Chen + Fox + Guestrin 2014) — stochastic-gradient HMC with friction term.
pSGLD (Li et al 2016) — preconditioned with Fisher-like matrix (RMSProp-style).
SG-NHT (Ding et al 2014) — Nosé-Hoover thermostat to absorb gradient-noise temperature drift.

These are foundational for Bayesian deep learning.

10. Sequential Monte Carlo (SMC) and particle filters

A population of weighted particles {(x_i^{(t)}, w_i^{(t)})} evolves through a sequence of distributions π_0, π_1, ..., π_T. Each step:

Propagate: draw x_i^{(t)} ~ K_t(·|x_i^{(t−1)}).
Reweight: w_i^{(t)} ∝ w_i^{(t−1)} · π_t(x_i^{(t)})/(π_{t−1}(x_i^{(t−1)})·K_t).
Resample when ESS drops below threshold (typically N/2): draw N particles with replacement proportional to weights.

Special cases:

Bootstrap particle filter (Gordon, Salmond, Smith 1993) — state-space models with π_t = p(x_t | y_{1:t}).
SMC samplers (Del Moral, Doucet, Jasra 2006) — tempered posteriors π_t ∝ p(D|x)^{β_t}·p(x).

Applications: state-space estimation, robotics SLAM (see [[Robotics/bayesian-estimation]]), rare-event simulation, model evidence estimation, tracking, finance.

11. Slice sampling (Neal 2003)

Sample uniformly from the area under the density. Introduce auxiliary u:

u ~ Uniform(0, p(x)).
x_new ~ Uniform({x' : p(x') > u}) — the horizontal “slice”.

Step 2 is the work; Neal gives “stepping out + shrinkage” rules for finding the slice without knowing it analytically. No tuning parameter for step size. Used in BUGS/JAGS for conditionally non-conjugate updates, and in emcee’s ensemble slice variants.

12. Reversible-jump MCMC (Green 1995, Biometrika)

For variable-dimension parameter spaces — model selection across different numbers of components, regression with unknown number of changepoints, mixture models with unknown K. Augments the proposal with a “dimension-matching” diffeomorphism and a Jacobian factor in the acceptance ratio. Notoriously hard to tune in practice; superseded in many cases by SMC samplers or Bayesian nonparametrics.

13. Parallel tempering / replica exchange

Run M parallel chains at inverse temperatures β_1 = 1 > β_2 > ... > β_M, each targeting p(x)^{β_m}. High-temperature chains traverse the landscape easily and “donate” mode-jumping via swap proposals:

swap(m, m+1) accept prob = min(1, exp((β_m − β_{m+1})·(U(x_{m+1}) − U(x_m))))

Critical for multimodal targets, glassy energy landscapes, deep generative model posteriors.

14. Convergence diagnostics

No diagnostic can prove a chain has converged — only flag obvious failures. Combine several.

14.1 R̂ — potential scale reduction (Gelman + Rubin 1992; refined in Vehtari et al 2021)

Run m ≥ 4 chains of length n. Compute between-chain variance B and within-chain W:

R̂ = sqrt(((n−1)/n + B/(n·W)) / 1)   (rank-normalized split version preferred)

R̂ > 1.01 ⇒ chains have not mixed. Modern practice (Vehtari 2021) uses split-R̂ + rank normalization to catch non-stationarity and fat tails.

14.2 Effective Sample Size (ESS)

ESS = n / (1 + 2·Σ_{t≥1} ρ_t)

where ρ_t is lag-t autocorrelation. Stan reports both bulk-ESS (for posterior mean) and tail-ESS (for tail quantiles). Target ESS ≥ 400 per chain for stable quantile estimates.

14.3 Trace plot

Plot x_t vs t for each chain. Stationary + well-mixed chains look like “fuzzy caterpillars” overlapping each other. Drift, sticking, multi-modal jumps are visible.

14.4 Posterior predictive check (Gelman, Meng, Stern 1996)

For each posterior sample θ_s, generate replicated data y_rep_s ~ p(·|θ_s). Compare summary statistics T(y_rep) to T(y_obs) — Bayesian p-values. Detects model mis-specification, not sampler issues.

14.5 HMC-specific: divergences and energy

A divergent transition = leapfrog error blew up (often a “funnel” region with strong curvature, e.g., hierarchical-model variance hyperparameter near 0). Even a handful of divergences biases inference.

E-BFMI (Energy Bayesian Fraction of Missing Information, Betancourt 2018): low values indicate the momentum resample isn’t exploring the energy distribution well.

15. HMC tuning tricks

Non-centered parameterization for hierarchical models. Instead of θ_i ~ N(μ, τ), write θ_i = μ + τ·z_i with z_i ~ N(0,1). Decouples the funnel and dramatically improves geometry. The “8-schools” example (Rubin 1981; Gelman BDA3) is the canonical demonstration.
Mass matrix adaptation: set M⁻¹ to the empirical posterior covariance during warmup. Diagonal mass is default in Stan; dense is available but expensive.
Dual averaging step size: adapt ε to hit target accept (Stan default 0.8; raise to 0.95–0.99 if divergences appear).
Increase max tree depth (NUTS) if you hit the cap — but first check geometry.
Rescale parameters to roughly unit scale; helps the diagonal mass matrix.
Reparameterize bounded variables with log (positive), logit (probabilities), softmax/stick-breaking (simplices), Cholesky (covariance).

16. Software

Tool	Year	Backend / language	Notes
BUGS	1989	Pascal / Component Pascal	First general-purpose Bayesian software; Gibbs-based.
WinBUGS	1997	Windows	Standard for 2000s applied Bayes.
JAGS	2007	C++	Just Another Gibbs Sampler; cross-platform.
Stan (Carpenter et al 2017, JSS)	2012	C++; R/Python/Julia front-ends	HMC + NUTS; auto-diff; the gold standard for applied Bayesian models.
PyMC v5 (Salvatier, Wiecki, Fonnesbeck)	2022 rewrite	Python on PyTensor (JAX/Numba)	NUTS via BlackJAX or internal; very ergonomic.
NumPyro (Phan, Pradhan, Jankowiak 2019)	2019	JAX	Fastest NUTS; great for vectorized models.
Pyro (Bingham et al 2018)	2018	PyTorch	Focus on VI / SVI; supports HMC/NUTS.
Turing.jl (Ge, Xu, Ghahramani 2018)	2018	Julia	Composable inference (HMC, particle, VI).
emcee (Foreman-Mackey 2013, PASP)	2013	Python	Affine-invariant ensemble sampler (Goodman + Weare 2010); standard in astronomy.
dynesty / nestle / MultiNest	various	Python / C	Nested sampling (Skilling 2006) — Bayesian evidence + multimodal.
BlackJAX	2022	JAX	Composable MCMC kernels; powers PyMC + NumPyro.
TensorFlow Probability	2017	TF / JAX	MCMC chains + VI + distributions.

Affine-invariant ensemble (emcee): walkers update via a stretch move using another walker’s position, automatically rescaling to local geometry — no gradient needed. Excellent for low-to-moderate dim physics-style problems.

17. Use cases

Bayesian parameter estimation — the default modern workflow. See [[Math/bayesian-inference]].
Bayesian hierarchical / multilevel models — 8-schools, longitudinal regression, election forecasting (Gelman + Hill 2007).
Gaussian process hyperparameters — marginal-likelihood posterior is non-Gaussian; HMC or slice sampling.
Approximate Bayesian Computation (ABC) — likelihood-free; for simulator-based models in epidemiology, population genetics, cosmology (Marin, Pudlo, Robert, Ryder 2012).
Cosmological parameter inference — emcee/MultiNest/Cobaya on Planck CMB likelihoods; LIGO gravitational-wave parameter estimation (PyCBC, Bilby).
Phylogenetics — MrBayes (Ronquist + Huelsenbeck 2003), BEAST/BEAST2 (Drummond + Rambaut 2007; Bouckaert et al 2019); tree-space MCMC.
Pharmacokinetic / pharmacodynamic modeling — population PK/PD via NONMEM, Stan, Monolix.
Statistical mechanics / lattice QCD — origin of HMC; partition function expectations.
Finance — pricing path-dependent derivatives via Monte Carlo + variance reduction; QMC for high-dim integrals.
Computer graphics — Metropolis Light Transport (Veach + Guibas 1997).
Model selection / averaging — RJMCMC, SMC samplers for evidence.
Tracking / SLAM — particle filters (see [[Robotics/bayesian-estimation]]).

Note: LLM token sampling (temperature, top-k, top-p, nucleus) is not MCMC — it’s iid categorical sampling from a learned conditional. Don’t confuse the two.

18. Variance reduction

Plain MC has O(σ/√n) error. For fixed budget, smaller σ matters.

18.1 Antithetic variates

For symmetric proposals, pair X with −X (or 1 − U for uniforms). If f is monotonic, the two estimators are negatively correlated, cutting variance.

18.2 Control variates

Find h(x) with known E_p[h] = μ_h. Estimator:

μ̂_CV = (1/n)·Σ(f(X_i) − c·(h(X_i) − μ_h)),   c* = Cov(f,h)/Var(h)

Variance reduction factor 1 − Corr(f, h)². Used in derivative pricing (use Black-Scholes closed-form as control for non-Black-Scholes payoffs).

18.3 Stratified sampling

Partition Ω = ∪ Ω_k, sample n_k from p restricted to Ω_k. Variance is Σ p_k²·σ_k²/n_k; optimal allocation n_k ∝ p_k·σ_k (Neyman allocation).

18.4 Common random numbers

When comparing two estimates that differ only in parameters, use the same U_i for both → strong positive correlation → small Var(estimator difference).

18.5 Rao-Blackwellization

Replace f(X) with conditional expectation E[f(X) | T(X)] if available — strict variance reduction by tower law.

19. Quasi-Monte Carlo (QMC)

Replace iid U_i ~ Uniform([0,1]^d) with low-discrepancy sequences that fill the cube more uniformly than random:

Halton sequence (Halton 1960) — radical-inverse base-p_i for dim i.
Sobol’ sequence (Sobol’ 1967) — direction numbers; the workhorse of QMC.
Faure sequence (Faure 1982) — base-prime ≥ d.
Niederreiter (1987) — generalization with discrepancy bounds.

Koksma-Hlawka inequality: integration error ≤ V(f)·D*(x_1,...,x_n), where D* is star discrepancy. Sobol’/Halton achieve D* = O((log n)^d / n), giving asymptotic error O((log n)^d / n) vs MC’s O(1/√n) — a huge win for smooth integrands and moderate d.

Scrambled QMC (Owen 1995) randomizes a low-discrepancy sequence to recover an unbiased estimator with confidence intervals, while keeping near-O(1/n) variance for smooth f.

Standard in: finance (basket option pricing, especially in dim 10–100), robotics path planning, computer graphics, high-dim integration. Implementations: scipy.stats.qmc, torch.quasirandom, SobolNum.

20. Modern MCMC research (selected directions)

Riemannian Manifold HMC (RMHMC) (Girolami + Calderhead 2011, JRSS-B) — let the mass matrix M(x) = G(x) be the local Fisher metric, so dynamics flow on the statistical manifold. Handles ill-conditioned posteriors but requires expensive geometry computations.
Adaptive NUTS with dense mass matrix — Stan supports; expensive but pays off for highly correlated posteriors.
HMC on Lie groups — for SO(3) and SE(3) orientation/pose posteriors (Byrne + Girolami 2013; Liu + Zhu et al 2016). See [[Math/lie-groups-so3-se3]].
Neural Spline Flow + MCMC hybrids — train a normalizing flow as the proposal in IS/MH; “transport map” approach (Marzouk et al 2016; Hoffman et al 2019).
Piecewise Deterministic Markov Processes (PDMPs) — bouncy particle sampler (Bouchard-Côté et al 2018), Zig-Zag sampler (Bierkens, Fearnhead, Roberts 2019); non-reversible chains with super-efficient mixing in some regimes.
MALA — Metropolis-adjusted Langevin algorithm (Roberts + Tweedie 1996); single-step HMC with L=1.
Stein Variational Gradient Descent (Liu + Wang 2016) — deterministic particle method; not strictly MCMC but a competitor.
Coupled MCMC (Jacob, O’Leary, Atchadé 2020) — unbiased estimators from finite-time coupled chains.
Quasi-Newton HMC, NUTS-on-flows — active areas.

21. Common pitfalls

Too short warmup / burn-in — chains haven’t forgotten initialization. Stan’s default 1000 warmup is a minimum, not a maximum.
One chain only — can’t detect non-mixing or multimodality. Always run ≥ 4 chains from dispersed initial values.
High autocorrelation — thousands of raw samples but ESS = 50. Diagnose with autocorrelation plot; remedy with better parameterization, HMC over RW-MH, block updates.
Thinning misuse — thinning saves disk space but loses information; don’t thin to “fix” autocorrelation, just report ESS.
Ignoring HMC divergences — even one divergence biases tails. Reparameterize, raise adapt_delta, shrink step size — don’t suppress the warning.
Multimodal posterior, one mode found — symptoms: chains stuck in different modes, R̂ huge between chains but tiny within. Remedies: parallel tempering, mode-finding initialization, mixture proposals, SMC samplers, nested sampling.
Improper priors → improper posterior — sampler may still “run” but the answer is meaningless. Check posterior is proper before sampling.
Initialization on a flat region of the prior — gradient is tiny, HMC drifts forever. Initialize near MAP or sensible default.
Bad mass matrix — diagonal M when posterior is strongly correlated → very small effective step. Use full mass matrix or reparameterize to decorrelate.
Confusing IS effective sample size with MCMC ESS — different quantities; both important.

22. Cross-references

[[Math/bayesian-inference]] — the inferential framework; conjugate updating, hierarchical models, model comparison.
[[Math/probability-fundamentals]] — measure-theoretic basis, conditional expectation, ergodic theorem.
[[Math/hypothesis-testing-mle]] — MLE / frequentist counterpart.
[[Math/_index]] — full Math tier index.
[[Robotics/bayesian-estimation]] — Kalman / EKF / UKF / particle filter (sequential MCMC) for robotics state estimation.
[[Math/gradient-descent-variants]] — SGD background for SGLD / SGHMC.
[[Math/lie-groups-so3-se3]] — geometry for HMC on rotations and rigid-body poses.
[[Math/convex-optimization]] — MAP estimation as the optimization counterpart of Bayesian sampling.

23. Citations

Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, 1953. “Equation of State Calculations by Fast Computing Machines”. J. Chem. Phys. 21, 1087–1092.
Hastings, 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57, 97–109.
Geman + Geman, 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images”. IEEE PAMI 6, 721–741.
Box + Muller, 1958. “A Note on the Generation of Random Normal Deviates”. Ann. Math. Stat. 29, 610–611.
Marsaglia + Tsang, 2000. “The Ziggurat Method for Generating Random Variables”. J. Stat. Software.
Duane, Kennedy, Pendleton, Roweth, 1987. “Hybrid Monte Carlo”. Phys. Lett. B 195, 216–222.
Neal, 1993. “Probabilistic Inference Using Markov Chain Monte Carlo Methods”. Technical Report CRG-TR-93-1, U. Toronto.
Neal, 2011. “MCMC Using Hamiltonian Dynamics”. In Handbook of Markov Chain Monte Carlo.
Hoffman + Gelman, 2014. “The No-U-Turn Sampler”. JMLR 15, 1593–1623.
Welling + Teh, 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics”. ICML.
Chen + Fox + Guestrin, 2014. “Stochastic Gradient Hamiltonian Monte Carlo”. ICML.
Green, 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination”. Biometrika 82, 711–732.
Gelman + Rubin, 1992. “Inference from Iterative Simulation Using Multiple Sequences”. Stat. Sci. 7, 457–472.
Vehtari, Gelman, Simpson, Carpenter, Bürkner, 2021. “Rank-Normalization, Folding, and Localization: An Improved R̂”. Bayesian Analysis 16, 667–718.
Gordon, Salmond, Smith, 1993. “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation”. IEE Proc. F 140, 107–113.
Del Moral, Doucet, Jasra, 2006. “Sequential Monte Carlo Samplers”. JRSS-B 68, 411–436.
Neal, 2003. “Slice Sampling”. Ann. Stat. 31, 705–767.
Girolami + Calderhead, 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”. JRSS-B 73, 123–214.
Roberts + Tweedie, 1996. “Exponential Convergence of Langevin Distributions and Their Discrete Approximations”. Bernoulli 2, 341–363.
Goodman + Weare, 2010. “Ensemble Samplers with Affine Invariance”. CAMCoS 5, 65–80.
Foreman-Mackey, Hogg, Lang, Goodman, 2013. “emcee: The MCMC Hammer”. PASP 125, 306–312.
Skilling, 2006. “Nested Sampling for General Bayesian Computation”. Bayesian Analysis 1, 833–860.
Sobol’, 1967. “On the Distribution of Points in a Cube and the Approximate Evaluation of Integrals”. USSR Comp. Math. 7, 86–112.
Halton, 1960. “On the Efficiency of Certain Quasi-Random Sequences of Points”. Num. Math. 2, 84–90.
Owen, 1995. “Randomly Permuted (t,m,s)-Nets and (t,s)-Sequences”. In Monte Carlo and Quasi-Monte Carlo Methods.
Roberts, Gelman, Gilks, 1997. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms”. Ann. Appl. Prob. 7, 110–120.
Beskos, Pillai, Roberts, Sanz-Serna, Stuart, 2013. “Optimal Tuning of the Hybrid Monte Carlo Algorithm”. Bernoulli 19, 1501–1534.
Haario, Saksman, Tamminen, 2001. “An Adaptive Metropolis Algorithm”. Bernoulli 7, 223–242.
Carpenter et al, 2017. “Stan: A Probabilistic Programming Language”. J. Stat. Software 76, 1–32.
Phan, Pradhan, Jankowiak, 2019. “Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro”. arXiv:1912.11554.
Bingham et al, 2018. “Pyro: Deep Universal Probabilistic Programming”. JMLR 20, 1–6.
Ge, Xu, Ghahramani, 2018. “Turing: A Language for Flexible Probabilistic Inference”. AISTATS.
Bouchard-Côté, Vollmer, Doucet, 2018. “The Bouncy Particle Sampler”. JASA 113, 855–867.
Bierkens, Fearnhead, Roberts, 2019. “The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data”. Ann. Stat. 47, 1288–1320.
Veach + Guibas, 1997. “Metropolis Light Transport”. SIGGRAPH.
Brooks, Gelman, Jones, Meng (eds), 2011. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
Robert + Casella, 2004. Monte Carlo Statistical Methods, 2nd ed. Springer.
Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, 2013. Bayesian Data Analysis, 3rd ed. (BDA3). Chapman & Hall/CRC.
Betancourt, 2018. “A Conceptual Introduction to Hamiltonian Monte Carlo”. arXiv:1701.02434.
Liu + Wang, 2016. “Stein Variational Gradient Descent”. NeurIPS.
Jacob, O’Leary, Atchadé, 2020. “Unbiased Markov Chain Monte Carlo Methods with Couplings”. JRSS-B 82, 543–600.

Compendium

Explorer

MCMC & Monte Carlo Sampling — Math Reference

MCMC & Monte Carlo Sampling — Math Reference

1. At a glance

2. Direct sampling

2.1 Inverse CDF (probability integral transform)

2.2 Box-Muller (Box + Muller 1958)

2.3 Marsaglia polar method (Marsaglia + Bray 1964)

2.4 Ziggurat (Marsaglia + Tsang 2000)

2.5 Rejection sampling (von Neumann 1951)

3. Importance sampling (IS)

4. MCMC: the core idea

5. Metropolis-Hastings (MH)

6. Gibbs sampling (Geman + Geman 1984)

7. Hamiltonian Monte Carlo (HMC)

8. NUTS — No-U-Turn Sampler (Hoffman + Gelman 2014)

9. SGLD — Stochastic Gradient Langevin Dynamics (Welling + Teh 2011, ICML)

10. Sequential Monte Carlo (SMC) and particle filters

11. Slice sampling (Neal 2003)

12. Reversible-jump MCMC (Green 1995, Biometrika)

13. Parallel tempering / replica exchange

14. Convergence diagnostics

14.1 R̂ — potential scale reduction (Gelman + Rubin 1992; refined in Vehtari et al 2021)

14.2 Effective Sample Size (ESS)

14.3 Trace plot

14.4 Posterior predictive check (Gelman, Meng, Stern 1996)

14.5 HMC-specific: divergences and energy

15. HMC tuning tricks

16. Software

17. Use cases

18. Variance reduction

18.1 Antithetic variates

18.2 Control variates

18.3 Stratified sampling

18.4 Common random numbers

18.5 Rao-Blackwellization

19. Quasi-Monte Carlo (QMC)

20. Modern MCMC research (selected directions)

21. Common pitfalls

22. Cross-references

23. Citations

Graph View

Table of Contents