MCMC & Monte Carlo Sampling — Math Reference

1. At a glance

Monte Carlo (MC) methods draw samples from a probability distribution p(x) to approximate expectations, integrals, marginals, and uncertainty. Markov Chain Monte Carlo (MCMC) extends this to settings where direct sampling is infeasible — typically because the normalizing constant Z = ∫ p̃(x) dx is intractable, as in most Bayesian posteriors p(θ|D) ∝ p(D|θ)·p(θ).

Core estimator: for X_i ~ p iid,

E_p[f(X)] ≈ (1/n) · Σ_{i=1..n} f(X_i)

Strong Law of Large Numbers gives consistency; CLT gives O(1/√n) error rate regardless of dimension (the famous escape from the curse of dimensionality for integration).

Two big families:

  • Exact / direct sampling — when F⁻¹ is available (inverse CDF), or special tricks (Box-Muller for Gaussian) work. Fast, iid samples.
  • Markov-chain sampling (MCMC) — construct a Markov chain X_0 → X_1 → ... whose stationary distribution is p. After burn-in, samples are correlated draws from p. Asymptotically correct under ergodicity.

Used for: Bayesian inference, statistical physics, finance derivatives, computer graphics (path tracing), state-space models, model selection, rare-event simulation.

Cross-refs: see [[Math/bayesian-inference]] for the inferential framework MCMC supports, [[Math/probability-fundamentals]] for measure-theoretic foundations.

2. Direct sampling

2.1 Inverse CDF (probability integral transform)

If F is the CDF of X and U ~ Uniform(0,1), then X = F⁻¹(U) has distribution F. Works when F⁻¹ is computable.

Examples:

  • Exponential: F(x) = 1 − e^{−λx}X = −log(1 − U)/λ.
  • Cauchy: X = tan(π·(U − 0.5)).
  • Discrete: store CDF, binary-search on U.

2.2 Box-Muller (Box + Muller 1958)

Two iid U_1, U_2 ~ Uniform(0,1) → two iid Z_1, Z_2 ~ N(0,1):

Z_1 = sqrt(−2·log(U_1)) · cos(2π·U_2)
Z_2 = sqrt(−2·log(U_1)) · sin(2π·U_2)

2.3 Marsaglia polar method (Marsaglia + Bray 1964)

Avoids the trig calls of Box-Muller. Sample (V_1, V_2) ~ Uniform(square), reject if S = V_1² + V_2² > 1, else

Z_1 = V_1 · sqrt(−2·log(S)/S)
Z_2 = V_2 · sqrt(−2·log(S)/S)

2.4 Ziggurat (Marsaglia + Tsang 2000)

State-of-the-art fast Gaussian (and exponential) sampler. Pre-tabulates a stack of rectangles covering the density; ~97% of samples accepted by table lookup, fallback for tail. Used in NumPy randn, R, MATLAB.

2.5 Rejection sampling (von Neumann 1951)

Given p(x) ≤ M·q(x) for an envelope q we can sample:

  1. Draw X ~ q.
  2. Draw U ~ Uniform(0, 1).
  3. Accept X if U ≤ p(X)/(M·q(X)), else reject.

Acceptance probability is 1/M. Inefficient in high dimensions: M typically grows exponentially with d.

3. Importance sampling (IS)

Rewrite the expectation under a tractable proposal q:

E_p[f(X)] = ∫ f(x)·p(x) dx = ∫ f(x)·(p(x)/q(x))·q(x) dx = E_q[f(X)·w(X)]

where w(x) = p(x)/q(x) is the importance weight. Estimator:

μ̂_IS = (1/n) · Σ f(X_i)·w(X_i),   X_i ~ q

When p is known only up to a constant (p̃ = Z·p), use self-normalized IS (SNIS):

μ̂_SNIS = Σ f(X_i)·w̃(X_i) / Σ w̃(X_i),   w̃ = p̃/q

SNIS is biased (consistent) but works without Z.

Variance: Var(μ̂_IS) = Var_q(f·w)/n. Minimum at q*(x) ∝ |f(x)|·p(x). In practice, choose q with heavier tails than p to avoid infinite-variance weights; check the effective sample size ESS = (Σw_i)² / Σ w_i² — small ESS means a few weights dominate.

Use cases: rare-event simulation (tilted proposal); particle filters (sequential IS, see §10); variational inference reweighting (IWAE, Burda et al 2016).

4. MCMC: the core idea

Metropolis et al 1953 (Metropolis, Rosenbluth, Rosenbluth, Teller, Teller) introduced the idea at Los Alamos: simulate a Markov chain whose stationary distribution is the Boltzmann distribution. Hastings 1970 generalized to arbitrary p.

Key facts:

  • Chain transition kernel K(x → x') must satisfy detailed balance: p(x)·K(x → x') = p(x')·K(x' → x). This implies p is stationary.
  • Irreducibility + aperiodicity ⇒ unique stationary distribution p, and (1/n)·Σ f(X_i) → E_p[f] (ergodic theorem).
  • Samples are correlated, so effective sample size n_eff < n.
  • Need burn-in to forget initial state.

The art is designing fast-mixing kernels.

5. Metropolis-Hastings (MH)

Given current state x, propose x' ~ q(·|x), accept with probability

α(x → x') = min(1, [p(x')·q(x|x')] / [p(x)·q(x'|x)])

If accepted, x_{t+1} = x'; else x_{t+1} = x. The acceptance ratio uses only ratios of p, so the normalizer Z cancels.

# Random-walk Metropolis-Hastings (toy, log-target)
import numpy as np
 
def rw_mh(log_p, x0, n_iter, step):
    x = np.asarray(x0, dtype=float)
    lp = log_p(x)
    samples = np.empty((n_iter, x.size))
    accept = 0
    for t in range(n_iter):
        x_prop = x + step * np.random.randn(*x.shape)
        lp_prop = log_p(x_prop)
        # Symmetric q ⇒ q-terms cancel
        if np.log(np.random.rand()) < lp_prop - lp:
            x, lp = x_prop, lp_prop
            accept += 1
        samples[t] = x
    return samples, accept / n_iter

Tuning random-walk MH: Roberts + Gelman + Gilks 1997 showed the optimal acceptance rate is 0.234 for d → ∞ (Gaussian target, isotropic Gaussian proposal). In practice tune step so acceptance ≈ 0.25–0.35 in moderate dim, ≈ 0.44 in 1D.

Independence sampler: q(x'|x) = q(x') ignores current state — essentially IS with a Markov-chain twist. Good when q ≈ p.

Adaptive MH (Haario, Saksman, Tamminen 2001) — adapt proposal covariance from sample history. Care: adaptation must diminish or break detailed balance.

6. Gibbs sampling (Geman + Geman 1984)

For multivariate x = (x_1, ..., x_d), sample one component at a time from its full conditional:

x_i^{(t+1)} ~ p(x_i | x_1^{(t+1)}, ..., x_{i−1}^{(t+1)}, x_{i+1}^{(t)}, ..., x_d^{(t)})

This is a special MH where each conditional update has acceptance 1. Works beautifully when conditionals are tractable — conjugate Bayesian hierarchical models (Normal-Inverse-Gamma, Beta-Binomial, Dirichlet-Multinomial).

Block Gibbs: update several components jointly to reduce autocorrelation when components are strongly correlated.

Collapsed Gibbs: marginalize out some variables analytically before sampling (e.g., LDA topic models, Griffiths + Steyvers 2004).

Trap: Gibbs mixes poorly when components are highly correlated — it can only move along axis-aligned directions.

7. Hamiltonian Monte Carlo (HMC)

Introduced by Duane, Kennedy, Pendleton, Roweth 1987 in lattice QCD as “Hybrid Monte Carlo”; brought to statistics by Neal 1993 (and 2011 chapter in MCMC Handbook).

Idea: augment state space with auxiliary momentum p ~ N(0, M) (mass matrix M). Define Hamiltonian

H(x, p) = U(x) + K(p) = −log p(x) + (1/2)·pᵀ M⁻¹ p

Joint density π(x, p) ∝ exp(−H(x, p)) factorizes; marginal on x is the target p. To propose, simulate Hamiltonian dynamics

dx/dt =  ∂H/∂p = M⁻¹·p
dp/dt = −∂H/∂x = −∇U(x) = ∇ log p(x)

for L steps of size ε via the leapfrog integrator (symplectic, second-order):

p ← p − (ε/2)·∇U(x)
x ← x + ε·M⁻¹·p
p ← p − (ε/2)·∇U(x)

Leapfrog preserves phase-space volume exactly and H up to O(ε²). Accept proposed (x', p') with prob min(1, exp(H(x,p) − H(x',p'))).

# Vanilla HMC step (M = I)
def hmc_step(x, log_p, grad_log_p, eps, L):
    p = np.random.randn(*x.shape)
    H0 = -log_p(x) + 0.5 * (p @ p)
    x_new, p_new = x.copy(), p.copy()
    # Leapfrog L steps
    p_new = p_new + 0.5 * eps * grad_log_p(x_new)
    for _ in range(L - 1):
        x_new = x_new + eps * p_new
        p_new = p_new + eps * grad_log_p(x_new)
    x_new = x_new + eps * p_new
    p_new = p_new + 0.5 * eps * grad_log_p(x_new)
    H1 = -log_p(x_new) + 0.5 * (p_new @ p_new)
    if np.log(np.random.rand()) < H0 - H1:
        return x_new, True
    return x, False

Because HMC uses gradients, it makes coherent long moves in high dimensions and dramatically outperforms random-walk MH on smooth targets — scaling is roughly O(d^{5/4}) work per effective sample vs O(d^2) for RW-MH (Beskos et al 2013).

8. NUTS — No-U-Turn Sampler (Hoffman + Gelman 2014)

HMC requires choosing trajectory length L·ε. Too short → diffusive. Too long → wasted gradient evals and possible U-turns back toward start.

NUTS solves this: build a balanced binary tree of leapfrog steps in both forward and backward time directions, doubling until a U-turn criterion triggers:

(x_+ − x_−) · p_+ < 0   or   (x_+ − x_−) · p_− < 0

Sample uniformly from the trajectory with care to preserve detailed balance (slice sampler over (x, p, u)).

Plus dual-averaging step-size adaptation (Nesterov 2009 applied by H+G): during warmup, adapt ε to target an acceptance statistic (default 0.8 in Stan).

NUTS is the default sampler in Stan, PyMC, NumPyro, Turing.jl. The original H+G paper is essential reading.

9. SGLD — Stochastic Gradient Langevin Dynamics (Welling + Teh 2011, ICML)

Marries SGD with Langevin diffusion to scale Bayesian inference to massive data. Discretized overdamped Langevin:

x_{t+1} = x_t + (ε_t/2)·(∇ log p(x_t) + ∇ log p(D|x_t)) + sqrt(ε_t)·η_t,   η_t ~ N(0,I)

Replace the full-data gradient with a minibatch estimate. As ε_t → 0 (Robbins-Monro schedule), the discretization error and minibatch noise both vanish, and the chain targets the posterior — no MH accept step needed.

Extensions:

  • SGHMC (Chen + Fox + Guestrin 2014) — stochastic-gradient HMC with friction term.
  • pSGLD (Li et al 2016) — preconditioned with Fisher-like matrix (RMSProp-style).
  • SG-NHT (Ding et al 2014) — Nosé-Hoover thermostat to absorb gradient-noise temperature drift.

These are foundational for Bayesian deep learning.

10. Sequential Monte Carlo (SMC) and particle filters

A population of weighted particles {(x_i^{(t)}, w_i^{(t)})} evolves through a sequence of distributions π_0, π_1, ..., π_T. Each step:

  1. Propagate: draw x_i^{(t)} ~ K_t(·|x_i^{(t−1)}).
  2. Reweight: w_i^{(t)} ∝ w_i^{(t−1)} · π_t(x_i^{(t)})/(π_{t−1}(x_i^{(t−1)})·K_t).
  3. Resample when ESS drops below threshold (typically N/2): draw N particles with replacement proportional to weights.

Special cases:

  • Bootstrap particle filter (Gordon, Salmond, Smith 1993) — state-space models with π_t = p(x_t | y_{1:t}).
  • SMC samplers (Del Moral, Doucet, Jasra 2006) — tempered posteriors π_t ∝ p(D|x)^{β_t}·p(x).

Applications: state-space estimation, robotics SLAM (see [[Robotics/bayesian-estimation]]), rare-event simulation, model evidence estimation, tracking, finance.

11. Slice sampling (Neal 2003)

Sample uniformly from the area under the density. Introduce auxiliary u:

  1. u ~ Uniform(0, p(x)).
  2. x_new ~ Uniform({x' : p(x') > u}) — the horizontal “slice”.

Step 2 is the work; Neal gives “stepping out + shrinkage” rules for finding the slice without knowing it analytically. No tuning parameter for step size. Used in BUGS/JAGS for conditionally non-conjugate updates, and in emcee’s ensemble slice variants.

12. Reversible-jump MCMC (Green 1995, Biometrika)

For variable-dimension parameter spaces — model selection across different numbers of components, regression with unknown number of changepoints, mixture models with unknown K. Augments the proposal with a “dimension-matching” diffeomorphism and a Jacobian factor in the acceptance ratio. Notoriously hard to tune in practice; superseded in many cases by SMC samplers or Bayesian nonparametrics.

13. Parallel tempering / replica exchange

Run M parallel chains at inverse temperatures β_1 = 1 > β_2 > ... > β_M, each targeting p(x)^{β_m}. High-temperature chains traverse the landscape easily and “donate” mode-jumping via swap proposals:

swap(m, m+1) accept prob = min(1, exp((β_m − β_{m+1})·(U(x_{m+1}) − U(x_m))))

Critical for multimodal targets, glassy energy landscapes, deep generative model posteriors.

14. Convergence diagnostics

No diagnostic can prove a chain has converged — only flag obvious failures. Combine several.

14.1 R̂ — potential scale reduction (Gelman + Rubin 1992; refined in Vehtari et al 2021)

Run m ≥ 4 chains of length n. Compute between-chain variance B and within-chain W:

R̂ = sqrt(((n−1)/n + B/(n·W)) / 1)   (rank-normalized split version preferred)

R̂ > 1.01 ⇒ chains have not mixed. Modern practice (Vehtari 2021) uses split-R̂ + rank normalization to catch non-stationarity and fat tails.

14.2 Effective Sample Size (ESS)

ESS = n / (1 + 2·Σ_{t≥1} ρ_t)

where ρ_t is lag-t autocorrelation. Stan reports both bulk-ESS (for posterior mean) and tail-ESS (for tail quantiles). Target ESS ≥ 400 per chain for stable quantile estimates.

14.3 Trace plot

Plot x_t vs t for each chain. Stationary + well-mixed chains look like “fuzzy caterpillars” overlapping each other. Drift, sticking, multi-modal jumps are visible.

14.4 Posterior predictive check (Gelman, Meng, Stern 1996)

For each posterior sample θ_s, generate replicated data y_rep_s ~ p(·|θ_s). Compare summary statistics T(y_rep) to T(y_obs) — Bayesian p-values. Detects model mis-specification, not sampler issues.

14.5 HMC-specific: divergences and energy

A divergent transition = leapfrog error blew up (often a “funnel” region with strong curvature, e.g., hierarchical-model variance hyperparameter near 0). Even a handful of divergences biases inference.

E-BFMI (Energy Bayesian Fraction of Missing Information, Betancourt 2018): low values indicate the momentum resample isn’t exploring the energy distribution well.

15. HMC tuning tricks

  1. Non-centered parameterization for hierarchical models. Instead of θ_i ~ N(μ, τ), write θ_i = μ + τ·z_i with z_i ~ N(0,1). Decouples the funnel and dramatically improves geometry. The “8-schools” example (Rubin 1981; Gelman BDA3) is the canonical demonstration.
  2. Mass matrix adaptation: set M⁻¹ to the empirical posterior covariance during warmup. Diagonal mass is default in Stan; dense is available but expensive.
  3. Dual averaging step size: adapt ε to hit target accept (Stan default 0.8; raise to 0.95–0.99 if divergences appear).
  4. Increase max tree depth (NUTS) if you hit the cap — but first check geometry.
  5. Rescale parameters to roughly unit scale; helps the diagonal mass matrix.
  6. Reparameterize bounded variables with log (positive), logit (probabilities), softmax/stick-breaking (simplices), Cholesky (covariance).

16. Software

ToolYearBackend / languageNotes
BUGS1989Pascal / Component PascalFirst general-purpose Bayesian software; Gibbs-based.
WinBUGS1997WindowsStandard for 2000s applied Bayes.
JAGS2007C++Just Another Gibbs Sampler; cross-platform.
Stan (Carpenter et al 2017, JSS)2012C++; R/Python/Julia front-endsHMC + NUTS; auto-diff; the gold standard for applied Bayesian models.
PyMC v5 (Salvatier, Wiecki, Fonnesbeck)2022 rewritePython on PyTensor (JAX/Numba)NUTS via BlackJAX or internal; very ergonomic.
NumPyro (Phan, Pradhan, Jankowiak 2019)2019JAXFastest NUTS; great for vectorized models.
Pyro (Bingham et al 2018)2018PyTorchFocus on VI / SVI; supports HMC/NUTS.
Turing.jl (Ge, Xu, Ghahramani 2018)2018JuliaComposable inference (HMC, particle, VI).
emcee (Foreman-Mackey 2013, PASP)2013PythonAffine-invariant ensemble sampler (Goodman + Weare 2010); standard in astronomy.
dynesty / nestle / MultiNestvariousPython / CNested sampling (Skilling 2006) — Bayesian evidence + multimodal.
BlackJAX2022JAXComposable MCMC kernels; powers PyMC + NumPyro.
TensorFlow Probability2017TF / JAXMCMC chains + VI + distributions.

Affine-invariant ensemble (emcee): walkers update via a stretch move using another walker’s position, automatically rescaling to local geometry — no gradient needed. Excellent for low-to-moderate dim physics-style problems.

17. Use cases

  • Bayesian parameter estimation — the default modern workflow. See [[Math/bayesian-inference]].
  • Bayesian hierarchical / multilevel models — 8-schools, longitudinal regression, election forecasting (Gelman + Hill 2007).
  • Gaussian process hyperparameters — marginal-likelihood posterior is non-Gaussian; HMC or slice sampling.
  • Approximate Bayesian Computation (ABC) — likelihood-free; for simulator-based models in epidemiology, population genetics, cosmology (Marin, Pudlo, Robert, Ryder 2012).
  • Cosmological parameter inferenceemcee/MultiNest/Cobaya on Planck CMB likelihoods; LIGO gravitational-wave parameter estimation (PyCBC, Bilby).
  • Phylogenetics — MrBayes (Ronquist + Huelsenbeck 2003), BEAST/BEAST2 (Drummond + Rambaut 2007; Bouckaert et al 2019); tree-space MCMC.
  • Pharmacokinetic / pharmacodynamic modeling — population PK/PD via NONMEM, Stan, Monolix.
  • Statistical mechanics / lattice QCD — origin of HMC; partition function expectations.
  • Finance — pricing path-dependent derivatives via Monte Carlo + variance reduction; QMC for high-dim integrals.
  • Computer graphics — Metropolis Light Transport (Veach + Guibas 1997).
  • Model selection / averaging — RJMCMC, SMC samplers for evidence.
  • Tracking / SLAM — particle filters (see [[Robotics/bayesian-estimation]]).

Note: LLM token sampling (temperature, top-k, top-p, nucleus) is not MCMC — it’s iid categorical sampling from a learned conditional. Don’t confuse the two.

18. Variance reduction

Plain MC has O(σ/√n) error. For fixed budget, smaller σ matters.

18.1 Antithetic variates

For symmetric proposals, pair X with −X (or 1 − U for uniforms). If f is monotonic, the two estimators are negatively correlated, cutting variance.

18.2 Control variates

Find h(x) with known E_p[h] = μ_h. Estimator:

μ̂_CV = (1/n)·Σ(f(X_i) − c·(h(X_i) − μ_h)),   c* = Cov(f,h)/Var(h)

Variance reduction factor 1 − Corr(f, h)². Used in derivative pricing (use Black-Scholes closed-form as control for non-Black-Scholes payoffs).

18.3 Stratified sampling

Partition Ω = ∪ Ω_k, sample n_k from p restricted to Ω_k. Variance is Σ p_k²·σ_k²/n_k; optimal allocation n_k ∝ p_k·σ_k (Neyman allocation).

18.4 Common random numbers

When comparing two estimates that differ only in parameters, use the same U_i for both → strong positive correlation → small Var(estimator difference).

18.5 Rao-Blackwellization

Replace f(X) with conditional expectation E[f(X) | T(X)] if available — strict variance reduction by tower law.

19. Quasi-Monte Carlo (QMC)

Replace iid U_i ~ Uniform([0,1]^d) with low-discrepancy sequences that fill the cube more uniformly than random:

  • Halton sequence (Halton 1960) — radical-inverse base-p_i for dim i.
  • Sobol’ sequence (Sobol’ 1967) — direction numbers; the workhorse of QMC.
  • Faure sequence (Faure 1982) — base-prime ≥ d.
  • Niederreiter (1987) — generalization with discrepancy bounds.

Koksma-Hlawka inequality: integration error ≤ V(f)·D*(x_1,...,x_n), where D* is star discrepancy. Sobol’/Halton achieve D* = O((log n)^d / n), giving asymptotic error O((log n)^d / n) vs MC’s O(1/√n) — a huge win for smooth integrands and moderate d.

Scrambled QMC (Owen 1995) randomizes a low-discrepancy sequence to recover an unbiased estimator with confidence intervals, while keeping near-O(1/n) variance for smooth f.

Standard in: finance (basket option pricing, especially in dim 10–100), robotics path planning, computer graphics, high-dim integration. Implementations: scipy.stats.qmc, torch.quasirandom, SobolNum.

20. Modern MCMC research (selected directions)

  • Riemannian Manifold HMC (RMHMC) (Girolami + Calderhead 2011, JRSS-B) — let the mass matrix M(x) = G(x) be the local Fisher metric, so dynamics flow on the statistical manifold. Handles ill-conditioned posteriors but requires expensive geometry computations.
  • Adaptive NUTS with dense mass matrix — Stan supports; expensive but pays off for highly correlated posteriors.
  • HMC on Lie groups — for SO(3) and SE(3) orientation/pose posteriors (Byrne + Girolami 2013; Liu + Zhu et al 2016). See [[Math/lie-groups-so3-se3]].
  • Neural Spline Flow + MCMC hybrids — train a normalizing flow as the proposal in IS/MH; “transport map” approach (Marzouk et al 2016; Hoffman et al 2019).
  • Piecewise Deterministic Markov Processes (PDMPs) — bouncy particle sampler (Bouchard-Côté et al 2018), Zig-Zag sampler (Bierkens, Fearnhead, Roberts 2019); non-reversible chains with super-efficient mixing in some regimes.
  • MALA — Metropolis-adjusted Langevin algorithm (Roberts + Tweedie 1996); single-step HMC with L=1.
  • Stein Variational Gradient Descent (Liu + Wang 2016) — deterministic particle method; not strictly MCMC but a competitor.
  • Coupled MCMC (Jacob, O’Leary, Atchadé 2020) — unbiased estimators from finite-time coupled chains.
  • Quasi-Newton HMC, NUTS-on-flows — active areas.

21. Common pitfalls

  1. Too short warmup / burn-in — chains haven’t forgotten initialization. Stan’s default 1000 warmup is a minimum, not a maximum.
  2. One chain only — can’t detect non-mixing or multimodality. Always run ≥ 4 chains from dispersed initial values.
  3. High autocorrelation — thousands of raw samples but ESS = 50. Diagnose with autocorrelation plot; remedy with better parameterization, HMC over RW-MH, block updates.
  4. Thinning misuse — thinning saves disk space but loses information; don’t thin to “fix” autocorrelation, just report ESS.
  5. Ignoring HMC divergences — even one divergence biases tails. Reparameterize, raise adapt_delta, shrink step size — don’t suppress the warning.
  6. Multimodal posterior, one mode found — symptoms: chains stuck in different modes, R̂ huge between chains but tiny within. Remedies: parallel tempering, mode-finding initialization, mixture proposals, SMC samplers, nested sampling.
  7. Improper priors → improper posterior — sampler may still “run” but the answer is meaningless. Check posterior is proper before sampling.
  8. Initialization on a flat region of the prior — gradient is tiny, HMC drifts forever. Initialize near MAP or sensible default.
  9. Bad mass matrix — diagonal M when posterior is strongly correlated → very small effective step. Use full mass matrix or reparameterize to decorrelate.
  10. Confusing IS effective sample size with MCMC ESS — different quantities; both important.

22. Cross-references

  • [[Math/bayesian-inference]] — the inferential framework; conjugate updating, hierarchical models, model comparison.
  • [[Math/probability-fundamentals]] — measure-theoretic basis, conditional expectation, ergodic theorem.
  • [[Math/hypothesis-testing-mle]] — MLE / frequentist counterpart.
  • [[Math/_index]] — full Math tier index.
  • [[Robotics/bayesian-estimation]] — Kalman / EKF / UKF / particle filter (sequential MCMC) for robotics state estimation.
  • [[Math/gradient-descent-variants]] — SGD background for SGLD / SGHMC.
  • [[Math/lie-groups-so3-se3]] — geometry for HMC on rotations and rigid-body poses.
  • [[Math/convex-optimization]] — MAP estimation as the optimization counterpart of Bayesian sampling.

23. Citations

  • Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, 1953. “Equation of State Calculations by Fast Computing Machines”. J. Chem. Phys. 21, 1087–1092.
  • Hastings, 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57, 97–109.
  • Geman + Geman, 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images”. IEEE PAMI 6, 721–741.
  • Box + Muller, 1958. “A Note on the Generation of Random Normal Deviates”. Ann. Math. Stat. 29, 610–611.
  • Marsaglia + Tsang, 2000. “The Ziggurat Method for Generating Random Variables”. J. Stat. Software.
  • Duane, Kennedy, Pendleton, Roweth, 1987. “Hybrid Monte Carlo”. Phys. Lett. B 195, 216–222.
  • Neal, 1993. “Probabilistic Inference Using Markov Chain Monte Carlo Methods”. Technical Report CRG-TR-93-1, U. Toronto.
  • Neal, 2011. “MCMC Using Hamiltonian Dynamics”. In Handbook of Markov Chain Monte Carlo.
  • Hoffman + Gelman, 2014. “The No-U-Turn Sampler”. JMLR 15, 1593–1623.
  • Welling + Teh, 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics”. ICML.
  • Chen + Fox + Guestrin, 2014. “Stochastic Gradient Hamiltonian Monte Carlo”. ICML.
  • Green, 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination”. Biometrika 82, 711–732.
  • Gelman + Rubin, 1992. “Inference from Iterative Simulation Using Multiple Sequences”. Stat. Sci. 7, 457–472.
  • Vehtari, Gelman, Simpson, Carpenter, Bürkner, 2021. “Rank-Normalization, Folding, and Localization: An Improved R̂”. Bayesian Analysis 16, 667–718.
  • Gordon, Salmond, Smith, 1993. “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation”. IEE Proc. F 140, 107–113.
  • Del Moral, Doucet, Jasra, 2006. “Sequential Monte Carlo Samplers”. JRSS-B 68, 411–436.
  • Neal, 2003. “Slice Sampling”. Ann. Stat. 31, 705–767.
  • Girolami + Calderhead, 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”. JRSS-B 73, 123–214.
  • Roberts + Tweedie, 1996. “Exponential Convergence of Langevin Distributions and Their Discrete Approximations”. Bernoulli 2, 341–363.
  • Goodman + Weare, 2010. “Ensemble Samplers with Affine Invariance”. CAMCoS 5, 65–80.
  • Foreman-Mackey, Hogg, Lang, Goodman, 2013. “emcee: The MCMC Hammer”. PASP 125, 306–312.
  • Skilling, 2006. “Nested Sampling for General Bayesian Computation”. Bayesian Analysis 1, 833–860.
  • Sobol’, 1967. “On the Distribution of Points in a Cube and the Approximate Evaluation of Integrals”. USSR Comp. Math. 7, 86–112.
  • Halton, 1960. “On the Efficiency of Certain Quasi-Random Sequences of Points”. Num. Math. 2, 84–90.
  • Owen, 1995. “Randomly Permuted (t,m,s)-Nets and (t,s)-Sequences”. In Monte Carlo and Quasi-Monte Carlo Methods.
  • Roberts, Gelman, Gilks, 1997. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms”. Ann. Appl. Prob. 7, 110–120.
  • Beskos, Pillai, Roberts, Sanz-Serna, Stuart, 2013. “Optimal Tuning of the Hybrid Monte Carlo Algorithm”. Bernoulli 19, 1501–1534.
  • Haario, Saksman, Tamminen, 2001. “An Adaptive Metropolis Algorithm”. Bernoulli 7, 223–242.
  • Carpenter et al, 2017. “Stan: A Probabilistic Programming Language”. J. Stat. Software 76, 1–32.
  • Phan, Pradhan, Jankowiak, 2019. “Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro”. arXiv:1912.11554.
  • Bingham et al, 2018. “Pyro: Deep Universal Probabilistic Programming”. JMLR 20, 1–6.
  • Ge, Xu, Ghahramani, 2018. “Turing: A Language for Flexible Probabilistic Inference”. AISTATS.
  • Bouchard-Côté, Vollmer, Doucet, 2018. “The Bouncy Particle Sampler”. JASA 113, 855–867.
  • Bierkens, Fearnhead, Roberts, 2019. “The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data”. Ann. Stat. 47, 1288–1320.
  • Veach + Guibas, 1997. “Metropolis Light Transport”. SIGGRAPH.
  • Brooks, Gelman, Jones, Meng (eds), 2011. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
  • Robert + Casella, 2004. Monte Carlo Statistical Methods, 2nd ed. Springer.
  • Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, 2013. Bayesian Data Analysis, 3rd ed. (BDA3). Chapman & Hall/CRC.
  • Betancourt, 2018. “A Conceptual Introduction to Hamiltonian Monte Carlo”. arXiv:1701.02434.
  • Liu + Wang, 2016. “Stein Variational Gradient Descent”. NeurIPS.
  • Jacob, O’Leary, Atchadé, 2020. “Unbiased Markov Chain Monte Carlo Methods with Couplings”. JRSS-B 82, 543–600.