MCMC & Monte Carlo Sampling — Math Reference
1. At a glance
Monte Carlo (MC) methods draw samples from a probability distribution p(x) to approximate expectations, integrals, marginals, and uncertainty. Markov Chain Monte Carlo (MCMC) extends this to settings where direct sampling is infeasible — typically because the normalizing constant Z = ∫ p̃(x) dx is intractable, as in most Bayesian posteriors p(θ|D) ∝ p(D|θ)·p(θ).
Core estimator: for X_i ~ p iid,
E_p[f(X)] ≈ (1/n) · Σ_{i=1..n} f(X_i)
Strong Law of Large Numbers gives consistency; CLT gives O(1/√n) error rate regardless of dimension (the famous escape from the curse of dimensionality for integration).
Two big families:
- Exact / direct sampling — when
F⁻¹is available (inverse CDF), or special tricks (Box-Muller for Gaussian) work. Fast, iid samples. - Markov-chain sampling (MCMC) — construct a Markov chain
X_0 → X_1 → ...whose stationary distribution isp. After burn-in, samples are correlated draws fromp. Asymptotically correct under ergodicity.
Used for: Bayesian inference, statistical physics, finance derivatives, computer graphics (path tracing), state-space models, model selection, rare-event simulation.
Cross-refs: see [[Math/bayesian-inference]] for the inferential framework MCMC supports, [[Math/probability-fundamentals]] for measure-theoretic foundations.
2. Direct sampling
2.1 Inverse CDF (probability integral transform)
If F is the CDF of X and U ~ Uniform(0,1), then X = F⁻¹(U) has distribution F. Works when F⁻¹ is computable.
Examples:
- Exponential:
F(x) = 1 − e^{−λx}→X = −log(1 − U)/λ. - Cauchy:
X = tan(π·(U − 0.5)). - Discrete: store CDF, binary-search on
U.
2.2 Box-Muller (Box + Muller 1958)
Two iid U_1, U_2 ~ Uniform(0,1) → two iid Z_1, Z_2 ~ N(0,1):
Z_1 = sqrt(−2·log(U_1)) · cos(2π·U_2)
Z_2 = sqrt(−2·log(U_1)) · sin(2π·U_2)
2.3 Marsaglia polar method (Marsaglia + Bray 1964)
Avoids the trig calls of Box-Muller. Sample (V_1, V_2) ~ Uniform(square), reject if S = V_1² + V_2² > 1, else
Z_1 = V_1 · sqrt(−2·log(S)/S)
Z_2 = V_2 · sqrt(−2·log(S)/S)
2.4 Ziggurat (Marsaglia + Tsang 2000)
State-of-the-art fast Gaussian (and exponential) sampler. Pre-tabulates a stack of rectangles covering the density; ~97% of samples accepted by table lookup, fallback for tail. Used in NumPy randn, R, MATLAB.
2.5 Rejection sampling (von Neumann 1951)
Given p(x) ≤ M·q(x) for an envelope q we can sample:
- Draw
X ~ q. - Draw
U ~ Uniform(0, 1). - Accept
XifU ≤ p(X)/(M·q(X)), else reject.
Acceptance probability is 1/M. Inefficient in high dimensions: M typically grows exponentially with d.
3. Importance sampling (IS)
Rewrite the expectation under a tractable proposal q:
E_p[f(X)] = ∫ f(x)·p(x) dx = ∫ f(x)·(p(x)/q(x))·q(x) dx = E_q[f(X)·w(X)]
where w(x) = p(x)/q(x) is the importance weight. Estimator:
μ̂_IS = (1/n) · Σ f(X_i)·w(X_i), X_i ~ q
When p is known only up to a constant (p̃ = Z·p), use self-normalized IS (SNIS):
μ̂_SNIS = Σ f(X_i)·w̃(X_i) / Σ w̃(X_i), w̃ = p̃/q
SNIS is biased (consistent) but works without Z.
Variance: Var(μ̂_IS) = Var_q(f·w)/n. Minimum at q*(x) ∝ |f(x)|·p(x). In practice, choose q with heavier tails than p to avoid infinite-variance weights; check the effective sample size ESS = (Σw_i)² / Σ w_i² — small ESS means a few weights dominate.
Use cases: rare-event simulation (tilted proposal); particle filters (sequential IS, see §10); variational inference reweighting (IWAE, Burda et al 2016).
4. MCMC: the core idea
Metropolis et al 1953 (Metropolis, Rosenbluth, Rosenbluth, Teller, Teller) introduced the idea at Los Alamos: simulate a Markov chain whose stationary distribution is the Boltzmann distribution. Hastings 1970 generalized to arbitrary p.
Key facts:
- Chain transition kernel
K(x → x')must satisfy detailed balance:p(x)·K(x → x') = p(x')·K(x' → x). This impliespis stationary. - Irreducibility + aperiodicity ⇒ unique stationary distribution
p, and(1/n)·Σ f(X_i) → E_p[f](ergodic theorem). - Samples are correlated, so effective sample size
n_eff < n. - Need burn-in to forget initial state.
The art is designing fast-mixing kernels.
5. Metropolis-Hastings (MH)
Given current state x, propose x' ~ q(·|x), accept with probability
α(x → x') = min(1, [p(x')·q(x|x')] / [p(x)·q(x'|x)])
If accepted, x_{t+1} = x'; else x_{t+1} = x. The acceptance ratio uses only ratios of p, so the normalizer Z cancels.
# Random-walk Metropolis-Hastings (toy, log-target)
import numpy as np
def rw_mh(log_p, x0, n_iter, step):
x = np.asarray(x0, dtype=float)
lp = log_p(x)
samples = np.empty((n_iter, x.size))
accept = 0
for t in range(n_iter):
x_prop = x + step * np.random.randn(*x.shape)
lp_prop = log_p(x_prop)
# Symmetric q ⇒ q-terms cancel
if np.log(np.random.rand()) < lp_prop - lp:
x, lp = x_prop, lp_prop
accept += 1
samples[t] = x
return samples, accept / n_iterTuning random-walk MH: Roberts + Gelman + Gilks 1997 showed the optimal acceptance rate is 0.234 for d → ∞ (Gaussian target, isotropic Gaussian proposal). In practice tune step so acceptance ≈ 0.25–0.35 in moderate dim, ≈ 0.44 in 1D.
Independence sampler: q(x'|x) = q(x') ignores current state — essentially IS with a Markov-chain twist. Good when q ≈ p.
Adaptive MH (Haario, Saksman, Tamminen 2001) — adapt proposal covariance from sample history. Care: adaptation must diminish or break detailed balance.
6. Gibbs sampling (Geman + Geman 1984)
For multivariate x = (x_1, ..., x_d), sample one component at a time from its full conditional:
x_i^{(t+1)} ~ p(x_i | x_1^{(t+1)}, ..., x_{i−1}^{(t+1)}, x_{i+1}^{(t)}, ..., x_d^{(t)})
This is a special MH where each conditional update has acceptance 1. Works beautifully when conditionals are tractable — conjugate Bayesian hierarchical models (Normal-Inverse-Gamma, Beta-Binomial, Dirichlet-Multinomial).
Block Gibbs: update several components jointly to reduce autocorrelation when components are strongly correlated.
Collapsed Gibbs: marginalize out some variables analytically before sampling (e.g., LDA topic models, Griffiths + Steyvers 2004).
Trap: Gibbs mixes poorly when components are highly correlated — it can only move along axis-aligned directions.
7. Hamiltonian Monte Carlo (HMC)
Introduced by Duane, Kennedy, Pendleton, Roweth 1987 in lattice QCD as “Hybrid Monte Carlo”; brought to statistics by Neal 1993 (and 2011 chapter in MCMC Handbook).
Idea: augment state space with auxiliary momentum p ~ N(0, M) (mass matrix M). Define Hamiltonian
H(x, p) = U(x) + K(p) = −log p(x) + (1/2)·pᵀ M⁻¹ p
Joint density π(x, p) ∝ exp(−H(x, p)) factorizes; marginal on x is the target p. To propose, simulate Hamiltonian dynamics
dx/dt = ∂H/∂p = M⁻¹·p
dp/dt = −∂H/∂x = −∇U(x) = ∇ log p(x)
for L steps of size ε via the leapfrog integrator (symplectic, second-order):
p ← p − (ε/2)·∇U(x)
x ← x + ε·M⁻¹·p
p ← p − (ε/2)·∇U(x)
Leapfrog preserves phase-space volume exactly and H up to O(ε²). Accept proposed (x', p') with prob min(1, exp(H(x,p) − H(x',p'))).
# Vanilla HMC step (M = I)
def hmc_step(x, log_p, grad_log_p, eps, L):
p = np.random.randn(*x.shape)
H0 = -log_p(x) + 0.5 * (p @ p)
x_new, p_new = x.copy(), p.copy()
# Leapfrog L steps
p_new = p_new + 0.5 * eps * grad_log_p(x_new)
for _ in range(L - 1):
x_new = x_new + eps * p_new
p_new = p_new + eps * grad_log_p(x_new)
x_new = x_new + eps * p_new
p_new = p_new + 0.5 * eps * grad_log_p(x_new)
H1 = -log_p(x_new) + 0.5 * (p_new @ p_new)
if np.log(np.random.rand()) < H0 - H1:
return x_new, True
return x, FalseBecause HMC uses gradients, it makes coherent long moves in high dimensions and dramatically outperforms random-walk MH on smooth targets — scaling is roughly O(d^{5/4}) work per effective sample vs O(d^2) for RW-MH (Beskos et al 2013).
8. NUTS — No-U-Turn Sampler (Hoffman + Gelman 2014)
HMC requires choosing trajectory length L·ε. Too short → diffusive. Too long → wasted gradient evals and possible U-turns back toward start.
NUTS solves this: build a balanced binary tree of leapfrog steps in both forward and backward time directions, doubling until a U-turn criterion triggers:
(x_+ − x_−) · p_+ < 0 or (x_+ − x_−) · p_− < 0
Sample uniformly from the trajectory with care to preserve detailed balance (slice sampler over (x, p, u)).
Plus dual-averaging step-size adaptation (Nesterov 2009 applied by H+G): during warmup, adapt ε to target an acceptance statistic (default 0.8 in Stan).
NUTS is the default sampler in Stan, PyMC, NumPyro, Turing.jl. The original H+G paper is essential reading.
9. SGLD — Stochastic Gradient Langevin Dynamics (Welling + Teh 2011, ICML)
Marries SGD with Langevin diffusion to scale Bayesian inference to massive data. Discretized overdamped Langevin:
x_{t+1} = x_t + (ε_t/2)·(∇ log p(x_t) + ∇ log p(D|x_t)) + sqrt(ε_t)·η_t, η_t ~ N(0,I)
Replace the full-data gradient with a minibatch estimate. As ε_t → 0 (Robbins-Monro schedule), the discretization error and minibatch noise both vanish, and the chain targets the posterior — no MH accept step needed.
Extensions:
- SGHMC (Chen + Fox + Guestrin 2014) — stochastic-gradient HMC with friction term.
- pSGLD (Li et al 2016) — preconditioned with Fisher-like matrix (RMSProp-style).
- SG-NHT (Ding et al 2014) — Nosé-Hoover thermostat to absorb gradient-noise temperature drift.
These are foundational for Bayesian deep learning.
10. Sequential Monte Carlo (SMC) and particle filters
A population of weighted particles {(x_i^{(t)}, w_i^{(t)})} evolves through a sequence of distributions π_0, π_1, ..., π_T. Each step:
- Propagate: draw
x_i^{(t)} ~ K_t(·|x_i^{(t−1)}). - Reweight:
w_i^{(t)} ∝ w_i^{(t−1)} · π_t(x_i^{(t)})/(π_{t−1}(x_i^{(t−1)})·K_t). - Resample when ESS drops below threshold (typically
N/2): drawNparticles with replacement proportional to weights.
Special cases:
- Bootstrap particle filter (Gordon, Salmond, Smith 1993) — state-space models with
π_t = p(x_t | y_{1:t}). - SMC samplers (Del Moral, Doucet, Jasra 2006) — tempered posteriors
π_t ∝ p(D|x)^{β_t}·p(x).
Applications: state-space estimation, robotics SLAM (see [[Robotics/bayesian-estimation]]), rare-event simulation, model evidence estimation, tracking, finance.
11. Slice sampling (Neal 2003)
Sample uniformly from the area under the density. Introduce auxiliary u:
u ~ Uniform(0, p(x)).x_new ~ Uniform({x' : p(x') > u})— the horizontal “slice”.
Step 2 is the work; Neal gives “stepping out + shrinkage” rules for finding the slice without knowing it analytically. No tuning parameter for step size. Used in BUGS/JAGS for conditionally non-conjugate updates, and in emcee’s ensemble slice variants.
12. Reversible-jump MCMC (Green 1995, Biometrika)
For variable-dimension parameter spaces — model selection across different numbers of components, regression with unknown number of changepoints, mixture models with unknown K. Augments the proposal with a “dimension-matching” diffeomorphism and a Jacobian factor in the acceptance ratio. Notoriously hard to tune in practice; superseded in many cases by SMC samplers or Bayesian nonparametrics.
13. Parallel tempering / replica exchange
Run M parallel chains at inverse temperatures β_1 = 1 > β_2 > ... > β_M, each targeting p(x)^{β_m}. High-temperature chains traverse the landscape easily and “donate” mode-jumping via swap proposals:
swap(m, m+1) accept prob = min(1, exp((β_m − β_{m+1})·(U(x_{m+1}) − U(x_m))))
Critical for multimodal targets, glassy energy landscapes, deep generative model posteriors.
14. Convergence diagnostics
No diagnostic can prove a chain has converged — only flag obvious failures. Combine several.
14.1 R̂ — potential scale reduction (Gelman + Rubin 1992; refined in Vehtari et al 2021)
Run m ≥ 4 chains of length n. Compute between-chain variance B and within-chain W:
R̂ = sqrt(((n−1)/n + B/(n·W)) / 1) (rank-normalized split version preferred)
R̂ > 1.01 ⇒ chains have not mixed. Modern practice (Vehtari 2021) uses split-R̂ + rank normalization to catch non-stationarity and fat tails.
14.2 Effective Sample Size (ESS)
ESS = n / (1 + 2·Σ_{t≥1} ρ_t)
where ρ_t is lag-t autocorrelation. Stan reports both bulk-ESS (for posterior mean) and tail-ESS (for tail quantiles). Target ESS ≥ 400 per chain for stable quantile estimates.
14.3 Trace plot
Plot x_t vs t for each chain. Stationary + well-mixed chains look like “fuzzy caterpillars” overlapping each other. Drift, sticking, multi-modal jumps are visible.
14.4 Posterior predictive check (Gelman, Meng, Stern 1996)
For each posterior sample θ_s, generate replicated data y_rep_s ~ p(·|θ_s). Compare summary statistics T(y_rep) to T(y_obs) — Bayesian p-values. Detects model mis-specification, not sampler issues.
14.5 HMC-specific: divergences and energy
A divergent transition = leapfrog error blew up (often a “funnel” region with strong curvature, e.g., hierarchical-model variance hyperparameter near 0). Even a handful of divergences biases inference.
E-BFMI (Energy Bayesian Fraction of Missing Information, Betancourt 2018): low values indicate the momentum resample isn’t exploring the energy distribution well.
15. HMC tuning tricks
- Non-centered parameterization for hierarchical models. Instead of
θ_i ~ N(μ, τ), writeθ_i = μ + τ·z_iwithz_i ~ N(0,1). Decouples the funnel and dramatically improves geometry. The “8-schools” example (Rubin 1981; Gelman BDA3) is the canonical demonstration. - Mass matrix adaptation: set
M⁻¹to the empirical posterior covariance during warmup. Diagonal mass is default in Stan; dense is available but expensive. - Dual averaging step size: adapt
εto hit target accept (Stan default 0.8; raise to 0.95–0.99 if divergences appear). - Increase max tree depth (NUTS) if you hit the cap — but first check geometry.
- Rescale parameters to roughly unit scale; helps the diagonal mass matrix.
- Reparameterize bounded variables with
log(positive),logit(probabilities),softmax/stick-breaking(simplices), Cholesky (covariance).
16. Software
| Tool | Year | Backend / language | Notes |
|---|---|---|---|
| BUGS | 1989 | Pascal / Component Pascal | First general-purpose Bayesian software; Gibbs-based. |
| WinBUGS | 1997 | Windows | Standard for 2000s applied Bayes. |
| JAGS | 2007 | C++ | Just Another Gibbs Sampler; cross-platform. |
| Stan (Carpenter et al 2017, JSS) | 2012 | C++; R/Python/Julia front-ends | HMC + NUTS; auto-diff; the gold standard for applied Bayesian models. |
| PyMC v5 (Salvatier, Wiecki, Fonnesbeck) | 2022 rewrite | Python on PyTensor (JAX/Numba) | NUTS via BlackJAX or internal; very ergonomic. |
| NumPyro (Phan, Pradhan, Jankowiak 2019) | 2019 | JAX | Fastest NUTS; great for vectorized models. |
| Pyro (Bingham et al 2018) | 2018 | PyTorch | Focus on VI / SVI; supports HMC/NUTS. |
| Turing.jl (Ge, Xu, Ghahramani 2018) | 2018 | Julia | Composable inference (HMC, particle, VI). |
| emcee (Foreman-Mackey 2013, PASP) | 2013 | Python | Affine-invariant ensemble sampler (Goodman + Weare 2010); standard in astronomy. |
| dynesty / nestle / MultiNest | various | Python / C | Nested sampling (Skilling 2006) — Bayesian evidence + multimodal. |
| BlackJAX | 2022 | JAX | Composable MCMC kernels; powers PyMC + NumPyro. |
| TensorFlow Probability | 2017 | TF / JAX | MCMC chains + VI + distributions. |
Affine-invariant ensemble (emcee): walkers update via a stretch move using another walker’s position, automatically rescaling to local geometry — no gradient needed. Excellent for low-to-moderate dim physics-style problems.
17. Use cases
- Bayesian parameter estimation — the default modern workflow. See
[[Math/bayesian-inference]]. - Bayesian hierarchical / multilevel models — 8-schools, longitudinal regression, election forecasting (Gelman + Hill 2007).
- Gaussian process hyperparameters — marginal-likelihood posterior is non-Gaussian; HMC or slice sampling.
- Approximate Bayesian Computation (ABC) — likelihood-free; for simulator-based models in epidemiology, population genetics, cosmology (Marin, Pudlo, Robert, Ryder 2012).
- Cosmological parameter inference —
emcee/MultiNest/Cobayaon Planck CMB likelihoods; LIGO gravitational-wave parameter estimation (PyCBC,Bilby). - Phylogenetics — MrBayes (Ronquist + Huelsenbeck 2003), BEAST/BEAST2 (Drummond + Rambaut 2007; Bouckaert et al 2019); tree-space MCMC.
- Pharmacokinetic / pharmacodynamic modeling — population PK/PD via NONMEM, Stan, Monolix.
- Statistical mechanics / lattice QCD — origin of HMC; partition function expectations.
- Finance — pricing path-dependent derivatives via Monte Carlo + variance reduction; QMC for high-dim integrals.
- Computer graphics — Metropolis Light Transport (Veach + Guibas 1997).
- Model selection / averaging — RJMCMC, SMC samplers for evidence.
- Tracking / SLAM — particle filters (see
[[Robotics/bayesian-estimation]]).
Note: LLM token sampling (temperature, top-k, top-p, nucleus) is not MCMC — it’s iid categorical sampling from a learned conditional. Don’t confuse the two.
18. Variance reduction
Plain MC has O(σ/√n) error. For fixed budget, smaller σ matters.
18.1 Antithetic variates
For symmetric proposals, pair X with −X (or 1 − U for uniforms). If f is monotonic, the two estimators are negatively correlated, cutting variance.
18.2 Control variates
Find h(x) with known E_p[h] = μ_h. Estimator:
μ̂_CV = (1/n)·Σ(f(X_i) − c·(h(X_i) − μ_h)), c* = Cov(f,h)/Var(h)
Variance reduction factor 1 − Corr(f, h)². Used in derivative pricing (use Black-Scholes closed-form as control for non-Black-Scholes payoffs).
18.3 Stratified sampling
Partition Ω = ∪ Ω_k, sample n_k from p restricted to Ω_k. Variance is Σ p_k²·σ_k²/n_k; optimal allocation n_k ∝ p_k·σ_k (Neyman allocation).
18.4 Common random numbers
When comparing two estimates that differ only in parameters, use the same U_i for both → strong positive correlation → small Var(estimator difference).
18.5 Rao-Blackwellization
Replace f(X) with conditional expectation E[f(X) | T(X)] if available — strict variance reduction by tower law.
19. Quasi-Monte Carlo (QMC)
Replace iid U_i ~ Uniform([0,1]^d) with low-discrepancy sequences that fill the cube more uniformly than random:
- Halton sequence (Halton 1960) — radical-inverse base-
p_ifor dimi. - Sobol’ sequence (Sobol’ 1967) — direction numbers; the workhorse of QMC.
- Faure sequence (Faure 1982) — base-prime ≥ d.
- Niederreiter (1987) — generalization with discrepancy bounds.
Koksma-Hlawka inequality: integration error ≤ V(f)·D*(x_1,...,x_n), where D* is star discrepancy. Sobol’/Halton achieve D* = O((log n)^d / n), giving asymptotic error O((log n)^d / n) vs MC’s O(1/√n) — a huge win for smooth integrands and moderate d.
Scrambled QMC (Owen 1995) randomizes a low-discrepancy sequence to recover an unbiased estimator with confidence intervals, while keeping near-O(1/n) variance for smooth f.
Standard in: finance (basket option pricing, especially in dim 10–100), robotics path planning, computer graphics, high-dim integration. Implementations: scipy.stats.qmc, torch.quasirandom, SobolNum.
20. Modern MCMC research (selected directions)
- Riemannian Manifold HMC (RMHMC) (Girolami + Calderhead 2011, JRSS-B) — let the mass matrix
M(x) = G(x)be the local Fisher metric, so dynamics flow on the statistical manifold. Handles ill-conditioned posteriors but requires expensive geometry computations. - Adaptive NUTS with dense mass matrix — Stan supports; expensive but pays off for highly correlated posteriors.
- HMC on Lie groups — for
SO(3)andSE(3)orientation/pose posteriors (Byrne + Girolami 2013; Liu + Zhu et al 2016). See[[Math/lie-groups-so3-se3]]. - Neural Spline Flow + MCMC hybrids — train a normalizing flow as the proposal in IS/MH; “transport map” approach (Marzouk et al 2016; Hoffman et al 2019).
- Piecewise Deterministic Markov Processes (PDMPs) — bouncy particle sampler (Bouchard-Côté et al 2018), Zig-Zag sampler (Bierkens, Fearnhead, Roberts 2019); non-reversible chains with super-efficient mixing in some regimes.
- MALA — Metropolis-adjusted Langevin algorithm (Roberts + Tweedie 1996); single-step HMC with
L=1. - Stein Variational Gradient Descent (Liu + Wang 2016) — deterministic particle method; not strictly MCMC but a competitor.
- Coupled MCMC (Jacob, O’Leary, Atchadé 2020) — unbiased estimators from finite-time coupled chains.
- Quasi-Newton HMC, NUTS-on-flows — active areas.
21. Common pitfalls
- Too short warmup / burn-in — chains haven’t forgotten initialization. Stan’s default 1000 warmup is a minimum, not a maximum.
- One chain only — can’t detect non-mixing or multimodality. Always run ≥ 4 chains from dispersed initial values.
- High autocorrelation — thousands of raw samples but
ESS = 50. Diagnose with autocorrelation plot; remedy with better parameterization, HMC over RW-MH, block updates. - Thinning misuse — thinning saves disk space but loses information; don’t thin to “fix” autocorrelation, just report ESS.
- Ignoring HMC divergences — even one divergence biases tails. Reparameterize, raise
adapt_delta, shrink step size — don’t suppress the warning. - Multimodal posterior, one mode found — symptoms: chains stuck in different modes, R̂ huge between chains but tiny within. Remedies: parallel tempering, mode-finding initialization, mixture proposals, SMC samplers, nested sampling.
- Improper priors → improper posterior — sampler may still “run” but the answer is meaningless. Check posterior is proper before sampling.
- Initialization on a flat region of the prior — gradient is tiny, HMC drifts forever. Initialize near MAP or sensible default.
- Bad mass matrix — diagonal
Mwhen posterior is strongly correlated → very small effective step. Use full mass matrix or reparameterize to decorrelate. - Confusing IS effective sample size with MCMC ESS — different quantities; both important.
22. Cross-references
[[Math/bayesian-inference]]— the inferential framework; conjugate updating, hierarchical models, model comparison.[[Math/probability-fundamentals]]— measure-theoretic basis, conditional expectation, ergodic theorem.[[Math/hypothesis-testing-mle]]— MLE / frequentist counterpart.[[Math/_index]]— full Math tier index.[[Robotics/bayesian-estimation]]— Kalman / EKF / UKF / particle filter (sequential MCMC) for robotics state estimation.[[Math/gradient-descent-variants]]— SGD background for SGLD / SGHMC.[[Math/lie-groups-so3-se3]]— geometry for HMC on rotations and rigid-body poses.[[Math/convex-optimization]]— MAP estimation as the optimization counterpart of Bayesian sampling.
23. Citations
- Metropolis, Rosenbluth, Rosenbluth, Teller, Teller, 1953. “Equation of State Calculations by Fast Computing Machines”. J. Chem. Phys. 21, 1087–1092.
- Hastings, 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57, 97–109.
- Geman + Geman, 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images”. IEEE PAMI 6, 721–741.
- Box + Muller, 1958. “A Note on the Generation of Random Normal Deviates”. Ann. Math. Stat. 29, 610–611.
- Marsaglia + Tsang, 2000. “The Ziggurat Method for Generating Random Variables”. J. Stat. Software.
- Duane, Kennedy, Pendleton, Roweth, 1987. “Hybrid Monte Carlo”. Phys. Lett. B 195, 216–222.
- Neal, 1993. “Probabilistic Inference Using Markov Chain Monte Carlo Methods”. Technical Report CRG-TR-93-1, U. Toronto.
- Neal, 2011. “MCMC Using Hamiltonian Dynamics”. In Handbook of Markov Chain Monte Carlo.
- Hoffman + Gelman, 2014. “The No-U-Turn Sampler”. JMLR 15, 1593–1623.
- Welling + Teh, 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics”. ICML.
- Chen + Fox + Guestrin, 2014. “Stochastic Gradient Hamiltonian Monte Carlo”. ICML.
- Green, 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination”. Biometrika 82, 711–732.
- Gelman + Rubin, 1992. “Inference from Iterative Simulation Using Multiple Sequences”. Stat. Sci. 7, 457–472.
- Vehtari, Gelman, Simpson, Carpenter, Bürkner, 2021. “Rank-Normalization, Folding, and Localization: An Improved R̂”. Bayesian Analysis 16, 667–718.
- Gordon, Salmond, Smith, 1993. “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation”. IEE Proc. F 140, 107–113.
- Del Moral, Doucet, Jasra, 2006. “Sequential Monte Carlo Samplers”. JRSS-B 68, 411–436.
- Neal, 2003. “Slice Sampling”. Ann. Stat. 31, 705–767.
- Girolami + Calderhead, 2011. “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods”. JRSS-B 73, 123–214.
- Roberts + Tweedie, 1996. “Exponential Convergence of Langevin Distributions and Their Discrete Approximations”. Bernoulli 2, 341–363.
- Goodman + Weare, 2010. “Ensemble Samplers with Affine Invariance”. CAMCoS 5, 65–80.
- Foreman-Mackey, Hogg, Lang, Goodman, 2013. “emcee: The MCMC Hammer”. PASP 125, 306–312.
- Skilling, 2006. “Nested Sampling for General Bayesian Computation”. Bayesian Analysis 1, 833–860.
- Sobol’, 1967. “On the Distribution of Points in a Cube and the Approximate Evaluation of Integrals”. USSR Comp. Math. 7, 86–112.
- Halton, 1960. “On the Efficiency of Certain Quasi-Random Sequences of Points”. Num. Math. 2, 84–90.
- Owen, 1995. “Randomly Permuted (t,m,s)-Nets and (t,s)-Sequences”. In Monte Carlo and Quasi-Monte Carlo Methods.
- Roberts, Gelman, Gilks, 1997. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms”. Ann. Appl. Prob. 7, 110–120.
- Beskos, Pillai, Roberts, Sanz-Serna, Stuart, 2013. “Optimal Tuning of the Hybrid Monte Carlo Algorithm”. Bernoulli 19, 1501–1534.
- Haario, Saksman, Tamminen, 2001. “An Adaptive Metropolis Algorithm”. Bernoulli 7, 223–242.
- Carpenter et al, 2017. “Stan: A Probabilistic Programming Language”. J. Stat. Software 76, 1–32.
- Phan, Pradhan, Jankowiak, 2019. “Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro”. arXiv:1912.11554.
- Bingham et al, 2018. “Pyro: Deep Universal Probabilistic Programming”. JMLR 20, 1–6.
- Ge, Xu, Ghahramani, 2018. “Turing: A Language for Flexible Probabilistic Inference”. AISTATS.
- Bouchard-Côté, Vollmer, Doucet, 2018. “The Bouncy Particle Sampler”. JASA 113, 855–867.
- Bierkens, Fearnhead, Roberts, 2019. “The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data”. Ann. Stat. 47, 1288–1320.
- Veach + Guibas, 1997. “Metropolis Light Transport”. SIGGRAPH.
- Brooks, Gelman, Jones, Meng (eds), 2011. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC.
- Robert + Casella, 2004. Monte Carlo Statistical Methods, 2nd ed. Springer.
- Gelman, Carlin, Stern, Dunson, Vehtari, Rubin, 2013. Bayesian Data Analysis, 3rd ed. (BDA3). Chapman & Hall/CRC.
- Betancourt, 2018. “A Conceptual Introduction to Hamiltonian Monte Carlo”. arXiv:1701.02434.
- Liu + Wang, 2016. “Stein Variational Gradient Descent”. NeurIPS.
- Jacob, O’Leary, Atchadé, 2020. “Unbiased Markov Chain Monte Carlo Methods with Couplings”. JRSS-B 82, 543–600.