Bayesian Inference — Math Reference

1. At a glance

Bayesian inference treats unknown parameters as random variables with probability distributions encoding belief. Given data, beliefs are updated via Bayes’ rule. This is the philosophical split with frequentist statistics, which treats parameters as fixed-but-unknown constants and probability as long-run frequency under repeated sampling.

Both paradigms are useful in practice; the split matters less than understanding which inferential question is being answered. Bayesians can quote a probability that a parameter lies in an interval given the observed data (“credible interval”); frequentists quote coverage probability across hypothetical repetitions (“confidence interval”). When priors are weak and data are large, the two often agree numerically (Bernstein–von Mises theorem). They diverge when sample sizes are small, when prior knowledge is genuine, when sequential decisions matter, or when nested model comparison is required.

Three structural advantages of the Bayesian approach:

  • Coherent uncertainty quantification — the posterior is a full distribution over parameters, not a point estimate plus a derived interval.
  • Principled incorporation of prior information — domain experts, historical data, and regularization all enter through the prior.
  • Model comparison and averaging — posterior probabilities of models, marginal likelihoods, and predictive scores all fit naturally in one framework.

Costs: computational expense (MCMC, VI), sensitivity to prior choice, and the need to specify a full generative model rather than a discriminative one.

Cross-references: see [[Math/probability-fundamentals]] for distributions and expectation, [[Math/convex-optimization]] for the optimization tools VI and MAP rely on, and [[Robotics/bayesian-estimation]] for sequential state estimation (Kalman, particle filters).

2. Bayes’ rule

For parameters θ and data D,

P(θ | D) = P(D | θ) · P(θ) / P(D)

Names:

  • P(θ | D) — posterior. Belief about θ after seeing D.
  • P(D | θ) — likelihood. Generative model evaluated at the observed data. As a function of θ with D fixed, it is not a probability density over θ.
  • P(θ) — prior. Belief about θ before seeing D.
  • P(D) — marginal likelihood or evidence. Normalizing constant: P(D) = ∫ P(D | θ) · P(θ) dθ.

The proportionality form

P(θ | D) ∝ P(D | θ) · P(θ)

is the workhorse — MCMC and MAP need only the unnormalized posterior. The normalizing integral P(D) is the hard part: it is rarely tractable in closed form for non-conjugate models, and computing it is what model comparison via Bayes factors and the marginal-likelihood term in some criteria requires.

Sequential updating: today’s posterior is tomorrow’s prior. If data arrive in two batches D₁ then D₂,

P(θ | D₁, D₂) ∝ P(D₂ | θ) · P(θ | D₁)

assuming conditional independence given θ. This decomposition underlies online Bayesian learning, Kalman filtering, and particle filtering.

3. Priors

The prior P(θ) is where the modeler injects information. Choices fall on a spectrum from very informative to deliberately uninformative.

Informative priors

Encode genuine domain knowledge. A drug-trial example: prior on response rate centered at the historical mean from similar drugs with variance reflecting between-drug variability. Informative priors are powerful when correctly specified and dangerous when wrong; report sensitivity analyses.

Weakly informative priors

The default recommendation in modern Bayesian workflow. Examples: N(0, 1) on standardized regression coefficients, half-Normal or half-Cauchy on scale parameters. The intent is regularization — keeping the posterior in a sensible region without committing to a specific value. Avoids the pathologies of genuinely flat priors (unboundedly large parameter values can dominate the posterior in models with sparse data).

Uninformative / objective priors

Attempts to encode “no information” while remaining principled. Variants:

  • Uniform priors. Not invariant under reparameterization: a uniform prior on σ is not uniform on log σ. So “ignorance” depends on parameterization.
  • Jeffreys priors (Jeffreys 1946) — proportional to √det(I(θ)) where I is the Fisher information. Invariant under reparameterization by construction; for a binomial proportion p, this gives Beta(½, ½). Can be improper.
  • Reference priors (Bernardo 1979) — maximize the expected KL divergence between prior and posterior; coincide with Jeffreys in one dimension but differ in higher dimensions where Jeffreys can behave badly.
  • Maximum entropy priors (Jaynes) — pick the distribution of maximum entropy subject to known constraints (e.g., known mean and variance → Normal).

Improper priors

Do not integrate to one. Example: a uniform prior on the real line for a location parameter. Sometimes admissible if the resulting posterior is proper. Always verify posterior propriety before using an improper prior — improper posterior means inference is undefined, MCMC will wander forever, and the resulting estimates are nonsense.

Hierarchical priors

The prior itself has parameters (hyperparameters) drawn from a hyperprior.

θ_j | μ, τ ~ N(μ, τ²),    μ, τ ~ p(μ, τ)

This is the engine of partial pooling — observations within a group inform the group’s parameter directly, and across groups they jointly inform the hyperparameters that anchor every group. The 8-schools problem (Rubin 1981) is the canonical illustration.

4. Conjugate priors

A prior is conjugate to a likelihood when the posterior is in the same family as the prior. This produces closed-form updates, which is invaluable for analytic work, fast online updates, and Gibbs samplers.

The standard conjugate pairs:

LikelihoodConjugate priorPosterior
Bernoulli / BinomialBeta(α, β)Beta(α + successes, β + failures)
PoissonGamma(α, β)Gamma(α + Σx, β + n)
Normal, σ² knownNormal on μNormal
Normal, μ knownInverse-Gamma on σ²Inverse-Gamma
Normal, both unknownNormal-Inverse-GammaNormal-Inverse-Gamma
MultinomialDirichlet(α)Dirichlet(α + counts)
Multivariate Normal, Σ knownMultivariate Normal on μMVN
Multivariate Normal, μ knownInverse-Wishart on ΣInverse-Wishart
Exponential / GammaGammaGamma
Uniform(0, θ)ParetoPareto

The Beta-Binomial update is the textbook starter: prior Beta(α, β), observe k successes in n trials, posterior is Beta(α + k, β + n − k). The hyperparameters have an interpretation as pseudo-counts.

Conjugate analyses are limited — most realistic models lack closed-form conjugacy — but they appear inside larger algorithms. Gibbs sampling for a hierarchical model often updates each block by its conjugate full conditional; the Inverse-Wishart on Σ appears inside Bayesian mixture models; the Dirichlet appears in topic models.

5. MAP estimate

The maximum a posteriori estimate is the mode of the posterior:

θ_MAP = argmax_θ P(θ | D) = argmax_θ P(D | θ) · P(θ)

Equivalently, working in log space,

θ_MAP = argmin_θ [ −log P(D | θ) − log P(θ) ]

which is penalized maximum likelihood. The prior contributes a penalty term:

  • N(0, σ²) prior → L2 penalty (ridge regression / weight decay).
  • Laplace(0, b) prior → L1 penalty (lasso); produces sparsity.
  • Horseshoe prior (Carvalho 2010) → adaptive sparsity, better than lasso for many sparse problems.

MAP is a point estimate. It is cheap (just an optimization problem) and often a useful starting point for MCMC initialization, but it discards all uncertainty. For multimodal posteriors, MAP can land on a sharp local mode that is unrepresentative of probability mass — a well-known pathology in latent- variable and mixture models. Prefer full posterior or at least Laplace approximation around the MAP when uncertainty matters.

6. Full posterior and uncertainty

The posterior is a distribution, not a number. Summary tools:

  • Posterior mean, median, mode.
  • Posterior variance and standard deviation.
  • Credible intervals — an interval that contains θ with stated posterior probability. Two variants:
    • Equal-tailed (ETI): the 2.5th and 97.5th percentile bracket a 95% interval. Simple, but for skewed posteriors can exclude high-density regions.
    • Highest posterior density (HPD): the smallest interval containing 95% of the posterior mass; every point inside has higher density than every point outside. Coincides with ETI for symmetric unimodal posteriors.

A 95% credible interval has the direct interpretation “given the model, the posterior probability that θ lies in this interval is 0.95.” Contrast with a 95% confidence interval, which has frequentist meaning over hypothetical repetitions: 95% of intervals so constructed would contain the true θ.

Reporting practice: report a posterior point summary plus an interval, plus a density plot or histogram. For decisions, derive posterior expectations of the loss rather than thresholding on the interval.

7. Predictive distributions

The posterior predictive integrates parameter uncertainty into predictions:

P(x_new | D) = ∫ P(x_new | θ) · P(θ | D) dθ

Each prediction is a mixture over parameter values weighted by their posterior probability. With samples θ⁽ⁱ⁾ from the posterior, the predictive is approximated by

P(x_new | D) ≈ (1/M) Σᵢ P(x_new | θ⁽ⁱ⁾)

which is trivial to implement: sample θ⁽ⁱ⁾, simulate x_new given θ⁽ⁱ⁾, repeat.

Predictive distributions are preferred over plug-in estimates P(x_new | θ̂). Plug-in ignores parameter uncertainty and produces overconfident predictions, especially with small samples or many parameters. The difference is large in hierarchical models, time-series forecasting, and Gaussian processes.

The prior predictive — P(x_new) = ∫ P(x_new | θ) · P(θ) dθ — is the distribution of data implied by the prior alone, before seeing any. Prior predictive checks (simulate data from the prior, inspect for plausibility) are a standard step in Bayesian workflow.

Posterior predictive checks compare replicates simulated from the posterior predictive to the observed data. Discrepancies flag model misspecification.

8. MCMC

Markov chain Monte Carlo constructs a Markov chain whose stationary distribution is the target posterior. Drawing a long enough chain yields samples that, after burn-in, approximate samples from the posterior.

Metropolis–Hastings

(Metropolis et al. 1953, J Chem Phys; Hastings 1970, Biometrika.) The general recipe:

  1. Start at θ.
  2. Propose θ’ from a proposal distribution q(θ’ | θ).
  3. Compute acceptance ratio α = min(1, [P(D|θ’) P(θ’) q(θ|θ’)] / [P(D|θ) P(θ) q(θ’|θ)])
  4. Accept θ’ with probability α; otherwise keep θ.

Requires only the unnormalized posterior. Proposal choice is the engineering problem: too narrow → slow mixing, too wide → low acceptance. The simplest proposal is a Gaussian random walk centered at the current θ; targets acceptance around 0.234 in high dimensions (Roberts–Gelman–Gilks 1997).

Gibbs sampling

(Geman + Geman 1984.) A special case of Metropolis–Hastings where the proposal is the exact full conditional. For parameter blocks (θ₁, θ₂, …, θ_k), iterate:

θ₁⁽ᵗ⁺¹⁾ ~ P(θ₁ | θ₂⁽ᵗ⁾, …, θ_k⁽ᵗ⁾, D)
θ₂⁽ᵗ⁺¹⁾ ~ P(θ₂ | θ₁⁽ᵗ⁺¹⁾, θ₃⁽ᵗ⁾, …, θ_k⁽ᵗ⁾, D)
…

Acceptance is always 1. Requires that full conditionals are tractable to sample from, which conjugate priors often provide. Suffers when parameters are correlated — the chain moves coordinate-wise and explores ridges slowly.

Hamiltonian Monte Carlo

(Duane et al. 1987 Phys Lett B; Neal 1993, 2011 in Brooks–Gelman–Jones–Meng “Handbook of Markov Chain Monte Carlo”.) Uses gradients of the log posterior. Augment θ with momentum r and simulate Hamiltonian dynamics on the joint (θ, r) using the leapfrog integrator, then Metropolis-correct for integrator error. The dynamics conserve a Hamiltonian H = U(θ) + K(r) where U = −log P(θ|D) and K = ½ rᵀ M⁻¹ r, so proposals can travel long distances while staying in high-probability regions.

Two tuning parameters: step size ε and number of leapfrog steps L. Both critical and both finicky to tune by hand.

NUTS

(Hoffman + Gelman 2014, JMLR.) No-U-Turn Sampler. Adaptive HMC that picks L automatically by running the leapfrog forward and backward until the trajectory “u-turns” (starts returning toward its origin). Step size ε is tuned during a warmup phase using dual averaging.

NUTS is the default sampler in Stan, PyMC (since v3), and NumPyro. For most continuous models with computable gradients, it is the right starting point. Limitations: requires differentiable log posterior, struggles with discrete parameters (marginalize them out if possible), and can fail on funnel-shaped posteriors (the Neal funnel) without reparameterization.

Sequential Monte Carlo

Particle filters for state-space and temporal models. Maintain a weighted particle set approximating the posterior, propagate through time, resample to avoid weight degeneracy. SMC samplers generalize this to arbitrary targets via tempering — sequence of distributions interpolating between prior and posterior. Useful when MCMC is hard to mix (multimodal posteriors, label switching).

Diagnostics

A chain that has not converged is not a sample. Mandatory checks:

  • R-hat (Gelman–Rubin potential scale reduction factor). Compares within-chain to between-chain variance across multiple independent chains. R-hat ≈ 1.00 indicates convergence; > 1.01 is a warning, > 1.05 is a failure. The improved rank-normalized split-R-hat (Vehtari et al. 2021) is the modern standard.
  • Effective sample size (ESS). Number of independent samples worth of information; small ESS means autocorrelation is high and inferences are imprecise. Aim for ESS > 400 for stable interval estimates.
  • Trace plots. Visual check for mixing, stationarity, and obvious failures.
  • Energy diagnostics (HMC/NUTS): E-BFMI flags poor energy mixing.
  • Divergent transitions (HMC/NUTS): nonzero count indicates the integrator is failing in regions of high curvature; reparameterize or shrink step size.
  • Posterior predictive checks. Simulate replicates; compare to data.

The Bayesian workflow (Gelman, Vehtari, Simpson et al. 2020, arXiv) codifies this as a multi-step iterative process: prior predictive check, fit, diagnose, posterior predictive check, sensitivity analysis, model comparison.

9. Variational inference

VI replaces the sampling problem with an optimization problem. Pick a family of distributions Q, find q* ∈ Q minimizing KL divergence to the posterior:

q* = argmin_{q ∈ Q} KL(q(θ) || p(θ | D))

KL is intractable directly because it involves the unknown evidence, but minimizing KL is equivalent to maximizing the evidence lower bound:

ELBO(q) = E_q[log P(D, θ)] − E_q[log q(θ)]
        = log P(D) − KL(q || p(·|D))

Since log P(D) is constant, maximizing the ELBO minimizes the KL. The ELBO is also a lower bound on log marginal likelihood, sometimes used (cautiously) for model comparison.

VI is faster than MCMC and scales to large data, but biased: the posterior approximation is whatever the chosen family Q can express. Mean-field VI in particular tends to underestimate posterior variance — KL(q||p) penalizes mass where q is high and p is low, so q shrinks to fit modes tightly.

Mean-field VI

Assume q factorizes: q(θ) = ∏_i q_i(θ_i). Coordinate-ascent updates each q_i optimally given the others — closed form for conjugate-exponential models. Standard in classical Bayesian text (Bishop 2006 ch 10).

Stochastic VI

(Hoffman et al. 2013 JMLR.) Subsamples data minibatches and uses noisy gradient estimates of the ELBO. Scales to billions of observations. Drove the use of VI in industrial-scale topic modeling.

Black-box VI

(Ranganath, Gerrish, Blei 2014 AISTATS.) Estimates ELBO gradients via Monte Carlo, avoiding model-specific math. Variance reduction (control variates, Rao-Blackwellization) is necessary in practice.

ADVI

(Kucukelbir et al. 2017 JMLR.) Automatic Differentiation Variational Inference. Transforms constrained parameters to the real line, fits a Gaussian in the transformed space, transforms back. Universal — works on any differentiable model. Implemented in Stan and PyMC.

VAE

(Kingma + Welling 2013 ICLR.) Variational autoencoder. Amortized VI: an encoder network outputs variational parameters q(z | x; φ), trained jointly with a decoder network parameterizing P(x | z; θ). The reparameterization trick (z = μ + σ · ε with ε standard normal) allows pathwise gradients through sampling. Beyond density modeling — central to representation learning.

Normalizing flows

(Rezende + Mohamed 2015 ICML.) Express q as a chain of invertible transformations applied to a base distribution, computing the change-of- variables Jacobian to keep tractable density. Real NVP, Masked Autoregressive Flow, Inverse Autoregressive Flow, neural spline flows. Used for flexible variational families and density estimation.

10. Bayesian model comparison

Comparing two models M₁ and M₂ via posterior probabilities:

P(M_k | D) ∝ P(D | M_k) · P(M_k)

The marginal likelihood P(D | M_k) = ∫ P(D | θ_k, M_k) · P(θ_k | M_k) dθ_k.

Bayes factor

(Jeffreys 1961, “Theory of Probability”.) BF₁₂ = P(D | M₁) / P(D | M₂). Interpretation guides (Kass + Raftery 1995): BF > 10 strong, > 100 decisive.

Practical issues: marginal likelihoods are hard to compute (bridge sampling, nested sampling, thermodynamic integration), and Bayes factors are extremely sensitive to prior choice. Diffuse priors penalize complex models more than they “should” — Lindley’s paradox (Lindley 1957) shows the posterior of a null hypothesis can approach 1 as the prior on the alternative becomes vague, even when the likelihood ratio favors the alternative.

WAIC

(Watanabe 2010 JMLR; Watanabe–Akaike Information Criterion.) Estimates out-of- sample expected log predictive density:

WAIC = −2 (lppd − p_WAIC)

where lppd is the log pointwise predictive density and p_WAIC is an effective- parameters correction. Computable from posterior samples, no extra fits required. Works for hierarchical and singular models where AIC/DIC fail.

LOO-CV with PSIS

(Vehtari, Gelman, Gabry 2017 Stat Comput.) Leave-one-out cross-validation approximated via Pareto-smoothed importance sampling (PSIS-LOO). Reweights posterior samples to estimate held-out predictive density without re-fitting. Pareto-k diagnostic flags observations where the approximation is unreliable. Currently the recommended default for comparing Bayesian models.

DIC

(Spiegelhalter et al. 2002 JRSSB.) Deviance Information Criterion. Historically common (default in WinBUGS), but criticized: not invariant to reparameterization and ill-behaved for singular and hierarchical models. Prefer WAIC or LOO-CV.

11. Empirical Bayes

Estimate hyperparameters of a hierarchical model from the data — Type II maximum likelihood — instead of placing a hyperprior on them. Computationally cheaper than a fully Bayesian treatment; loses some uncertainty quantification at the top level of the hierarchy.

James-Stein shrinkage (James + Stein 1961, “Estimation with quadratic loss”): when estimating ≥ 3 means jointly, shrinking each individual mean toward the grand mean dominates the MLE in total mean squared error. The seminal empirical-Bayes result; the shrinkage factor is data-estimated.

In practice, empirical Bayes is often a fine approximation to full Bayes when the top-level prior has many groups informing it (so the hyperparameter uncertainty is genuinely small). With few groups, treat hyperparameters properly — give them a weakly informative hyperprior and sample them.

12. Hierarchical and multilevel models

The structure: observations within groups, groups within a population.

y_ij | θ_j ~ p(y | θ_j)              # observations in group j
θ_j | μ, τ ~ p(θ | μ, τ)             # group-level parameters
μ, τ ~ p(μ, τ)                       # population-level hyperprior

Three estimation strategies:

  • No pooling: fit each θ_j independently. Wastes the structure; high variance for small groups.
  • Complete pooling: collapse all data, estimate a single θ. Ignores group variation; biased.
  • Partial pooling: hierarchical model learns how much to shrink each group toward the population mean based on within-group and between-group variance.

The amount of shrinkage is data-driven: groups with little data get pulled toward the population mean; groups with abundant data dominate their own estimates. Hierarchical Bayes is one of the most consequential ideas in applied statistics — it is how you fit “small-area” estimates, “random effects” in mixed models, “transfer learning” across tasks.

The 8-schools problem (Rubin 1981, Educational and Psychological Measurement; Gelman BDA3 ch 5) is the canonical example. Eight schools each ran an SAT coaching program; effects (with standard errors) varied. No-pooling and complete-pooling estimates both feel wrong. Partial pooling produces effect estimates that are shrunken toward the average by an amount depending on the estimated between-school variance τ. The non-centered parameterization is the standard workaround for the Neal funnel that arises when τ is poorly identified.

13. Gaussian processes

(Rasmussen + Williams 2006, “Gaussian Processes for Machine Learning”, MIT Press.) A Gaussian process is a distribution over functions f: X → ℝ such that any finite collection of function values is jointly Gaussian. Specified by a mean function m(x) and a covariance (kernel) function k(x, x’).

GP regression: prior is GP(0, k); observations y = f(x) + ε with ε ~ N(0, σ²); posterior over function values at test points x* is

f* | y ~ N( k*ᵀ (K + σ²I)⁻¹ y, k(x*, x*) − k*ᵀ (K + σ²I)⁻¹ k* )

with K_{ij} = k(x_i, x_j) and kthe cross-covariance to x. Closed form, no MCMC needed for the function — but the matrix inverse is O(n³).

Kernel choice is the modeling decision:

  • RBF / squared exponential: k(x, x’) = σ² exp(−‖x − x’‖² / 2ℓ²). Smooth (infinitely differentiable). Defaults to oversmoothness for real data.
  • Matérn ν=3/2 or ν=5/2: rougher, often more realistic.
  • Periodic kernels for cyclic structure.
  • Linear kernel recovers Bayesian linear regression.
  • ARD (Automatic Relevance Determination): separate length scale per input dimension; large length scale → irrelevant input.
  • Composition: kernels can be added and multiplied; structure search picks combinations (Duvenaud 2013).

Scalability: O(n³) memory and compute is the bottleneck. Sparse variational GPs (Titsias 2009 AISTATS) introduce m inducing points and reduce to O(nm²). SVGP (Hensman 2013), stochastic variational extensions, KISS-GP, BBMM (black-box matrix-matrix), and others extend to large data.

Libraries: GPflow (TensorFlow), GPy (numpy/scipy, older), GPyTorch (PyTorch, current state-of-the-art for scalable inference).

14. Bayesian deep learning

Neural networks trained by maximum likelihood overfit and produce overconfident predictions. Bayesian treatment adds a prior on weights and an approximation to the posterior. Several practical approximations:

  • MC dropout (Gal + Ghahramani 2016 ICML). Dropout at test time approximates variational inference with a particular Bernoulli variational family. Run many forward passes; average for prediction; variance estimates uncertainty.
  • SWAG — Stochastic Weight Averaging Gaussian (Maddox et al. 2019 NeurIPS). Collect SGD iterates near convergence, fit a Gaussian to the trajectory, sample from it for predictions.
  • Laplace approximation (MacKay 1992 Neural Comput, “A practical Bayesian framework for backpropagation networks”). Fit a Gaussian centered at the MAP with covariance equal to the inverse Hessian (or a tractable approximation like K-FAC or last-layer-only). Post-hoc; no retraining.
  • Mean-field VI on neural-net weights. Bayes by Backprop (Blundell 2015). Doubles parameter count (mean + variance per weight); variational posteriors are restrictive.
  • HMC over weights. Theoretically clean but practically infeasible at scale.
  • Deep ensembles (Lakshminarayanan 2017 NeurIPS). Train M networks from different initializations; average predictions. Not strictly Bayesian but empirically competitive on calibration and uncertainty; can be reinterpreted as approximate Bayesian (Wilson + Izmailov 2020).

For most production use cases, deep ensembles or SWAG are the pragmatic choice; full Bayesian DL remains a research area.

15. Probabilistic programming languages

PPLs let you express a generative model in code and obtain inference for free (or near-free). Major systems:

  • Stan (Carpenter et al. 2017 J Stat Softw, “Stan: A probabilistic programming language”). HMC/NUTS, ADVI, optimization. Statically typed model language compiled to C++. Considered the gold standard for accuracy and diagnostics in mid-sized models. Interfaces: RStan, PyStan, CmdStanPy, CmdStanR.
  • PyMC. Started as PyMC3 with Theano, migrated to PyMC4 on TensorFlow, then to PyMC v5 (since 2022) on PyTensor (Theano fork) with JAX/Numba backends. Pythonic API, broad sampler support (NUTS, SMC, VI), strong diagnostics ecosystem (ArviZ).
  • NumPyro (Phan, Pradhan, Jankowiak 2019). JAX-based, very fast NUTS thanks to XLA compilation and GPU support. Same model-spec API as Pyro.
  • Pyro (Bingham et al. 2018 JMLR, “Pyro: Deep universal probabilistic programming”). PyTorch-based, focused on VI, amortized inference, and deep probabilistic models. Companion to NumPyro for non-deep work.
  • Turing.jl (Ge, Xu, Ghahramani 2018 AISTATS). Julia. Composable inference algorithms; strong for research and custom samplers.
  • Edward, TensorFlow Probability. Edward (Tran 2016) was an early effort, superseded by TFP. TFP remains active for low-level distribution primitives but is less commonly used as a full PPL today; alternatives include Bean Machine (Meta) and Bambi (PyMC-based formula interface).
  • BUGS / WinBUGS / OpenBUGS / JAGS. Gibbs-based, legacy. JAGS (Plummer 2003) is still maintained and useful for models where conjugate Gibbs is natural; for general-purpose continuous models, NUTS-based systems outperform.

Workflow recommendation in 2026: NumPyro or Stan for serious modeling on larger problems; PyMC v5 for ease of use and integration with the Python data ecosystem; Pyro for deep probabilistic models with amortized inference.

16. Applications

A/B testing

Bayesian A/B testing models conversion rates as Beta-Binomial. Posterior on the difference p_B − p_A is computable from posterior samples; decisions can incorporate priors (last-quarter’s baseline), allow continuous monitoring without spending α the way sequential frequentist tests must, and quote posterior probability of lift directly. Decision-theoretic stopping rules balance expected gain against test cost.

Time series and forecasting

State-space models, dynamic linear models, structural time series. Prophet (Taylor + Letham 2017, Facebook) wraps a Stan/PyMC time-series model in an accessible API. Bayesian structural time series (BSTS, Brodersen et al. 2015, Google) for causal-impact analysis. ARIMA-like models also have Bayesian analogues; benefits include direct forecast intervals and easy handling of missing data.

Clinical trials

Adaptive Bayesian designs allow dose-finding, sample-size adjustment, and early stopping based on posterior probability of efficacy or futility. FDA’s 2010 (and updated 2019) guidance on the use of Bayesian statistics in medical- device clinical trials provides a regulatory foundation.

Topic modeling

Latent Dirichlet Allocation (Blei, Ng, Jordan 2003 JMLR). Documents are mixtures over topics; topics are distributions over words; Dirichlet priors on both. Inference traditionally by Gibbs (Griffiths + Steyvers 2004) or variational methods (collapsed VI). Foundational generative model in NLP pre-transformer era; still used for interpretable corpus analysis.

Recommendation

Bayesian probabilistic matrix factorization (Salakhutdinov + Mnih 2008). Latent user and item factors with Gaussian priors; observations with Gaussian likelihood; Gibbs sampling. Provides uncertainty over predictions, useful in cold-start and exploration settings.

Robotics

SLAM (simultaneous localization and mapping) is fundamentally a Bayesian filtering problem: maintain a posterior over robot pose and map given sensor observations. Kalman filter (linear Gaussian), extended Kalman filter (local linearization), unscented Kalman filter (Julier + Uhlmann 1997, deterministic sigma points), and particle filter / Monte Carlo localization (Thrun, Burgard, Fox 2005, “Probabilistic Robotics”). See [[Robotics/bayesian-estimation]].

ML hyperparameter tuning

Bayesian optimization (Snoek, Larochelle, Adams 2012 NeurIPS, “Practical Bayesian optimization of machine learning algorithms”). GP surrogate model over the response surface (validation loss as a function of hyperparameters); acquisition function (Expected Improvement, Upper Confidence Bound, Thompson sampling, knowledge gradient) balances exploration and exploitation; query the expensive function (training a model) where the acquisition is highest. Libraries: BoTorch (Meta, PyTorch + GPyTorch), Ax (Meta, higher-level over BoTorch), scikit-optimize, GPyOpt, Optuna with its TPE sampler (a non-GP Bayesian alternative).

17. Common pitfalls

  • Improper prior leading to improper posterior. Always check. Improper posterior means MCMC will run forever and report meaningless quantities.
  • Unconverged MCMC. R-hat near 1.00 is necessary, not sufficient. Look at ESS, divergences, trace plots, and predictive checks before trusting estimates. Multiple chains from dispersed starts.
  • Confusing the posterior P(θ|D) with the posterior predictive P(x_new | D). The former is about parameters; the latter is about future data and is what you should report for forecasting tasks.
  • Over-narrow priors that dominate the likelihood and produce regularized estimates without the modeler realizing. Always do a sensitivity analysis: refit with wider/narrower priors and check stability.
  • Computing Bayes factors with diffuse priors (Lindley’s paradox). Marginal likelihoods are sensitive to priors on parameters that the likelihood hardly identifies. Prefer LOO-CV or WAIC for routine comparison.
  • Treating posterior mode (MAP) as representative of posterior mass. Especially bad in high dimensions where mass concentrates on a typical set far from the mode (the “concentration of measure” phenomenon).
  • Discrete parameters in NUTS-based PPLs. Marginalize them out analytically if possible, or use mixed Gibbs / HMC schemes (Stan’s int<lower=0> parameters are not supported in HMC for this reason).
  • Ignoring divergences in HMC. Even a handful indicates regions of the posterior the sampler cannot reach correctly. Reparameterize (centered vs non-centered), increase target acceptance, shrink step size.
  • Pareto-k > 0.7 in PSIS-LOO. The LOO estimate is unreliable for those observations; either refit without them or accept the imprecision.

18. Cross-references

  • [[Math/probability-fundamentals]] — distributions, expectation, conditional probability, the building blocks.
  • [[Math/_index]] — the math subdomain table of contents.
  • [[Math/convex-optimization]] — MAP and VI both reduce to optimization problems; understanding convexity, gradient methods, and stochastic optimization is necessary for tuning Bayesian software.
  • [[Math/linear-algebra-essentials]] — Gaussian processes and Bayesian linear models need Cholesky, eigendecomposition, and structured matrix inversion. Multivariate Normal manipulations are the workhorse.
  • [[Robotics/bayesian-estimation]] — Kalman, EKF, UKF, particle filters as sequential Bayesian updates on state-space models.
  • [[Compute/transformer-architecture]] — uncertainty quantification in large neural models intersects with Bayesian DL approximations.

19. Citations

Books:

  • Gelman, Carlin, Stern, Dunson, Vehtari, Rubin. “Bayesian Data Analysis”, 3rd ed., Chapman & Hall/CRC, 2013. The reference text.
  • Bishop. “Pattern Recognition and Machine Learning”, Springer, 2006. Ch 10 on approximate inference, ch 8 on graphical models.
  • Rasmussen, Williams. “Gaussian Processes for Machine Learning”, MIT Press, 2006. Free online.
  • Murphy. “Probabilistic Machine Learning: An Introduction” (2022) and “Advanced Topics” (2023), MIT Press.
  • Thrun, Burgard, Fox. “Probabilistic Robotics”, MIT Press, 2005.
  • Brooks, Gelman, Jones, Meng (eds.). “Handbook of Markov Chain Monte Carlo”, Chapman & Hall/CRC, 2011.
  • Jeffreys. “Theory of Probability”, 3rd ed., Oxford, 1961.
  • McElreath. “Statistical Rethinking”, 2nd ed., Chapman & Hall/CRC, 2020.

Foundational papers:

  • Metropolis, Rosenbluth, Rosenbluth, Teller, Teller. “Equation of state calculations by fast computing machines.” J Chem Phys, 1953.
  • Hastings. “Monte Carlo sampling methods using Markov chains and their applications.” Biometrika, 1970.
  • Geman, Geman. “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE PAMI, 1984.
  • Duane, Kennedy, Pendleton, Roweth. “Hybrid Monte Carlo.” Physics Letters B, 1987.
  • Neal. “Probabilistic Inference Using Markov Chain Monte Carlo Methods.” Tech report, 1993.
  • Hoffman, Gelman. “The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.” JMLR, 2014.
  • Jeffreys. “An invariant form for the prior probability in estimation problems.” Proc Roy Soc A, 1946.
  • Bernardo. “Reference posterior distributions for Bayesian inference.” JRSS B, 1979.
  • James, Stein. “Estimation with quadratic loss.” Berkeley Symp, 1961.
  • Rubin. “Estimation in parallel randomized experiments.” JEBS, 1981.
  • Watanabe. “Asymptotic equivalence of Bayes cross validation and widely applicable information criterion.” JMLR, 2010.
  • Vehtari, Gelman, Gabry. “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Stat Comput, 2017.
  • Spiegelhalter, Best, Carlin, van der Linde. “Bayesian measures of model complexity and fit.” JRSS B, 2002.

VI and modern inference:

  • Hoffman, Blei, Wang, Paisley. “Stochastic variational inference.” JMLR, 2013.
  • Ranganath, Gerrish, Blei. “Black box variational inference.” AISTATS, 2014.
  • Kucukelbir, Tran, Ranganath, Gelman, Blei. “Automatic differentiation variational inference.” JMLR, 2017.
  • Kingma, Welling. “Auto-encoding variational Bayes.” ICLR, 2013.
  • Rezende, Mohamed. “Variational inference with normalizing flows.” ICML, 2015.
  • Titsias. “Variational learning of inducing variables in sparse Gaussian processes.” AISTATS, 2009.

Bayesian DL:

  • MacKay. “A practical Bayesian framework for backpropagation networks.” Neural Comput, 1992.
  • Gal, Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” ICML, 2016.
  • Maddox, Garipov, Izmailov, Vetrov, Wilson. “A simple baseline for Bayesian uncertainty in deep learning” (SWAG). NeurIPS, 2019.
  • Blundell, Cornebise, Kavukcuoglu, Wierstra. “Weight uncertainty in neural networks” (Bayes by Backprop). ICML, 2015.
  • Lakshminarayanan, Pritzel, Blundell. “Simple and scalable predictive uncertainty estimation using deep ensembles.” NeurIPS, 2017.

PPLs:

  • Carpenter et al. “Stan: A probabilistic programming language.” J Stat Softw, 2017.
  • Bingham et al. “Pyro: Deep universal probabilistic programming.” JMLR, 2018.
  • Phan, Pradhan, Jankowiak. “Composable effects for flexible and accelerated probabilistic programming in NumPyro.” arXiv, 2019.
  • Ge, Xu, Ghahramani. “Turing: A language for flexible probabilistic inference.” AISTATS, 2018.
  • Plummer. “JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling.” DSC, 2003.

Applications:

  • Blei, Ng, Jordan. “Latent Dirichlet Allocation.” JMLR, 2003.
  • Salakhutdinov, Mnih. “Bayesian probabilistic matrix factorization using Markov chain Monte Carlo.” ICML, 2008.
  • Snoek, Larochelle, Adams. “Practical Bayesian optimization of machine learning algorithms.” NeurIPS, 2012.
  • Taylor, Letham. “Forecasting at scale” (Prophet). The American Statistician, 2018.
  • Brodersen et al. “Inferring causal impact using Bayesian structural time-series models.” Annals of Applied Statistics, 2015.
  • Gelman, Vehtari, Simpson et al. “Bayesian Workflow.” arXiv, 2020.

Documentation:

  • Stan Reference Manual — mc-stan.org.
  • PyMC documentation — pymc.io.
  • NumPyro documentation — num.pyro.ai.
  • ArviZ documentation — arviz-devs.github.io.
  • GPyTorch documentation — gpytorch.ai.