Variational Inference — ELBO, Mean-Field, VAEs, Normalizing Flows
Variational inference (VI) recasts Bayesian posterior computation as an optimization problem: rather than draw exact samples from the intractable posterior p(z|x), one selects a tractable family Q of densities q(z) and finds the member closest in Kullback–Leibler (KL) divergence to the true posterior. Compared to Markov chain Monte Carlo (MCMC) — which is asymptotically exact but mixes slowly on high-dimensional or large-data problems — VI trades exactness for scalability and deterministic convergence. The 1999 PhD thesis of Michael Jordan’s group at Berkeley (Jaakkola, Saul, Ghahramani) systematized the approach; the renaissance came in the 2010s when stochastic gradients, the reparameterization trick, and amortized neural encoders combined to make VI tractable for models with millions of parameters and billions of observations.
This note covers the analytic foundation (evidence lower bound, mean-field, conjugate-exponential families), the modern algorithmic toolkit (stochastic VI, black-box VI, reparameterization, normalizing flows, Stein variational gradient descent), the variational autoencoder family and its descendants, comparisons against MCMC and EM, diagnostic practice, and current software stacks. SI units throughout; logs natural unless stated.
1. Motivation — why the posterior is intractable
Given a probabilistic generative model with latent variable z and observed x,
p(z | x) = p(x | z) p(z) / p(x), p(x) = ∫ p(x | z) p(z) dz.
The marginal likelihood p(x) (the “evidence”) is an integral over the latent space. For all but the simplest conjugate models — Beta–Bernoulli, Gaussian–Gaussian with known variance, Dirichlet–Multinomial, the linear Gaussian state-space model — this integral admits no closed form. Three strategies exist:
- Exact enumeration or Gaussian conjugacy — restricted to small discrete state spaces or strictly linear-Gaussian models.
- MCMC — Metropolis–Hastings (1953/1970), Gibbs sampling (Geman & Geman 1984), Hamiltonian Monte Carlo (Duane 1987, Neal 2011), the No-U-Turn Sampler (Hoffman & Gelman 2014). Asymptotically exact, but requires many serial transitions and scales poorly when each likelihood evaluation touches the full dataset.
- Variational inference — approximate p(z|x) with q(z) ∈ Q by optimizing a tractable surrogate. Cost is one optimization run; no asymptotic exactness guarantee, but tractable on large data and amenable to GPU / stochastic gradient pipelines.
A 2017 review by Blei, Kucukelbir & McAuliffe (JASA, “Variational Inference: A Review for Statisticians”) frames VI vs MCMC as the central methodological choice in modern applied Bayesian computation.
2. The evidence lower bound (ELBO)
Start from the log-evidence and introduce any density q(z) over the latent:
log p(x) = log ∫ p(x, z) dz
= log ∫ q(z) · p(x, z) / q(z) dz
≥ ∫ q(z) log [ p(x, z) / q(z) ] dz (Jensen, log is concave)
= E_q[log p(x, z)] − E_q[log q(z)]
≡ ELBO(q).
A tighter, exact decomposition:
log p(x) = ELBO(q) + KL( q(z) ‖ p(z | x) ).
Because KL is non-negative and zero iff q = p almost everywhere, maximizing ELBO in q is equivalent to minimizing KL(q ‖ p(z|x)) — and log p(x) is a constant in q, so the gap closes from below. The ELBO has two readable forms:
ELBO = E_q[log p(x | z)] − KL( q(z) ‖ p(z) ) (reconstruction − prior-KL)
= E_q[log p(x, z)] + H[q] (energy + entropy)
The first form makes the VAE objective transparent: maximize expected log-likelihood of x given a sample z from the encoder, regularized by how far the encoder strays from the prior.
Asymmetry of the KL. VI as commonly stated minimizes KL(q ‖ p), not KL(p ‖ q). The former is mode-seeking / zero-forcing: q places mass where p has mass, but is willing to ignore modes of p (mass-covering failures). The latter — minimized by expectation propagation (Minka 2001) and by α-divergence VI with α → 1 — is mass-covering / zero-avoiding. For multi-modal posteriors KL(q ‖ p) with a unimodal q under-estimates posterior variance, a well-documented pathology.
Information-theoretic reading. Rewrite the ELBO as
ELBO(q) = log p(x) − KL(q(z) ‖ p(z | x)).
The bound is tight when q matches the posterior exactly. The slack equals the KL gap — the “amortization + family + optimization gaps” decomposition of Cremer et al. (2018) attributes residual error respectively to a global encoder being unable to match any per-datum optimum, to Q being too restrictive to contain p(z|x), and to the optimizer failing to find the best q ∈ Q. All three are addressable independently: amortization with larger encoders or semi-amortized refinement (Kim, Wiseman, Miller, Sontag & Rush, ICML 2018); family with flows or hierarchical priors; optimization with natural gradients or per-datum SGD warm starts.
ELBO as free energy. The “energy + entropy” form
ELBO(q) = − F[q] = E_q[log p(x, z)] + H[q]
borrows physics nomenclature: −ELBO is a variational free energy. The link is more than analogy — Feynman’s path-integral formulation and statistical mechanics motivated Helmholtz’s free-energy principle in computational neuroscience (Friston, Nature Reviews Neuroscience 2010), which posits the brain as an organ that minimizes variational free energy over generative models of sensory input. The mathematics is identical.
3. Mean-field VI and CAVI
The simplest variational family is the mean-field factorization:
q(z) = ∏_{i=1}^M q_i(z_i).
Each latent dimension (or block) has its own free density. Plugging into the ELBO and isolating q_i gives the celebrated coordinate ascent variational inference (CAVI) update:
log q_i*(z_i) ∝ E_{q_{−i}}[ log p(x, z) ] + const,
i.e. the optimal q_i is the exponentiated expected log-joint with respect to the other factors. In a conjugate-exponential model — every complete conditional p(z_i | z_{−i}, x) is in an exponential family — this expectation is closed-form, and CAVI reduces to repeatedly updating natural parameters of each q_i until the ELBO plateaus.
Worked exemplars in the conjugate-exponential family.
- Bayesian Gaussian mixture model. Latents are component assignments and Gaussian–Wishart cluster parameters; CAVI updates resemble a soft-EM iteration but maintain full posterior distributions over means, covariances, and mixing weights (Bishop, Pattern Recognition and Machine Learning, 2006, §10.2).
- Latent Dirichlet Allocation. Blei, Ng & Jordan (JMLR 2003, “Latent Dirichlet Allocation”) gave the canonical CAVI derivation for topic models — q over topic assignments per token (multinomial) and per-document topic proportions (Dirichlet), updated alternately. The variational EM version remains a workhorse for industrial topic modeling.
- Bayesian linear regression with Gaussian prior on weights and inverse-gamma on noise variance — CAVI gives Gaussian × inverse-gamma updates; closed-form ELBO suitable for model selection.
- Hidden Markov models and state-space models — structured mean-field with q(z_{1:T}) factorized over time but keeping the forward–backward structure within each factor; Beal’s 2003 thesis (Cambridge) gave the variational Bayesian HMM.
CAVI converges to a local optimum of the ELBO. It is sensitive to initialization (especially for non-identifiable latents like mixture components) and to the granularity of the factorization — too aggressive a factorization underestimates posterior covariance; structured mean-field keeps key couplings.
Sketch of the CAVI derivation. For fixed q_{−i}, the ELBO as a functional of q_i is
ELBO(q_i) = ∫ q_i(z_i) E_{q_{−i}}[log p(x, z)] dz_i − ∫ q_i(z_i) log q_i(z_i) dz_i + const.
Add a Lagrange multiplier for the normalization ∫ q_i = 1 and take the functional derivative δ/δq_i; setting it to zero yields
log q_i*(z_i) = E_{q_{−i}}[log p(x, z)] + const,
i.e. the unnormalized optimal factor is the geometric mean (in log space) of the joint with respect to the other factors. When p is conjugate-exponential, this expectation is linear in the sufficient statistics of z_i and the optimal q_i lies in the same exponential family — yielding a closed-form natural-parameter update. The proof generalizes to any free-form q_i; no parametric assumption is needed.
Granularity trade-off. Fully factorized mean-field (one factor per scalar latent) is fastest but discards all posterior correlations. Block mean-field — one factor per logically coupled group, e.g. all weights of one neural-network layer — captures intra-block correlations. Structured mean-field retains a chosen subgraph: in hidden Markov models, q(z_{1:T}) = ∏n q_n(z{1:T}) factorizes across documents but preserves the within-document Markov chain (forward–backward inside each factor). The choice is dataset- and model-specific; there is no free lunch.
4. Stochastic variational inference (SVI)
When N (number of data points) is large, each CAVI sweep touches the entire dataset to update global parameters β (e.g. cluster parameters, topic distributions). Hoffman, Blei, Wang & Paisley (JMLR 2013, “Stochastic Variational Inference”) observed that the optimal natural-parameter update for β is a sum over data points, hence amenable to noisy unbiased estimation from a minibatch.
The recipe:
- Sample a minibatch S ⊂ {1, …, N} of size |S|.
- For each i ∈ S, optimize local variational parameters φ_i (the per-data factors) to convergence given the current global β.
- Compute the natural gradient of the ELBO with respect to global natural parameters, rescaled by N / |S| to remain unbiased.
- Take a step: β ← β + ρ_t · (intermediate estimate − β), with Robbins–Monro step size ρ_t satisfying Σ ρ_t = ∞, Σ ρ_t² < ∞.
The natural gradient (Amari 1998) corrects for the curvature of the manifold of probability distributions — for exponential families, it is simply the difference in natural parameters, dramatically faster than vanilla gradient. SVI made variational LDA practical on 3.3 million Wikipedia articles in the JMLR paper, a step change in scale.
Why natural gradients win for exponential families. The Euclidean gradient ascends in the direction steepest in parameter space, but parameter space has no canonical metric — rescaling a parameter changes which direction is “steepest.” The natural gradient (Amari, Neural Computation 1998, “Natural Gradient Works Efficiently in Learning”) uses the Fisher information matrix F(λ) as the local metric:
nat-grad = F(λ)^{−1} · ∇_λ ELBO.
For exponential-family q with natural parameter λ and sufficient statistic T(z), the Fisher information equals the Hessian of the log-partition function, and the natural gradient simplifies dramatically — in many cases to a difference between the current natural parameter and a closed-form target. The resulting updates are invariant to reparameterization and converge in O(1) to O(10) sweeps where Euclidean SGD takes hundreds.
Doubly stochastic VI. SVI subsamples data; later work (Titsias & Lázaro-Gredilla, ICML 2014, “Doubly Stochastic Variational Bayes for non-Conjugate Inference”) subsamples both data and Monte Carlo samples from q. The combination is the standard recipe behind every modern PPL: minibatch indices selected, reparameterized samples drawn, gradient backpropagated through both sources of noise.
5. Black-box VI and amortized inference
CAVI and SVI both require analytic conditional expectations. Ranganath, Gerrish & Blei (AISTATS 2014, “Black Box Variational Inference”) removed this requirement by writing the ELBO gradient with the score-function (REINFORCE) estimator (Williams 1992):
∇_λ ELBO(λ) = E_{q_λ}[ ∇_λ log q_λ(z) · ( log p(x, z) − log q_λ(z) ) ].
Sampling z from q_λ gives a Monte Carlo estimate. The estimator is unbiased and requires only the ability to sample from q and evaluate its log-density — applicable to discrete latents, where reparameterization fails.
The catch: variance is high. Control variates (subtracting a baseline b that does not bias the estimator) and Rao–Blackwellization (analytically integrating out subsets of z) are essential. Even so, score-function gradients can be one to three orders of magnitude noisier than pathwise gradients for continuous z, motivating reparameterization (next section) wherever continuous variables permit.
Amortized VI. Rather than maintain per-datum variational parameters φ_i, train a single neural network — the inference network or encoder q_φ(z | x) — that outputs φ as a function of x. Inference is then a single forward pass instead of a per-datum optimization. The cost: the amortization gap — the encoder cannot in general match the per-datum optimum, and a global encoder may underfit complex posteriors (Cremer, Li & Duvenaud, ICML 2018, “Inference Suboptimality in Variational Autoencoders”).
6. The reparameterization trick
Kingma & Welling (ICLR 2014, “Auto-Encoding Variational Bayes”) and Rezende, Mohamed & Wierstra (ICML 2014, “Stochastic Backpropagation and Approximate Inference in Deep Generative Models”) independently introduced the reparameterization trick. If q_λ(z) is location-scale — e.g. Gaussian q_λ(z) = N(z; μ_λ, σ_λ²) — write
z = μ_λ + σ_λ ⊙ ε, ε ~ N(0, I).
Now z is a deterministic function of λ and an auxiliary noise variable independent of λ. Gradients flow through the sample:
∇_λ E_{q_λ}[ f(z) ] = E_ε[ ∇_λ f(μ_λ + σ_λ ⊙ ε) ].
This pathwise estimator typically has variance one to two orders of magnitude lower than the score-function estimator for continuous latents — empirically the single most important reason VAEs train at all.
Eligible distributions. Reparameterization works directly for Gaussian, Laplace, Logistic, Gumbel, uniform-on-the-hypercube, and any distribution expressible as g(λ, ε) with ε having a fixed parameterless density. It does not directly cover discrete latents (Categorical, Bernoulli) — addressed in §11 below.
Implicit reparameterization (Figurnov, Mohamed & Mnih, NeurIPS 2018) extends the trick to distributions defined only via tractable CDFs (e.g. Gamma, Beta, von Mises) by differentiating the inverse-CDF map implicitly.
7. Variational autoencoders
A VAE combines amortized VI, the reparameterization trick, and a deep neural-network likelihood. The standard setup:
-
Prior p(z) = N(0, I_d), latent dimension d (typically 2–256 for images, larger for high-resolution generation).
-
Decoder p_θ(x | z): for binary x, factorized Bernoulli; for real-valued x, factorized Gaussian with learned mean (often learned variance too, with care — see “posterior collapse” below). The decoder is a deep network θ.
-
Encoder q_φ(z | x) = N(z; μ_φ(x), diag(σ_φ²(x))). Two heads on a shared trunk emit mean and log-variance.
-
Objective. Per-datum ELBO:
L(θ, φ; x) = E_{q_φ(z|x)}[ log p_θ(x | z) ] − KL( q_φ(z | x) ‖ p(z) ).The KL is closed-form for two diagonal Gaussians:
KL(N(μ, σ²) ‖ N(0, I)) = ½ Σ_j ( μ_j² + σ_j² − log σ_j² − 1 ). -
Training. Sample ε, compute z = μ + σ ⊙ ε, decode, compute reconstruction log-likelihood and KL, sum over minibatch, take Adam step.
Benchmarks. Standard binarized MNIST baselines report a test ELBO of roughly −85 nats (Burda, Grosse & Salakhutdinov, ICLR 2016, “Importance Weighted Autoencoders”). Sub-80 nats requires more expressive posteriors (flows, IWAE, hierarchical).
Posterior collapse. When the decoder is sufficiently flexible (e.g. autoregressive PixelCNN decoder; Bowman et al. CoNLL 2016 for LSTM language VAEs), the optimum drives q_φ(z|x) → p(z) and ignores z entirely — the KL term goes to zero and the decoder explains x without latent information. Mitigations: KL annealing (warm-up from β=0 to β=1 over training), free bits (Kingma et al. NeurIPS 2016 IAF paper — bound KL from below per dimension), δ-VAE (Razavi 2019 — restricted prior), or weakening the decoder.
β-VAE and disentanglement. Higgins et al. (ICLR 2017, “β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework”) introduced a Lagrange-style coefficient β > 1 on the KL term to encourage factorized, disentangled latents. Later work (Locatello et al. ICML 2019, “Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations”) showed unsupervised disentanglement is fundamentally underdetermined without inductive biases or weak supervision.
Variants. InfoVAE (Zhao, Song & Ermon 2017) replaces the standard KL with a mutual-information–maximizing objective; Wasserstein autoencoders (Tolstikhin et al. ICLR 2018) replace KL with a Wasserstein penalty on the aggregate posterior; VQ-VAE (van den Oord, Vinyals & Kavukcuoglu, NeurIPS 2017) replaces continuous z with a discrete codebook learned by straight-through estimation — backbone of subsequent neural speech and image codecs.
Aggregate posterior and the prior gap. Define the aggregate posterior
q_φ(z) = (1/N) Σ_{n=1}^N q_φ(z | x_n).
A trained VAE is well-calibrated when q_φ(z) ≈ p(z); the gap is sometimes called the prior hole (Rosca, Lakshminarayanan & Mohamed 2018) and causes poor unconditional samples even with high ELBO. Fixes include learning the prior — VampPrior (Tomczak & Welling AISTATS 2018) parameterizes p(z) as a mixture of encoded pseudo-inputs — or fitting a second-stage flow on q_φ(z) post hoc (Bauer & Mnih AISTATS 2019).
Reconstruction-loss choice matters. A Gaussian decoder with fixed unit variance reduces the reconstruction term to ½‖x − μ_θ(z)‖², encouraging blurry pixel-averaged outputs. Learning the per-pixel variance, switching to a discretized mixture of logistics (PixelCNN++ — Salimans, Karpathy, Chen & Kingma ICLR 2017), or replacing the pixel-space likelihood with a perceptual / LPIPS loss (Zhang et al. CVPR 2018) all dramatically alter sample fidelity at fixed ELBO. The ELBO is not a faithful measure of sample quality when likelihoods are reweighted.
8. Importance-weighted autoencoder (IWAE)
Burda, Grosse & Salakhutdinov (ICLR 2016) gave a strictly tighter ELBO by averaging K importance-weighted samples inside the log:
L_K(θ, φ; x) = E_{z_1, ..., z_K ~ q_φ}[ log ( (1/K) Σ_{k=1}^K p_θ(x, z_k) / q_φ(z_k | x) ) ].
L_K ≥ L_{K−1} ≥ … ≥ L_1 = ELBO, and L_K → log p(x) as K → ∞. The signal-to-noise ratio of the inference-network gradient nonetheless decreases as 1/√K (Rainforth et al. ICML 2018, “Tighter Variational Bounds are Not Necessarily Better”), so K should be tuned, not just maximized; reweighted wake–sleep (Bornschein & Bengio ICLR 2015) and doubly reparameterized estimators (Tucker et al. ICLR 2019) address this.
9. Conditional and supervised VAEs
The CVAE (Sohn, Lee & Yan, NeurIPS 2015, “Learning Structured Output Representation using Deep Conditional Generative Models”) conditions encoder and decoder on a label y: q_φ(z | x, y) and p_θ(x | z, y), with prior p(z | y). Used for structured prediction, image-to-image, and any conditional generation. The semi-supervised variant (Kingma, Rezende, Mohamed & Welling, NeurIPS 2014, “Semi-supervised Learning with Deep Generative Models”) jointly treats y as latent for unlabeled data — a foundational deep-SSL technique.
10. Hierarchical and structured VAEs
A single Gaussian latent layer is a weak prior. Hierarchical VAEs (Sønderby et al. NeurIPS 2016, “Ladder Variational Autoencoders”) chain L latent layers z_1, …, z_L with a top-down generative model and bottom-up + top-down inference. NVAE (Vahdat & Kautz, NeurIPS 2020) and Very Deep VAE (Child, ICLR 2021) demonstrated that hierarchical VAEs at scale match GAN sample quality on high-resolution faces while retaining likelihood. These designs prefigure the diffusion-as-hierarchical-VI connection (§17).
11. Discrete latents
The reparameterization trick fails for discrete z. Options:
- REINFORCE / score-function with control variates — the default fallback; high variance.
- NVIL (Mnih & Gregor ICML 2014, “Neural Variational Inference and Learning”) — input-dependent baseline neural network trained jointly.
- REBAR (Tucker et al. NeurIPS 2017) and RELAX (Grathwohl et al. ICLR 2018) — control variates that combine the score-function estimator with a continuous reparameterized surrogate; unbiased.
- Gumbel-Softmax / Concrete distribution (Jang, Gu & Poole, ICLR 2017; Maddison, Mnih & Teh, ICLR 2017) — relax a Categorical to a continuous distribution on the simplex parameterized by temperature τ; reparameterizable via Gumbel noise. As τ → 0 recovers the discrete sample. Biased for τ > 0 but low-variance.
- Straight-through estimator (Bengio, Léonard & Courville 2013 tech report) — forward pass uses discrete sample (argmax); backward pass treats it as identity. Biased but empirically effective, used in VQ-VAE.
12. Normalizing flows
A fixed Gaussian q_φ(z | x) caps the expressivity of the encoder. Normalizing flows transform a simple base z_0 through a sequence of invertible, differentiable maps:
z_K = f_K ∘ f_{K−1} ∘ ... ∘ f_1(z_0), z_0 ~ q_0.
Change of variables gives
log q_K(z_K) = log q_0(z_0) − Σ_{k=1}^K log | det J_{f_k}(z_{k−1}) |.
The Jacobian determinant must be tractable for the flow to be useful at scale. Major families:
- Planar and radial flows (Rezende & Mohamed, ICML 2015, “Variational Inference with Normalizing Flows”) — element-wise perturbations; cheap Jacobian but limited expressivity.
- NICE (Dinh, Krueger & Bengio, ICLR 2015) and Real NVP (Dinh, Sohl-Dickstein & Bengio, ICLR 2017) — affine coupling layers split z into two halves (z_a, z_b), pass z_a through unchanged, scale-and-shift z_b conditional on z_a. Jacobian is triangular hence determinant is the product of diagonal entries. Invertible by construction.
- Glow (Kingma & Dhariwal, NeurIPS 2018) — Real NVP plus 1×1 invertible convolutions for channel mixing; the first flow to produce competitive high-resolution image samples.
- Autoregressive flows. Inverse Autoregressive Flow (Kingma et al. NeurIPS 2016) and Masked Autoregressive Flow (Papamakarios, Pavlakou & Murray, NeurIPS 2017) — z_i is an affine function of z_{<i} parameterized by a MADE-style masked network. IAF: fast sampling, slow density evaluation; MAF: fast density evaluation, slow sampling. Choose by use case.
- Spline flows (Durkan, Bekasov, Murray & Papamakarios, NeurIPS 2019, “Neural Spline Flows”) — replace affine coupling with monotonic rational-quadratic splines; substantially more expressive per layer.
- Continuous-time flows. Neural ODEs (Chen, Rubanova, Bettencourt & Duvenaud, NeurIPS 2018) parameterize dz/dt with a neural network and integrate. FFJORD (Grathwohl et al. ICLR 2019, “Free-form Jacobian of Reversible Dynamics”) uses Hutchinson’s trace estimator to avoid the explicit Jacobian determinant, enabling unrestricted architectures.
In VI, flows compose with the encoder: q_0 = N(μ_φ(x), σ_φ²(x)), z_K = flow(z_0), and the encoder log-density is the change-of-variables formula above. The ELBO becomes
E_{q_0}[ log p_θ(x | z_K) + log p(z_K) − log q_0(z_0) + Σ log | det J_{f_k} | ].
Flow-based posteriors close much of the gap between mean-field VI and true posteriors, at the price of K extra forward / Jacobian evaluations.
Sylvester and Householder flows. Van den Berg, Hasenclever, Tomczak & Welling (UAI 2018, “Sylvester Normalizing Flows for Variational Inference”) generalize planar flows from a rank-1 to a rank-M perturbation using Sylvester’s determinant identity, allowing larger expressivity at fixed parameter count. Tomczak & Welling (NeurIPS 2016) used Householder reflections as orthogonal transformations with unit Jacobian determinant — useful as a “free” rotation between expensive flow steps.
Surjective and stochastic flows. Classical flows preserve dimensionality. Nielsen, Jaini, Hoogeboom, Winther & Welling (NeurIPS 2020, “SurVAE Flows”) unify flows, VAEs, and bijective dimension changes under a single framework — encoder steps that throw away information (surjective) appear as a generalized change-of-variables with a stochastic inverse.
13. Stein variational gradient descent
Liu & Wang (NeurIPS 2016, “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm”) proposed a particle-based VI: maintain n particles {z^(i)}_{i=1}^n and update them with the velocity
φ*(z) = (1/n) Σ_j [ k(z_j, z) ∇_{z_j} log p(z_j | x) + ∇_{z_j} k(z_j, z) ],
where k is a positive-definite kernel (RBF the default). The first term pulls particles toward high-density regions; the second is a repulsion that maintains diversity. SVGD is deterministic, requires no parametric q, and interpolates between MAP (n=1) and full posterior approximation (n → ∞). It pairs well with neural-network parameter posteriors (Bayesian deep nets) but scales as O(n²) per step in kernel evaluations.
14. Connections to other inference methods
- EM algorithm (Dempster, Laird & Rubin, JRSS-B 1977) — the E-step computes the exact posterior q(z) = p(z | x, θ_old) when tractable; the M-step maximizes E_q[log p(x, z; θ)] over θ. Equivalently, EM is coordinate ascent on the ELBO with q unconstrained. Variational EM relaxes the E-step to constrained q ∈ Q.
- Expectation propagation (Minka, UAI 2001, “Expectation Propagation for Approximate Bayesian Inference”) — minimizes KL(p ‖ q) site-by-site by moment matching. Mass-covering rather than mode-seeking. Power EP and α-divergence variants (Minka 2004; Hernández-Lobato, Li, Hernández-Lobato, Bui & Turner, ICML 2016) interpolate between VI (α=0) and EP (α=1).
- Variational message passing (Winn & Bishop, JMLR 2005) — automates CAVI for conjugate-exponential graphical models by passing natural-parameter messages along the factor graph. Implemented in Infer.NET.
- Black-box VI = REINFORCE on the ELBO (§5).
- Laplace approximation — Gaussian q centered at the MAP with covariance the inverse Hessian; a fixed, ELBO-free alternative useful for shallow posteriors.
15. VI vs MCMC — practical trade-off
| Property | Variational inference | MCMC |
|---|---|---|
| Asymptotic exactness | No (limited by Q) | Yes (under detailed balance + ergodicity) |
| Convergence guarantees | Local optimum of ELBO | Geometric ergodicity to true posterior |
| Wall-clock on large N | Fast (minibatch + amortized) | Slow (each step uses full likelihood; HMC needs gradients per step) |
| Uncertainty calibration | Often under-estimated (mode-seeking) | Well-calibrated at convergence |
| Multi-modal posteriors | Captures one mode unless flow / mixture q | Mixes between modes (slowly) |
| GPU / vectorization | Native | Possible with batched HMC (Chains in parallel) |
| Hyperparameter tuning | Learning rate, family choice | Step size, mass matrix, adaptation |
| Diagnostics | ELBO trace, k-hat | R-hat, ESS, divergent transitions |
A typical applied pipeline: fit VI for fast iteration and prototyping; finalize a critical model with NUTS or HMC; or use VI to initialize MCMC chains.
Mode-seeking failure mode. Because KL(q ‖ p) penalizes q(z) > 0 where p(z|x) = 0 much more than the converse, mean-field q with a Gaussian factor will fit one mode of a multi-modal posterior and ignore the others, dramatically underestimating posterior variance and predictive uncertainty.
16. Diagnostics
- ELBO trace. Plot ELBO vs iteration. Plateau is necessary but not sufficient; ELBO can plateau at a local optimum.
- Pareto-smoothed importance sampling (PSIS-LOO). Vehtari, Gelman & Gabry (Statistics and Computing 2017, “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC”) provides leave-one-out predictive density via importance sampling from q to p. The fitted Pareto tail-shape k̂ diagnoses both LOO accuracy and the quality of q as an importance proposal: k̂ < 0.5 is fine, 0.5–0.7 borderline, > 0.7 indicates q is a poor proposal (VI likely biased).
- Yao, Vehtari, Simpson & Gelman (Bayesian Analysis 2018, “Yes, but Did It Work?”) propose VSBC and PSIS-based VI diagnostics specifically.
- Posterior predictive checks. Sample x_new ~ p(x | z), z ~ q(z); compare to held-out data with summary statistics.
- Simulation-based calibration (Talts, Betancourt, Simpson, Vehtari & Gelman 2018) — repeatedly simulate (z*, x*) ~ joint, fit q on x*, check that the rank of z* under q is uniform.
17. Diffusion models as hierarchical VI
Sohl-Dickstein, Weiss, Maheswaranathan & Ganguli (ICML 2015, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics”) and Ho, Jain & Abbeel (NeurIPS 2020, “Denoising Diffusion Probabilistic Models”, DDPM) formulated denoising diffusion as a deep hierarchical VAE with T fixed Gaussian-noise transitions q(x_t | x_{t−1}) (the forward / encoder process is not learned) and learned reverse transitions p_θ(x_{t−1} | x_t) (the decoder). The training objective is the variational bound
L = E_q[ log p_θ(x_0 | x_1) − Σ_{t>1} KL( q(x_{t−1} | x_t, x_0) ‖ p_θ(x_{t−1} | x_t) ) − KL( q(x_T | x_0) ‖ p(x_T) ) ],
simplified by Ho et al. to a noise-prediction MSE loss equivalent up to weighting. The mathematics is exactly that of a T-layer hierarchical VAE with fixed encoder. Score-matching (Song & Ermon NeurIPS 2019, “Generative Modeling by Estimating Gradients of the Data Distribution”) and stochastic differential equations (Song, Sohl-Dickstein, Kingma, Kumar, Ermon & Poole, ICLR 2021, “Score-Based Generative Modeling Through Stochastic Differential Equations”) gave the continuous-time reformulation; Lipman, Chen, Ben-Hamu, Nickel & Le (ICLR 2023, “Flow Matching for Generative Modeling”) generalized further to deterministic conditional flow matching. The unifying VI perspective is now standard pedagogy.
18. Bayesian deep learning
VI also approximates posteriors over neural-network weights w given data D:
- Bayes by Backprop (Blundell, Cornebise, Kavukcuoglu & Wierstra, ICML 2015, “Weight Uncertainty in Neural Networks”) — diagonal Gaussian q(w), reparameterized; ELBO over w.
- MC dropout (Gal & Ghahramani, ICML 2016, “Dropout as a Bayesian Approximation”) — interprets training with dropout + L2 as approximate VI with a Bernoulli q(w); test-time uncertainty by averaging stochastic forward passes.
- Functional VI (Sun, Zhang, Shi & Grosse, ICLR 2019) — variational distribution over functions rather than weights, sidestepping symmetries of weight space.
These methods give calibrated uncertainty for safety-critical applications (autonomous driving, medical imaging) at modest compute overhead — typically 2× to 10× training cost versus point estimates.
19. Applications
- Generative modeling. VAEs for images (DALL·E 1’s discrete VAE prior, 2021), audio (WaveNet-style VAE-VQs), molecular design (Junction Tree VAE — Jin, Barzilay & Jaakkola ICML 2018).
- Topic modeling. Variational LDA (Blei/Ng/Jordan 2003) on document corpora; ProdLDA (Srivastava & Sutton ICLR 2017) replaces the Dirichlet with a logistic-normal for amortized neural inference.
- Recommender systems. Multinomial VAE for collaborative filtering (Liang et al. WWW 2018) — strong baseline on MovieLens / Netflix.
- Computational neuroscience. Population-spike-train VAEs (LFADS — Pandarinath et al. Nature Methods 2018) infer latent neural dynamics; Gershman (Trends in Cognitive Sciences 2014, “Computational and Cognitive Foundations of Categorization”) reviews VI’s role as a normative model of perception.
- Survey nonresponse imputation and missing-data Bayesian models — VI lets practitioners fit multilevel models on millions of records (PyMC, NumPyro).
- Phylogenetics and population genetics. Variational Bayesian phylogenetic inference (Zhang & Matsen IV ICLR 2019).
- Climate and Earth system science. VI for parameter calibration of stochastic climate emulators (Cleary, Garbuno-Iñigo, Lan, Schneider & Stuart, Journal of Computational Physics 2021).
- Causal inference. Deep latent-variable models for treatment-effect estimation under unobserved confounding (CEVAE — Louizos, Shalit, Mooij, Sontag, Zemel & Welling NeurIPS 2017); identifiability caveats (D’Amour, AISTATS 2019).
- Robotics. VAE-based world models for model-based reinforcement learning (Ha & Schmidhuber NeurIPS 2018, “World Models”; Hafner et al. ICLR 2020, Dreamer).
- Particle physics. Surrogate VI for likelihood-free / simulation-based inference at the LHC (Cranmer, Brehmer & Louppe, PNAS 2020 review).
20. Software
| Package | Backend | Strengths |
|---|---|---|
| Stan | C++ (NUTS, ADVI) | Reference HMC implementation; ADVI for VI (Kucukelbir et al. JMLR 2017) |
| PyMC | PyTensor / JAX | Idiomatic Python Bayesian modeling; ADVI, SVGD, full-rank VI |
| NumPyro | JAX | GPU/TPU; reparameterized + autoguide VI; very fast SVI |
| Pyro | PyTorch | Eli Bingham et al. (Uber, 2018); deep generative models, flexible guides |
| TensorFlow Probability | TF / JAX | Reparameterized distributions, flows, JointDistribution API |
| Edward2 | TF | Tran et al. 2018; functional programming style, deep PPL |
| Turing.jl | Julia | HMC + variational, idiomatic in Julia |
ADVI (Automatic Differentiation Variational Inference, Kucukelbir, Tran, Ranganath, Gelman & Blei JMLR 2017) automates VI for any differentiable model: transform constrained latents to ℝ^d via bijections, fit a mean-field or full-rank Gaussian on the transformed space, use the reparameterization trick. Behind Stan’s advi, PyMC’s fit(method="advi"), and NumPyro’s AutoNormal / AutoMultivariateNormal guides.
21. Frontier directions (2024–2026)
- Amortized inference for flow matching and rectified flow. Generalizing diffusion’s hierarchical-VI scaffold; CFM (Tong et al. 2023) and Rectified Flow (Liu, Gong & Liu 2022) replace the noising process with arbitrary couplings.
- Latent diffusion + VAE encoders. Stable Diffusion (Rombach et al. CVPR 2022) trains a VAE on raw images, then diffuses in latent space; the VAE encoder is a continuous variational compressor.
- Variational LLM alignment. Variational view of RLHF / direct preference optimization (Rafailov et al. 2023) — DPO’s closed-form solution corresponds to a KL-regularized objective with the reference model as prior, structurally identical to the prior-KL term of an ELBO.
- Implicit and energy-based VI. Generative implicit models with no tractable q-density; adversarial VI (Mescheder, Nowozin & Geiger ICML 2017).
- Variational inference on neural fields and PDE solvers — uncertainty quantification for scientific machine learning.
- Path-space VI for diffusion-bridge sampling. Recent work on continuous-time stochastic-control formulations of generative diffusion (Berner, Richter & Ullrich 2024) re-interprets sampling as a variational problem over path measures, with applications to molecular sampling and Bayesian inverse problems.
- Variational quantum inference. Quantum-circuit variational ansätze (VQE — Peruzzo et al. Nature Communications 2014) share the ELBO scaffolding with classical VI; recent crossover work uses classical VI to train parameter priors for quantum circuits.
22. Worked example — Gaussian latent model end-to-end
To anchor the abstractions, consider the canonical toy: x ∈ ℝ^D observed, z ∈ ℝ^d latent, with
z ~ N(0, I_d), x | z ~ N(W z + b, σ² I_D).
This is probabilistic PCA (Tipping & Bishop, JRSS-B 1999). The posterior is closed-form Gaussian — useful as a unit test for VI implementations.
Exact posterior. With M = W^⊤ W + σ² I_d,
p(z | x) = N(M^{−1} W^⊤ (x − b), σ² M^{−1}).
Mean-field VI. q(z) = ∏_j N(z_j; m_j, s_j²). CAVI updates iterate m_j and s_j to convergence; because the exact posterior has non-diagonal covariance unless W has orthogonal columns, mean-field under-estimates posterior covariance by ignoring off-diagonal terms. The ELBO at convergence falls short of log p(x) by precisely the KL between the diagonal q and the full-covariance posterior.
Full-rank Gaussian VI. q(z) = N(z; m, S) with S ∈ ℝ^{d×d} positive-definite (parameterized via Cholesky factor L, S = L L^⊤). The optimum is exact: q* = p(z | x). Fitting it via reparameterized SGD recovers the closed-form answer up to optimization noise — a clean sanity check.
Amortized VAE. Replace m, S with neural-net outputs μ_φ(x), σ_φ(x). One forward pass per inference; ELBO close to but not equal to log p(x) because the encoder cannot perfectly fit every x. Comparing the three on a synthetic dataset is the standard pedagogical exercise (Murphy, Probabilistic Machine Learning: Advanced Topics, MIT Press 2023, Ch. 10).
23. Theoretical foundations and convergence
Consistency of VI estimators. For parametric q families with parameter λ ∈ Λ ⊂ ℝ^p, fitting λ̂ on N i.i.d. observations yields an M-estimator. Under regularity (compact Λ, continuity of ELBO, identifiability of the optimal q∈ Q closest to p(z|x)), λ̂ → λ almost surely as N → ∞ (Wang & Blei, JASA 2019, “Frequentist Consistency of Variational Bayes”). Asymptotic normality holds with a sandwich covariance — important for VI-based standard errors, which are otherwise systematically too small.
Rate of convergence. For mean-field VB in well-specified parametric models, contraction of the variational posterior to the truth occurs at the parametric rate 1/√N (Yang, Pati & Bhattacharya, Bernoulli 2020, “α-Variational Inference with Statistical Guarantees”). For non-parametric problems (Gaussian processes, infinite mixtures), rates are slower and depend on prior concentration — see Zhang & Gao (Annals of Statistics 2020).
Where VI provably fails. Two settings demand caution:
- Multimodal posteriors with comparable mode mass. Mean-field VI captures one mode and reports near-zero variance toward the others. Bayesian model averaging, multi-modal MCMC (parallel tempering, nested sampling), or explicit mixture-of-Gaussians q-families are the fix.
- Heavy-tailed posteriors. A Gaussian q on a Student-t or Cauchy-tailed posterior produces vanishingly small importance weights in the tails; PSIS-LOO k̂ statistic flags this immediately.
Identifiability and label-switching. Posteriors over mixture-component parameters are invariant to permutation of component indices, so the true posterior has K! symmetric modes. CAVI and SGD-based VI break the symmetry arbitrarily at initialization and lock in to one mode; this is usually desired (predictive distributions are invariant), but interpretability of per-component parameters requires post-hoc relabeling (Stephens, JRSS-B 2000).
24. Computational and numerical considerations
- Numerical stability of the KL term. Parameterize σ² in log space: emit log σ² from the encoder and exponentiate inside the KL formula. Clipping log σ² to e.g. [−10, +10] avoids NaNs from saturating linear layers.
- Cholesky factor of full-rank Gaussian guides. Parameterize the lower-triangular L with positive diagonal via
softplusorexp; this guarantees positive-definite covariance. - Mixed precision. VAEs and flows train cleanly in bfloat16; KL and Jacobian-determinant computations should remain in float32.
- Random-seed handling. The ε draws inside reparameterization are the only stochastic input; fixing the seed makes ELBO evaluation deterministic at inference time — useful for unit tests and for reproducible benchmark numbers.
- Per-example variance vs minibatch variance. ELBO is a sum over examples; the gradient variance per example dominates when batch size is small. Increasing batch size lowers ELBO gradient variance with diminishing returns — typical sweet spots are 64–256 for image VAEs.
- KL annealing schedules. Linear warm-up (β: 0 → 1 over the first 10–20% of training) is the standard remedy for posterior collapse. Cyclical annealing (Fu et al. NAACL 2019) repeatedly cycles β to maintain pressure on the latent throughout training; consistently outperforms one-shot annealing on language VAEs.
- Memory cost of flows. A K-step flow on a d-dimensional latent requires storing K intermediate activations for backprop — O(K · d · batch) memory. Gradient checkpointing or reversible architectures (Real NVP is exactly reversible) trade compute for memory.
- Vectorized inference over particles in SVGD. The O(n²) kernel-evaluation cost is amortized by batching all n particles into a single forward pass through the log-density network; modern accelerators handle n ∈ [10², 10³] particles trivially.
25. Pedagogical resources
- Blei, Kucukelbir & McAuliffe (JASA 2017, “Variational Inference: A Review for Statisticians”) — the canonical pedagogical entry point; mean-field, CAVI, SVI worked end-to-end.
- Kingma & Welling (Foundations and Trends in Machine Learning 2019, “An Introduction to Variational Autoencoders”) — definitive monograph from the VAE inventors.
- Murphy (Probabilistic Machine Learning: Advanced Topics, MIT Press 2023), chapters 10–11 — modern textbook treatment integrating VI, normalizing flows, and diffusion.
- Papamakarios, Nalisnick, Rezende, Mohamed & Lakshminarayanan (JMLR 2021, “Normalizing Flows for Probabilistic Modeling and Inference”) — comprehensive flow review.
- Bishop (Pattern Recognition and Machine Learning, Springer 2006), chapter 10 — the pre-deep-learning treatment, still definitive on conjugate-exponential CAVI.
- MacKay (Information Theory, Inference, and Learning Algorithms, Cambridge 2003), chapters 33–34 — sets the information-theoretic stage for the ELBO.
26. Glossary
- ELBO — evidence lower bound; surrogate objective maximized in VI.
- CAVI — coordinate ascent variational inference; closed-form updates for mean-field q.
- SVI — stochastic variational inference; minibatch + natural-gradient.
- BBVI — black-box variational inference; REINFORCE-style gradients.
- ADVI — automatic differentiation variational inference; mean-field/full-rank Gaussian guide in transformed space.
- IAF / MAF — inverse / masked autoregressive flow.
- SVGD — Stein variational gradient descent; particle-based VI.
- PSIS-LOO — Pareto-smoothed importance sampling LOO; held-out predictive metric.
- k̂ — Pareto tail-shape parameter; VI quality diagnostic.
- Amortization gap — performance loss from using a shared encoder versus per-datum optimization.