Gaussian Processes — Math Reference
A Tier 2 reference for Gaussian processes (GPs): distributions over functions, kernel construction, exact and approximate inference, scalable variants, the deep-GP / neural-tangent-kernel bridge to deep learning, Bayesian optimization, and the geostatistical kriging tradition. Intended as a working reference for engineers using GPs as surrogate models, BayesOpt acquisitions, calibrated regressors, and theoretical lenses on wide networks.
1. At a glance
A Gaussian process is a distribution over functions rather than over finite-dimensional parameter vectors. The defining property is simple but powerful: any finite collection of function values is jointly Gaussian. Pick any set of input points x_1, …, x_n; then the vector [f(x_1), …, f(x_n)] has a multivariate normal distribution.
A GP is fully specified by two functions:
- A mean function m(x) = E[f(x)] — often taken to be zero after centering.
- A kernel (covariance function) k(x, x’) = Cov(f(x), f(x’)) — encodes assumptions about smoothness, periodicity, length scales, and feature relevance.
Two characteristics define the modern GP toolkit:
- Nonparametric. The model has effectively infinite parameters; complexity grows with data. The kernel encodes a prior over all smooth functions, not a fixed function class.
- Bayesian. Outputs are full posterior distributions with calibrated uncertainty, not point estimates. This is the dominant reason to reach for a GP rather than a random forest or neural net.
GPs underpin a remarkably wide swath of applied methods:
- Bayesian regression with calibrated uncertainty (Rasmussen + Williams 2006).
- Bayesian optimization for hyperparameter tuning, materials discovery, robotics gain auto-tuning (Mockus 1978; Snoek et al. 2012).
- Geostatistics / kriging (Krige 1951; Matheron 1962).
- Computer experiments / surrogate modeling — replacing expensive simulators (CFD, FEA, molecular dynamics) with cheap GP proxies.
- Function-space view of neural networks — wide nets behave as GPs (Neal 1996; Lee 2018; Jacot 2018).
The trade-offs are equally well known: O(n^3) inference cost, sensitivity to kernel choice, brittleness in high dimensions, and difficulty composing arbitrary likelihoods. The bulk of GP research since 2005 attacks these problems via sparse approximations, variational inference, structured kernels, and tensor decompositions.
2. Definition and properties
A real-valued stochastic process {f(x) : x in X} indexed by an input domain X is a Gaussian process if and only if for every finite subset {x_1, …, x_n} of X, the random vector
[f(x_1), f(x_2), …, f(x_n)]
has a multivariate Gaussian distribution. We write f ~ GP(m, k), where:
- m : X → R is the mean function, m(x) = E[f(x)].
- k : X x X → R is the covariance (kernel), k(x, x’) = E[(f(x) - m(x))(f(x’) - m(x’))].
The joint distribution at finite locations is then
[f(x_1), …, f(x_n)]^T ~ N(mu, K),
with mu_i = m(x_i) and K_{ij} = k(x_i, x_j).
Positive semi-definiteness. The kernel k must be symmetric and positive semi-definite (PSD): for every finite set of points {x_1, …, x_n} and every vector c in R^n,
sum_{i,j} c_i c_j k(x_i, x_j) >= 0.
Equivalently, every Gram matrix K it produces is PSD. PSD kernels and GPs are in one-to-one correspondence (Moore-Aronszajn theorem; Aronszajn 1950).
Consistency (Kolmogorov extension theorem). If a family of finite-dimensional Gaussian distributions over all finite point sets satisfies consistency (marginals agree under permutation and restriction), there exists a stochastic process realizing them. PSD kernels satisfy this automatically, so any PSD kernel defines a valid GP.
Sample paths. The smoothness of f drawn from GP(m, k) is dictated by the smoothness of k near the diagonal. The Matern-nu kernel produces sample paths that are ceil(nu) - 1 times mean-square differentiable; the squared-exponential kernel produces analytic (infinitely differentiable) paths.
Stationarity. A kernel k is stationary if k(x, x’) = k(x - x’), depending only on the lag. Isotropic kernels depend only on the distance ||x - x’||. Stationarity buys you the Wiener-Khinchin theorem (kernel and spectral density are Fourier pairs) and is essential for spectral kernels and random Fourier features.
Reproducing kernel Hilbert space (RKHS). Each PSD kernel induces an RKHS H_k of functions for which f → f(x) is bounded. The GP posterior mean for noise-free regression is the minimum-norm interpolant in H_k. This connects GPs to splines, kernel ridge regression, and SVMs (Wahba 1990; Scholkopf + Smola 2002).
3. Standard kernels
The choice of kernel is the principal modeling decision in GP work. Below are the canonical families.
3.1 Squared exponential (RBF, Gaussian)
k_SE(x, x’) = sigma^2 exp(- ||x - x’||^2 / (2 ell^2))
- The smoothest kernel; sample paths are analytic (C-infinity).
- Hyperparameters: signal variance sigma^2 and length scale ell (one global, or one per dimension for ARD).
- Default when the underlying function is genuinely smooth.
- Often too smooth — it underpredicts roughness in real signals and overestimates extrapolation confidence. Matern is almost always a safer default.
3.2 Matern family
k_nu(r) = sigma^2 (2^{1 - nu} / Gamma(nu)) (sqrt(2nu) * r / ell)^nu * K_nu(sqrt(2nu) * r / ell),
where r = ||x - x’||, K_nu is the modified Bessel function of the second kind, and nu > 0 controls smoothness.
Closed forms exist at half-integer nu:
- nu = 1/2 (Ornstein-Uhlenbeck / exponential): k(r) = sigma^2 * exp(-r / ell). Nowhere differentiable. Continuous but rough sample paths.
- nu = 3/2: k(r) = sigma^2 *(1 + sqrt(3)*r/ell) *exp(-sqrt(3)*r/ell). Once mean-square differentiable.
- nu = 5/2: k(r) = sigma^2 (1 + sqrt(5)r/ell + 5r^2/(3ell^2)) *exp(-sqrt(5)*r/ell). Twice mean-square differentiable. Common practical default.
- nu → infinity: Matern reduces to the squared exponential.
Stein (1999) argues Matern with finite nu is almost always preferable to SE for physical data — real signals are rarely analytic.
3.3 Rational quadratic
k_RQ(x, x’) = sigma^2 (1 + ||x - x’||^2 / (2 alpha * ell^2))^{-alpha}
Equivalent to an infinite mixture of squared-exponential kernels with a gamma distribution over length scales (parameter alpha). Captures variation across multiple scales. Reduces to SE as alpha → infinity.
3.4 Periodic (MacKay 1998)
k_per(x, x’) = sigma^2 exp(- 2 sin^2(pi * |x - x’| / p) / ell^2)
Sample paths repeat with period p. Useful for modeling diurnal cycles, seasonal climate signals, periodic mechanical loads. Typically composed (product or sum) with a decaying kernel (Matern or SE) to allow quasi-periodicity.
3.5 Linear
k_lin(x, x’) = sigma^2 * (x - c) . (x’ - c)
Encodes a linear prior over functions. The GP posterior under this kernel is identical to Bayesian linear regression. Often combined with stationary kernels for “linear trend plus residual” modeling.
3.6 Polynomial
k_poly(x, x’) = (x . x’ + c)
Degree-d polynomial features. Rarely used standalone but appears in classical kernel methods (SVMs).
3.7 Automatic Relevance Determination (ARD)
A simple but high-impact extension: replace a single length scale ell with a per-dimension vector (ell_1, …, ell_D):
k_ARD(x, x’) = sigma^2 exp(- 0.5 sum_d (x_d - x’_d)^2 / ell_d^2)
After marginal-likelihood optimization, dimensions with very large ell_d effectively drop out — the kernel becomes insensitive to them. This is an automatic feature-importance mechanism (Neal 1996; MacKay 1992). The trade-off is more hyperparameters to fit (curse of dimensionality on hyperparameter optimization).
3.8 Spectral mixture (Wilson + Adams 2013)
Bochner’s theorem says every stationary kernel is the Fourier transform of a non-negative spectral density. The spectral mixture kernel parameterizes the spectral density as a mixture of Gaussians:
k_SM(tau) = sum_{q=1}^Q w_q prod_d exp(- 2pi^2 tau_d^2 v_{q,d}) cos(2pi* tau_d* mu_{q,d}),
where tau = x - x’. This is in some sense a universal stationary kernel — sufficient Q approximates any stationary covariance — and is widely used for extrapolation tasks (Wilson + Adams 2013 ICML).
3.9 Other notable kernels
- String kernels, graph kernels, tree kernels for structured inputs (Lodhi et al. 2002).
- Diffusion kernels on graphs (Kondor + Lafferty 2002).
- Convolutional kernels for invariance (van der Wilk 2017).
- Tanimoto / molecular fingerprints kernels for chemistry (Ralaivola 2005).
4. Kernel composition
PSD kernels are closed under several operations, allowing compositional prior design:
- Sum: k = k_1 + k_2 is PSD if both k_1 and k_2 are. Models additive contributions (linear trend + periodic + residual).
- Product: k = k_1 * k_2 is PSD. Captures interaction (e.g., RBF over time times RBF over latitude for spatiotemporal data).
- Scaling: k = c * k_1 for c >= 0 is PSD.
- Warping / pullback: if phi : X → Z and k_Z is a PSD kernel on Z, then k_X(x, x’) = k_Z(phi(x), phi(x’)) is PSD. This is how deep kernels work (Wilson 2016): phi is a neural net.
Compositional patterns:
- Quasi-periodic: k_per * k_SE — periodic envelope with slow decay.
- Damped oscillation: k_per * k_Matern(1/2).
- Local linear trend: k_lin + k_Matern.
- Hierarchical scales: k_RQ or sum of k_SE with different ell.
Duvenaud et al. (2013) propose automatic statistician workflows that search over kernel compositions to find structure (trend, periodicity, change-points) in time-series data.
5. GP regression
Given training inputs X = (x_1, …, x_n)^T and noisy outputs y_i = f(x_i) + epsilon_i with epsilon_i ~ N(0, sigma_n^2) i.i.d., assume f ~ GP(0, k) (centered for simplicity).
The joint distribution of training outputs y and test outputs f_at test locations X_ is:
[y] [0] [ K + sigma_n^2 I K_ ] [ ] ~ N( , [ ]) [f_] [0] [ K_^T K_** ]
where K = k(X, X) is n x n, K_= k(X, X_) is n x m, and K_** = k(X_, X_) is m x m.
Conditioning on observed y by the standard Gaussian conditional formula:
f_| X, y, X_ ~ N(mean_, cov_),
with
mean_* = K_^T * (K + sigma_n^2 * I)^{-1} * y, cov_ = K_** - K_^T * (K + sigma_n^2 * I)^{-1} * K_.
Computational structure. Compute the Cholesky factor L of (K + sigma_n^2 * I) once in O(n^3) time and O(n^2) memory. Then each predictive mean and variance is O(n^2) per test point.
Numerical practice:
- Cholesky on (K + sigma_n^2 I + jitter I) with jitter ~ 1e-6 to handle near-singularities.
- Solve L alpha = y, then L^T v = alpha, giving alpha = (K + sigma_n^2 I)^{-1} y.
- Predictive mean = K_^T * alpha; predictive variance = diag(K_**) - sum-of-squares of L^{-1} K_.
Noise-free limit. As sigma_n^2 → 0, the posterior mean interpolates the training data exactly: mean_*(x_i) = y_i. The posterior variance at training points goes to zero.
6. GP classification
For binary classification with labels y_i in {0, 1} (or {-1, +1}), the likelihood is non-Gaussian:
p(y_i | f(x_i)) = sigmoid(y_i f(x_i)) (logit link) or p(y_i | f(x_i)) = Phi(y_i f(x_i)) (probit link)
The posterior p(f | X, y) is no longer Gaussian, and the marginal likelihood has no closed form. Approximations:
- Laplace approximation (MacKay 1992; Williams + Barber 1998). Fit a Gaussian at the posterior mode via Newton’s method.
- Expectation Propagation (EP) (Minka 2001). Iteratively match marginal moments of factor approximations; more accurate than Laplace for skewed posteriors.
- Variational inference (VI) (Hensman et al. 2015). Optimize an ELBO over a Gaussian variational posterior.
For multi-class problems, use softmax likelihoods with similar approximation machinery, or one-vs-rest decompositions.
7. Hyperparameter learning
The kernel hyperparameters theta (length scales, signal variance, noise variance, periods, ARD weights) are typically learned by maximizing the marginal likelihood (a.k.a. evidence) p(y | X, theta).
For Gaussian likelihoods this is closed form:
log p(y | X, theta) = - 0.5 y^T (K_theta + sigma_n^2 I)^{-1} y - 0.5 log |K_theta + sigma_n^2 I| - (n / 2) log(2 pi).
The three terms have intuitive interpretations:
- Data fit: -0.5 y^T K^{-1} * y rewards explaining y.
- Complexity penalty: -0.5 * log|K| penalizes flexible kernels. Encodes automatic Occam’s razor (MacKay 1992) — high-capacity kernels are penalized even with no held-out data.
- Normalization constant.
Optimization. Gradient-based optimizers — L-BFGS, Adam, or trust-region methods — are standard. Analytical gradients are available since the kernel and likelihood are differentiable in theta. The objective is non-convex; multiple restarts from random initializations are routine.
MCMC over hyperparameters. Fully Bayesian treatment integrates over theta with HMC or NUTS. This is essential when the marginal-likelihood surface is flat or multimodal. Pyro, NumPyro, PyMC, and Stan all support GP-with-MCMC workflows.
Cross-validation. For non-Gaussian likelihoods or model selection across kernel families, leave-one-out CV (closed form for Gaussian likelihood — Sundararajan + Keerthi 2001) or k-fold CV are alternatives.
8. Scalability
Naive GP inference is O(n^3) compute and O(n^2) memory. This is intractable beyond roughly n ~ 10,000 points. The literature has produced a rich set of approximations:
8.1 Sparse GP with inducing points
Introduce m << n inducing variables u = f(Z) at pseudo-input locations Z. Approximate the joint distribution by replacing exact f-conditionals with conditionals depending only on u.
- DTC (Deterministic Training Conditional) — Seeger et al. 2003.
- FITC (Fully Independent Training Conditional) — Snelson + Ghahramani 2006. Replaces the training conditional with a fully factorized approximation. O(n * m^2) train, O(m^2) per test point. Pathologically over-confident in regions far from inducing points.
- VFE (Variational Free Energy) — Titsias 2009. ELBO-based; introduces a variational distribution q(u) and lower-bounds the marginal likelihood. The bound penalizes poor inducing-point coverage automatically, fixing the FITC over-confidence problem.
VFE is the default sparse method in modern GP libraries.
8.2 SVGP (Stochastic Variational GP) — Hensman et al. 2013
Extends VFE with mini-batch training:
- Variational posterior q(u) = N(mu_u, S_u) explicitly parameterized.
- ELBO decomposes over data points, allowing stochastic gradient estimation.
- O(m^3 + b * m^2) per step for mini-batch size b.
- Scales to millions of training points.
- Default for large-scale GPs in GPflow and GPyTorch.
Particularly powerful with non-Gaussian likelihoods: SVGP-classification with mini-batch SGD makes GP classification on imagenet-scale data feasible.
8.3 KISS-GP / SKI — Wilson + Nickisch 2015
Structured Kernel Interpolation exploits a grid structure. Place inducing points on a multi-dimensional grid, use local (e.g., cubic) interpolation weights to express data kernels in terms of grid kernels:
K ~= W K_UU W^T,
where K_UU has Kronecker / Toeplitz structure, enabling O(n) matvec via FFT. Combined with conjugate gradients (CG), gives near-linear training. Best when input dimensionality is small (⇐ 5).
SKIP extends to higher dimensions via product kernels.
8.4 Random Fourier features (Rahimi + Recht 2007)
For a stationary kernel k with spectral density p(omega):
k(x - x’) = E_omega ~ p[ exp(i omega^T (x - x’)) ] ~= (1/D) sum_{d=1}^D cos(omega_d^T x + b_d) * cos(omega_d^T x’ + b_d),
with D random frequencies sampled from p. This converts the GP into approximate Bayesian linear regression in D random features — O(n * D^2) training, no Cholesky needed. Works well for moderately smooth kernels (RBF, Matern) but fails for kernels with heavy spectral tails.
8.5 Nystrom approximation
Low-rank factorization K ~= U Lambda U^T using m sampled rows. The sparse Nystrom GP (Williams + Seeger 2001) was historically important; superseded by VFE/SVGP, but variants (e.g., adaptive Nystrom) remain useful.
8.6 Structured / Kronecker kernels
When inputs lie on a Cartesian product structure (spatial-temporal grids, images), product kernels k = k_1 x k_2 x … admit Kronecker factorization:
K = K_1 (x) K_2 (x) … (x) K_D.
Inversion and log-determinant reduce to per-factor operations: O(D * n^{1 + 1/D}) total (Saatci 2011). Combined with missing-data EM, this scales GPs to gigapixel images.
8.7 Iterative methods + conjugate gradients
Rather than Cholesky, use CG to solve (K + sigma_n^2 I) alpha = y in O(k * n^2) for k CG steps. With preconditioning (e.g., pivoted Cholesky preconditioner — Gardner et al. 2018), k stays small. This is the default in GPyTorch, enabling exact GP inference on n ~ 1M with multi-GPU implementations (Wang et al. 2019).
9. Deep and neural GPs
9.1 Deep Gaussian processes (Damianou + Lawrence 2013)
Compose multiple GPs in series:
f(x) = f_L(f_{L-1}( … f_1(x) … ))
with each f_l ~ GP independently. Provides a richer, hierarchical function class than a single GP — can model non-stationarity and abrupt changes.
Inference is hard:
- Variational inference with doubly stochastic deep GPs (Salimbeni + Deisenroth 2017).
- Importance-weighted variational bounds.
- Most modern deep GP work uses inducing points + reparameterized sampling through the layers.
In practice deep GPs are rarely competitive with neural networks on large datasets, but they remain a fertile area for theoretical work and for problems where uncertainty matters more than raw performance.
9.2 NNGP — Neural Network Gaussian Process (Neal 1996; Lee 2018)
Neal (1996) showed that a single-layer neural network with i.i.d. random weights converges, as the hidden width goes to infinity, to a GP whose kernel depends on the activation function. Lee et al. (2018) and Matthews et al. (2018) extended this to deep networks: an infinitely wide untrained deep net is a GP. The kernel can be computed layer by layer via recursive formulas (the NNGP kernel).
9.3 NTK — Neural Tangent Kernel (Jacot et al. 2018)
For a wide neural network trained by gradient descent on a squared-loss objective, the dynamics of predictions on the training set are governed in the infinite-width limit by the neural tangent kernel:
Theta(x, x’) = E[ <grad_theta f(x; theta), grad_theta f(x’; theta)> ],
where the expectation is over the initialization. Predictions evolve as kernel regression with kernel Theta. The NTK is different from the NNGP kernel — NNGP describes the function-space prior at initialization; NTK describes the trained predictor.
This was a major theoretical advance: it bridges deep learning and kernel methods. Practical consequences:
- Closed-form predictions for infinite-width nets (no training needed).
- Explanation of double descent, lazy training, and feature learning regimes (Chizat + Bach 2018; Yang + Hu 2021).
- Diagnostic tool for analyzing finite-width network behavior.
Libraries: Neural Tangents (Google) computes NNGP and NTK kernels for arbitrary architectures.
9.4 Deep kernel learning (Wilson et al. 2016)
A practical hybrid: use a neural network phi(x; w) as a feature extractor, then a GP on the features:
k_DKL(x, x’) = k_base(phi(x; w), phi(x’; w)).
Train w jointly with kernel hyperparameters via marginal likelihood. Combines the inductive biases of deep nets with the uncertainty quantification of GPs.
10. Bayesian optimization (BayesOpt)
Sequentially optimize an expensive-to-evaluate black-box function f : X → R by maintaining a GP surrogate and choosing the next query point to maximize an acquisition function that balances exploration and exploitation.
Generic loop:
- Fit GP to current observations D_t = {(x_i, y_i)}.
- Optimize acquisition function alpha(x; D_t) over X to get next x_{t+1}.
- Evaluate y_{t+1} = f(x_{t+1}) + noise.
- Append, repeat.
10.1 Acquisition functions
- Expected Improvement (EI) — Mockus 1978. EI(x) = E_{f(x)}[max(0, f_best - f(x))] for minimization. Closed form for GP posterior. Default for many practical systems.
- Upper Confidence Bound (UCB) — Srinivas et al. 2010. UCB(x) = mu(x) + beta * sigma(x). Tunable beta controls exploration. Has theoretical regret bounds.
- Probability of Improvement (PI) — Kushner 1964. P(f(x) < f_best). Tends to over-exploit.
- Thompson Sampling — Sample f ~ posterior GP; pick its argmin. Naturally parallelizable for batch BO.
- Knowledge Gradient (KG) — Frazier 2018. Expected value of the best decision after one more observation. Better for noisy / risk-averse settings.
- Entropy Search / Predictive Entropy Search / Max-value Entropy Search (Hennig + Schuler 2012; Hernandez-Lobato 2014; Wang + Jegelka 2017). Information-theoretic; expensive to compute but sample-efficient.
10.2 Practical BayesOpt
- Hyperparameter tuning: scikit-optimize (skopt), GPyOpt, Hyperopt, Optuna (with GP backend), BoTorch + Ax (Meta), Google Vizier (internal + Vertex AI), Spearmint (Snoek et al. 2012 reference impl).
- Chemistry + materials: active learning over molecular libraries with Tanimoto / fingerprint kernels (Griffiths + Hernandez-Lobato 2020); high-throughput experimentation systems (Atomistic ML, Aionics).
- Robotics: SafeOpt (Berkenkamp et al. 2016) for safety-constrained controller tuning; PILCO (Deisenroth + Rasmussen 2011) for sample-efficient policy search.
- A/B testing: adaptive sequential design — Meta uses BoTorch + Ax in production.
- Reinforcement learning hyperparameters — used by DeepMind, OpenAI internally.
10.3 Limitations of vanilla BO
- High-dimensional spaces (typically degrades beyond ~20 dims). Address with random embeddings (REMBO, Wang et al. 2016), additive structure (Add-GP-UCB), or local trust-region methods (TuRBO — Eriksson et al. 2019).
- Mixed (continuous + categorical) inputs require specialized kernels (one-hot or transformer-style; SMAC uses random forests instead).
- Non-stationary objective functions — input warping (Snoek et al. 2014) helps.
- Constraints — constrained EI, SafeOpt, or use a separate GP for each constraint.
11. Geostatistics and kriging
GP regression in the spatial-statistics literature is called kriging, named after South African mining engineer D. G. Krige whose master’s thesis (1951) proposed empirical regression-based ore-grade estimation. Georges Matheron (1962) formalized the theory at the Ecole des Mines de Paris, coining “geostatistics” and “kriging” and developing the variogram framework.
11.1 Variants
- Simple kriging: known mean, equivalent to zero-mean GP regression with known kernel.
- Ordinary kriging: unknown constant mean, marginalized analytically.
- Universal kriging: unknown mean with regression structure m(x) = sum beta_j * phi_j(x).
- Co-kriging: multiple correlated variables (multi-output GP).
- Indicator kriging: for categorical / threshold-exceedance probabilities.
- Disjunctive kriging: nonlinear functions of the field.
11.2 Variogram
Instead of a kernel, classical geostatistics works with the variogram:
gamma(h) = 0.5 * E[ (Z(x) - Z(x + h))^2 ].
For a second-order stationary process this is related to the covariance by gamma(h) = C(0) - C(h), so kernel and variogram are equivalent descriptors. The variogram is more robust to non-stationarity (it does not require knowing the mean) and is the historical workhorse of mining and petroleum geostatistics.
Variogram models: spherical, exponential, Gaussian, Matern, nugget effect (discontinuity at h = 0 representing measurement noise and microscale variability).
11.3 Software
- gstat (R) — Pebesma 2004, the standard R geostatistics package.
- PyKrige (Python) — Matheron-style kriging in Python.
- GSLIB — Fortran-based, the classical reference codebase (Deutsch + Journel 1992).
- SGeMS — open-source geostatistical toolkit.
- Leapfrog Geo (Seequent / Bentley) — commercial 3D geological modeling with implicit surfaces + kriging.
- Petrel (Schlumberger) — commercial petroleum geomodeling.
- GeostatsPy — Python ports of GSLIB.
11.4 Applications
- Mine planning and reserve estimation (the original use case).
- Petroleum reservoir characterization.
- Hydrology (water-table interpolation, contaminant plume mapping).
- Soil science and agronomy (precision agriculture).
- Air-quality interpolation (ozone, NO_2, PM_2.5).
- Climate reanalysis (kriging surface temperatures from sparse station data).
- Public health (disease prevalence mapping).
12. Software
Mature GP libraries as of 2026:
Python
- GPyTorch (Cornell; Gardner + Pleiss + Bindel + Wilson, since 2018). Built on PyTorch. Exact GP inference via CG + Lanczos on GPUs; SKI, SKIP, deep kernel learning, multi-output, variational. Fastest for large-scale exact and approximate GPs. Default for most modern research.
- BoTorch (Meta, since 2019). Bayesian optimization on top of GPyTorch. Differentiable acquisition functions, batch BO, multi-objective (qEHVI), constrained BO. Pairs with Ax (Meta’s experimentation platform) for production BO.
- GPflow (since 2016; originally James Hensman, ST John, et al.). Built on TensorFlow 2. Strong SVGP / variational implementations, multi-output GPs, deep GPs via GPflux.
- scikit-learn GaussianProcessRegressor / GaussianProcessClassifier. Pure-Python, dense Cholesky. Good for small data and pedagogy; do not use beyond ~1000 points.
- Pyro / NumPyro Gaussian Processes (Uber, then Linux Foundation). PyTorch / JAX-based probabilistic programming with built-in GP modules; integrates with NUTS / SVI.
- PyMC (since 2009). GP modules with HMC / NUTS sampling; ergonomic API.
- Stan + brms / cmdstanpy — full-Bayes GPs via NUTS; reference-grade MCMC.
- GPy (Sheffield; predecessor of GPflow). Still used in some pipelines, no longer actively developed.
- Neural Tangents (Google JAX). NNGP + NTK kernels for arbitrary networks.
Julia
- AbstractGPs.jl — common interface for GP libraries.
- KernelFunctions.jl — kernel construction.
- Stheno.jl — GP modeling with multi-output and process algebra.
- Turing.jl — probabilistic programming with GP integration.
R
- gstat — geostatistics (kriging, variograms).
- GPfit, kernlab, mlegp — GP regression.
- DiceKriging + DiceOptim — kriging-based BO for computer experiments.
Specialized
- GPstuff (Aalto / Vehtari). MATLAB / Octave; strong on EP and Laplace methods.
- MOGAPS — multi-output GP for spatial-temporal data.
13. Comparison vs other methods
GP vs random forest
- GP: smooth interpolation, calibrated uncertainty, kernel encodes prior, expensive (O(n^3)).
- RF: handles mixed features and large n trivially, no calibrated uncertainty out of the box (conformal or quantile RF needed), no smoothness.
- Use GP when n is small (< 10K) and uncertainty matters. Use RF (or gradient boosting — XGBoost, LightGBM, CatBoost) for medium-to-large tabular data without uncertainty needs.
GP vs neural networks
- GP: small data, calibrated uncertainty, interpretable kernel, exact posterior conditional on hyperparameters.
- NN: large data, learn features, scale to high dimensions, point estimates (Bayesian NN possible but expensive).
- Hybrid: deep kernel learning (GP head on NN features); NTK / NNGP for theoretical analysis.
GP vs Bayesian neural networks
- GP gives exact posterior for Gaussian likelihood; BNN needs VI / MCMC / Laplace approximation.
- GP scales worse in n; BNN scales worse in parameters.
- For wide architectures the two are connected via NTK / NNGP.
GP vs splines / kernel ridge regression
- A spline is the posterior mean of a GP with a specific kernel (e.g., natural cubic spline = thin-plate kernel posterior).
- Kernel ridge regression is the GP posterior mean with no uncertainty quantification.
- GP generalizes splines: any PSD kernel, plus uncertainty, plus hyperparameter learning.
GP vs polynomial chaos / surrogate models
- Polynomial chaos expansions (Wiener 1938; Xiu + Karniadakis 2002) — global polynomial basis with spectral coefficients. Best for smooth functions of low-dim inputs with known input distributions.
- GP — local kernel basis, more flexible, handles arbitrary smoothness, uncertainty included.
- In computer experiments (CFD, FEA surrogates), GP is the dominant choice for moderate-dim, expensive evaluations.
14. Applications
14.1 Hyperparameter optimization
- BoTorch + Ax (Meta) — production BO inside Meta and external products.
- Google Vizier — internal hyperparameter tuning service, now Vertex AI Vizier.
- Spearmint (Snoek et al. 2012) — historic reference impl that popularized BO for ML.
- scikit-optimize, GPyOpt, Hyperopt with GP backend, Optuna.
14.2 Robotics + control
- PILCO (Deisenroth + Rasmussen 2011) — sample-efficient model-based RL with GP dynamics.
- GP-MPC — Model Predictive Control with GP error models on residual dynamics; common in autonomous driving and quadrotor control (Hewing et al. 2019).
- SafeOpt (Berkenkamp et al. 2016) — Bayesian optimization with safety constraints, used for autotuning impedance controllers.
- Cartesian impedance gain auto-tuning, friction model identification, manipulation skill learning.
14.3 Computer experiments / surrogate modeling
- CFD and FEA simulations are expensive (hours per run). A GP surrogate trained on a Latin hypercube design enables design optimization, uncertainty quantification, and sensitivity analysis (Sobol indices via GP).
- Aerospace: airfoil shape optimization at NASA, Airbus.
- Energy: gas-turbine combustor optimization at GE, Rolls-Royce.
- Building design: HVAC layout, daylighting (Lawrence Berkeley National Laboratory).
- Climate: emulators of Earth-system models (UKESM, CESM) — climate scientists fit GPs to a small ensemble of expensive model runs.
14.4 Geospatial + environmental science
- Soil-property mapping (SoilGrids).
- Precipitation downscaling, sea-surface temperature interpolation.
- Air-quality monitoring (US EPA ozone, NO_2 surfaces).
- Public-health disease mapping (PrevMap).
14.5 Finance
- Yield-curve fitting with GP priors (Cousin et al. 2016).
- Option pricing surrogate models for Monte Carlo accelerators.
- Volatility surface smoothing.
- Credit risk + macroeconomic factor modeling.
14.6 Drug and materials discovery
- Bayesian active learning over molecular fingerprint / SMILES libraries (Griffiths + Hernandez-Lobato 2020).
- High-throughput experimentation: each wet-lab run is expensive, BO is natural.
- Atomistic GP potentials (FLARE, GAP — Bartok et al. 2010) for accelerated MD.
- Insilico Medicine, Atomwise, Isomorphic Labs use GP-based active learning in pipelines.
14.7 Clinical trials + adaptive design
- Sequential dose-finding (continuous reassessment method, Bayesian).
- Adaptive randomization (e.g., Berry’s group-sequential designs).
- Platform trials (I-SPY 2, BATTLE for cancer).
14.8 Time series
- GP with composite kernel (linear + periodic + Matern) captures trend + seasonality + residual. The original Mauna Loa CO_2 example in Rasmussen + Williams (2006) is the canonical demonstration.
- State-space representation: a GP with Matern kernel has an exact stochastic differential equation (SDE) representation; combine with Kalman filtering for O(n) inference (Hartikainen + Sarkka 2010).
14.9 Function-space theory of deep learning
- NNGP + NTK explain wide-network behavior.
- Generalization bounds via RKHS norm of NTK.
- Lazy vs feature-learning regimes (Chizat + Bach 2018).
- Tensor programs (Yang 2019) extending NTK to more architectures.
15. Common pitfalls
- Defaulting to RBF. RBF gives unrealistically smooth posteriors; switch to Matern-5/2 unless you genuinely believe the function is analytic. Stein (1999): “use Matern.”
- Failing to standardize inputs. Length scales are dimensional; if x_1 is in meters and x_2 in millimeters, ARD will struggle. Always rescale to zero mean and unit variance (or to [0, 1]) before fitting.
- Overfitting hyperparameters. With n = 50 points and a kernel having 10 hyperparameters, marginal-likelihood optimization is brittle. Use priors over hyperparameters (light gamma / log-normal priors on length scales are standard) or fully Bayesian treatment via HMC.
- Numerical instability in Cholesky. Add jitter ~ 1e-6 (in standardized output scale) to the diagonal. If problems persist, use double precision and consider reparameterizing.
- Naive GP on > 10K points. Cholesky blows up. Use SVGP, KISS-GP, or CG-based methods.
- BO in high dimensions. Beyond ~20 dimensions a single GP loses calibration. Use TuRBO (Eriksson 2019), additive GP-UCB (Kandasamy 2015), random embeddings (REMBO), or random forests (SMAC) instead.
- Ignoring noise variance. Fitting sigma_n^2 = 0 (“noise-free GP”) can give numerically singular matrices and dangerously overconfident predictions. Always fit at least a small noise term unless interpolating known measurements exactly.
- Wrong posterior variance for derivative outputs. Posterior variance of f’ is not the derivative of posterior variance of f. Use proper kernel derivatives if you care about gradient uncertainty (relevant in GP-MPC).
- Forgetting the prior mean. A zero-mean GP biases extrapolation to zero. Either subtract a fitted linear trend (universal-kriging style) or use a non-zero mean function.
- Stationarity assumed where it does not hold. Real signals are often non-stationary (smooth in some regions, rough elsewhere). Use deep GPs, input warping, or local GPs.
- Mis-specified likelihood for classification. Mixing Laplace-approximated GPC with EI for BayesOpt over discrete outcomes can give weird acquisitions; use posterior samples or specialized acquisitions (e.g., information-based).
- Treating posterior variance as a real-world confidence interval. It is the model’s uncertainty under its prior. If the prior is wrong, the intervals are wrong. Always sanity-check on held-out data and compare to alternative models.
16. Cross-references
- bayesian-inference — broader Bayesian framework; GP is one prior family.
- probability-fundamentals — multivariate Gaussian, conditioning.
- probability-distributions — Gaussian properties used throughout.
- linear-algebra-essentials — Cholesky factorization, conditioning, positive definiteness.
- numerical-linear-algebra — CG, low-rank approximations, preconditioning relevant to scalable GPs.
- mcmc-sampling — HMC / NUTS for fully Bayesian GPs.
- hypothesis-testing-mle — marginal likelihood as a model-selection criterion.
- convex-optimization — kernel hyperparameter optimization (L-BFGS).
- information-theory — entropy-based acquisition functions in BO.
- _index — math reference root.
- transformer-architecture — NTK connection to deep models.
- control-algorithms — GP-MPC, PILCO, SafeOpt in robotic control.
17. Citations
Foundational books
- Rasmussen, C. E. + Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. The canonical reference; free PDF at gaussianprocess.org.
- MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Chapter 45 on Gaussian processes.
- Stein, M. L. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.
- Cressie, N. Statistics for Spatial Data. Wiley, 1993.
Kernels and theory
- Aronszajn, N. Theory of reproducing kernels. Trans. AMS, 1950.
- Neal, R. M. Bayesian Learning for Neural Networks. PhD thesis, Toronto, 1996.
- MacKay, D. J. C. Bayesian interpolation. Neural Computation, 1992.
- Wilson, A. G. + Adams, R. P. Gaussian process kernels for pattern discovery and extrapolation. ICML, 2013.
- Duvenaud, D. et al. Structure discovery in nonparametric regression through compositional kernel search. ICML, 2013.
Sparse and variational GPs
- Seeger, M. et al. Fast forward selection to speed up sparse GP regression. AISTATS, 2003.
- Snelson, E. + Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs (FITC). NIPS, 2006.
- Titsias, M. Variational learning of inducing variables in sparse Gaussian processes (VFE). AISTATS, 2009.
- Hensman, J. + Fusi, N. + Lawrence, N. D. Gaussian processes for big data (SVGP). UAI, 2013.
- Hensman, J. + Matthews, A. G. + Ghahramani, Z. Scalable variational Gaussian process classification. AISTATS, 2015.
- Wilson, A. G. + Nickisch, H. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). ICML, 2015.
- Gardner, J. R. et al. GPyTorch: blackbox matrix-matrix Gaussian process inference with GPU acceleration. NeurIPS, 2018.
- Wang, K. A. et al. Exact Gaussian processes on a million data points. NeurIPS, 2019.
Approximate inference
- Williams, C. K. I. + Barber, D. Bayesian classification with Gaussian processes. PAMI, 1998.
- Minka, T. Expectation propagation for approximate Bayesian inference. UAI, 2001.
- Rahimi, A. + Recht, B. Random features for large-scale kernel machines. NIPS, 2007.
- Williams, C. K. I. + Seeger, M. Using the Nystrom method to speed up kernel machines. NIPS, 2001.
Deep and neural GPs
- Damianou, A. + Lawrence, N. D. Deep Gaussian processes. AISTATS, 2013.
- Salimbeni, H. + Deisenroth, M. Doubly stochastic variational inference for deep Gaussian processes. NeurIPS, 2017.
- Wilson, A. G. et al. Deep kernel learning. AISTATS, 2016.
- Lee, J. et al. Deep neural networks as Gaussian processes. ICLR, 2018.
- Matthews, A. G. et al. Gaussian process behaviour in wide deep neural networks. ICLR, 2018.
- Jacot, A. + Gabriel, F. + Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. NeurIPS, 2018.
- Yang, G. Tensor programs I. arXiv, 2019.
- Chizat, L. + Bach, F. On the global convergence of gradient descent for over-parameterized models. NeurIPS, 2018.
Bayesian optimization
- Mockus, J. On Bayesian methods for seeking the extremum. Toward Global Optimization, 1978.
- Kushner, H. A new method of locating the maximum of an arbitrary multipeak curve. J. Basic Engineering, 1964.
- Srinivas, N. et al. Gaussian process optimization in the bandit setting: no regret and experimental design (GP-UCB). ICML, 2010.
- Snoek, J. + Larochelle, H. + Adams, R. P. Practical Bayesian optimization of machine learning algorithms. NeurIPS, 2012.
- Hennig, P. + Schuler, C. J. Entropy search for information-efficient global optimization. JMLR, 2012.
- Hernandez-Lobato, J. M. et al. Predictive entropy search for efficient global optimization. NeurIPS, 2014.
- Frazier, P. I. A tutorial on Bayesian optimization. arXiv, 2018.
- Eriksson, D. et al. Scalable global optimization via local Bayesian optimization (TuRBO). NeurIPS, 2019.
- Berkenkamp, F. + Schoellig, A. P. + Krause, A. Safe controller optimization for quadrotors with Gaussian processes (SafeOpt). ICRA, 2016.
- Deisenroth, M. + Rasmussen, C. E. PILCO: a model-based and data-efficient approach to policy search. ICML, 2011.
Geostatistics
- Krige, D. G. A statistical approach to some basic mine valuation problems. J. Chem. Metall. Min. Soc. South Africa, 1951.
- Matheron, G. Traite de Geostatistique Appliquee. Editions Technip, 1962.
- Deutsch, C. V. + Journel, A. G. GSLIB: Geostatistical Software Library and Users Guide. Oxford, 1992.
- Chiles, J.-P. + Delfiner, P. Geostatistics: Modeling Spatial Uncertainty. Wiley, 2012.
Time series / state-space
- Hartikainen, J. + Sarkka, S. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. MLSP, 2010.
- Sarkka, S. + Solin, A. Applied Stochastic Differential Equations. Cambridge, 2019.
Software documentation
- GPyTorch — gpytorch.ai.
- GPflow — gpflow.github.io.
- BoTorch — botorch.org.
- Pyro Gaussian Processes — pyro.ai.
- PyMC — pymc.io.
- Stan — mc-stan.org.
- Neural Tangents — github.com/google/neural-tangents.
- scikit-learn — scikit-learn.org/stable/modules/gaussian_process.html.
- AbstractGPs.jl — github.com/JuliaGaussianProcesses/AbstractGPs.jl.
- gstat (R) — cran.r-project.org/package=gstat.
- PyKrige — github.com/GeoStat-Framework/PyKrige.