Statistical Distributions Catalog (Extended)

Encyclopedic catalogue of probability distributions, beyond the introductory zoo. Each entry: support, parameters, density/PMF, moments where compact, primary historical attribution, canonical applications, and software locator. SI units used where physical; otherwise dimensionless.

This note extends prob-distribution-zoo with the heavier-tailed, multivariate, directional, extreme-value, Bayesian-nonparametric, and survival families that appear in modern statistical practice.

1. Continuous Univariate — Location-Scale + Exponential Family

1.1 Normal / Gaussian

Support ℝ. f(x) = (2π σ²)^(−1/2) exp(−(x−μ)²/(2σ²)). Gauss 1809 (Theoria Motus); de Moivre 1733 (binomial approximation); Laplace also independently. Mean μ, variance σ², skew 0, kurtosis 3 (excess 0). Universal default; central limit theorem (CLT) justification. scipy.stats.norm, R dnorm.

1.2 Log-normal

Support (0, ∞). Y = exp(X), X ∼ N(μ, σ²). Galton (1879) and McAlister. Mean e^(μ+σ²/2), variance (e^σ² − 1) e^(2μ+σ²). Multiplicative CLT. Particle sizes, income distributions, biological growth, geometric Brownian motion in finance. scipy.stats.lognorm.

1.3 Half-normal, Folded Normal, Truncated Normal

  • Half-normal: |X|, X ∼ N(0, σ²)
  • Folded: |X|, X ∼ N(μ, σ²) — μ ≠ 0
  • Truncated: N conditioned on a < X < b; closed-form moments via Mills ratio φ/Φ

Used as priors (half-normal scale prior in Bayesian HLMs; Stan / PyMC convention), measurement-error truncation.

1.4 Inverse Gaussian (Wald)

Support (0, ∞). First passage time of Brownian motion with drift to fixed level. Schrödinger 1915; Tweedie 1957. Mean μ, variance μ³/λ. Reaction times, failure times.

1.5 Generalized Inverse Gaussian (GIG)

Mixing distribution for generalized hyperbolic (§7); Bessel-K density. Barndorff-Nielsen + Halgreen 1977.

1.6 Beta

Support [0, 1]. f(x) = x^(α−1)(1−x)^(β−1) / B(α, β). Pearson System Type I (1916). Conjugate prior for Bernoulli/Binomial. Mean α/(α+β). Jeffreys prior Beta(1/2, 1/2). scipy.stats.beta.

1.7 Gamma

Support (0, ∞). f(x) = x^(k−1) e^(−x/θ) / (θ^k Γ(k)). Shape k, scale θ. Mean kθ, variance kθ². Waiting time for k Poisson events. Conjugate prior for Poisson rate (shape-rate parameterisation).

1.8 Inverse-Gamma

X ∼ Gamma → 1/X ∼ Inverse-Gamma. Conjugate for normal variance (with known mean) in classical Bayesian — though half-Cauchy / half-normal scale priors are preferred today (Gelman 2006).

1.9 Generalized Gamma (Stacy 1962)

Three-parameter unifier of Gamma, Weibull, Log-normal (limit). Stacy. Useful in survival when shape uncertain.

1.10 Erlang

Special Gamma with integer shape k. Sum of k i.i.d. Exp(λ). Telephone-traffic / queueing origin (Erlang at Copenhagen Telephone 1909–1929).

1.11 Chi-squared (χ²)

Sum of k squared standard normals. Helmert 1875, popularised Pearson 1900 (goodness-of-fit). DoF k; mean k, variance 2k. Test statistics: Pearson χ², LR test asymptotic.

1.12 Chi (χ)

√χ². k-DoF magnitude of k-vector of i.i.d. N(0,1). χ_2 = Rayleigh, χ_3 = Maxwell.

1.13 Student’s t

Gosset 1908 (pen-name “Student”; Guinness brewery confidentiality). k DoF. Heavier tails than normal; → normal as k → ∞. Test statistic for small-sample mean inference. scipy.stats.t.

1.14 F (Fisher–Snedecor)

Ratio of two scaled χ². Fisher (1924) + Snedecor (1934). ANOVA, regression F-tests.

1.15 Cauchy

Support ℝ. f(x) = (πγ(1 + ((x−x₀)/γ)²))^(−1). No mean, no variance (pathological). Lorentz/Cauchy 1853; resonance spectroscopy. Stable index α = 1.

1.16 Lévy α-stable

Stable family parameterised by (α, β, γ, δ): characteristic function ψ(t) = exp(iδt − γ^α |t|^α (1 − iβ sign(t) Φ(α, t))). α ∈ (0, 2]. α = 2 ⇒ Normal; α = 1, β = 0 ⇒ Cauchy; α = 1/2, β = 1 ⇒ Lévy distribution. Mandelbrot 1963 (cotton prices); Lévy 1925. Infinite variance for α < 2. Heavy-tailed finance, anomalous diffusion. R: stable; Python: scipy.stats.levy_stable.

1.17 Generalized Hyperbolic

Mean-variance mixture of Normal with GIG mixing. Barndorff-Nielsen 1977. Subfamilies: NIG (Normal Inverse Gaussian), Variance Gamma, Hyperbolic, Skew-t. Finance returns.

1.18 NIG (Normal Inverse Gaussian)

Special case of GH; popular in option pricing (Barndorff-Nielsen). R: GeneralizedHyperbolic.

1.19 Variance Gamma

Madan + Seneta 1990. Subordinated Brownian motion with Gamma time change. Carr-Geman-Madan-Yor CGMY 2002 extends to four-parameter Levy.

1.20 Pareto (Type I, II, III, IV; generalized Pareto)

Pareto 1896 (income, the 80/20 rule). Type I: f(x) = α x_m^α / x^(α+1), x ≥ x_m. Type II = Lomax (location shift). Power-law tail. Mandelbrot, Newman 2005 reviews power laws.

1.21 Generalized Pareto Distribution (GPD)

Pickands 1975. CDF F(x) = 1 − (1 + ξx/σ)^(−1/ξ). Threshold exceedance limit in extreme value theory (POT method). Three regimes:

  • ξ > 0: heavy tail (Pareto)
  • ξ = 0: exponential tail (Gumbel domain)
  • ξ < 0: bounded tail (reverse-Weibull)

1.22 Generalized Extreme Value (GEV)

Jenkinson 1955; unifies Gumbel + Fréchet + Weibull. CDF F(x) = exp{−[1 + ξ(x−μ)/σ]^(−1/ξ)} for ξ ≠ 0; F(x) = exp{−exp[−(x−μ)/σ]} for ξ = 0. Block-maxima limit (Fisher-Tippett-Gnedenko theorem). Coles 2001 textbook.

1.23 Gumbel

GEV with ξ = 0. Maxima of light-tailed RVs. Flood return levels, NLP softmax sampling (Gumbel-Max trick).

1.24 Fréchet

GEV with ξ > 0. Maxima of heavy-tailed RVs.

1.25 Weibull (reversed Weibull = GEV ξ < 0)

Waloddi Weibull 1951. Survival: F(t) = 1 − exp(−(t/η)^β). Shape β:

  • β < 1 decreasing hazard (infant mortality)
  • β = 1 exponential
  • β > 1 increasing hazard (wear-out)

Wind speed, material fatigue.

1.26 Burr (Type XII) and Lomax

Burr 1942. f(x) ∝ ckx^(c−1)/(1+x^c)^(k+1). Pareto generalisation. Singh-Maddala = Burr Type XII.

1.27 Logistic + Generalized Logistic

Symmetric, similar to normal but heavier tails. CDF F(x) = 1/(1+exp(−(x−μ)/s)). Logit link in GLM. Hosking 1990 generalized logistic in EVT.

1.28 Hyperbolic Secant / Half-Logistic

Niche.

1.29 Laplace + Asymmetric Laplace

f(x) = (1/2b) exp(−|x−μ|/b). Laplace 1774. Heavier-tailed than normal. L1 regression / median estimation MLE. Asymmetric Laplace = quantile regression likelihood (Yu + Moyeed 2001).

1.30 Hypoexponential + Hyperexponential

Mixtures of exponentials. Hypoexponential = sum of independent Exp with possibly different rates. Hyperexponential = mixture (random selection of one Exp). Erlang is hypoexponential with equal rates.

1.31 Phase-Type

General class: absorbing-time of finite-state continuous-time Markov chain. Includes Erlang, hypoexponential, hyperexponential, Coxian. Neuts 1981. Insurance, queueing.

1.32 Exponentially Modified Gaussian (EMG)

Convolution of Gaussian and Exponential. Skewed-right peaks in chromatography, neural reaction times.

1.33 Voigt

Convolution of Gaussian + Lorentzian. Spectroscopy line profiles (Doppler + pressure broadening). Approximations: pseudo-Voigt (Thompson-Cox-Hastings).

1.34 Skew-Normal (Azzalini 1985)

f(x) = 2 φ(x) Φ(αx). Adds asymmetry to Gaussian. α = 0 ⇒ Normal.

1.35 Skew-t

Azzalini + Capitanio 2003. Heavy-tailed + asymmetric. Standard in robust finance modelling.

1.36 Generalized Error Distribution (GED) / Subbotin / Generalized Normal

f(x) ∝ exp(−(|x|/α)^β / β). β = 2 ⇒ Normal; β = 1 ⇒ Laplace; β → ∞ ⇒ Uniform. GARCH residuals option (Nelson 1991 EGARCH).

1.37 Tukey Lambda

Symmetric, parameter-shape-only; useful for Q-Q plot diagnosis of unknown distribution shape.

1.38 Power-law Tail (Pareto-type)

P(X > x) ∝ x^(−α). Newman 2005 review. Estimation: Hill estimator, Clauset-Shalizi-Newman 2009 ML method with KS-based xmin selection.

2. Continuous Multivariate

2.1 Multivariate Normal (MVN)

f(x) = (2π)^(−k/2) |Σ|^(−1/2) exp(−(1/2)(x−μ)’ Σ^(−1) (x−μ)). Σ positive-definite. Marginals + conditionals + linear transforms all Gaussian.

2.2 Multivariate t

f(x) ∝ (1 + (x−μ)’ Σ^(−1) (x−μ)/ν)^(−(ν+k)/2). Heavier-tailed. ν = 1 ⇒ multivariate Cauchy.

2.3 Dirichlet

Support open simplex Δ^(k−1). Conjugate to Multinomial. f(p) ∝ ∏ p_i^(α_i−1). Topic models (LDA), bayesian softmax priors.

2.4 Wishart

Wishart 1928. Distribution of S = X’X for X (n × p) i.i.d. N(0, Σ). Conjugate prior for inverse-covariance (precision matrix) in Bayesian multivariate normal.

2.5 Inverse-Wishart

Conjugate for covariance matrix Σ in Bayesian MVN. Critique (Gelman): induces correlation in priors; modern preference is LKJ prior on correlation matrices + separate scale priors (Lewandowski-Kurowicka-Joe 2009).

2.6 Matrix Normal

X (n × p) with row and column covariance structure (Σ_row ⊗ Σ_col separable Kronecker). Vec(X) ∼ N(vec(M), Σ_col ⊗ Σ_row).

2.7 Matrix-variate t

2.8 Multivariate Skew-Normal (Azzalini-Capitanio 1999)

2.9 Copulas

See copulas-and-dependence. Sklar (1959) theorem: any joint distribution is marginals + copula. Gaussian, t, Archimedean (Clayton, Gumbel, Frank, Joe), Vine copulas (Bedford-Cooke). Risk modelling, hydrology, finance dependence.

2.10 LKJ Distribution (Correlation Matrices)

f(R) ∝ |R|^(η−1). Lewandowski, Kurowicka, Joe 2009. Standard in PyMC, Stan, NumPyro for correlation priors.

3. Discrete

3.1 Bernoulli

X ∈ {0,1}, p. Building block.

3.2 Binomial

Sum of n i.i.d. Bernoulli(p). Mean np, variance np(1−p).

3.3 Negative Binomial (Polya)

Number of trials before r-th success in Bernoulli sequence. Overdispersed Poisson; common in count regression (gene expression, insurance claim counts).

3.4 Geometric

Special case Negative Binomial r=1. Memoryless.

3.5 Hypergeometric

Sampling without replacement; K successes in population N, sample size n.

3.6 Poisson

Limit of Binomial as n → ∞, np = λ. Rare event count. Mean = variance = λ.

3.7 Generalized Poisson (Consul + Jain 1973)

Two-parameter; allows over- and underdispersion.

3.8 Skellam

Difference of two independent Poisson(λ₁), Poisson(λ₂). Sports scoring differences (Karlis + Ntzoufras).

3.9 Conway-Maxwell-Poisson (COM-Poisson)

Conway + Maxwell 1962. Adds dispersion parameter ν; under (ν > 1) and over (ν < 1) dispersion vs Poisson (ν = 1).

3.10 Categorical / Multinoulli

Single-trial K-class.

3.11 Multinomial

Sum of n Categorical. Generalises Binomial to K classes.

3.12 Dirichlet-Multinomial (DM)

Multinomial with probabilities themselves drawn Dirichlet. Overdispersed multinomial (microbiome compositional data).

3.13 Beta-Binomial

Binomial with Beta-distributed p; overdispersed. Toxicology, clustered trials.

3.14 Zipf / Zipfian

P(X = k) ∝ 1/k^s. Zipf 1949 (word frequencies). Limit of size-biased sampling.

3.15 Yule-Simon

Preferential attachment (Simon 1955). Power-law tail.

3.16 Discrete Uniform

3.17 Logarithmic (Fisher series)

Fisher 1943. Insect species abundance.

3.18 Borel + Borel-Tanner

Branching process family sizes.

3.19 Riemann Zeta / Hurwitz Zeta

P(X = k) = k^(−s)/ζ(s). Probabilistic number theory.

3.20 Polya Urn

Discrete-time Markov chain modelling reinforcement; underlies Dirichlet process (de Finetti’s theorem for exchangeable sequences).

4. Mixtures + Bayesian Nonparametric

4.1 Gaussian Mixture Model (GMM)

∑ π_k N(μ_k, Σ_k). Fit via EM (Dempster-Laird-Rubin 1977) or VI; model selection via BIC/AIC/cross-val. sklearn.mixture.GaussianMixture, scikit-learn.

4.2 Mixture of t / Mixture of factor analyzers

Robust extensions; Ghahramani + Hinton MFA, McLachlan + Peel mixture of t.

4.3 Dirichlet Process (DP) — Ferguson 1973

Nonparametric prior over probability measures. DP(α, G_0). Sticks-breaking construction (Sethuraman 1994), Pólya urn / CRP representation.

4.4 Chinese Restaurant Process (CRP)

Exchangeable partition probability function (EPPF). Aldous + Pitman. Customer k+1 joins existing table j with prob n_j/(α+k) or new table with prob α/(α+k).

4.5 Pitman-Yor Process (PYP)

Pitman 1995; two-parameter extension of DP (discount d ∈ [0, 1), concentration α). Power-law cluster sizes — natural language modelling (Teh 2006 HPYP n-gram).

4.6 Indian Buffet Process (IBP)

Griffiths + Ghahramani 2005. Latent feature model — infinite binary features per data point. Beta-Bernoulli stick-breaking.

4.7 Hierarchical Dirichlet Process (HDP)

Teh-Jordan-Beal-Blei 2006. Sharing across groups; HDP-LDA (nonparametric topic model).

4.8 Latent Dirichlet Allocation (LDA)

Blei-Ng-Jordan 2003. Topic model; Dirichlet priors over per-document topic distributions and per-topic word distributions. Variational + Gibbs samplers (Griffiths + Steyvers).

4.9 Mixture of Experts (MoE)

Jacobs et al 1991. Gating network selects expert subnetwork. Foundation of sparse-MoE LLMs (Mixtral 8×7B, DeepSeek-V3, GPT-4 rumored).

5. Survival + Reliability

5.1 Exponential — constant hazard; memoryless

5.3 Log-normal — non-monotone hazard

5.4 Gompertz — log-hazard linear in time; biological aging

5.5 Gompertz-Makeham — Gompertz + constant baseline; actuarial mortality

5.6 Beta-survival

5.7 Generalized Gamma — encompasses Exp, Weibull, log-normal (Stacy 1962)

5.8 Generalized Weibull / Exponentiated Weibull (Mudholkar + Srivastava 1993)

5.9 Bathtub-curve models — Hjorth, modified Weibull (Lai-Xie-Murthy), additive Weibull

5.10 AFT (Accelerated Failure Time) models — log T = β’X + σε; ε from chosen base

5.11 Cox PH — semi-parametric: h(t|X) = h_0(t) exp(β’X). Cox 1972. Partial likelihood. Default in clinical biostatistics

5.12 Piecewise Exponential — flexible nonparametric baseline

5.13 Royston-Parmar / Flexible Parametric — splines on log cumulative hazard

5.14 Kaplan-Meier — nonparametric survival estimator

5.15 Nelson-Aalen — nonparametric cumulative hazard estimator

Tools: R survival, flexsurv, rms (Frank Harrell), Python lifelines (Davidson-Pilon), scikit-survival, PyMC for Bayesian survival.

6. Heavy-Tailed Families (Finance + Risk)

6.1 Stable α-Stable (§1.16)

Mandelbrot 1963 cotton prices; Fama 1965 stock returns. Heavy tails + stability under sums.

6.2 Generalized Hyperbolic + sub-families (§1.17–1.19)

6.3 Jump-Diffusion (Merton 1976)

dS_t / S_t = μ dt + σ dW_t + (Y − 1) dN_t. N_t Poisson jump count, Y log-normal jump size.

6.4 Kou Double-Exponential Jump-Diffusion (2002)

Asymmetric Laplace jump-size; analytical option pricing.

6.5 Stochastic Volatility — Heston (1993)

dS = μ S dt + √v S dW₁, dv = κ(θ − v)dt + σ_v √v dW₂, corr(dW₁, dW₂) = ρ. Closed-form via FFT (Carr-Madan).

6.6 SABR (Hagan, Kumar, Lesniewski, Woodward 2002)

dF = α F^β dW₁, dα = ν α dW₂, corr = ρ. Volatility surface in fixed-income.

6.7 Rough Volatility — Gatheral-Jaisson-Rosenbaum 2018

Fractional Brownian motion with Hurst H ≈ 0.1. Volatility is “rougher” than diffusive (H = 0.5). rBergomi popular.

6.8 CGMY / Variance Gamma — pure-jump Lévy (Carr-Geman-Madan-Yor 2002)

6.9 Tempered Stable — α-stable with exponentially dampened tails for finite moments

6.10 Subordinated Brownian Motion — general framework: B(τ(t)) with random time change τ

7. Circular / Directional Statistics

7.1 von Mises

Circular analogue of normal on the circle [0, 2π). f(θ) ∝ exp(κ cos(θ − μ)). von Mises 1918. Wind direction, oriented data. Mardia + Jupp 2000 textbook.

7.2 Wrapped Normal, Wrapped Cauchy, Wrapped Lévy

“Wrap” a real-line distribution onto circle by summing over translates.

7.3 von Mises-Fisher (vMF)

S^(p−1) analogue of von Mises. Directional clustering (Banerjee-Dhillon-Ghosh-Sra 2005). Used in spherical text embeddings, hyperspherical normalisation in deep learning (CosFace, ArcFace).

7.4 Bingham

Antipodally symmetric; orientation data without sign.

7.5 Matrix Bingham + Fisher-Bingham

Distributions on Stiefel manifold (orthonormal matrices). Bayesian PCA orientation, cryo-EM reconstructions.

7.6 Watson

Axial data analogue of vMF.

7.7 Kent

Generalisation of vMF on sphere with elliptical concentration (5 parameters); geology orientation data.

8. Empirical / Nonparametric

8.1 Empirical Distribution Function

F̂_n(x) = (1/n) ∑ I(X_i ≤ x). Glivenko-Cantelli theorem.

8.2 Bootstrap (Efron 1979)

Resampling with replacement from empirical distribution. CI + standard errors. Nonparametric, parametric, smoothed, bootstrap-t, BCa, double bootstrap.

8.3 Kernel Density Estimation (KDE)

f̂(x) = (1/nh) ∑ K((x − X_i)/h). Silverman 1986. Bandwidth selection: Silverman’s rule, Scott’s rule, cross-val, plug-in.

8.4 Spline Density

P-splines (Eilers + Marx 1996), penalised B-splines.

8.5 Histogram + Bayesian Histogram

8.6 Bayesian Nonparametric Density (DPM, BNP)

9. Extreme Value Theory (EVT)

9.1 Fisher-Tippett-Gnedenko Theorem

Block maxima → GEV (only three possible limits: Gumbel, Fréchet, Weibull).

9.2 Pickands-Balkema-de Haan Theorem

Threshold exceedances → GPD.

9.3 Hill Estimator

Tail index α̂ = (1/k) ∑_{i=1}^k log(X_{(n−i+1)}/X_{(n−k)}). Bias-variance tradeoff in k.

9.4 Pickands Estimator

Estimator for ξ (shape parameter) of GPD/GEV.

9.5 Return Levels

x_p = μ + [σ/ξ]((−log(1−p))^(−ξ) − 1). “100-year flood” = level exceeded with prob 0.01 in any year (return period 100 years).

9.6 Multivariate EVT

Spectral measure on simplex; max-stable processes (Brown-Resnick, Schlather, Smith).

9.7 VaR + Expected Shortfall + Cornish-Fisher

Quantile-based risk; ES = E[X | X > VaR]. Cornish-Fisher expansion approximates quantiles with skew + kurt correction.

Software: R evd, evir, extRemes, ismev, POT; Python pyextremes.

10. Tests + Diagnostics

10.1 Goodness-of-Fit

  • Q-Q plot — visual
  • P-P plot — visual
  • Anderson-Darling — emphasises tails (weighted KS)
  • Kolmogorov-Smirnov — sup |F̂ − F|; less powerful than AD
  • Cramér-von Mises — integrated squared difference
  • Shapiro-Wilk — normality (small n)
  • Jarque-Bera — moment-based normality test
  • Lilliefors — KS with estimated parameters
  • D’Agostino-Pearson — skew + kurtosis combined

10.2 Two-Sample

  • KS two-sample
  • Anderson-Darling k-sample
  • Energy distance (Székely + Rizzo 2004)
  • Maximum Mean Discrepancy (MMD) (Gretton et al 2007) — kernel-based; widely used in ML
  • Permutation tests

10.3 Discrepancies

  • Wasserstein W_p — optimal transport
  • Sinkhorn divergence — entropy-regularised OT (Cuturi 2013)
  • KL divergence — Kullback-Leibler 1951
  • JS divergence — symmetrised KL
  • f-divergences — Csiszár class
  • Hellinger distance — bounded [0, 1]

11. Software Locator

  • Python scipy.stats — 100+ continuous, 20+ discrete; rvs/pdf/pmf/cdf/ppf/sf/isf/logpdf/…
  • statsmodels — fit + tests + diagnostics
  • scikit-learn mixture — GMM, BayesianGaussianMixture (DP-GMM via VI)
  • PyMC — Bayesian; rich distribution library
  • NumPyro + Distrax + TensorFlow Probability — JAX/TF stacks
  • PyTorch torch.distributions — autograd-compatible
  • hmmlearn, pomegranate — HMM + general Bayesian
  • R stats + MASS — base + classical
  • R fitdistrplus — fitting any continuous + discrete with diagnostics
  • R EnvStats — environmental statistics distributions (Censored)
  • R evd, evir, extRemes, ismev, POT — EVT
  • R copula, VineCopula — copula models
  • R survival, flexsurv, rms — survival
  • R bnpy / dirichletprocess — BNP
  • Stan / RStan / CmdStanR — Bayesian HMC; complete distribution language
  • Julia Distributions.jl — type-stable, extensive

12. Common Pitfalls

  • Heavy-tailed data: sample mean + standard deviation may not exist or converge slowly; use trimmed mean, median, MAD instead
  • Q-Q plot ≠ proof of distribution; only suggests
  • KS test loses power at parameter estimation (use Lilliefors for normality); for heavy tails use Anderson-Darling
  • Bootstrap CIs are not exact; coverage depends on regularity and tail behaviour
  • Hill estimator is sensitive to threshold k; plot tail-index vs k and pick stable plateau
  • Conjugate priors are convenient but can be informative; check sensitivity
  • Inverse-Wishart on covariance ⇒ correlated priors on variance and correlation — prefer LKJ + scale priors
  • Discrete distributions in continuous-data toolkits: scipy treats them via pmf; conversion gotchas
  • Mixed continuous-discrete (zero-inflated count models, censored response) — use Tobit, hurdle, ZIP/ZINB, Tweedie compound Poisson-Gamma
  • Sampling from heavy-tailed distributions in MCMC: HMC step-size adaptation struggles; reparameterise (centred ⇄ non-centred) or use NUTS warm-up + step-size jitter
  • Multivariate distributions: number of parameters grows O(p²) for covariance; structured priors (LKJ, Cholesky, factor-analytic) essential at p > 20

13. Exponential Family + GLM Backbone

Many of the above belong to the natural exponential family, which determines the canonical GLM (Generalized Linear Model — Nelder + Wedderburn 1972):

DistributionLink (canonical)Variance function V(μ)GLM use
Normalidentity1OLS regression
Bernoulli / Binomiallogitμ(1−μ)Logistic regression
PoissonlogμCount regression
Negative Binomiallogμ + μ²/kOverdispersed counts
Gammainverseμ²Positive continuous
Inverse Gaussianinverse-squareμ³Failure times
Multinomiallogit (softmax)diag(μ) − μμ’Categorical regression
Tweediepowerμ^pInsurance claims, semi-continuous

GLMs unify these; MLE via IRLS (iteratively reweighted least squares). Extensions: GLMM (random effects), GAM (smooth terms, Wood 2017), Bayesian GLM (brms, rstanarm).

14. Process-Level Distributions (Time-Indexed)

Generalising static distributions to stochastic processes:

14.1 Gaussian Processes

  • Mean function m(x) + kernel k(x, x’)
  • Common kernels: squared exponential (RBF), Matérn (ν = 1/2 exponential, ν = 3/2, 5/2, ∞ RBF), periodic, linear, spectral mixture
  • Bayesian regression, optimisation (BO; Mockus 1989, Snoek-Larochelle-Adams 2012 spearmint)
  • Tools: GPy, GPflow, GPyTorch, BoTorch, PyMC GP module

14.2 Poisson Process + Cox Process

  • Homogeneous PP: events with rate λ; inter-arrival ∼ Exp(λ)
  • Inhomogeneous PP: rate λ(t)
  • Cox (doubly stochastic) Process: λ(t) is itself random (e.g., log-Gaussian Cox process for spatial point patterns; Møller-Syversveen-Waagepetersen 1998)

14.3 Hawkes Process (self-exciting)

Conditional intensity λ*(t) = μ + ∑_{t_i < t} α exp(−β(t − t_i)). Hawkes 1971. Seismology aftershocks, finance order arrivals, contagion.

14.4 Markov Chains + HMM

  • Discrete-time / continuous-time
  • Hidden Markov Model — emissions conditional on hidden state; Baum-Welch + Viterbi
  • LDS / Kalman filter — linear-Gaussian HMM
  • Tools: hmmlearn, pomegranate, pyhsmm, edward, NumPyro

14.5 Lévy Processes

  • Independent + stationary increments; càdlàg paths
  • Brownian motion (continuous), Poisson, compound Poisson, Gamma process, variance gamma, NIG, CGMY, α-stable
  • Lévy-Khintchine triplet (drift, Gaussian volatility, jump measure)

14.6 Diffusions + SDEs

dX_t = μ(X_t, t) dt + σ(X_t, t) dW_t. Itô + Stratonovich calculi.

  • Ornstein-Uhlenbeck (mean-reverting Gaussian) — interest rates Vasicek
  • CIR (Cox-Ingersoll-Ross) — non-negative square-root process; short rates
  • Heston volatility (§6.5)
  • SABR (§6.6)

14.7 Fractional + Rough Processes

  • Fractional Brownian motion fBm(H) — H Hurst exponent
  • Rough volatility H ≈ 0.1 (Gatheral et al)

15. Distance + Divergence Reference

Beyond the goodness-of-fit tests in §10, key divergences used in modern probabilistic ML:

  • KL D_KL(P‖Q) = ∫ p log(p/q) — variational inference, mutual information, ELBO
  • Reverse-KL D_KL(Q‖P) — mode-seeking; VAE encoder objective
  • JS = (KL(P‖M) + KL(Q‖M))/2, M=(P+Q)/2 — symmetric, bounded
  • Wasserstein W_p — moves mass; sensitive to support overlap absent; basis WGAN (Arjovsky-Chintala-Bottou 2017); Sinkhorn approximation O(n² log n) (Cuturi 2013)
  • Hellinger H²(P,Q) = (1/2)∫(√p − √q)²
  • Total Variation TV(P,Q) = (1/2) ∫ |p − q|
  • MMD ‖μ_P − μ_Q‖_H_k — kernel two-sample (Gretton 2007); used in autoencoder testing
  • Energy distance — characteristic kernel MMD with negative-distance kernel
  • Stein Discrepancy + KSD — sample-vs-distribution goodness-of-fit (Gorham + Mackey 2017); used in posterior sampler diagnostics
  • Bregman divergences — generalisation; KL is the Bregman induced by entropy
  • α-divergences (Rényi, Tsallis) — interpolate KL ↔ Hellinger

16. Reference Texts

  • Johnson, Kotz, Balakrishnan — Continuous Univariate Distributions Vols 1 + 2 (Wiley)
  • Johnson, Kemp, Kotz — Univariate Discrete Distributions
  • Kotz, Balakrishnan, Johnson — Continuous Multivariate Distributions
  • Coles — An Introduction to Statistical Modeling of Extreme Values (2001)
  • Embrechts, Klüppelberg, Mikosch — Modelling Extremal Events (1997)
  • Mardia + Jupp — Directional Statistics (2000)
  • McLachlan + Peel — Finite Mixture Models (2000)
  • Klein + Moeschberger — Survival Analysis (Springer)
  • Therneau + Grambsch — Modeling Survival Data: Extending the Cox Model
  • Cont + Tankov — Financial Modelling with Jump Processes (2003)
  • Hogg + McKean + Craig — Introduction to Mathematical Statistics

Adjacent