Statistical Distributions Catalog (Extended)
Encyclopedic catalogue of probability distributions, beyond the introductory zoo. Each entry: support, parameters, density/PMF, moments where compact, primary historical attribution, canonical applications, and software locator. SI units used where physical; otherwise dimensionless.
This note extends prob-distribution-zoo with the heavier-tailed, multivariate, directional, extreme-value, Bayesian-nonparametric, and survival families that appear in modern statistical practice.
1. Continuous Univariate — Location-Scale + Exponential Family
1.1 Normal / Gaussian
Support ℝ. f(x) = (2π σ²)^(−1/2) exp(−(x−μ)²/(2σ²)). Gauss 1809 (Theoria Motus); de Moivre 1733 (binomial approximation); Laplace also independently. Mean μ, variance σ², skew 0, kurtosis 3 (excess 0). Universal default; central limit theorem (CLT) justification. scipy.stats.norm, R dnorm.
1.2 Log-normal
Support (0, ∞). Y = exp(X), X ∼ N(μ, σ²). Galton (1879) and McAlister. Mean e^(μ+σ²/2), variance (e^σ² − 1) e^(2μ+σ²). Multiplicative CLT. Particle sizes, income distributions, biological growth, geometric Brownian motion in finance. scipy.stats.lognorm.
1.3 Half-normal, Folded Normal, Truncated Normal
- Half-normal: |X|, X ∼ N(0, σ²)
- Folded: |X|, X ∼ N(μ, σ²) — μ ≠ 0
- Truncated: N conditioned on a < X < b; closed-form moments via Mills ratio φ/Φ
Used as priors (half-normal scale prior in Bayesian HLMs; Stan / PyMC convention), measurement-error truncation.
1.4 Inverse Gaussian (Wald)
Support (0, ∞). First passage time of Brownian motion with drift to fixed level. Schrödinger 1915; Tweedie 1957. Mean μ, variance μ³/λ. Reaction times, failure times.
1.5 Generalized Inverse Gaussian (GIG)
Mixing distribution for generalized hyperbolic (§7); Bessel-K density. Barndorff-Nielsen + Halgreen 1977.
1.6 Beta
Support [0, 1]. f(x) = x^(α−1)(1−x)^(β−1) / B(α, β). Pearson System Type I (1916). Conjugate prior for Bernoulli/Binomial. Mean α/(α+β). Jeffreys prior Beta(1/2, 1/2). scipy.stats.beta.
1.7 Gamma
Support (0, ∞). f(x) = x^(k−1) e^(−x/θ) / (θ^k Γ(k)). Shape k, scale θ. Mean kθ, variance kθ². Waiting time for k Poisson events. Conjugate prior for Poisson rate (shape-rate parameterisation).
1.8 Inverse-Gamma
X ∼ Gamma → 1/X ∼ Inverse-Gamma. Conjugate for normal variance (with known mean) in classical Bayesian — though half-Cauchy / half-normal scale priors are preferred today (Gelman 2006).
1.9 Generalized Gamma (Stacy 1962)
Three-parameter unifier of Gamma, Weibull, Log-normal (limit). Stacy. Useful in survival when shape uncertain.
1.10 Erlang
Special Gamma with integer shape k. Sum of k i.i.d. Exp(λ). Telephone-traffic / queueing origin (Erlang at Copenhagen Telephone 1909–1929).
1.11 Chi-squared (χ²)
Sum of k squared standard normals. Helmert 1875, popularised Pearson 1900 (goodness-of-fit). DoF k; mean k, variance 2k. Test statistics: Pearson χ², LR test asymptotic.
1.12 Chi (χ)
√χ². k-DoF magnitude of k-vector of i.i.d. N(0,1). χ_2 = Rayleigh, χ_3 = Maxwell.
1.13 Student’s t
Gosset 1908 (pen-name “Student”; Guinness brewery confidentiality). k DoF. Heavier tails than normal; → normal as k → ∞. Test statistic for small-sample mean inference. scipy.stats.t.
1.14 F (Fisher–Snedecor)
Ratio of two scaled χ². Fisher (1924) + Snedecor (1934). ANOVA, regression F-tests.
1.15 Cauchy
Support ℝ. f(x) = (πγ(1 + ((x−x₀)/γ)²))^(−1). No mean, no variance (pathological). Lorentz/Cauchy 1853; resonance spectroscopy. Stable index α = 1.
1.16 Lévy α-stable
Stable family parameterised by (α, β, γ, δ): characteristic function ψ(t) = exp(iδt − γ^α |t|^α (1 − iβ sign(t) Φ(α, t))). α ∈ (0, 2]. α = 2 ⇒ Normal; α = 1, β = 0 ⇒ Cauchy; α = 1/2, β = 1 ⇒ Lévy distribution. Mandelbrot 1963 (cotton prices); Lévy 1925. Infinite variance for α < 2. Heavy-tailed finance, anomalous diffusion. R: stable; Python: scipy.stats.levy_stable.
1.17 Generalized Hyperbolic
Mean-variance mixture of Normal with GIG mixing. Barndorff-Nielsen 1977. Subfamilies: NIG (Normal Inverse Gaussian), Variance Gamma, Hyperbolic, Skew-t. Finance returns.
1.18 NIG (Normal Inverse Gaussian)
Special case of GH; popular in option pricing (Barndorff-Nielsen). R: GeneralizedHyperbolic.
1.19 Variance Gamma
Madan + Seneta 1990. Subordinated Brownian motion with Gamma time change. Carr-Geman-Madan-Yor CGMY 2002 extends to four-parameter Levy.
1.20 Pareto (Type I, II, III, IV; generalized Pareto)
Pareto 1896 (income, the 80/20 rule). Type I: f(x) = α x_m^α / x^(α+1), x ≥ x_m. Type II = Lomax (location shift). Power-law tail. Mandelbrot, Newman 2005 reviews power laws.
1.21 Generalized Pareto Distribution (GPD)
Pickands 1975. CDF F(x) = 1 − (1 + ξx/σ)^(−1/ξ). Threshold exceedance limit in extreme value theory (POT method). Three regimes:
- ξ > 0: heavy tail (Pareto)
- ξ = 0: exponential tail (Gumbel domain)
- ξ < 0: bounded tail (reverse-Weibull)
1.22 Generalized Extreme Value (GEV)
Jenkinson 1955; unifies Gumbel + Fréchet + Weibull. CDF F(x) = exp{−[1 + ξ(x−μ)/σ]^(−1/ξ)} for ξ ≠ 0; F(x) = exp{−exp[−(x−μ)/σ]} for ξ = 0. Block-maxima limit (Fisher-Tippett-Gnedenko theorem). Coles 2001 textbook.
1.23 Gumbel
GEV with ξ = 0. Maxima of light-tailed RVs. Flood return levels, NLP softmax sampling (Gumbel-Max trick).
1.24 Fréchet
GEV with ξ > 0. Maxima of heavy-tailed RVs.
1.25 Weibull (reversed Weibull = GEV ξ < 0)
Waloddi Weibull 1951. Survival: F(t) = 1 − exp(−(t/η)^β). Shape β:
- β < 1 decreasing hazard (infant mortality)
- β = 1 exponential
- β > 1 increasing hazard (wear-out)
Wind speed, material fatigue.
1.26 Burr (Type XII) and Lomax
Burr 1942. f(x) ∝ ckx^(c−1)/(1+x^c)^(k+1). Pareto generalisation. Singh-Maddala = Burr Type XII.
1.27 Logistic + Generalized Logistic
Symmetric, similar to normal but heavier tails. CDF F(x) = 1/(1+exp(−(x−μ)/s)). Logit link in GLM. Hosking 1990 generalized logistic in EVT.
1.28 Hyperbolic Secant / Half-Logistic
Niche.
1.29 Laplace + Asymmetric Laplace
f(x) = (1/2b) exp(−|x−μ|/b). Laplace 1774. Heavier-tailed than normal. L1 regression / median estimation MLE. Asymmetric Laplace = quantile regression likelihood (Yu + Moyeed 2001).
1.30 Hypoexponential + Hyperexponential
Mixtures of exponentials. Hypoexponential = sum of independent Exp with possibly different rates. Hyperexponential = mixture (random selection of one Exp). Erlang is hypoexponential with equal rates.
1.31 Phase-Type
General class: absorbing-time of finite-state continuous-time Markov chain. Includes Erlang, hypoexponential, hyperexponential, Coxian. Neuts 1981. Insurance, queueing.
1.32 Exponentially Modified Gaussian (EMG)
Convolution of Gaussian and Exponential. Skewed-right peaks in chromatography, neural reaction times.
1.33 Voigt
Convolution of Gaussian + Lorentzian. Spectroscopy line profiles (Doppler + pressure broadening). Approximations: pseudo-Voigt (Thompson-Cox-Hastings).
1.34 Skew-Normal (Azzalini 1985)
f(x) = 2 φ(x) Φ(αx). Adds asymmetry to Gaussian. α = 0 ⇒ Normal.
1.35 Skew-t
Azzalini + Capitanio 2003. Heavy-tailed + asymmetric. Standard in robust finance modelling.
1.36 Generalized Error Distribution (GED) / Subbotin / Generalized Normal
f(x) ∝ exp(−(|x|/α)^β / β). β = 2 ⇒ Normal; β = 1 ⇒ Laplace; β → ∞ ⇒ Uniform. GARCH residuals option (Nelson 1991 EGARCH).
1.37 Tukey Lambda
Symmetric, parameter-shape-only; useful for Q-Q plot diagnosis of unknown distribution shape.
1.38 Power-law Tail (Pareto-type)
P(X > x) ∝ x^(−α). Newman 2005 review. Estimation: Hill estimator, Clauset-Shalizi-Newman 2009 ML method with KS-based xmin selection.
2. Continuous Multivariate
2.1 Multivariate Normal (MVN)
f(x) = (2π)^(−k/2) |Σ|^(−1/2) exp(−(1/2)(x−μ)’ Σ^(−1) (x−μ)). Σ positive-definite. Marginals + conditionals + linear transforms all Gaussian.
2.2 Multivariate t
f(x) ∝ (1 + (x−μ)’ Σ^(−1) (x−μ)/ν)^(−(ν+k)/2). Heavier-tailed. ν = 1 ⇒ multivariate Cauchy.
2.3 Dirichlet
Support open simplex Δ^(k−1). Conjugate to Multinomial. f(p) ∝ ∏ p_i^(α_i−1). Topic models (LDA), bayesian softmax priors.
2.4 Wishart
Wishart 1928. Distribution of S = X’X for X (n × p) i.i.d. N(0, Σ). Conjugate prior for inverse-covariance (precision matrix) in Bayesian multivariate normal.
2.5 Inverse-Wishart
Conjugate for covariance matrix Σ in Bayesian MVN. Critique (Gelman): induces correlation in priors; modern preference is LKJ prior on correlation matrices + separate scale priors (Lewandowski-Kurowicka-Joe 2009).
2.6 Matrix Normal
X (n × p) with row and column covariance structure (Σ_row ⊗ Σ_col separable Kronecker). Vec(X) ∼ N(vec(M), Σ_col ⊗ Σ_row).
2.7 Matrix-variate t
2.8 Multivariate Skew-Normal (Azzalini-Capitanio 1999)
2.9 Copulas
See copulas-and-dependence. Sklar (1959) theorem: any joint distribution is marginals + copula. Gaussian, t, Archimedean (Clayton, Gumbel, Frank, Joe), Vine copulas (Bedford-Cooke). Risk modelling, hydrology, finance dependence.
2.10 LKJ Distribution (Correlation Matrices)
f(R) ∝ |R|^(η−1). Lewandowski, Kurowicka, Joe 2009. Standard in PyMC, Stan, NumPyro for correlation priors.
3. Discrete
3.1 Bernoulli
X ∈ {0,1}, p. Building block.
3.2 Binomial
Sum of n i.i.d. Bernoulli(p). Mean np, variance np(1−p).
3.3 Negative Binomial (Polya)
Number of trials before r-th success in Bernoulli sequence. Overdispersed Poisson; common in count regression (gene expression, insurance claim counts).
3.4 Geometric
Special case Negative Binomial r=1. Memoryless.
3.5 Hypergeometric
Sampling without replacement; K successes in population N, sample size n.
3.6 Poisson
Limit of Binomial as n → ∞, np = λ. Rare event count. Mean = variance = λ.
3.7 Generalized Poisson (Consul + Jain 1973)
Two-parameter; allows over- and underdispersion.
3.8 Skellam
Difference of two independent Poisson(λ₁), Poisson(λ₂). Sports scoring differences (Karlis + Ntzoufras).
3.9 Conway-Maxwell-Poisson (COM-Poisson)
Conway + Maxwell 1962. Adds dispersion parameter ν; under (ν > 1) and over (ν < 1) dispersion vs Poisson (ν = 1).
3.10 Categorical / Multinoulli
Single-trial K-class.
3.11 Multinomial
Sum of n Categorical. Generalises Binomial to K classes.
3.12 Dirichlet-Multinomial (DM)
Multinomial with probabilities themselves drawn Dirichlet. Overdispersed multinomial (microbiome compositional data).
3.13 Beta-Binomial
Binomial with Beta-distributed p; overdispersed. Toxicology, clustered trials.
3.14 Zipf / Zipfian
P(X = k) ∝ 1/k^s. Zipf 1949 (word frequencies). Limit of size-biased sampling.
3.15 Yule-Simon
Preferential attachment (Simon 1955). Power-law tail.
3.16 Discrete Uniform
3.17 Logarithmic (Fisher series)
Fisher 1943. Insect species abundance.
3.18 Borel + Borel-Tanner
Branching process family sizes.
3.19 Riemann Zeta / Hurwitz Zeta
P(X = k) = k^(−s)/ζ(s). Probabilistic number theory.
3.20 Polya Urn
Discrete-time Markov chain modelling reinforcement; underlies Dirichlet process (de Finetti’s theorem for exchangeable sequences).
4. Mixtures + Bayesian Nonparametric
4.1 Gaussian Mixture Model (GMM)
∑ π_k N(μ_k, Σ_k). Fit via EM (Dempster-Laird-Rubin 1977) or VI; model selection via BIC/AIC/cross-val. sklearn.mixture.GaussianMixture, scikit-learn.
4.2 Mixture of t / Mixture of factor analyzers
Robust extensions; Ghahramani + Hinton MFA, McLachlan + Peel mixture of t.
4.3 Dirichlet Process (DP) — Ferguson 1973
Nonparametric prior over probability measures. DP(α, G_0). Sticks-breaking construction (Sethuraman 1994), Pólya urn / CRP representation.
4.4 Chinese Restaurant Process (CRP)
Exchangeable partition probability function (EPPF). Aldous + Pitman. Customer k+1 joins existing table j with prob n_j/(α+k) or new table with prob α/(α+k).
4.5 Pitman-Yor Process (PYP)
Pitman 1995; two-parameter extension of DP (discount d ∈ [0, 1), concentration α). Power-law cluster sizes — natural language modelling (Teh 2006 HPYP n-gram).
4.6 Indian Buffet Process (IBP)
Griffiths + Ghahramani 2005. Latent feature model — infinite binary features per data point. Beta-Bernoulli stick-breaking.
4.7 Hierarchical Dirichlet Process (HDP)
Teh-Jordan-Beal-Blei 2006. Sharing across groups; HDP-LDA (nonparametric topic model).
4.8 Latent Dirichlet Allocation (LDA)
Blei-Ng-Jordan 2003. Topic model; Dirichlet priors over per-document topic distributions and per-topic word distributions. Variational + Gibbs samplers (Griffiths + Steyvers).
4.9 Mixture of Experts (MoE)
Jacobs et al 1991. Gating network selects expert subnetwork. Foundation of sparse-MoE LLMs (Mixtral 8×7B, DeepSeek-V3, GPT-4 rumored).
5. Survival + Reliability
5.1 Exponential — constant hazard; memoryless
5.2 Weibull — most popular parametric (see §1.25)
5.3 Log-normal — non-monotone hazard
5.4 Gompertz — log-hazard linear in time; biological aging
5.5 Gompertz-Makeham — Gompertz + constant baseline; actuarial mortality
5.6 Beta-survival
5.7 Generalized Gamma — encompasses Exp, Weibull, log-normal (Stacy 1962)
5.8 Generalized Weibull / Exponentiated Weibull (Mudholkar + Srivastava 1993)
5.9 Bathtub-curve models — Hjorth, modified Weibull (Lai-Xie-Murthy), additive Weibull
5.10 AFT (Accelerated Failure Time) models — log T = β’X + σε; ε from chosen base
5.11 Cox PH — semi-parametric: h(t|X) = h_0(t) exp(β’X). Cox 1972. Partial likelihood. Default in clinical biostatistics
5.12 Piecewise Exponential — flexible nonparametric baseline
5.13 Royston-Parmar / Flexible Parametric — splines on log cumulative hazard
5.14 Kaplan-Meier — nonparametric survival estimator
5.15 Nelson-Aalen — nonparametric cumulative hazard estimator
Tools: R survival, flexsurv, rms (Frank Harrell), Python lifelines (Davidson-Pilon), scikit-survival, PyMC for Bayesian survival.
6. Heavy-Tailed Families (Finance + Risk)
6.1 Stable α-Stable (§1.16)
Mandelbrot 1963 cotton prices; Fama 1965 stock returns. Heavy tails + stability under sums.
6.2 Generalized Hyperbolic + sub-families (§1.17–1.19)
6.3 Jump-Diffusion (Merton 1976)
dS_t / S_t = μ dt + σ dW_t + (Y − 1) dN_t. N_t Poisson jump count, Y log-normal jump size.
6.4 Kou Double-Exponential Jump-Diffusion (2002)
Asymmetric Laplace jump-size; analytical option pricing.
6.5 Stochastic Volatility — Heston (1993)
dS = μ S dt + √v S dW₁, dv = κ(θ − v)dt + σ_v √v dW₂, corr(dW₁, dW₂) = ρ. Closed-form via FFT (Carr-Madan).
6.6 SABR (Hagan, Kumar, Lesniewski, Woodward 2002)
dF = α F^β dW₁, dα = ν α dW₂, corr = ρ. Volatility surface in fixed-income.
6.7 Rough Volatility — Gatheral-Jaisson-Rosenbaum 2018
Fractional Brownian motion with Hurst H ≈ 0.1. Volatility is “rougher” than diffusive (H = 0.5). rBergomi popular.
6.8 CGMY / Variance Gamma — pure-jump Lévy (Carr-Geman-Madan-Yor 2002)
6.9 Tempered Stable — α-stable with exponentially dampened tails for finite moments
6.10 Subordinated Brownian Motion — general framework: B(τ(t)) with random time change τ
7. Circular / Directional Statistics
7.1 von Mises
Circular analogue of normal on the circle [0, 2π). f(θ) ∝ exp(κ cos(θ − μ)). von Mises 1918. Wind direction, oriented data. Mardia + Jupp 2000 textbook.
7.2 Wrapped Normal, Wrapped Cauchy, Wrapped Lévy
“Wrap” a real-line distribution onto circle by summing over translates.
7.3 von Mises-Fisher (vMF)
S^(p−1) analogue of von Mises. Directional clustering (Banerjee-Dhillon-Ghosh-Sra 2005). Used in spherical text embeddings, hyperspherical normalisation in deep learning (CosFace, ArcFace).
7.4 Bingham
Antipodally symmetric; orientation data without sign.
7.5 Matrix Bingham + Fisher-Bingham
Distributions on Stiefel manifold (orthonormal matrices). Bayesian PCA orientation, cryo-EM reconstructions.
7.6 Watson
Axial data analogue of vMF.
7.7 Kent
Generalisation of vMF on sphere with elliptical concentration (5 parameters); geology orientation data.
8. Empirical / Nonparametric
8.1 Empirical Distribution Function
F̂_n(x) = (1/n) ∑ I(X_i ≤ x). Glivenko-Cantelli theorem.
8.2 Bootstrap (Efron 1979)
Resampling with replacement from empirical distribution. CI + standard errors. Nonparametric, parametric, smoothed, bootstrap-t, BCa, double bootstrap.
8.3 Kernel Density Estimation (KDE)
f̂(x) = (1/nh) ∑ K((x − X_i)/h). Silverman 1986. Bandwidth selection: Silverman’s rule, Scott’s rule, cross-val, plug-in.
8.4 Spline Density
P-splines (Eilers + Marx 1996), penalised B-splines.
8.5 Histogram + Bayesian Histogram
8.6 Bayesian Nonparametric Density (DPM, BNP)
9. Extreme Value Theory (EVT)
9.1 Fisher-Tippett-Gnedenko Theorem
Block maxima → GEV (only three possible limits: Gumbel, Fréchet, Weibull).
9.2 Pickands-Balkema-de Haan Theorem
Threshold exceedances → GPD.
9.3 Hill Estimator
Tail index α̂ = (1/k) ∑_{i=1}^k log(X_{(n−i+1)}/X_{(n−k)}). Bias-variance tradeoff in k.
9.4 Pickands Estimator
Estimator for ξ (shape parameter) of GPD/GEV.
9.5 Return Levels
x_p = μ + [σ/ξ]((−log(1−p))^(−ξ) − 1). “100-year flood” = level exceeded with prob 0.01 in any year (return period 100 years).
9.6 Multivariate EVT
Spectral measure on simplex; max-stable processes (Brown-Resnick, Schlather, Smith).
9.7 VaR + Expected Shortfall + Cornish-Fisher
Quantile-based risk; ES = E[X | X > VaR]. Cornish-Fisher expansion approximates quantiles with skew + kurt correction.
Software: R evd, evir, extRemes, ismev, POT; Python pyextremes.
10. Tests + Diagnostics
10.1 Goodness-of-Fit
- Q-Q plot — visual
- P-P plot — visual
- Anderson-Darling — emphasises tails (weighted KS)
- Kolmogorov-Smirnov — sup |F̂ − F|; less powerful than AD
- Cramér-von Mises — integrated squared difference
- Shapiro-Wilk — normality (small n)
- Jarque-Bera — moment-based normality test
- Lilliefors — KS with estimated parameters
- D’Agostino-Pearson — skew + kurtosis combined
10.2 Two-Sample
- KS two-sample
- Anderson-Darling k-sample
- Energy distance (Székely + Rizzo 2004)
- Maximum Mean Discrepancy (MMD) (Gretton et al 2007) — kernel-based; widely used in ML
- Permutation tests
10.3 Discrepancies
- Wasserstein W_p — optimal transport
- Sinkhorn divergence — entropy-regularised OT (Cuturi 2013)
- KL divergence — Kullback-Leibler 1951
- JS divergence — symmetrised KL
- f-divergences — Csiszár class
- Hellinger distance — bounded [0, 1]
11. Software Locator
- Python scipy.stats — 100+ continuous, 20+ discrete; rvs/pdf/pmf/cdf/ppf/sf/isf/logpdf/…
- statsmodels — fit + tests + diagnostics
- scikit-learn mixture — GMM, BayesianGaussianMixture (DP-GMM via VI)
- PyMC — Bayesian; rich distribution library
- NumPyro + Distrax + TensorFlow Probability — JAX/TF stacks
- PyTorch torch.distributions — autograd-compatible
- hmmlearn, pomegranate — HMM + general Bayesian
- R stats + MASS — base + classical
- R fitdistrplus — fitting any continuous + discrete with diagnostics
- R EnvStats — environmental statistics distributions (Censored)
- R evd, evir, extRemes, ismev, POT — EVT
- R copula, VineCopula — copula models
- R survival, flexsurv, rms — survival
- R bnpy / dirichletprocess — BNP
- Stan / RStan / CmdStanR — Bayesian HMC; complete distribution language
- Julia Distributions.jl — type-stable, extensive
12. Common Pitfalls
- Heavy-tailed data: sample mean + standard deviation may not exist or converge slowly; use trimmed mean, median, MAD instead
- Q-Q plot ≠ proof of distribution; only suggests
- KS test loses power at parameter estimation (use Lilliefors for normality); for heavy tails use Anderson-Darling
- Bootstrap CIs are not exact; coverage depends on regularity and tail behaviour
- Hill estimator is sensitive to threshold k; plot tail-index vs k and pick stable plateau
- Conjugate priors are convenient but can be informative; check sensitivity
- Inverse-Wishart on covariance ⇒ correlated priors on variance and correlation — prefer LKJ + scale priors
- Discrete distributions in continuous-data toolkits: scipy treats them via pmf; conversion gotchas
- Mixed continuous-discrete (zero-inflated count models, censored response) — use Tobit, hurdle, ZIP/ZINB, Tweedie compound Poisson-Gamma
- Sampling from heavy-tailed distributions in MCMC: HMC step-size adaptation struggles; reparameterise (centred ⇄ non-centred) or use NUTS warm-up + step-size jitter
- Multivariate distributions: number of parameters grows O(p²) for covariance; structured priors (LKJ, Cholesky, factor-analytic) essential at p > 20
13. Exponential Family + GLM Backbone
Many of the above belong to the natural exponential family, which determines the canonical GLM (Generalized Linear Model — Nelder + Wedderburn 1972):
| Distribution | Link (canonical) | Variance function V(μ) | GLM use |
|---|---|---|---|
| Normal | identity | 1 | OLS regression |
| Bernoulli / Binomial | logit | μ(1−μ) | Logistic regression |
| Poisson | log | μ | Count regression |
| Negative Binomial | log | μ + μ²/k | Overdispersed counts |
| Gamma | inverse | μ² | Positive continuous |
| Inverse Gaussian | inverse-square | μ³ | Failure times |
| Multinomial | logit (softmax) | diag(μ) − μμ’ | Categorical regression |
| Tweedie | power | μ^p | Insurance claims, semi-continuous |
GLMs unify these; MLE via IRLS (iteratively reweighted least squares). Extensions: GLMM (random effects), GAM (smooth terms, Wood 2017), Bayesian GLM (brms, rstanarm).
14. Process-Level Distributions (Time-Indexed)
Generalising static distributions to stochastic processes:
14.1 Gaussian Processes
- Mean function m(x) + kernel k(x, x’)
- Common kernels: squared exponential (RBF), Matérn (ν = 1/2 exponential, ν = 3/2, 5/2, ∞ RBF), periodic, linear, spectral mixture
- Bayesian regression, optimisation (BO; Mockus 1989, Snoek-Larochelle-Adams 2012 spearmint)
- Tools: GPy, GPflow, GPyTorch, BoTorch, PyMC GP module
14.2 Poisson Process + Cox Process
- Homogeneous PP: events with rate λ; inter-arrival ∼ Exp(λ)
- Inhomogeneous PP: rate λ(t)
- Cox (doubly stochastic) Process: λ(t) is itself random (e.g., log-Gaussian Cox process for spatial point patterns; Møller-Syversveen-Waagepetersen 1998)
14.3 Hawkes Process (self-exciting)
Conditional intensity λ*(t) = μ + ∑_{t_i < t} α exp(−β(t − t_i)). Hawkes 1971. Seismology aftershocks, finance order arrivals, contagion.
14.4 Markov Chains + HMM
- Discrete-time / continuous-time
- Hidden Markov Model — emissions conditional on hidden state; Baum-Welch + Viterbi
- LDS / Kalman filter — linear-Gaussian HMM
- Tools: hmmlearn, pomegranate, pyhsmm, edward, NumPyro
14.5 Lévy Processes
- Independent + stationary increments; càdlàg paths
- Brownian motion (continuous), Poisson, compound Poisson, Gamma process, variance gamma, NIG, CGMY, α-stable
- Lévy-Khintchine triplet (drift, Gaussian volatility, jump measure)
14.6 Diffusions + SDEs
dX_t = μ(X_t, t) dt + σ(X_t, t) dW_t. Itô + Stratonovich calculi.
- Ornstein-Uhlenbeck (mean-reverting Gaussian) — interest rates Vasicek
- CIR (Cox-Ingersoll-Ross) — non-negative square-root process; short rates
- Heston volatility (§6.5)
- SABR (§6.6)
14.7 Fractional + Rough Processes
- Fractional Brownian motion fBm(H) — H Hurst exponent
- Rough volatility H ≈ 0.1 (Gatheral et al)
15. Distance + Divergence Reference
Beyond the goodness-of-fit tests in §10, key divergences used in modern probabilistic ML:
- KL D_KL(P‖Q) = ∫ p log(p/q) — variational inference, mutual information, ELBO
- Reverse-KL D_KL(Q‖P) — mode-seeking; VAE encoder objective
- JS = (KL(P‖M) + KL(Q‖M))/2, M=(P+Q)/2 — symmetric, bounded
- Wasserstein W_p — moves mass; sensitive to support overlap absent; basis WGAN (Arjovsky-Chintala-Bottou 2017); Sinkhorn approximation O(n² log n) (Cuturi 2013)
- Hellinger H²(P,Q) = (1/2)∫(√p − √q)²
- Total Variation TV(P,Q) = (1/2) ∫ |p − q|
- MMD ‖μ_P − μ_Q‖_H_k — kernel two-sample (Gretton 2007); used in autoencoder testing
- Energy distance — characteristic kernel MMD with negative-distance kernel
- Stein Discrepancy + KSD — sample-vs-distribution goodness-of-fit (Gorham + Mackey 2017); used in posterior sampler diagnostics
- Bregman divergences — generalisation; KL is the Bregman induced by entropy
- α-divergences (Rényi, Tsallis) — interpolate KL ↔ Hellinger
16. Reference Texts
- Johnson, Kotz, Balakrishnan — Continuous Univariate Distributions Vols 1 + 2 (Wiley)
- Johnson, Kemp, Kotz — Univariate Discrete Distributions
- Kotz, Balakrishnan, Johnson — Continuous Multivariate Distributions
- Coles — An Introduction to Statistical Modeling of Extreme Values (2001)
- Embrechts, Klüppelberg, Mikosch — Modelling Extremal Events (1997)
- Mardia + Jupp — Directional Statistics (2000)
- McLachlan + Peel — Finite Mixture Models (2000)
- Klein + Moeschberger — Survival Analysis (Springer)
- Therneau + Grambsch — Modeling Survival Data: Extending the Cox Model
- Cont + Tankov — Financial Modelling with Jump Processes (2003)
- Hogg + McKean + Craig — Introduction to Mathematical Statistics