Probability Distribution Zoo

Comprehensive catalog of 60+ probability distributions: their PMFs/PDFs, moments, MGFs, parameter relationships, applications, conjugate priors, and family relationships (transformations, limits, compoundings, mixtures).

Tier 3 family index. Cousin of [[Math/probability-distributions]] (Tier 1 conceptual). Read Tier 1 for “what is a distribution”; read this for “which one do I use and how does it relate to the others”.

Notation: support S, parameters listed, density f (continuous) or PMF p (discrete), mean μ = E[X], variance σ² = Var[X], MGF M(t) = E[exp(tX)], characteristic function φ(t) = E[exp(itX)].

1. Discrete distributions

1.1 Bernoulli(p)

Support: {0, 1}
PMF: p(x) = pˣ (1-p)^(1-x), p ∈ [0,1]
Mean: p · Var: p(1-p) · MGF: 1 - p + p·eᵗ
Use: single binary trial; coin flip; success/failure indicator.
Conjugate prior: Beta.
Relationship: Binomial(1, p) = Bernoulli(p). Sum of n iid Bernoulli(p) is Binomial(n, p).
Originator: Jacob Bernoulli (1713, Ars Conjectandi).

1.2 Binomial(n, p)

Support: {0, 1, …, n}
PMF: p(k) = C(n,k) pᵏ (1-p)^(n-k)
Mean: np · Var: np(1-p) · MGF: (1 - p + p·eᵗ)ⁿ
Use: number of successes in n independent Bernoulli trials.
Conjugate prior (for p, n fixed): Beta.
Relationship: limit Binomial(n, λ/n) → Poisson(λ) as n → ∞ (Poisson limit theorem). Mixture over Beta gives Beta-binomial. Standardized → Normal as n → ∞ (de Moivre-Laplace).

1.3 Geometric(p) — number of trials until first success

Support: {1, 2, 3, …} (or {0, 1, 2, …} for “failures before success” variant)
PMF: p(k) = (1-p)^(k-1) p (1-indexed)
Mean: 1/p · Var: (1-p)/p² · MGF: p eᵗ / (1 - (1-p)eᵗ) for t < -log(1-p)
Use: waiting time (discrete) until first success.
Memoryless property (unique among discrete distributions): P(X > m+n | X > m) = P(X > n).
Relationship: Geometric(p) = NegBin(1, p). Continuous analog: Exponential.

1.4 Negative binomial NB(r, p)

Support: {r, r+1, …} (trials variant) or {0, 1, 2, …} (failures variant)
PMF (failures): p(k) = C(k + r - 1, k) (1-p)ᵏ pʳ
Mean (failures): r(1-p)/p · Var: r(1-p)/p² · MGF: (p / (1 - (1-p)eᵗ))ʳ
Use: number of failures before r-th success; overdispersed counts.
Relationship: Poisson-Gamma mixture: if X | λ ~ Poisson(λ) and λ ~ Gamma(r, p/(1-p)), then X ~ NB(r, p). Generalizes geometric (r = 1). Allows non-integer r (Pólya distribution).

1.5 Hypergeometric(N, K, n)

Support: max(0, n-N+K), …, min(n, K)
PMF: p(k) = C(K,k) C(N-K, n-k) / C(N, n)
Mean: nK/N · Var: n(K/N)(1-K/N)(N-n)/(N-1)
Use: sampling without replacement from a finite population (K “successes” among N).
Relationship: limit N, K → ∞ with K/N → p: Hyper → Binomial(n, p). Multivariate version: multivariate hypergeometric.

1.6 Poisson(λ)

Support: {0, 1, 2, …}
PMF: p(k) = e^(-λ) λᵏ / k!, λ > 0
Mean: λ · Var: λ · MGF: exp(λ(eᵗ - 1)) · Char. fn: exp(λ(e^(it) - 1))
Use: counts of rare events in fixed interval; arrivals; defects.
Conjugate prior: Gamma.
Equidispersion: mean = variance. Departure signals overdispersion (use NB) or underdispersion (use Conway-Maxwell-Poisson).
Relationship: sum of independent Poissons is Poisson. Difference is Skellam. Limit of Binomial(n, λ/n).
Originator: Siméon Denis Poisson (1837).

1.7 Categorical(p₁, …, pₖ)

Support: {1, …, K} (or one-hot vectors)
PMF: p(x = i) = pᵢ, ∑pᵢ = 1
Use: single multi-class outcome (generalization of Bernoulli).
Conjugate prior: Dirichlet.

1.8 Multinomial(n, p₁, …, pₖ)

Support: {(n₁, …, nₖ) : ∑nᵢ = n, nᵢ ≥ 0}
PMF: p(n₁,…,nₖ) = n! / (n₁! … nₖ!) ∏ pᵢ^(nᵢ)
Mean: E[nᵢ] = n pᵢ · Cov: Cov(nᵢ, nⱼ) = -n pᵢ pⱼ (i ≠ j); Var(nᵢ) = n pᵢ(1-pᵢ)
Use: counts across K categories in n independent trials.
Conjugate prior: Dirichlet.

1.9 Multivariate hypergeometric

Support: (k₁, …, kₘ) with ∑kᵢ = n, kᵢ ≤ Kᵢ
PMF: p(k) = ∏ C(Kᵢ, kᵢ) / C(N, n) where N = ∑Kᵢ
Use: draw n from urn with categories of sizes K₁, …, Kₘ without replacement.

1.10 Discrete uniform({a, …, b})

Support: {a, a+1, …, b}, N = b - a + 1
PMF: 1/N
Mean: (a+b)/2 · Var: (N²-1)/12
Use: maximum-entropy distribution on a finite set; randomization.

1.11 Zipf(s, N) — Zipfian / discrete Pareto

Support: {1, 2, …, N} (or N → ∞ if s > 1)
PMF: p(k) = k^(-s) / H_{N,s} where H_{N,s} = ∑_{i=1}^N i^(-s) (generalized harmonic number)
Mean (finite N): H_{N,s-1} / H_{N,s}; diverges for s ≤ 2 as N → ∞.
Use: rank-frequency in natural language (Zipf’s law), city sizes, citation counts, file-size distributions.
Originator: George Kingsley Zipf (1949).

1.12 Yule-Simon(ρ)

Support: {1, 2, 3, …}
PMF: p(k) = ρ · B(k, ρ + 1) where B is Beta function
Mean: ρ/(ρ-1) for ρ > 1; Var: ρ²/((ρ-1)²(ρ-2)) for ρ > 2
Use: preferential attachment models; gene families; word frequencies.
Originator: G. Udny Yule (1925), formalized by Herbert Simon (1955).

1.13 Discrete Pareto (Pareto-II discrete / zeta-shift)

Support: {0, 1, 2, …}
PMF: p(k) ∝ (k + α)^(-s-1) (various parameterizations)
Use: heavy-tailed discrete data with a shift; alternative to Zipf.

1.14 Beta-binomial(n, α, β)

Support: {0, 1, …, n}
PMF: p(k) = C(n,k) B(k+α, n-k+β) / B(α, β)
Mean: nα/(α+β) · Var: nαβ(α+β+n) / ((α+β)²(α+β+1))
Use: overdispersed binomial; clustered Bernoulli trials.
Relationship: marginal of Binomial(n, p) when p ~ Beta(α, β). As α, β → ∞ with fixed α/(α+β): → Binomial.

1.15 Pólya-Eggenberger urn

Support: depends on urn dynamics.
Description: draw + replace + add c of same color. Generalizes hypergeometric (c = -1), binomial (c = 0), beta-binomial (c = 1 with continuous limit).
Use: contagion models; reinforced random processes.

1.16 Gibbs / Boltzmann

Support: state space S (often combinatorial).
PMF: p(x) = exp(-E(x)/T) / Z, Z = ∑ exp(-E(x)/T)
Use: equilibrium statistical mechanics; energy-based models; simulated annealing target.

1.17 Logarithmic series Log(p)

Support: {1, 2, 3, …}
PMF: p(k) = -pᵏ / (k log(1-p)), 0 < p < 1
Mean: -p / ((1-p) log(1-p))
Use: species abundance (Fisher, 1943); compound for NB (logarithmic mixture of Poissons).

1.18 Skellam(μ₁, μ₂) — difference of two Poissons

Support: ℤ
PMF: p(k) = e^(-(μ₁+μ₂)) (μ₁/μ₂)^(k/2) I_|k|(2√(μ₁μ₂)) (I = modified Bessel)
Mean: μ₁ - μ₂ · Var: μ₁ + μ₂
Use: difference of two independent Poisson counts (e.g., soccer score differences).

1.19 Conway-Maxwell-Poisson CMP(λ, ν)

Support: {0, 1, 2, …}
PMF: p(k) = λᵏ / ((k!)^ν Z(λ, ν)) where Z(λ, ν) = ∑ λʲ / (j!)^ν
Use: counts with flexible dispersion: ν > 1 → underdispersed (subsumes Bernoulli at ν → ∞); ν = 1 → Poisson; ν < 1 → overdispersed (subsumes geometric at ν = 0).
Originator: Conway & Maxwell (1962).

1.20 Zero-inflated Poisson ZIP(π, λ)

Support: {0, 1, 2, …}
PMF: p(0) = π + (1-π) e^(-λ); p(k) = (1-π) e^(-λ) λᵏ / k! for k ≥ 1
Use: count data with excess zeros (insurance claims, defect counts, ecological surveys).

1.21 Zero-truncated Poisson ZTP(λ)

Support: {1, 2, 3, …}
PMF: p(k) = e^(-λ) λᵏ / (k! (1 - e^(-λ)))
Use: count data conditioned on at least one event (size-biased Poisson).

2. Continuous distributions on R

2.1 Normal / Gaussian N(μ, σ²)

Support: ℝ
PDF: f(x) = (2πσ²)^(-1/2) exp(-(x-μ)²/(2σ²))
Mean: μ · Var: σ² · MGF: exp(μt + σ²t²/2) · Char. fn: exp(iμt - σ²t²/2)
Use: CLT limit; measurement noise; linear regression errors; everything.
Conjugate prior (for μ, σ² known): Normal. For (μ, σ²): Normal-Inverse-Gamma.
Maximum-entropy: distribution on ℝ with given mean and variance.
Relationship: standardized (X - μ)/σ ~ N(0, 1). Sum of independent normals is normal. Square ~ Chi-square(1). exp(X) ~ Lognormal. Ratio of two iid centered normals ~ Cauchy.
Originator: de Moivre (1733), Laplace, Gauss (1809).

2.2 Cauchy(x₀, γ) / Lorentz

Support: ℝ
PDF: f(x) = 1 / (πγ (1 + ((x-x₀)/γ)²))
Mean: undefined · Var: undefined · MGF: does not exist · Char. fn: exp(ix₀t - γ|t|)
Use: heavy-tailed noise; resonance line shapes (physics); ratio of two iid centered normals; t-distribution with 1 df.
Property: stable with index 1; sum of n iid Cauchy is also Cauchy (scaled).

2.3 Laplace(μ, b) / double exponential

Support: ℝ
PDF: f(x) = (1/(2b)) exp(-|x-μ|/b)
Mean: μ · Var: 2b² · Char. fn: exp(iμt) / (1 + b²t²)
Use: robust regression (L1 loss = MAP under Laplace); LASSO prior; differential privacy noise.
Relationship: difference of two iid Exp(1/b) is Laplace(0, b).

2.4 Logistic(μ, s)

Support: ℝ
PDF: f(x) = e^(-(x-μ)/s) / (s (1 + e^(-(x-μ)/s))²)
Mean: μ · Var: s²π²/3 · MGF: e^(μt) B(1 - st, 1 + st) for |st| < 1
Use: logistic regression latent variable; growth curves; Bradley-Terry model.
Relationship: difference of two iid Gumbel.

2.5 Student’s t — t_ν(μ, σ)

Support: ℝ
PDF: f(x) = Γ((ν+1)/2) / (Γ(ν/2)√(νπ)σ) · (1 + ((x-μ)/σ)²/ν)^(-(ν+1)/2)
Mean: μ (for ν > 1) · Var: σ² ν/(ν-2) (for ν > 2)
Use: robust regression; heavy-tailed errors; t-test; Bayesian posterior of mean with unknown variance.
Relationship: ν → ∞: → Normal. ν = 1: Cauchy. Z ~ N(0,1), V ~ Chi²(ν), then Z/√(V/ν) ~ t_ν.

2.6 Hyperbolic secant

Support: ℝ
PDF: f(x) = (1/2) sech(πx/2)
Mean: 0 · Var: 1
Use: alternative bell-shape; arises in some neural network analyses.
Char. fn: sech(t).

2.7 Generalized normal / exponential power GN(μ, α, β)

Support: ℝ
PDF: f(x) = β / (2α Γ(1/β)) · exp(-(|x-μ|/α)^β)
Use: family containing Laplace (β=1), Normal (β=2), uniform (β→∞).
Relationship: Lₚ-norm penalties: β = p.

2.8 Skew-normal SN(μ, σ, α)

Support: ℝ
PDF: f(x) = (2/σ) φ((x-μ)/σ) Φ(α(x-μ)/σ)
Use: asymmetric continuous data; α = 0 recovers Normal.
Originator: Adelchi Azzalini (1985).

2.9 Skew-t

Support: ℝ
PDF: Azzalini-Capitanio form: f(x) = 2 t_ν(x) T_{ν+1}(α x √((ν+1)/(ν+x²)))
Use: heavy-tailed asymmetric data; finance, environmetrics.

2.10 Variance-Gamma VG(μ, σ, ν, θ)

Support: ℝ
Description: normal with variance ~ Gamma. Subclass of generalized hyperbolic.
Use: option pricing (Madan-Seneta); semi-heavy tails.

2.11 Stable / Lévy stable S(α, β, c, μ)

Support: ℝ (with α=1, β=0 special)
Char. fn: exp(iμt - |ct|^α [1 - iβ sign(t) tan(πα/2)]) for α ≠ 1
Properties: closed under sum (with rescaling). α = 2: Normal. α = 1, β = 0: Cauchy. α = 1/2, β = 1: Lévy distribution.
Use: heavy-tailed sums; α-stable noise; financial returns.
No closed-form PDF in general (except 3 special cases above).

2.12 Asymmetric Laplace AL(μ, λ, κ)

Support: ℝ
PDF: piecewise exponential with different rates on either side of μ.
Use: quantile regression (κ encodes quantile); financial returns.

2.13 Generalized hyperbolic GH

Support: ℝ
PDF involves modified Bessel function K_λ.
Use: parent family of NIG, hyperbolic, Variance-Gamma, t, normal; finance applications.
Originator: Ole Barndorff-Nielsen (1977).

2.14 Normal Inverse Gaussian NIG(α, β, μ, δ)

Support: ℝ
PDF: closed-form with Bessel K₁.
Use: semi-heavy tails with explicit MGF; finance.

3. Continuous distributions on R+ (positive reals)

3.1 Exponential(λ) — rate parameterization

Support: [0, ∞)
PDF: f(x) = λ e^(-λx)
Mean: 1/λ · Var: 1/λ² · MGF: λ/(λ-t) for t < λ
Use: waiting time between Poisson events; lifetime with constant hazard.
Memoryless (unique continuous): P(X > s+t | X > s) = P(X > t).
Conjugate prior (for λ): Gamma.
Relationship: sum of n iid Exp(λ) ~ Gamma(n, λ) = Erlang(n, λ). Min of n iid Exp(λᵢ) ~ Exp(∑λᵢ). -log(U)/λ ~ Exp(λ) where U ~ Uniform(0,1).

3.2 Gamma(α, β) — shape α, rate β (or scale θ = 1/β)

Support: (0, ∞)
PDF: f(x) = β^α x^(α-1) e^(-βx) / Γ(α)
Mean: α/β · Var: α/β² · MGF: (β/(β-t))^α for t < β
Use: waiting time for α-th Poisson event (integer α = Erlang); prior on rates/precisions; insurance claim sizes.
Conjugate prior for Poisson rate, Exponential rate, Normal precision.
Relationship: Gamma(1, β) = Exp(β). Sum of independent Gammas with same rate: shape sums. Gamma(ν/2, 1/2) = Chi²(ν). 1/X ~ Inverse-Gamma.

3.3 Erlang(k, λ)

Support: (0, ∞), integer k ≥ 1
PDF: f(x) = λᵏ xᵏ⁻¹ e^(-λx) / (k-1)!
Use: sum of k iid Exp(λ); telephony queueing; M/M/k queue waiting time.
Relationship: Erlang(k, λ) = Gamma(k, λ) with integer shape.
Originator: A.K. Erlang (1909, queueing theory at Copenhagen Telephone).

3.4 Chi-square χ²(ν)

Support: (0, ∞)
PDF: f(x) = x^(ν/2-1) e^(-x/2) / (2^(ν/2) Γ(ν/2))
Mean: ν · Var: 2ν · MGF: (1 - 2t)^(-ν/2) for t < 1/2
Use: sum of squares of iid standard normals; goodness-of-fit; variance estimator; t and F denominators.
Relationship: Chi²(ν) = Gamma(ν/2, 1/2). Sum of ν iid N(0,1)².

3.5 Inverse-Gamma IG(α, β)

Support: (0, ∞)
PDF: f(x) = β^α x^(-α-1) e^(-β/x) / Γ(α)
Mean: β/(α-1) for α > 1; Var: β²/((α-1)²(α-2)) for α > 2
Use: conjugate prior on Normal variance σ².
Relationship: 1/X ~ Gamma(α, β) ⇔ X ~ InvGamma(α, β).

3.6 Inverse-Gaussian / Wald IG(μ, λ)

Support: (0, ∞)
PDF: f(x) = √(λ/(2πx³)) exp(-λ(x-μ)²/(2μ²x))
Mean: μ · Var: μ³/λ · MGF: exp((λ/μ)(1 - √(1 - 2μ²t/λ)))
Use: first-passage time of Brownian motion with drift; reaction-time models.

3.7 Weibull(k, λ)

Support: [0, ∞), shape k > 0, scale λ > 0
PDF: f(x) = (k/λ) (x/λ)^(k-1) exp(-(x/λ)^k)
CDF: F(x) = 1 - exp(-(x/λ)^k)
Mean: λ Γ(1 + 1/k) · Var: λ²[Γ(1 + 2/k) - Γ(1 + 1/k)²]
Use: lifetime modeling with monotone hazard (k > 1 increasing, k < 1 decreasing, k = 1 constant = Exp); extreme value (min); wind speed.
Relationship: k = 1 is Exp. k = 2 is Rayleigh. Limit of min of iid samples (Fisher-Tippett-Gnedenko).

3.8 Lognormal LN(μ, σ²)

Support: (0, ∞)
PDF: f(x) = (xσ√(2π))^(-1) exp(-(log x - μ)²/(2σ²))
Mean: exp(μ + σ²/2) · Var: (eˢ² - 1) e^(2μ+σ²) · MGF: does not exist (heavy tail)
Use: multiplicative noise; income; particle sizes; biological growth.
Relationship: X ~ LN(μ, σ²) ⇔ log X ~ N(μ, σ²). Product of iid lognormals is lognormal.

3.9 Pareto(x_m, α) — Pareto Type I

Support: [x_m, ∞)
PDF: f(x) = α x_m^α / x^(α+1)
Mean: α x_m / (α-1) for α > 1; Var: x_m² α / ((α-1)²(α-2)) for α > 2
Use: 80/20 rule; wealth distribution; city sizes (above truncation); file sizes.
Relationship: log(X/x_m) ~ Exp(α). Tail of Generalized Pareto.
Originator: Vilfredo Pareto (1895).

3.10 Gumbel(μ, β) — Type I extreme value

Support: ℝ
PDF: f(x) = (1/β) exp(-((x-μ)/β + e^(-(x-μ)/β)))
CDF: F(x) = exp(-e^(-(x-μ)/β))
Mean: μ + βγ (γ = Euler-Mascheroni) · Var: β²π²/6
Use: extreme value (maximum) of light-tailed samples; choice modeling (Gumbel max trick).
Relationship: limit of max of iid Exp/Normal. Gumbel(0,1) - Gumbel(0,1) ~ Logistic(0,1).

3.11 Fréchet(α, s, m) — Type II extreme value

Support: (m, ∞)
PDF: f(x) = (α/s) ((x-m)/s)^(-1-α) exp(-((x-m)/s)^(-α))
Use: extreme value of heavy-tailed samples (Pareto-like parents).
Relationship: 1/X ~ Weibull if X ~ Fréchet.

3.12 Rayleigh(σ)

Support: [0, ∞)
PDF: f(x) = (x/σ²) exp(-x²/(2σ²))
Mean: σ√(π/2) · Var: (4-π)σ²/2
Use: magnitude of 2D Gaussian; wind speed; MRI noise.
Relationship: Rayleigh(σ) = Weibull(2, σ√2) = √(X²+Y²) for X, Y ~ N(0, σ²) iid. χ(2) scaled.

3.13 Rice(ν, σ)

Support: [0, ∞)
PDF: f(x) = (x/σ²) exp(-(x²+ν²)/(2σ²)) I₀(xν/σ²) (I₀ = modified Bessel)
Use: magnitude of 2D Gaussian with nonzero mean; MRI signal in noise; communications fading.
Relationship: ν = 0 recovers Rayleigh.

3.14 F-distribution F(d₁, d₂)

Support: [0, ∞)
PDF: f(x) = √((d₁x)^d₁ d₂^d₂ / (d₁x+d₂)^(d₁+d₂)) / (x B(d₁/2, d₂/2))
Mean: d₂/(d₂-2) for d₂ > 2 · Var: complex (see references)
Use: ratio of two chi-squared / df; ANOVA; regression overall F-test.
Relationship: F(d₁, d₂) = (Chi²(d₁)/d₁) / (Chi²(d₂)/d₂). 1/F(d₁, d₂) ~ F(d₂, d₁). t_ν² ~ F(1, ν).

3.15 Generalized Gamma GG(a, d, p)

Support: (0, ∞)
PDF: f(x) = (p/a^d) x^(d-1) exp(-(x/a)^p) / Γ(d/p)
Use: nests Gamma (p=1), Weibull (d=p), Exp (d=p=1), Half-normal, lognormal limit. Flexible survival modeling.

4. Continuous distributions on [0, 1] (bounded)

4.1 Beta(α, β)

Support: [0, 1]
PDF: f(x) = x^(α-1) (1-x)^(β-1) / B(α, β), B(α,β) = Γ(α)Γ(β)/Γ(α+β)
Mean: α/(α+β) · Var: αβ/((α+β)²(α+β+1)) · Mode: (α-1)/(α+β-2) if α, β > 1
Use: probabilities; proportions; conjugate prior to Bernoulli, Binomial, Geometric, NB.
Relationship: Beta(1,1) = Uniform(0,1). α = β: symmetric around 1/2; both → ∞: concentrates at 1/2. If X ~ Gamma(α, θ), Y ~ Gamma(β, θ), then X/(X+Y) ~ Beta(α, β).

4.2 Kumaraswamy(a, b)

Support: [0, 1]
PDF: f(x) = a b x^(a-1) (1 - x^a)^(b-1)
CDF: F(x) = 1 - (1 - x^a)^b (closed form, unlike Beta)
Use: alternative to Beta with tractable CDF; hydrology; classification calibration.
Originator: Ponnambalam Kumaraswamy (1980).

4.3 Logit-normal

Support: (0, 1)
Description: logit(X) ~ N(μ, σ²)
PDF: f(x) = (1/(σ√(2π))) (1/(x(1-x))) exp(-(logit(x) - μ)²/(2σ²))
Use: Normal-on-the-logit; arises in random-effects probability modeling; not conjugate to anything.

4.4 Beta-binomial (already in §1.14)

Marginal of Binomial(n, p) with p ~ Beta.

4.5 Power-function distribution

Support: [0, 1] (or [0, b])
PDF: f(x) = α x^(α-1)
Use: simple monotone density on bounded interval. Special case of Beta(α, 1).

4.6 Truncated normal on [0, 1] — TN(μ, σ²; 0, 1)

Support: [0, 1]
PDF: f(x) = φ((x-μ)/σ) / (σ (Φ((1-μ)/σ) - Φ(-μ/σ)))
Use: bounded continuous quantities; censored data; HMC for constrained variables.

5. Multivariate distributions

5.1 Multivariate Normal N_d(μ, Σ)

Support: ℝ^d
PDF: f(x) = (2π)^(-d/2) |Σ|^(-1/2) exp(-(x-μ)ᵀ Σ⁻¹ (x-μ)/2)
Mean: μ · Cov: Σ · Char. fn: exp(iμᵀt - tᵀΣt/2)
Use: linear-Gaussian models; Kalman filter; Gaussian processes (finite-dim slice); copulas.
Conjugate prior (for μ, Σ known): Normal. For (μ, Σ): Normal-Inverse-Wishart.
Properties: marginals normal, conditionals normal, linear combinations normal.
Relationship: limit of MV-t as df → ∞.

5.2 Dirichlet(α₁, …, αₖ)

Support: simplex {(p₁,…,pₖ) : pᵢ ≥ 0, ∑pᵢ = 1}
PDF: f(p) = (∏ pᵢ^(αᵢ-1)) Γ(∑αᵢ) / ∏ Γ(αᵢ)
Mean: αᵢ / α₀ where α₀ = ∑αᵢ · Var(pᵢ): αᵢ(α₀-αᵢ)/(α₀²(α₀+1))
Use: prior over categorical/multinomial probabilities; topic models (LDA); Bayesian smoothing.
Relationship: marginals are Beta. Limit α₀ → 0: sparse (concentrates on vertices). α₀ → ∞: concentrates on mean. Dirichlet Process is the infinite-dimensional analog.

5.3 Multinomial — already in §1.8

5.4 Wishart W_d(ν, V)

Support: d×d positive-definite matrices
PDF: f(X) = |X|^((ν-d-1)/2) exp(-tr(V⁻¹X)/2) / (2^(νd/2) |V|^(ν/2) Γ_d(ν/2))
Mean: νV
Use: distribution of sample covariance matrix X = ZᵀZ for Z i.i.d. rows ~ N(0, V); conjugate prior on precision (inverse covariance) matrices.
Originator: John Wishart (1928).

5.5 Inverse-Wishart IW(ν, Ψ)

Support: d×d PD matrices
Description: X ~ Wishart ⇒ X⁻¹ ~ Inverse-Wishart.
Use: conjugate prior on covariance matrix Σ.

5.6 Multivariate-t MVT_ν(μ, Σ)

Support: ℝ^d
PDF: f(x) ∝ (1 + (x-μ)ᵀ Σ⁻¹ (x-μ)/ν)^(-(ν+d)/2)
Use: robust analog of MVN; Bayesian posterior of MVN mean with unknown covariance.

5.7 Matrix-normal MN_{n,p}(M, U, V)

Support: n×p matrices
PDF: f(X) ∝ exp(-tr(V⁻¹(X-M)ᵀU⁻¹(X-M))/2)
Use: prior over coefficient matrices in multivariate regression.

5.8 Multivariate Laplace

Support: ℝ^d
Description: scale-mixture of MVN over Gamma scale.
Use: robust multivariate modeling; sparse priors.

5.9 Copulas — coupling of margins

A copula C: [0,1]^d → [0,1] is the joint CDF of (F₁(X₁), …, F_d(X_d)) (Sklar’s theorem). Separates dependence from marginals.

Gaussian copula: C(u) = Φ_d(Φ⁻¹(u₁), …, Φ⁻¹(u_d); Σ). No tail dependence (notoriously misused for CDOs).
Clayton copula: C(u₁, u₂; θ) = (u₁^(-θ) + u₂^(-θ) - 1)^(-1/θ), θ > 0. Lower-tail dependence.
Frank copula: C(u; θ) = -(1/θ) log(1 + (e^(-θu₁) - 1)(e^(-θu₂) - 1)/(e^(-θ) - 1)). Symmetric, no tail dependence.
Gumbel copula: C(u; θ) = exp(-((-log u₁)^θ + (-log u₂)^θ)^(1/θ)), θ ≥ 1. Upper-tail dependence.
Marginally-normal copulas: any copula with normal margins (= MVN if Gaussian copula, else non-MVN with normal marginals).

5.10 Multivariate hypergeometric — already in §1.9

6. Heavy-tailed / extreme-value distributions

6.1 Generalized Extreme Value GEV(μ, σ, ξ)

Support: depends on ξ.
CDF: F(x) = exp(-(1 + ξ(x-μ)/σ)^(-1/ξ)) for ξ ≠ 0; exp(-e^(-(x-μ)/σ)) for ξ = 0.
Use: block maxima (Fisher-Tippett-Gnedenko theorem). ξ > 0: Fréchet (heavy tail). ξ = 0: Gumbel (light tail). ξ < 0: Reverse Weibull (bounded tail).

6.2 Generalized Pareto GPD(μ, σ, ξ)

Support: [μ, ∞) (ξ ≥ 0) or [μ, μ - σ/ξ] (ξ < 0)
CDF: F(x) = 1 - (1 + ξ(x-μ)/σ)^(-1/ξ) for ξ ≠ 0; 1 - e^(-(x-μ)/σ) for ξ = 0.
Use: peaks-over-threshold (Pickands-Balkema-de Haan theorem). ξ = 0: Exp. ξ > 0: Pareto-tail. ξ < 0: bounded.

6.3 Fréchet — already in §3.11

6.4 Gumbel — already in §3.10

6.5 Weibull (reverse) — already in §3.7

6.6 Tukey lambda

Support: ℝ (or bounded depending on λ)
Description: defined via quantile function Q(p) = (p^λ - (1-p)^λ)/λ. Family contains uniform (λ=1), logistic (λ=0), approx Normal (λ ≈ 0.14).
Use: distributional shape exploration; goodness-of-fit benchmarking.

7. Circular / directional distributions

7.1 von Mises VM(μ, κ)

Support: [0, 2π)
PDF: f(θ) = exp(κ cos(θ - μ)) / (2π I₀(κ))
Mean direction: μ · Concentration: κ
Use: circular analog of normal; wind directions; phase angles.
Originator: Richard von Mises (1918).

7.2 von Mises-Fisher VMF(μ, κ) on S^(d-1)

Support: unit sphere in ℝ^d
PDF: f(x) = C_d(κ) exp(κ μᵀ x) where C_d(κ) = κ^(d/2-1) / ((2π)^(d/2) I_{d/2-1}(κ))
Use: directional data; clustering of L2-normalized embeddings.

7.3 Bingham(M, Z)

Support: unit sphere
PDF: f(x) ∝ exp(xᵀ M Z Mᵀ x)
Use: antipodally-symmetric directional data; orientations (where +x and -x are indistinguishable).

7.4 Wrapped Cauchy(μ, ρ)

Support: [0, 2π)
PDF: f(θ) = (1/(2π)) (1 - ρ²)/(1 + ρ² - 2ρ cos(θ - μ))
Use: heavy-tailed circular data.

7.5 Wrapped Normal WN(μ, σ²)

Support: [0, 2π)
PDF: f(θ) = ∑_{k=-∞}^∞ (1/√(2πσ²)) exp(-(θ - μ + 2πk)²/(2σ²))
Use: small-σ approximation of von Mises.

8. Bayesian non-parametric distributions

8.1 Dirichlet Process DP(α, H)

Description: prior over discrete probability measures. Draws G ~ DP(α, H) are almost surely discrete: G = ∑_{k=1}^∞ π_k δ_{θ_k}, with (π_k) from stick-breaking, θ_k ~ H.
Use: infinite mixture models; non-parametric Bayes; clustering with unknown number of clusters.
Stick-breaking (Sethuraman 1994): β_k ~ Beta(1, α), π_k = β_k ∏_{j<k} (1 - β_j).

8.2 Chinese Restaurant Process CRP(α)

Description: predictive distribution of a DP. Customer n+1 joins table k with probability n_k / (n + α) or starts new table with probability α/(n + α).
Use: exchangeable partitions; cluster assignment in non-parametric mixtures.

8.3 Pitman-Yor process PY(α, d, H)

Description: generalizes DP with discount parameter d ∈ [0, 1). Heavier tail in cluster sizes.
Use: language modeling; preferential attachment; better fit for power-law cluster sizes than DP.

8.4 Indian Buffet Process IBP(α)

Description: prior over infinite binary feature matrices. Customer n selects each previously-tried dish with prob m_k/n and tries Poisson(α/n) new dishes.
Use: latent feature models; factor analysis with unknown number of factors.

8.5 Polya tree

Description: prior over continuous distributions via recursive Beta-distributed splits of probability mass over a dyadic partition. Generalizes DP (which is discrete) to continuous distributions.
Use: density estimation; nonparametric Bayes for continuous data.

9. Distribution-family relationships diagram

            ┌─────────────┐
            │  Bernoulli  │
            └──────┬──────┘
                   │ sum of n iid
                   ▼
            ┌─────────────┐  mixture over     ┌──────────────┐
            │  Binomial   │────Beta────────►  │ Beta-Binomial│
            └──────┬──────┘                    └──────────────┘
        n→∞,np→λ  │
                   ▼
            ┌─────────────┐  mixture over     ┌──────────────┐
            │   Poisson   │────Gamma────────► │NegBinomial   │
            └──────┬──────┘                    └──────────────┘
              difference
                   ▼
              Skellam

            ┌─────────────┐  square + sum     ┌──────────────┐
            │   Normal    │──────────────────►│  Chi-square  │
            └──┬───┬───┬──┘                    └───┬──────────┘
               │   │   │ exp                       │ ratio
               │   │   ▼                           ▼
               │   │  Lognormal              F-distribution
               │   │ ratio Z/√(V/ν)
               │   └──────────────────────►  Student's t  ──ν=1──► Cauchy
               │ truncated
               ▼
            Half-normal,  TruncNormal,  Skew-normal

            ┌─────────────┐  sum of n iid     ┌──────────────┐
            │ Exponential │──────────────────►│    Gamma     │
            └──┬──┬──┬───-┘                    └──┬───────────┘
               │  │  │ x^(1/k)                   │ 1/X
               │  │  ▼                           ▼
               │  │  Weibull              Inverse-Gamma
               │  │ log
               │  ▼
               │  Gumbel ◄────── max of n iid Exp/Normal
               │
               │ x_m exp(-X/α)
               ▼
              Pareto ◄────── log(X/x_m) ~ Exp

            ┌─────────────┐
            │ Multinomial │ ◄──── Dirichlet (conjugate prior)
            └─────────────┘

            ┌──────────────────┐
            │ Multivariate Normal│ ◄──── Wishart (prior on Σ⁻¹)
            └──────────────────┘  ◄──── Inverse-Wishart (prior on Σ)

            Stable family (α-stable)
              ├── α=2     ──► Normal
              ├── α=1,β=0 ──► Cauchy
              └── α=1/2,β=1► Lévy

            Generalized Extreme Value (GEV)
              ├── ξ>0  ──► Fréchet
              ├── ξ=0  ──► Gumbel
              └── ξ<0  ──► Reverse-Weibull

            Generalized Pareto (POT)
              ├── ξ=0  ──► Exponential
              ├── ξ>0  ──► Pareto-tail
              └── ξ<0  ──► Bounded uniform-like

            Dirichlet Process
              ├── stick-breaking ──► discrete mixture weights
              ├── predictive    ──► Chinese Restaurant Process
              └── two-parameter ──► Pitman-Yor

10. Conjugate-prior reference

Likelihood	Parameter	Conjugate prior	Posterior update
Bernoulli(p)	p	Beta(α, β)	Beta(α + ∑x, β + n - ∑x)
Binomial(n, p)	p	Beta(α, β)	Beta(α + ∑x, β + ∑(nᵢ - xᵢ))
Geometric(p)	p	Beta(α, β)	Beta(α + n, β + ∑x)
NegBin(r, p), r fixed	p	Beta(α, β)	Beta(α + nr, β + ∑x)
Poisson(λ)	λ	Gamma(α, β)	Gamma(α + ∑x, β + n)
Exp(λ)	λ	Gamma(α, β)	Gamma(α + n, β + ∑x)
Normal(μ, σ²), σ² known	μ	Normal(μ₀, τ₀²)	Normal closed form
Normal(μ, σ²), μ known	σ²	Inverse-Gamma(α, β)	Inverse-Gamma updated
Normal(μ, σ²), both	(μ, σ²)	Normal-Inverse-Gamma	NIG updated
Gamma(α, β), α known	β	Gamma(a, b)	Gamma(a + nα, b + ∑x)
Categorical(p)	p	Dirichlet(α)	Dirichlet(α + counts)
Multinomial(n, p)	p	Dirichlet(α)	Dirichlet(α + counts)
MVN(μ, Σ), Σ known	μ	MVN(μ₀, Σ₀)	MVN closed form
MVN(μ, Σ), μ known	Σ	Inverse-Wishart(ν, Ψ)	IW(ν+n, Ψ + S)
MVN(μ, Σ), both	(μ, Σ)	Normal-Inverse-Wishart	NIW updated
Uniform(0, θ)	θ	Pareto(x_m, k)	Pareto(max(x_m, max x), k+n)

11. Use-case decision tree

Q1: Discrete or continuous outcome?

→ Discrete:

Q2: Binary? → Bernoulli (single) / Binomial (count of successes in n).
Q2: Count of events in a fixed window?
- Equidispersed (var ≈ mean) → Poisson.
- Overdispersed (var > mean) → Negative Binomial or CMP(ν<1).
- Underdispersed (var < mean) → Conway-Maxwell-Poisson (ν>1).
- Excess zeros → Zero-inflated Poisson / NB.
- No zeros (truncated) → Zero-truncated Poisson.
Q2: Sampling without replacement → Hypergeometric.
Q2: K categories, single trial → Categorical; n trials → Multinomial.
Q2: Rank-frequency / heavy-tailed counts → Zipf / Yule-Simon / discrete Pareto.

→ Continuous:

Q2: Support ℝ?
- Symmetric light-tailed → Normal.
- Symmetric heavy-tailed → Student’s t (df controls heaviness) / Cauchy (df=1) / Laplace.
- Asymmetric → Skew-normal / Skew-t / Asymmetric Laplace.
- Stable sums → α-stable.
Q2: Support (0, ∞) (positive)?
- Memoryless lifetimes → Exponential.
- Sum-of-exponentials waiting → Gamma / Erlang.
- Multiplicative noise → Lognormal.
- Heavy-tailed → Pareto / Lognormal / Fréchet.
- Lifetime with monotone hazard → Weibull.
- Sample variance / sum-of-squares → Chi-square.
- First-passage time of Brownian motion → Inverse-Gaussian.
Q2: Support [0, 1] (proportion / probability)?
- Most cases → Beta (conjugate to Bernoulli).
- Need closed-form CDF → Kumaraswamy.
- Logit-transformed normal data → Logit-normal.
Q2: Support a sphere / circle?
- Circle [0, 2π) → von Mises (light tail) / Wrapped Cauchy (heavy).
- High-dim sphere → von Mises-Fisher.
- Antipodally symmetric → Bingham.

→ Multivariate:

Linear Gaussian → Multivariate Normal.
Probabilities on simplex → Dirichlet (extends Beta).
Covariance matrices → Wishart (positive) / Inverse-Wishart (prior on Σ).
Heavy-tailed multivariate → MV-t / Multivariate Laplace.
Dependence with arbitrary marginals → Copula (choose tail-dependence structure).

→ Extreme values / tails:

Block maxima → Generalized Extreme Value (GEV).
Threshold exceedances → Generalized Pareto.

→ Bayesian non-parametric:

Unknown number of mixture components → Dirichlet Process / Chinese Restaurant Process.
Power-law cluster sizes → Pitman-Yor.
Latent binary features (unknown count) → Indian Buffet Process.
Density estimation → Polya tree.

12. Cross-links

[[Math/probability-fundamentals]] — measure-theoretic foundations, expectation, convergence.
[[Math/probability-distributions]] — Tier 1 conceptual overview (this note’s cousin).
[[Math/bayesian-inference]] — conjugate priors, posterior updates, MAP/MLE.
[[Math/mcmc-sampling]] — sampling from non-tractable distributions.
[[Math/information-theory]] — entropy + KL of named distributions.
[[Math/stochastic-calculus]] — Brownian motion, Itô, Lévy processes (parent of stable distributions).
[[Math/_index]] — Tier 1 math map.
[[Math/Tier3/_index]] — Tier 3 family-index catalog.

— Last updated 2026-05-17 by claude-code.

Compendium

Explorer

Probability Distribution Zoo — Math Family Index