Probability Distribution Zoo
Comprehensive catalog of 60+ probability distributions: their PMFs/PDFs, moments, MGFs, parameter relationships, applications, conjugate priors, and family relationships (transformations, limits, compoundings, mixtures).
Tier 3 family index. Cousin of [[Math/probability-distributions]] (Tier 1 conceptual). Read Tier 1 for “what is a distribution”; read this for “which one do I use and how does it relate to the others”.
Notation: support S, parameters listed, density f (continuous) or PMF p (discrete), mean μ = E[X], variance σ² = Var[X], MGF M(t) = E[exp(tX)], characteristic function φ(t) = E[exp(itX)].
1. Discrete distributions
1.1 Bernoulli(p)
- Support:
{0, 1} - PMF:
p(x) = pˣ (1-p)^(1-x),p ∈ [0,1] - Mean:
p· Var:p(1-p)· MGF:1 - p + p·eᵗ - Use: single binary trial; coin flip; success/failure indicator.
- Conjugate prior: Beta.
- Relationship:
Binomial(1, p) = Bernoulli(p). Sum ofniid Bernoulli(p) is Binomial(n, p). - Originator: Jacob Bernoulli (1713, Ars Conjectandi).
1.2 Binomial(n, p)
- Support:
{0, 1, …, n} - PMF:
p(k) = C(n,k) pᵏ (1-p)^(n-k) - Mean:
np· Var:np(1-p)· MGF:(1 - p + p·eᵗ)ⁿ - Use: number of successes in
nindependent Bernoulli trials. - Conjugate prior (for p, n fixed): Beta.
- Relationship: limit
Binomial(n, λ/n) → Poisson(λ)asn → ∞(Poisson limit theorem). Mixture over Beta gives Beta-binomial. Standardized → Normal asn → ∞(de Moivre-Laplace).
1.3 Geometric(p) — number of trials until first success
- Support:
{1, 2, 3, …}(or{0, 1, 2, …}for “failures before success” variant) - PMF:
p(k) = (1-p)^(k-1) p(1-indexed) - Mean:
1/p· Var:(1-p)/p²· MGF:p eᵗ / (1 - (1-p)eᵗ)fort < -log(1-p) - Use: waiting time (discrete) until first success.
- Memoryless property (unique among discrete distributions):
P(X > m+n | X > m) = P(X > n). - Relationship:
Geometric(p) = NegBin(1, p). Continuous analog: Exponential.
1.4 Negative binomial NB(r, p)
- Support:
{r, r+1, …}(trials variant) or{0, 1, 2, …}(failures variant) - PMF (failures):
p(k) = C(k + r - 1, k) (1-p)ᵏ pʳ - Mean (failures):
r(1-p)/p· Var:r(1-p)/p²· MGF:(p / (1 - (1-p)eᵗ))ʳ - Use: number of failures before
r-th success; overdispersed counts. - Relationship: Poisson-Gamma mixture: if
X | λ ~ Poisson(λ)andλ ~ Gamma(r, p/(1-p)), thenX ~ NB(r, p). Generalizes geometric (r = 1). Allows non-integerr(Pólya distribution).
1.5 Hypergeometric(N, K, n)
- Support:
max(0, n-N+K), …, min(n, K) - PMF:
p(k) = C(K,k) C(N-K, n-k) / C(N, n) - Mean:
nK/N· Var:n(K/N)(1-K/N)(N-n)/(N-1) - Use: sampling without replacement from a finite population (K “successes” among N).
- Relationship: limit
N, K → ∞withK/N → p:Hyper → Binomial(n, p). Multivariate version: multivariate hypergeometric.
1.6 Poisson(λ)
- Support:
{0, 1, 2, …} - PMF:
p(k) = e^(-λ) λᵏ / k!,λ > 0 - Mean:
λ· Var:λ· MGF:exp(λ(eᵗ - 1))· Char. fn:exp(λ(e^(it) - 1)) - Use: counts of rare events in fixed interval; arrivals; defects.
- Conjugate prior: Gamma.
- Equidispersion: mean = variance. Departure signals overdispersion (use NB) or underdispersion (use Conway-Maxwell-Poisson).
- Relationship: sum of independent Poissons is Poisson. Difference is Skellam. Limit of Binomial(n, λ/n).
- Originator: Siméon Denis Poisson (1837).
1.7 Categorical(p₁, …, pₖ)
- Support:
{1, …, K}(or one-hot vectors) - PMF:
p(x = i) = pᵢ,∑pᵢ = 1 - Use: single multi-class outcome (generalization of Bernoulli).
- Conjugate prior: Dirichlet.
1.8 Multinomial(n, p₁, …, pₖ)
- Support:
{(n₁, …, nₖ) : ∑nᵢ = n, nᵢ ≥ 0} - PMF:
p(n₁,…,nₖ) = n! / (n₁! … nₖ!) ∏ pᵢ^(nᵢ) - Mean:
E[nᵢ] = n pᵢ· Cov:Cov(nᵢ, nⱼ) = -n pᵢ pⱼ(i ≠ j);Var(nᵢ) = n pᵢ(1-pᵢ) - Use: counts across
Kcategories innindependent trials. - Conjugate prior: Dirichlet.
1.9 Multivariate hypergeometric
- Support:
(k₁, …, kₘ)with∑kᵢ = n,kᵢ ≤ Kᵢ - PMF:
p(k) = ∏ C(Kᵢ, kᵢ) / C(N, n)whereN = ∑Kᵢ - Use: draw
nfrom urn with categories of sizesK₁, …, Kₘwithout replacement.
1.10 Discrete uniform({a, …, b})
- Support:
{a, a+1, …, b},N = b - a + 1 - PMF:
1/N - Mean:
(a+b)/2· Var:(N²-1)/12 - Use: maximum-entropy distribution on a finite set; randomization.
1.11 Zipf(s, N) — Zipfian / discrete Pareto
- Support:
{1, 2, …, N}(orN → ∞ifs > 1) - PMF:
p(k) = k^(-s) / H_{N,s}whereH_{N,s} = ∑_{i=1}^N i^(-s)(generalized harmonic number) - Mean (finite N):
H_{N,s-1} / H_{N,s}; diverges fors ≤ 2asN → ∞. - Use: rank-frequency in natural language (Zipf’s law), city sizes, citation counts, file-size distributions.
- Originator: George Kingsley Zipf (1949).
1.12 Yule-Simon(ρ)
- Support:
{1, 2, 3, …} - PMF:
p(k) = ρ · B(k, ρ + 1)whereBis Beta function - Mean:
ρ/(ρ-1)forρ > 1; Var:ρ²/((ρ-1)²(ρ-2))forρ > 2 - Use: preferential attachment models; gene families; word frequencies.
- Originator: G. Udny Yule (1925), formalized by Herbert Simon (1955).
1.13 Discrete Pareto (Pareto-II discrete / zeta-shift)
- Support:
{0, 1, 2, …} - PMF:
p(k) ∝ (k + α)^(-s-1)(various parameterizations) - Use: heavy-tailed discrete data with a shift; alternative to Zipf.
1.14 Beta-binomial(n, α, β)
- Support:
{0, 1, …, n} - PMF:
p(k) = C(n,k) B(k+α, n-k+β) / B(α, β) - Mean:
nα/(α+β)· Var:nαβ(α+β+n) / ((α+β)²(α+β+1)) - Use: overdispersed binomial; clustered Bernoulli trials.
- Relationship: marginal of Binomial(n, p) when
p ~ Beta(α, β). Asα, β → ∞with fixedα/(α+β): → Binomial.
1.15 Pólya-Eggenberger urn
- Support: depends on urn dynamics.
- Description: draw + replace + add
cof same color. Generalizes hypergeometric (c = -1), binomial (c = 0), beta-binomial (c = 1 with continuous limit). - Use: contagion models; reinforced random processes.
1.16 Gibbs / Boltzmann
- Support: state space
S(often combinatorial). - PMF:
p(x) = exp(-E(x)/T) / Z,Z = ∑ exp(-E(x)/T) - Use: equilibrium statistical mechanics; energy-based models; simulated annealing target.
1.17 Logarithmic series Log(p)
- Support:
{1, 2, 3, …} - PMF:
p(k) = -pᵏ / (k log(1-p)),0 < p < 1 - Mean:
-p / ((1-p) log(1-p)) - Use: species abundance (Fisher, 1943); compound for NB (logarithmic mixture of Poissons).
1.18 Skellam(μ₁, μ₂) — difference of two Poissons
- Support:
ℤ - PMF:
p(k) = e^(-(μ₁+μ₂)) (μ₁/μ₂)^(k/2) I_|k|(2√(μ₁μ₂))(I = modified Bessel) - Mean:
μ₁ - μ₂· Var:μ₁ + μ₂ - Use: difference of two independent Poisson counts (e.g., soccer score differences).
1.19 Conway-Maxwell-Poisson CMP(λ, ν)
- Support:
{0, 1, 2, …} - PMF:
p(k) = λᵏ / ((k!)^ν Z(λ, ν))whereZ(λ, ν) = ∑ λʲ / (j!)^ν - Use: counts with flexible dispersion: ν > 1 → underdispersed (subsumes Bernoulli at ν → ∞); ν = 1 → Poisson; ν < 1 → overdispersed (subsumes geometric at ν = 0).
- Originator: Conway & Maxwell (1962).
1.20 Zero-inflated Poisson ZIP(π, λ)
- Support:
{0, 1, 2, …} - PMF:
p(0) = π + (1-π) e^(-λ);p(k) = (1-π) e^(-λ) λᵏ / k!fork ≥ 1 - Use: count data with excess zeros (insurance claims, defect counts, ecological surveys).
1.21 Zero-truncated Poisson ZTP(λ)
- Support:
{1, 2, 3, …} - PMF:
p(k) = e^(-λ) λᵏ / (k! (1 - e^(-λ))) - Use: count data conditioned on at least one event (size-biased Poisson).
2. Continuous distributions on R
2.1 Normal / Gaussian N(μ, σ²)
- Support:
ℝ - PDF:
f(x) = (2πσ²)^(-1/2) exp(-(x-μ)²/(2σ²)) - Mean:
μ· Var:σ²· MGF:exp(μt + σ²t²/2)· Char. fn:exp(iμt - σ²t²/2) - Use: CLT limit; measurement noise; linear regression errors; everything.
- Conjugate prior (for μ, σ² known): Normal. For (μ, σ²): Normal-Inverse-Gamma.
- Maximum-entropy: distribution on
ℝwith given mean and variance. - Relationship: standardized
(X - μ)/σ ~ N(0, 1). Sum of independent normals is normal. Square ~ Chi-square(1).exp(X) ~ Lognormal. Ratio of two iid centered normals ~ Cauchy. - Originator: de Moivre (1733), Laplace, Gauss (1809).
2.2 Cauchy(x₀, γ) / Lorentz
- Support:
ℝ - PDF:
f(x) = 1 / (πγ (1 + ((x-x₀)/γ)²)) - Mean: undefined · Var: undefined · MGF: does not exist · Char. fn:
exp(ix₀t - γ|t|) - Use: heavy-tailed noise; resonance line shapes (physics); ratio of two iid centered normals; t-distribution with 1 df.
- Property: stable with index 1; sum of n iid Cauchy is also Cauchy (scaled).
2.3 Laplace(μ, b) / double exponential
- Support:
ℝ - PDF:
f(x) = (1/(2b)) exp(-|x-μ|/b) - Mean:
μ· Var:2b²· Char. fn:exp(iμt) / (1 + b²t²) - Use: robust regression (L1 loss = MAP under Laplace); LASSO prior; differential privacy noise.
- Relationship: difference of two iid Exp(1/b) is Laplace(0, b).
2.4 Logistic(μ, s)
- Support:
ℝ - PDF:
f(x) = e^(-(x-μ)/s) / (s (1 + e^(-(x-μ)/s))²) - Mean:
μ· Var:s²π²/3· MGF:e^(μt) B(1 - st, 1 + st)for|st| < 1 - Use: logistic regression latent variable; growth curves; Bradley-Terry model.
- Relationship: difference of two iid Gumbel.
2.5 Student’s t — t_ν(μ, σ)
- Support:
ℝ - PDF:
f(x) = Γ((ν+1)/2) / (Γ(ν/2)√(νπ)σ) · (1 + ((x-μ)/σ)²/ν)^(-(ν+1)/2) - Mean:
μ(forν > 1) · Var:σ² ν/(ν-2)(forν > 2) - Use: robust regression; heavy-tailed errors;
t-test; Bayesian posterior of mean with unknown variance. - Relationship:
ν → ∞: → Normal.ν = 1: Cauchy. Z ~ N(0,1), V ~ Chi²(ν), thenZ/√(V/ν) ~ t_ν.
2.6 Hyperbolic secant
- Support:
ℝ - PDF:
f(x) = (1/2) sech(πx/2) - Mean:
0· Var:1 - Use: alternative bell-shape; arises in some neural network analyses.
- Char. fn:
sech(t).
2.7 Generalized normal / exponential power GN(μ, α, β)
- Support:
ℝ - PDF:
f(x) = β / (2α Γ(1/β)) · exp(-(|x-μ|/α)^β) - Use: family containing Laplace (β=1), Normal (β=2), uniform (β→∞).
- Relationship: Lₚ-norm penalties:
β = p.
2.8 Skew-normal SN(μ, σ, α)
- Support:
ℝ - PDF:
f(x) = (2/σ) φ((x-μ)/σ) Φ(α(x-μ)/σ) - Use: asymmetric continuous data; α = 0 recovers Normal.
- Originator: Adelchi Azzalini (1985).
2.9 Skew-t
- Support:
ℝ - PDF: Azzalini-Capitanio form:
f(x) = 2 t_ν(x) T_{ν+1}(α x √((ν+1)/(ν+x²))) - Use: heavy-tailed asymmetric data; finance, environmetrics.
2.10 Variance-Gamma VG(μ, σ, ν, θ)
- Support:
ℝ - Description: normal with variance ~ Gamma. Subclass of generalized hyperbolic.
- Use: option pricing (Madan-Seneta); semi-heavy tails.
2.11 Stable / Lévy stable S(α, β, c, μ)
- Support:
ℝ(with α=1, β=0 special) - Char. fn:
exp(iμt - |ct|^α [1 - iβ sign(t) tan(πα/2)])for α ≠ 1 - Properties: closed under sum (with rescaling).
α = 2: Normal.α = 1, β = 0: Cauchy.α = 1/2, β = 1: Lévy distribution. - Use: heavy-tailed sums; α-stable noise; financial returns.
- No closed-form PDF in general (except 3 special cases above).
2.12 Asymmetric Laplace AL(μ, λ, κ)
- Support:
ℝ - PDF: piecewise exponential with different rates on either side of μ.
- Use: quantile regression (κ encodes quantile); financial returns.
2.13 Generalized hyperbolic GH
- Support:
ℝ - PDF involves modified Bessel function
K_λ. - Use: parent family of NIG, hyperbolic, Variance-Gamma, t, normal; finance applications.
- Originator: Ole Barndorff-Nielsen (1977).
2.14 Normal Inverse Gaussian NIG(α, β, μ, δ)
- Support:
ℝ - PDF: closed-form with Bessel K₁.
- Use: semi-heavy tails with explicit MGF; finance.
3. Continuous distributions on R+ (positive reals)
3.1 Exponential(λ) — rate parameterization
- Support:
[0, ∞) - PDF:
f(x) = λ e^(-λx) - Mean:
1/λ· Var:1/λ²· MGF:λ/(λ-t)fort < λ - Use: waiting time between Poisson events; lifetime with constant hazard.
- Memoryless (unique continuous):
P(X > s+t | X > s) = P(X > t). - Conjugate prior (for λ): Gamma.
- Relationship: sum of n iid Exp(λ) ~ Gamma(n, λ) = Erlang(n, λ). Min of n iid Exp(λᵢ) ~ Exp(∑λᵢ).
-log(U)/λ ~ Exp(λ)whereU ~ Uniform(0,1).
3.2 Gamma(α, β) — shape α, rate β (or scale θ = 1/β)
- Support:
(0, ∞) - PDF:
f(x) = β^α x^(α-1) e^(-βx) / Γ(α) - Mean:
α/β· Var:α/β²· MGF:(β/(β-t))^αfort < β - Use: waiting time for α-th Poisson event (integer α = Erlang); prior on rates/precisions; insurance claim sizes.
- Conjugate prior for Poisson rate, Exponential rate, Normal precision.
- Relationship:
Gamma(1, β) = Exp(β). Sum of independent Gammas with same rate: shape sums.Gamma(ν/2, 1/2) = Chi²(ν).1/X ~ Inverse-Gamma.
3.3 Erlang(k, λ)
- Support:
(0, ∞), integerk ≥ 1 - PDF:
f(x) = λᵏ xᵏ⁻¹ e^(-λx) / (k-1)! - Use: sum of k iid Exp(λ); telephony queueing; M/M/k queue waiting time.
- Relationship:
Erlang(k, λ) = Gamma(k, λ)with integer shape. - Originator: A.K. Erlang (1909, queueing theory at Copenhagen Telephone).
3.4 Chi-square χ²(ν)
- Support:
(0, ∞) - PDF:
f(x) = x^(ν/2-1) e^(-x/2) / (2^(ν/2) Γ(ν/2)) - Mean:
ν· Var:2ν· MGF:(1 - 2t)^(-ν/2)fort < 1/2 - Use: sum of squares of iid standard normals; goodness-of-fit; variance estimator; t and F denominators.
- Relationship:
Chi²(ν) = Gamma(ν/2, 1/2). Sum of ν iidN(0,1)².
3.5 Inverse-Gamma IG(α, β)
- Support:
(0, ∞) - PDF:
f(x) = β^α x^(-α-1) e^(-β/x) / Γ(α) - Mean:
β/(α-1)forα > 1; Var:β²/((α-1)²(α-2))forα > 2 - Use: conjugate prior on Normal variance σ².
- Relationship:
1/X ~ Gamma(α, β)⇔X ~ InvGamma(α, β).
3.6 Inverse-Gaussian / Wald IG(μ, λ)
- Support:
(0, ∞) - PDF:
f(x) = √(λ/(2πx³)) exp(-λ(x-μ)²/(2μ²x)) - Mean:
μ· Var:μ³/λ· MGF:exp((λ/μ)(1 - √(1 - 2μ²t/λ))) - Use: first-passage time of Brownian motion with drift; reaction-time models.
3.7 Weibull(k, λ)
- Support:
[0, ∞), shapek > 0, scaleλ > 0 - PDF:
f(x) = (k/λ) (x/λ)^(k-1) exp(-(x/λ)^k) - CDF:
F(x) = 1 - exp(-(x/λ)^k) - Mean:
λ Γ(1 + 1/k)· Var:λ²[Γ(1 + 2/k) - Γ(1 + 1/k)²] - Use: lifetime modeling with monotone hazard (k > 1 increasing, k < 1 decreasing, k = 1 constant = Exp); extreme value (min); wind speed.
- Relationship:
k = 1is Exp.k = 2is Rayleigh. Limit of min of iid samples (Fisher-Tippett-Gnedenko).
3.8 Lognormal LN(μ, σ²)
- Support:
(0, ∞) - PDF:
f(x) = (xσ√(2π))^(-1) exp(-(log x - μ)²/(2σ²)) - Mean:
exp(μ + σ²/2)· Var:(eˢ² - 1) e^(2μ+σ²)· MGF: does not exist (heavy tail) - Use: multiplicative noise; income; particle sizes; biological growth.
- Relationship:
X ~ LN(μ, σ²)⇔log X ~ N(μ, σ²). Product of iid lognormals is lognormal.
3.9 Pareto(x_m, α) — Pareto Type I
- Support:
[x_m, ∞) - PDF:
f(x) = α x_m^α / x^(α+1) - Mean:
α x_m / (α-1)forα > 1; Var:x_m² α / ((α-1)²(α-2))forα > 2 - Use: 80/20 rule; wealth distribution; city sizes (above truncation); file sizes.
- Relationship:
log(X/x_m) ~ Exp(α). Tail of Generalized Pareto. - Originator: Vilfredo Pareto (1895).
3.10 Gumbel(μ, β) — Type I extreme value
- Support:
ℝ - PDF:
f(x) = (1/β) exp(-((x-μ)/β + e^(-(x-μ)/β))) - CDF:
F(x) = exp(-e^(-(x-μ)/β)) - Mean:
μ + βγ(γ = Euler-Mascheroni) · Var:β²π²/6 - Use: extreme value (maximum) of light-tailed samples; choice modeling (Gumbel max trick).
- Relationship: limit of max of iid Exp/Normal.
Gumbel(0,1) - Gumbel(0,1) ~ Logistic(0,1).
3.11 Fréchet(α, s, m) — Type II extreme value
- Support:
(m, ∞) - PDF:
f(x) = (α/s) ((x-m)/s)^(-1-α) exp(-((x-m)/s)^(-α)) - Use: extreme value of heavy-tailed samples (Pareto-like parents).
- Relationship:
1/X ~ Weibullif X ~ Fréchet.
3.12 Rayleigh(σ)
- Support:
[0, ∞) - PDF:
f(x) = (x/σ²) exp(-x²/(2σ²)) - Mean:
σ√(π/2)· Var:(4-π)σ²/2 - Use: magnitude of 2D Gaussian; wind speed; MRI noise.
- Relationship:
Rayleigh(σ) = Weibull(2, σ√2) = √(X²+Y²)forX, Y ~ N(0, σ²)iid.χ(2)scaled.
3.13 Rice(ν, σ)
- Support:
[0, ∞) - PDF:
f(x) = (x/σ²) exp(-(x²+ν²)/(2σ²)) I₀(xν/σ²)(I₀ = modified Bessel) - Use: magnitude of 2D Gaussian with nonzero mean; MRI signal in noise; communications fading.
- Relationship:
ν = 0recovers Rayleigh.
3.14 F-distribution F(d₁, d₂)
- Support:
[0, ∞) - PDF:
f(x) = √((d₁x)^d₁ d₂^d₂ / (d₁x+d₂)^(d₁+d₂)) / (x B(d₁/2, d₂/2)) - Mean:
d₂/(d₂-2)ford₂ > 2· Var: complex (see references) - Use: ratio of two chi-squared / df; ANOVA; regression overall F-test.
- Relationship:
F(d₁, d₂) = (Chi²(d₁)/d₁) / (Chi²(d₂)/d₂).1/F(d₁, d₂) ~ F(d₂, d₁).t_ν² ~ F(1, ν).
3.15 Generalized Gamma GG(a, d, p)
- Support:
(0, ∞) - PDF:
f(x) = (p/a^d) x^(d-1) exp(-(x/a)^p) / Γ(d/p) - Use: nests Gamma (p=1), Weibull (d=p), Exp (d=p=1), Half-normal, lognormal limit. Flexible survival modeling.
4. Continuous distributions on [0, 1] (bounded)
4.1 Beta(α, β)
- Support:
[0, 1] - PDF:
f(x) = x^(α-1) (1-x)^(β-1) / B(α, β),B(α,β) = Γ(α)Γ(β)/Γ(α+β) - Mean:
α/(α+β)· Var:αβ/((α+β)²(α+β+1))· Mode:(α-1)/(α+β-2)ifα, β > 1 - Use: probabilities; proportions; conjugate prior to Bernoulli, Binomial, Geometric, NB.
- Relationship:
Beta(1,1) = Uniform(0,1).α = β: symmetric around 1/2; both → ∞: concentrates at 1/2. IfX ~ Gamma(α, θ), Y ~ Gamma(β, θ), thenX/(X+Y) ~ Beta(α, β).
4.2 Kumaraswamy(a, b)
- Support:
[0, 1] - PDF:
f(x) = a b x^(a-1) (1 - x^a)^(b-1) - CDF:
F(x) = 1 - (1 - x^a)^b(closed form, unlike Beta) - Use: alternative to Beta with tractable CDF; hydrology; classification calibration.
- Originator: Ponnambalam Kumaraswamy (1980).
4.3 Logit-normal
- Support:
(0, 1) - Description:
logit(X) ~ N(μ, σ²) - PDF:
f(x) = (1/(σ√(2π))) (1/(x(1-x))) exp(-(logit(x) - μ)²/(2σ²)) - Use: Normal-on-the-logit; arises in random-effects probability modeling; not conjugate to anything.
4.4 Beta-binomial (already in §1.14)
Marginal of Binomial(n, p) with p ~ Beta.
4.5 Power-function distribution
- Support:
[0, 1](or[0, b]) - PDF:
f(x) = α x^(α-1) - Use: simple monotone density on bounded interval. Special case of Beta(α, 1).
4.6 Truncated normal on [0, 1] — TN(μ, σ²; 0, 1)
- Support:
[0, 1] - PDF:
f(x) = φ((x-μ)/σ) / (σ (Φ((1-μ)/σ) - Φ(-μ/σ))) - Use: bounded continuous quantities; censored data; HMC for constrained variables.
5. Multivariate distributions
5.1 Multivariate Normal N_d(μ, Σ)
- Support:
ℝ^d - PDF:
f(x) = (2π)^(-d/2) |Σ|^(-1/2) exp(-(x-μ)ᵀ Σ⁻¹ (x-μ)/2) - Mean:
μ· Cov:Σ· Char. fn:exp(iμᵀt - tᵀΣt/2) - Use: linear-Gaussian models; Kalman filter; Gaussian processes (finite-dim slice); copulas.
- Conjugate prior (for μ, Σ known): Normal. For (μ, Σ): Normal-Inverse-Wishart.
- Properties: marginals normal, conditionals normal, linear combinations normal.
- Relationship: limit of MV-t as df → ∞.
5.2 Dirichlet(α₁, …, αₖ)
- Support: simplex
{(p₁,…,pₖ) : pᵢ ≥ 0, ∑pᵢ = 1} - PDF:
f(p) = (∏ pᵢ^(αᵢ-1)) Γ(∑αᵢ) / ∏ Γ(αᵢ) - Mean:
αᵢ / α₀whereα₀ = ∑αᵢ· Var(pᵢ):αᵢ(α₀-αᵢ)/(α₀²(α₀+1)) - Use: prior over categorical/multinomial probabilities; topic models (LDA); Bayesian smoothing.
- Relationship: marginals are Beta. Limit
α₀ → 0: sparse (concentrates on vertices).α₀ → ∞: concentrates on mean. Dirichlet Process is the infinite-dimensional analog.
5.3 Multinomial — already in §1.8
5.4 Wishart W_d(ν, V)
- Support:
d×dpositive-definite matrices - PDF:
f(X) = |X|^((ν-d-1)/2) exp(-tr(V⁻¹X)/2) / (2^(νd/2) |V|^(ν/2) Γ_d(ν/2)) - Mean:
νV - Use: distribution of sample covariance matrix
X = ZᵀZforZi.i.d. rows ~ N(0, V); conjugate prior on precision (inverse covariance) matrices. - Originator: John Wishart (1928).
5.5 Inverse-Wishart IW(ν, Ψ)
- Support:
d×dPD matrices - Description:
X ~ Wishart⇒X⁻¹ ~ Inverse-Wishart. - Use: conjugate prior on covariance matrix Σ.
5.6 Multivariate-t MVT_ν(μ, Σ)
- Support:
ℝ^d - PDF:
f(x) ∝ (1 + (x-μ)ᵀ Σ⁻¹ (x-μ)/ν)^(-(ν+d)/2) - Use: robust analog of MVN; Bayesian posterior of MVN mean with unknown covariance.
5.7 Matrix-normal MN_{n,p}(M, U, V)
- Support:
n×pmatrices - PDF:
f(X) ∝ exp(-tr(V⁻¹(X-M)ᵀU⁻¹(X-M))/2) - Use: prior over coefficient matrices in multivariate regression.
5.8 Multivariate Laplace
- Support:
ℝ^d - Description: scale-mixture of MVN over Gamma scale.
- Use: robust multivariate modeling; sparse priors.
5.9 Copulas — coupling of margins
A copula C: [0,1]^d → [0,1] is the joint CDF of (F₁(X₁), …, F_d(X_d)) (Sklar’s theorem). Separates dependence from marginals.
- Gaussian copula:
C(u) = Φ_d(Φ⁻¹(u₁), …, Φ⁻¹(u_d); Σ). No tail dependence (notoriously misused for CDOs). - Clayton copula:
C(u₁, u₂; θ) = (u₁^(-θ) + u₂^(-θ) - 1)^(-1/θ),θ > 0. Lower-tail dependence. - Frank copula:
C(u; θ) = -(1/θ) log(1 + (e^(-θu₁) - 1)(e^(-θu₂) - 1)/(e^(-θ) - 1)). Symmetric, no tail dependence. - Gumbel copula:
C(u; θ) = exp(-((-log u₁)^θ + (-log u₂)^θ)^(1/θ)),θ ≥ 1. Upper-tail dependence. - Marginally-normal copulas: any copula with normal margins (= MVN if Gaussian copula, else non-MVN with normal marginals).
5.10 Multivariate hypergeometric — already in §1.9
6. Heavy-tailed / extreme-value distributions
6.1 Generalized Extreme Value GEV(μ, σ, ξ)
- Support: depends on ξ.
- CDF:
F(x) = exp(-(1 + ξ(x-μ)/σ)^(-1/ξ))forξ ≠ 0;exp(-e^(-(x-μ)/σ))forξ = 0. - Use: block maxima (Fisher-Tippett-Gnedenko theorem).
ξ > 0: Fréchet (heavy tail).ξ = 0: Gumbel (light tail).ξ < 0: Reverse Weibull (bounded tail).
6.2 Generalized Pareto GPD(μ, σ, ξ)
- Support:
[μ, ∞)(ξ ≥ 0) or[μ, μ - σ/ξ](ξ < 0) - CDF:
F(x) = 1 - (1 + ξ(x-μ)/σ)^(-1/ξ)forξ ≠ 0;1 - e^(-(x-μ)/σ)forξ = 0. - Use: peaks-over-threshold (Pickands-Balkema-de Haan theorem).
ξ = 0: Exp.ξ > 0: Pareto-tail.ξ < 0: bounded.
6.3 Fréchet — already in §3.11
6.4 Gumbel — already in §3.10
6.5 Weibull (reverse) — already in §3.7
6.6 Tukey lambda
- Support:
ℝ(or bounded depending on λ) - Description: defined via quantile function
Q(p) = (p^λ - (1-p)^λ)/λ. Family contains uniform (λ=1), logistic (λ=0), approx Normal (λ ≈ 0.14). - Use: distributional shape exploration; goodness-of-fit benchmarking.
7. Circular / directional distributions
7.1 von Mises VM(μ, κ)
- Support:
[0, 2π) - PDF:
f(θ) = exp(κ cos(θ - μ)) / (2π I₀(κ)) - Mean direction:
μ· Concentration:κ - Use: circular analog of normal; wind directions; phase angles.
- Originator: Richard von Mises (1918).
7.2 von Mises-Fisher VMF(μ, κ) on S^(d-1)
- Support: unit sphere in
ℝ^d - PDF:
f(x) = C_d(κ) exp(κ μᵀ x)whereC_d(κ) = κ^(d/2-1) / ((2π)^(d/2) I_{d/2-1}(κ)) - Use: directional data; clustering of L2-normalized embeddings.
7.3 Bingham(M, Z)
- Support: unit sphere
- PDF:
f(x) ∝ exp(xᵀ M Z Mᵀ x) - Use: antipodally-symmetric directional data; orientations (where +x and -x are indistinguishable).
7.4 Wrapped Cauchy(μ, ρ)
- Support:
[0, 2π) - PDF:
f(θ) = (1/(2π)) (1 - ρ²)/(1 + ρ² - 2ρ cos(θ - μ)) - Use: heavy-tailed circular data.
7.5 Wrapped Normal WN(μ, σ²)
- Support:
[0, 2π) - PDF:
f(θ) = ∑_{k=-∞}^∞ (1/√(2πσ²)) exp(-(θ - μ + 2πk)²/(2σ²)) - Use: small-σ approximation of von Mises.
8. Bayesian non-parametric distributions
8.1 Dirichlet Process DP(α, H)
- Description: prior over discrete probability measures. Draws
G ~ DP(α, H)are almost surely discrete:G = ∑_{k=1}^∞ π_k δ_{θ_k}, with(π_k)from stick-breaking,θ_k ~ H. - Use: infinite mixture models; non-parametric Bayes; clustering with unknown number of clusters.
- Stick-breaking (Sethuraman 1994):
β_k ~ Beta(1, α),π_k = β_k ∏_{j<k} (1 - β_j).
8.2 Chinese Restaurant Process CRP(α)
- Description: predictive distribution of a DP. Customer
n+1joins tablekwith probabilityn_k / (n + α)or starts new table with probabilityα/(n + α). - Use: exchangeable partitions; cluster assignment in non-parametric mixtures.
8.3 Pitman-Yor process PY(α, d, H)
- Description: generalizes DP with discount parameter
d ∈ [0, 1). Heavier tail in cluster sizes. - Use: language modeling; preferential attachment; better fit for power-law cluster sizes than DP.
8.4 Indian Buffet Process IBP(α)
- Description: prior over infinite binary feature matrices. Customer
nselects each previously-tried dish with probm_k/nand triesPoisson(α/n)new dishes. - Use: latent feature models; factor analysis with unknown number of factors.
8.5 Polya tree
- Description: prior over continuous distributions via recursive Beta-distributed splits of probability mass over a dyadic partition. Generalizes DP (which is discrete) to continuous distributions.
- Use: density estimation; nonparametric Bayes for continuous data.
9. Distribution-family relationships diagram
┌─────────────┐
│ Bernoulli │
└──────┬──────┘
│ sum of n iid
▼
┌─────────────┐ mixture over ┌──────────────┐
│ Binomial │────Beta────────► │ Beta-Binomial│
└──────┬──────┘ └──────────────┘
n→∞,np→λ │
▼
┌─────────────┐ mixture over ┌──────────────┐
│ Poisson │────Gamma────────► │NegBinomial │
└──────┬──────┘ └──────────────┘
difference
▼
Skellam
┌─────────────┐ square + sum ┌──────────────┐
│ Normal │──────────────────►│ Chi-square │
└──┬───┬───┬──┘ └───┬──────────┘
│ │ │ exp │ ratio
│ │ ▼ ▼
│ │ Lognormal F-distribution
│ │ ratio Z/√(V/ν)
│ └──────────────────────► Student's t ──ν=1──► Cauchy
│ truncated
▼
Half-normal, TruncNormal, Skew-normal
┌─────────────┐ sum of n iid ┌──────────────┐
│ Exponential │──────────────────►│ Gamma │
└──┬──┬──┬───-┘ └──┬───────────┘
│ │ │ x^(1/k) │ 1/X
│ │ ▼ ▼
│ │ Weibull Inverse-Gamma
│ │ log
│ ▼
│ Gumbel ◄────── max of n iid Exp/Normal
│
│ x_m exp(-X/α)
▼
Pareto ◄────── log(X/x_m) ~ Exp
┌─────────────┐
│ Multinomial │ ◄──── Dirichlet (conjugate prior)
└─────────────┘
┌──────────────────┐
│ Multivariate Normal│ ◄──── Wishart (prior on Σ⁻¹)
└──────────────────┘ ◄──── Inverse-Wishart (prior on Σ)
Stable family (α-stable)
├── α=2 ──► Normal
├── α=1,β=0 ──► Cauchy
└── α=1/2,β=1► Lévy
Generalized Extreme Value (GEV)
├── ξ>0 ──► Fréchet
├── ξ=0 ──► Gumbel
└── ξ<0 ──► Reverse-Weibull
Generalized Pareto (POT)
├── ξ=0 ──► Exponential
├── ξ>0 ──► Pareto-tail
└── ξ<0 ──► Bounded uniform-like
Dirichlet Process
├── stick-breaking ──► discrete mixture weights
├── predictive ──► Chinese Restaurant Process
└── two-parameter ──► Pitman-Yor
10. Conjugate-prior reference
| Likelihood | Parameter | Conjugate prior | Posterior update |
|---|---|---|---|
| Bernoulli(p) | p | Beta(α, β) | Beta(α + ∑x, β + n - ∑x) |
| Binomial(n, p) | p | Beta(α, β) | Beta(α + ∑x, β + ∑(nᵢ - xᵢ)) |
| Geometric(p) | p | Beta(α, β) | Beta(α + n, β + ∑x) |
| NegBin(r, p), r fixed | p | Beta(α, β) | Beta(α + nr, β + ∑x) |
| Poisson(λ) | λ | Gamma(α, β) | Gamma(α + ∑x, β + n) |
| Exp(λ) | λ | Gamma(α, β) | Gamma(α + n, β + ∑x) |
| Normal(μ, σ²), σ² known | μ | Normal(μ₀, τ₀²) | Normal closed form |
| Normal(μ, σ²), μ known | σ² | Inverse-Gamma(α, β) | Inverse-Gamma updated |
| Normal(μ, σ²), both | (μ, σ²) | Normal-Inverse-Gamma | NIG updated |
| Gamma(α, β), α known | β | Gamma(a, b) | Gamma(a + nα, b + ∑x) |
| Categorical(p) | p | Dirichlet(α) | Dirichlet(α + counts) |
| Multinomial(n, p) | p | Dirichlet(α) | Dirichlet(α + counts) |
| MVN(μ, Σ), Σ known | μ | MVN(μ₀, Σ₀) | MVN closed form |
| MVN(μ, Σ), μ known | Σ | Inverse-Wishart(ν, Ψ) | IW(ν+n, Ψ + S) |
| MVN(μ, Σ), both | (μ, Σ) | Normal-Inverse-Wishart | NIW updated |
| Uniform(0, θ) | θ | Pareto(x_m, k) | Pareto(max(x_m, max x), k+n) |
11. Use-case decision tree
Q1: Discrete or continuous outcome?
→ Discrete:
- Q2: Binary? → Bernoulli (single) / Binomial (count of successes in n).
- Q2: Count of events in a fixed window?
- Equidispersed (var ≈ mean) → Poisson.
- Overdispersed (var > mean) → Negative Binomial or CMP(ν<1).
- Underdispersed (var < mean) → Conway-Maxwell-Poisson (ν>1).
- Excess zeros → Zero-inflated Poisson / NB.
- No zeros (truncated) → Zero-truncated Poisson.
- Q2: Sampling without replacement → Hypergeometric.
- Q2: K categories, single trial → Categorical; n trials → Multinomial.
- Q2: Rank-frequency / heavy-tailed counts → Zipf / Yule-Simon / discrete Pareto.
→ Continuous:
- Q2: Support
ℝ?- Symmetric light-tailed → Normal.
- Symmetric heavy-tailed → Student’s t (df controls heaviness) / Cauchy (df=1) / Laplace.
- Asymmetric → Skew-normal / Skew-t / Asymmetric Laplace.
- Stable sums → α-stable.
- Q2: Support
(0, ∞)(positive)?- Memoryless lifetimes → Exponential.
- Sum-of-exponentials waiting → Gamma / Erlang.
- Multiplicative noise → Lognormal.
- Heavy-tailed → Pareto / Lognormal / Fréchet.
- Lifetime with monotone hazard → Weibull.
- Sample variance / sum-of-squares → Chi-square.
- First-passage time of Brownian motion → Inverse-Gaussian.
- Q2: Support
[0, 1](proportion / probability)?- Most cases → Beta (conjugate to Bernoulli).
- Need closed-form CDF → Kumaraswamy.
- Logit-transformed normal data → Logit-normal.
- Q2: Support a sphere / circle?
- Circle [0, 2π) → von Mises (light tail) / Wrapped Cauchy (heavy).
- High-dim sphere → von Mises-Fisher.
- Antipodally symmetric → Bingham.
→ Multivariate:
- Linear Gaussian → Multivariate Normal.
- Probabilities on simplex → Dirichlet (extends Beta).
- Covariance matrices → Wishart (positive) / Inverse-Wishart (prior on Σ).
- Heavy-tailed multivariate → MV-t / Multivariate Laplace.
- Dependence with arbitrary marginals → Copula (choose tail-dependence structure).
→ Extreme values / tails:
- Block maxima → Generalized Extreme Value (GEV).
- Threshold exceedances → Generalized Pareto.
→ Bayesian non-parametric:
- Unknown number of mixture components → Dirichlet Process / Chinese Restaurant Process.
- Power-law cluster sizes → Pitman-Yor.
- Latent binary features (unknown count) → Indian Buffet Process.
- Density estimation → Polya tree.
12. Cross-links
[[Math/probability-fundamentals]]— measure-theoretic foundations, expectation, convergence.[[Math/probability-distributions]]— Tier 1 conceptual overview (this note’s cousin).[[Math/bayesian-inference]]— conjugate priors, posterior updates, MAP/MLE.[[Math/mcmc-sampling]]— sampling from non-tractable distributions.[[Math/information-theory]]— entropy + KL of named distributions.[[Math/stochastic-calculus]]— Brownian motion, Itô, Lévy processes (parent of stable distributions).[[Math/_index]]— Tier 1 math map.[[Math/Tier3/_index]]— Tier 3 family-index catalog.
— Last updated 2026-05-17 by claude-code.