Probability Distributions Reference
At a glance
This note is the companion catalog to probability-fundamentals. Where that note develops the machinery (sigma-algebras, expectation, conditional probability, CLT, convergence), this one is the bestiary: a reference of the common parametric families, with PMF or PDF, moments, moment-generating / characteristic functions, conjugate priors, and the canonical real-world use cases.
Use this document as a lookup. When you need to model a real phenomenon (count of events, waiting time, proportion, angle, extreme), scan section 42 (Distribution selection by use case) for the right family, then jump to the dedicated section for formulas and properties. Section 43 collects the conjugate-prior pairs in one table for Bayesian work; section 44 collects MGF / characteristic functions.
A few conventions used throughout:
- PMF denotes
p(k) = P(X = k)for discreteX; PDF denotesf(x)for continuousXwithP(X in A) = integral_A f(x) dx. E[X]is the mean,Var(X) = E[(X - E[X])^2]the variance,sigma = sqrt(Var(X))the standard deviation.- The moment-generating function is
M_X(t) = E[exp(tX)]when it exists in a neighborhood of0; the characteristic function isphi_X(t) = E[exp(i t X)]and always exists. Gamma(z)denotes the gamma function (continuous analog of factorial:Gamma(n) = (n-1)!for positive integern);B(a, b) = Gamma(a) Gamma(b) / Gamma(a + b)is the beta function.- “Conjugate prior” means: if the likelihood is
L(theta | x)from familyF, and the priorp(theta)belongs to familyG, then the posteriorp(theta | x)also belongs toG(closed-form Bayesian update).
The depth of coverage follows the empirical frequency of use: Normal, Bernoulli/Binomial, Poisson, Exponential, Gamma, Beta, and Dirichlet get longer treatment. Specialty distributions (Rice, von Mises, GEV, stable) get the essential formulas + a pointer to literature.
DISCRETE DISTRIBUTIONS
A discrete distribution assigns positive probability to a countable support set (often {0, 1, 2, ...} or {0, 1, ..., n}). The PMF satisfies sum_k p(k) = 1 with p(k) >= 0. The CDF F(x) = P(X <= x) is a right-continuous step function.
The discrete families below subdivide into:
- Bernoulli family (single trial, count of successes): Bernoulli, Binomial, Geometric, Negative Binomial, Hypergeometric.
- Count-of-events family: Poisson, Negative Binomial (overdispersed Poisson).
- Categorical family (K outcomes): Categorical, Multinomial, Dirichlet-Multinomial (compound).
- Heavy-tail discrete: Zipf, discrete Pareto, Yule-Simon.
Bernoulli(p)
The atom of discrete probability: a single trial with binary outcome.
- Support:
{0, 1}. - Parameter:
p in [0, 1](success probability). - PMF:
P(X = 1) = p,P(X = 0) = 1 - p. Compactly,p(k) = p^k (1-p)^(1-k)fork in {0, 1}. - Mean:
E[X] = p. - Variance:
Var(X) = p(1 - p). - MGF:
M(t) = (1 - p) + p e^t. - Entropy:
H(X) = -p log p - (1 - p) log(1 - p)(the binary entropy function, maximized atp = 1/2). - Conjugate prior: Beta(alpha, beta) on
p; posterior after observingksuccesses inntrials is Beta(alpha + k, beta + n - k).
Use cases: a single coin flip; a binary indicator (clicked / did not click, defective / not defective, alive / dead at end of trial); the building block for logistic regression (one Bernoulli per row, with p linked to features by the logistic function).
Binomial(n, p)
The sum of n independent identically distributed (iid) Bernoulli(p) trials.
- Support:
{0, 1, ..., n}. - Parameters:
n in {1, 2, ...}(trial count),p in [0, 1]. - PMF:
p(k) = C(n, k) p^k (1 - p)^(n - k)whereC(n, k) = n! / (k! (n - k)!). - Mean:
E[X] = np. - Variance:
Var(X) = np(1 - p). - MGF:
M(t) = ((1 - p) + p e^t)^n. - Convolution rule:
Binomial(n1, p) + Binomial(n2, p)(independent)= Binomial(n1 + n2, p). - Normal approximation: for
np >= 5andn(1 - p) >= 5,Xis approximatelyN(np, np(1 - p))(de Moivre-Laplace, a special case of the CLT). - Poisson limit: as
n -> infinity,p -> 0, withnp = lambdaheld fixed,Binomial(n, p) -> Poisson(lambda). - Conjugate prior: Beta(alpha, beta).
Use cases: number of heads in n flips; number of defective items in a batch; number of clicks among n impressions; A/B test success counts.
Geometric(p)
The number of trials needed until the first success in repeated Bernoulli(p) trials. Two common parameterizations:
-
“Trials” form (1-indexed, support
{1, 2, ...}):- PMF:
p(k) = (1 - p)^(k - 1) p. - Mean:
E[X] = 1/p. - Variance:
Var(X) = (1 - p)/p^2.
- PMF:
-
“Failures-before-success” form (0-indexed, support
{0, 1, 2, ...}):- PMF:
p(k) = (1 - p)^k p. - Mean:
E[X] = (1 - p)/p. - Variance: same:
(1 - p)/p^2.
- PMF:
-
MGF (trials form):
M(t) = p e^t / (1 - (1 - p) e^t)fort < -log(1 - p). -
Memoryless property (the discrete analog of the exponential’s memorylessness):
P(X > m + n | X > m) = P(X > n). Geometric is the only memoryless discrete distribution on the positive integers.
Use cases: number of attempts until first success; time to first failure of a system polled at discrete intervals; classical models of waiting in queueing theory.
Negative Binomial(r, p)
Generalizes Geometric: the number of trials (or failures) until the r-th success. Equivalent (and often more useful) characterization: a Gamma-mixture of Poissons, which makes it the workhorse for overdispersed count data.
- “Failures-before-r-successes” form (support
{0, 1, 2, ...}):- PMF:
p(k) = C(k + r - 1, k) p^r (1 - p)^k. - Mean:
E[X] = r(1 - p)/p. - Variance:
Var(X) = r(1 - p)/p^2 = E[X]/p. NoteVar > E: overdispersion.
- PMF:
- Extension to non-integer
r: replaceC(k + r - 1, k)withGamma(k + r) / (Gamma(r) k!). Thenris a real-valued dispersion parameter. - MGF:
M(t) = (p / (1 - (1 - p) e^t))^rfort < -log(1 - p). - Poisson-Gamma mixture: if
lambda ~ Gamma(r, (1 - p)/p)andX | lambda ~ Poisson(lambda), then marginallyX ~ NegBin(r, p). The Negative Binomial is the marginal of a Gamma-distributed Poisson rate.
Use cases: count data where variance exceeds the mean (claim counts, ecological abundance, sequencing read counts where Poisson would underestimate variance). Standard GLM family for overdispersed counts; widely used in RNA-seq differential expression (DESeq2, edgeR).
Poisson(lambda)
The canonical model for the count of rare events in a fixed interval of time or space, under the assumption that events occur independently at a constant average rate lambda.
- Support:
{0, 1, 2, ...}. - Parameter:
lambda > 0(mean / rate). - PMF:
p(k) = lambda^k e^(-lambda) / k!. - Mean:
E[X] = lambda. - Variance:
Var(X) = lambda. (Mean equals variance — the distinguishing equidispersion property of the Poisson.) - MGF:
M(t) = exp(lambda (e^t - 1)). - Convolution:
Poisson(lambda1) + Poisson(lambda2)(independent)= Poisson(lambda1 + lambda2). - Connection to Exponential: if event inter-arrival times are iid
Exponential(lambda), then the count of events in[0, t]isPoisson(lambda t). This is the defining property of the Poisson process. - Conjugate prior: Gamma(alpha, beta) on
lambda; posterior after observing total countKovernintervals isGamma(alpha + K, beta + n).
Use cases: photons arriving at a detector, customer arrivals at a queue, web requests per second, mutations per genome, typos per page, earthquakes per year. Standard GLM family for count data when overdispersion is absent or modest.
Categorical / Multinomial
Generalize Bernoulli / Binomial from 2 outcomes to K outcomes.
Categorical(p_1, …, p_K) is one draw from K outcomes with probabilities p_i >= 0, sum_i p_i = 1. PMF: P(X = i) = p_i. Sometimes encoded as a one-hot vector e_i in {0, 1}^K.
Multinomial(n, p_1, …, p_K) is the joint distribution of (X_1, ..., X_K) where X_i is the count of category i after n iid Categorical draws.
- Support: vectors
(k_1, ..., k_K)withk_i >= 0integers andsum_i k_i = n. - PMF:
p(k_1, ..., k_K) = (n! / (k_1! ... k_K!)) p_1^k_1 ... p_K^k_K. - Marginals: each
X_i ~ Binomial(n, p_i). - Means:
E[X_i] = n p_i. - Variances:
Var(X_i) = n p_i (1 - p_i). - Covariances:
Cov(X_i, X_j) = -n p_i p_jfori != j(negative because total is fixed). - Conjugate prior: Dirichlet(alpha_1, …, alpha_K) on
(p_1, ..., p_K); posterior after counts(k_1, ..., k_K)is Dirichlet(alpha_1 + k_1, …, alpha_K + k_K).
Use cases: dice rolls; observed counts across categories of a survey; topic models (LDA: each word is Categorical over topics, each topic Categorical over vocabulary); softmax classifiers in deep learning (output is Categorical).
Hypergeometric(N, K, n)
Sampling without replacement from a finite population of size N containing K “successes”. Draw n items; count the successes drawn.
- Support:
max(0, n - (N - K)) <= k <= min(n, K). - PMF:
p(k) = C(K, k) C(N - K, n - k) / C(N, n). - Mean:
E[X] = n K / N. - Variance:
Var(X) = n (K/N) ((N - K)/N) ((N - n)/(N - 1)). The factor(N - n)/(N - 1)is the finite population correction; asN -> infinitywithK/N -> pfixed, the hypergeometric converges toBinomial(n, p).
Use cases: card draws (poker hand probabilities); quality assurance sampling (inspect n items from a lot of N); Fisher’s exact test for 2x2 contingency tables; capture-recapture wildlife population estimation.
Discrete uniform(a, b)
Equal probability on each integer in {a, a+1, ..., b} (n = b - a + 1 outcomes).
- PMF:
p(k) = 1/nfora <= k <= b. - Mean:
E[X] = (a + b)/2. - Variance:
Var(X) = (n^2 - 1)/12.
Use cases: fair dice (a = 1, b = 6); pseudorandom integer generation; lottery; uninformative prior over a finite set of hypotheses.
Beta-binomial(n, alpha, beta)
The Binomial with a Beta-distributed success probability — a compound distribution that captures overdispersion in proportions-of-successes data.
- Hierarchical:
p ~ Beta(alpha, beta),X | p ~ Binomial(n, p). - Marginal PMF:
p(k) = C(n, k) B(k + alpha, n - k + beta) / B(alpha, beta). - Mean:
E[X] = n alpha / (alpha + beta). - Variance:
Var(X) = n alpha beta (alpha + beta + n) / ((alpha + beta)^2 (alpha + beta + 1)), which is strictly greater than the Binomial variance for the same mean: overdispersion.
Use cases: pooled multi-site binary outcomes where each site has its own latent success rate; meta-analytic combination of binomial proportions; sequencing variant-calling priors.
Zipf / power-law / discrete Pareto
Heavy-tailed counts: a few categories dominate, most are rare. Standard parameterizations:
- Zipf(s, N) on
{1, ..., N}with PMFp(k) = (1/k^s) / H_{N, s}whereH_{N, s} = sum_{j=1}^N 1/j^sis the generalized harmonic number. - Zeta distribution (infinite support,
s > 1):p(k) = (1/k^s) / zeta(s)wherezetais the Riemann zeta function. - Yule-Simon for preferential-attachment processes.
For continuous Pareto see section 26.
Use cases: word-frequency distributions (Zipf’s law: rank r word frequency proportional to 1/r); city-size distributions; node-degree distributions in scale-free networks; library citation counts; income distributions (heavy upper tail).
CONTINUOUS DISTRIBUTIONS
A continuous distribution has a probability density function f(x) >= 0 with integral f(x) dx = 1 (over its support). For any interval A, P(X in A) = integral_A f(x) dx. Note P(X = x) = 0 for every single x; only intervals have positive probability.
The continuous families below subdivide into:
- Location-scale: Uniform, Normal, Logistic, Laplace, Cauchy.
- Positive support: Exponential, Gamma, Log-normal, Weibull, Pareto, Chi-square.
- Bounded support: Uniform, Beta.
- Sampling distributions (derived from Normal): Chi-square, t, F.
- Multivariate: Multivariate Normal, Multivariate t, Dirichlet (on the simplex), Wishart (on PSD matrices).
- Circular / directional: von Mises.
- Extreme-value: Gumbel, Frechet, Weibull (all subsumed by GEV); Generalized Pareto.
- Mixtures: GMM and beyond.
Uniform(a, b)
Constant density on the interval [a, b].
- Support:
[a, b]. - PDF:
f(x) = 1/(b - a)fora <= x <= b, zero otherwise. - CDF:
F(x) = (x - a)/(b - a)on[a, b]. - Mean:
E[X] = (a + b)/2. - Variance:
Var(X) = (b - a)^2 / 12. - MGF:
M(t) = (e^(tb) - e^(ta)) / (t(b - a))fort != 0. - Universality: if
U ~ Uniform(0, 1)andFis any CDF, thenF^(-1)(U)has CDFF(the inverse-CDF / Smirnov transform — basis of pseudorandom sampling).
Use cases: pseudorandom-number generation (every other sampler builds on Uniform(0, 1)); rejection sampling envelope; an uninformative prior on a bounded parameter; quantization noise model.
Normal / Gaussian N(mu, sigma^2)
The most important continuous distribution. Justified theoretically by the Central Limit Theorem (sums of many small independent contributions tend to Normal) and practically by analytic tractability (conjugate to itself in many settings, linear transforms preserve normality, marginals/conditionals of joint Normals are Normal).
- Support: all of
R. - Parameters:
mu in R(mean),sigma^2 > 0(variance). - PDF:
f(x) = (1 / sqrt(2 pi sigma^2)) exp(-(x - mu)^2 / (2 sigma^2)). - CDF: no closed form; written
Phi((x - mu)/sigma)wherePhiis the standard Normal CDF (withmu = 0,sigma = 1). Tables anderfgive numerical values. - Mean:
E[X] = mu. - Variance:
Var(X) = sigma^2. - Skewness:
0. Excess kurtosis:0(by convention — Normal is the reference). - MGF:
M(t) = exp(mu t + sigma^2 t^2 / 2). - Characteristic function:
phi(t) = exp(i mu t - sigma^2 t^2 / 2). - Linear closure: if
X ~ N(mu, sigma^2)thenaX + b ~ N(a mu + b, a^2 sigma^2). - Convolution: independent
N(mu1, s1^2) + N(mu2, s2^2) = N(mu1 + mu2, s1^2 + s2^2). - Conjugate priors:
- For
mu(variance known):N(mu_0, tau_0^2)is conjugate. - For
sigma^2(mean known):InverseGamma(alpha, beta)is conjugate. Equivalently,Inv-chi^2is sometimes used. - For both jointly:
Normal-Inverse-Gamma(NIG) prior, with closed-form posterior.
- For
Standardization: Z = (X - mu)/sigma ~ N(0, 1). The 68-95-99.7 rule: about 68% of mass within 1 sigma, 95% within 2 sigma, 99.7% within 3 sigma.
Use cases: measurement errors in physics; residuals in linear regression (assumed Normal for inference); approximating sums and averages (CLT); maximum-likelihood estimation pipelines; Kalman filters; Gaussian processes; baseline distribution for almost every ML loss function (MSE loss = Normal log-likelihood up to constants).
Multivariate Normal N(mu, Sigma)
The joint distribution of a random vector X in R^d such that any linear combination a^T X is univariate Normal.
- Support:
R^d. - Parameters: mean vector
mu in R^d, covariance matrixSigma in R^(d x d)symmetric PSD. - PDF (when
Sigmainvertible):f(x) = (1 / sqrt((2 pi)^d det Sigma)) exp(-(1/2) (x - mu)^T Sigma^(-1) (x - mu)). - Mean:
E[X] = mu. - Covariance:
Cov(X) = Sigma. - Marginals: every sub-vector
X_Sis also multivariate Normal with meanmu_Sand covarianceSigma_SS. - Conditionals:
X_A | X_B = x_Bis Normal with meanmu_A + Sigma_AB Sigma_BB^(-1) (x_B - mu_B)and covarianceSigma_AA - Sigma_AB Sigma_BB^(-1) Sigma_BA. (This is the Gaussian conditioning formula — the engine inside Kalman filters and Gaussian processes.) - Linear transform: if
X ~ N(mu, Sigma), thenAX + b ~ N(A mu + b, A Sigma A^T). - Independence: components are independent iff
Sigmais diagonal (Gaussian-specific — true uncorrelatedness implies independence only for jointly Normal).
Use cases: Kalman filtering and smoothing (state vector + Gaussian noise); Gaussian processes for regression; factor models in finance; multivariate hypothesis testing (Hotelling T^2); generative models (deep latent-variable models with a Gaussian latent prior).
Exponential(lambda)
The continuous waiting time between events of a Poisson process. The continuous analog of the geometric distribution.
- Support:
[0, infinity). - Parameter:
lambda > 0(rate). Alternative parameterization uses scaletheta = 1/lambda. - PDF:
f(x) = lambda exp(-lambda x). - CDF:
F(x) = 1 - exp(-lambda x). - Mean:
E[X] = 1/lambda. - Variance:
Var(X) = 1/lambda^2. (Standard deviation equals mean.) - MGF:
M(t) = lambda / (lambda - t)fort < lambda. - Memoryless property:
P(X > s + t | X > s) = P(X > t). The exponential is the only continuous distribution on[0, infinity)with this property. - Convolution:
sum_{i=1}^k X_i ~ Gamma(k, 1/lambda)for iid Exponentials.
Use cases: time between Poisson arrivals (calls, web requests, particle decays); radioactive decay (half-life is the median); reliability for components with a constant hazard rate; survival analysis baseline; thinning / inverse-CDF sampling primitive.
Gamma(k, theta)
Generalizes Exponential. Two common parameterizations:
-
Shape-scale:
k > 0(shape),theta > 0(scale).- PDF:
f(x) = (1 / (Gamma(k) theta^k)) x^(k - 1) exp(-x/theta)forx > 0. - Mean:
E[X] = k theta. - Variance:
Var(X) = k theta^2.
- PDF:
-
Shape-rate:
alpha > 0(shape),beta > 0(rate), withtheta = 1/beta.- PDF:
f(x) = (beta^alpha / Gamma(alpha)) x^(alpha - 1) exp(-beta x). - Mean:
alpha/beta. Variance:alpha/beta^2.
- PDF:
-
MGF (shape-scale form):
M(t) = (1 - theta t)^(-k)fort < 1/theta. -
Integer shape:
Gamma(k, theta)with integerkisErlang(k, theta)— the sum ofkiidExponential(1/theta)random variables. -
Convolution:
Gamma(k1, theta) + Gamma(k2, theta) = Gamma(k1 + k2, theta)(same scale). -
Conjugate prior: Gamma is the conjugate prior for the rate of a Poisson likelihood; also for the precision (inverse variance) of a Normal with known mean.
Use cases: aggregated waiting times (sum of exponentials); rainfall amounts; insurance claim sizes; the conjugate prior to the Poisson rate; the precision parameter in Normal-Gamma hierarchical models.
Beta(alpha, beta)
The canonical distribution on [0, 1] — used for probabilities, proportions, and fractions.
- Support:
[0, 1]. - Parameters:
alpha > 0,beta > 0(both “shape” parameters). - PDF:
f(x) = (1 / B(alpha, beta)) x^(alpha - 1) (1 - x)^(beta - 1), whereB(alpha, beta) = Gamma(alpha) Gamma(beta) / Gamma(alpha + beta). - Mean:
E[X] = alpha / (alpha + beta). - Variance:
Var(X) = alpha beta / ((alpha + beta)^2 (alpha + beta + 1)). - Mode (for
alpha, beta > 1):(alpha - 1) / (alpha + beta - 2). - Special cases:
Beta(1, 1) = Uniform(0, 1).Beta(1/2, 1/2)is the arcsine distribution (U-shaped, common in random-walk first-passage problems).- As
alpha + beta -> infinitywith mean held fixed, mass concentrates at the mean.
- Conjugate prior to Bernoulli and Binomial. Updating: prior
Beta(alpha, beta), observeksuccesses inntrials, posteriorBeta(alpha + k, beta + n - k). The hyperparameters(alpha, beta)act like pseudo-counts of successes and failures.
Use cases: prior over a probability or proportion (click-through rate, conversion rate); Thompson sampling for Bernoulli bandits (sample a probability from each arm’s Beta posterior, pick the arm with the largest sample); modeling proportions in ecology, voting shares, election outcomes; a flexible distribution shape on a bounded interval after rescaling.
Chi-square(k)
The distribution of the sum of squares of k independent standard Normal random variables.
- Support:
[0, infinity). - Parameter:
k > 0(degrees of freedom, usually integer but generalizable). - PDF:
f(x) = (1 / (2^(k/2) Gamma(k/2))) x^(k/2 - 1) exp(-x/2). - Mean:
E[X] = k. - Variance:
Var(X) = 2k. - MGF:
M(t) = (1 - 2t)^(-k/2)fort < 1/2. - Special case of Gamma:
chi^2(k) = Gamma(shape = k/2, scale = 2). - Convolution: independent
chi^2(k1) + chi^2(k2) = chi^2(k1 + k2).
Use cases: distribution of the sample variance (scaled) under a Normal model: (n - 1) S^2 / sigma^2 ~ chi^2(n - 1); Pearson’s chi-square goodness-of-fit and independence tests; likelihood-ratio test statistics (asymptotically chi^2 under regularity, Wilks’ theorem); confidence intervals for variance.
t-distribution(nu)
The “Student’s t” distribution: a heavier-tailed substitute for the Normal when estimating the mean of a Normal sample with unknown variance.
- Support: all of
R. - Parameter:
nu > 0(degrees of freedom). - PDF:
f(t) = (Gamma((nu + 1)/2) / (sqrt(nu pi) Gamma(nu/2))) (1 + t^2/nu)^(-(nu + 1)/2). - Mean:
0fornu > 1, undefined fornu <= 1(Cauchy atnu = 1). - Variance:
nu / (nu - 2)fornu > 2, infinite for1 < nu <= 2, undefined fornu <= 1. - Tail behavior: power-law tails
f(t) ~ |t|^(-(nu + 1)). Asnu -> infinity,t(nu) -> N(0, 1). - Construction: if
Z ~ N(0, 1)andV ~ chi^2(nu)independent, thenT = Z / sqrt(V/nu) ~ t(nu).
Use cases: classical “one-sample t-test” and “two-sample t-test” for the mean when the population variance is unknown and estimated from the sample; confidence intervals for the mean of a Normal sample; robust regression priors (a t-prior on errors tolerates outliers far better than Normal); Bayesian inference where t marginal arises from integrating out an unknown variance.
F-distribution(d1, d2)
The ratio (suitably scaled) of two independent chi-squared random variables.
- Support:
[0, infinity). - Parameters:
d1, d2 > 0(numerator and denominator degrees of freedom). - PDF:
f(x) = (1/B(d1/2, d2/2)) (d1/d2)^(d1/2) x^(d1/2 - 1) (1 + (d1/d2) x)^(-(d1 + d2)/2). - Construction: if
U ~ chi^2(d1)andV ~ chi^2(d2)independent, then(U/d1) / (V/d2) ~ F(d1, d2). - Mean:
d2 / (d2 - 2)ford2 > 2. - Reciprocal:
1 / F(d1, d2) ~ F(d2, d1).
Use cases: ANOVA F-test (ratio of between-group to within-group variance); F-test for nested linear models (full vs reduced); test for equality of two variances; the basis of significance testing in ordinary least-squares regression.
Log-normal(mu, sigma^2)
The distribution whose logarithm is Normal: if Y = log(X) ~ N(mu, sigma^2) then X is log-normal.
- Support:
(0, infinity). - PDF:
f(x) = (1 / (x sigma sqrt(2 pi))) exp(-(log(x) - mu)^2 / (2 sigma^2)). - Mean:
E[X] = exp(mu + sigma^2 / 2). (Note: notexp(mu).) - Variance:
Var(X) = (exp(sigma^2) - 1) exp(2 mu + sigma^2). - Median:
exp(mu). Mode:exp(mu - sigma^2). - Multiplicative CLT: products of many positive independent random variables are approximately log-normal (their logs are approximately Normal by the additive CLT).
Use cases: income and wealth distributions (right-skewed, positive); stock prices and asset returns (geometric Brownian motion has log-normal marginals — the Black-Scholes assumption); particle size distributions; biological growth quantities; multiplicative noise models.
Cauchy(x_0, gamma)
The Lorentz distribution: a symmetric, heavy-tailed distribution with no defined mean or variance.
- Support: all of
R. - Parameters:
x_0 in R(location, the median),gamma > 0(scale, half-width at half-maximum). - PDF:
f(x) = (1 / (pi gamma)) (gamma^2 / ((x - x_0)^2 + gamma^2)). - CDF:
F(x) = (1/pi) arctan((x - x_0)/gamma) + 1/2. - Mean and variance: undefined. The integral defining
E[X]does not converge absolutely. - Characteristic function (exists, even though MGF does not):
phi(t) = exp(i x_0 t - gamma |t|). - Construction: ratio of two independent standard Normals:
N(0, 1) / N(0, 1) ~ Cauchy(0, 1). - Stability: Cauchy is a stable distribution — sums of iid Cauchies are Cauchy (with rescaled parameters). The sample mean does not concentrate.
Use cases: resonance line shapes in spectroscopy and laser physics (Lorentzian profile); a stress-test for “robust” estimators (the median, not the mean, recovers the location); pathological-case prior to demonstrate the importance of finite moments.
Weibull(k, lambda)
A flexible distribution for positive quantities, generalizing the Exponential.
- Support:
[0, infinity). - Parameters:
k > 0(shape),lambda > 0(scale). - PDF:
f(x) = (k/lambda) (x/lambda)^(k - 1) exp(-(x/lambda)^k). - CDF:
F(x) = 1 - exp(-(x/lambda)^k). - Mean:
E[X] = lambda Gamma(1 + 1/k). - Variance:
Var(X) = lambda^2 (Gamma(1 + 2/k) - (Gamma(1 + 1/k))^2). - Hazard rate:
h(x) = (k/lambda) (x/lambda)^(k - 1).k < 1: decreasing hazard (infant mortality).k = 1: constant hazard (Exponential).k > 1: increasing hazard (wear-out failure).- Combining all three over a product lifetime gives the classic “bathtub curve”.
Use cases: reliability and survival analysis (lifetime of components); wind-speed distributions in renewable-energy modeling; particle-size distributions; material fatigue and breaking strength; extreme-value modeling for minima.
Pareto(x_m, alpha)
The continuous power-law distribution.
- Support:
[x_m, infinity)withx_m > 0. - Parameters:
x_m > 0(scale / minimum),alpha > 0(shape, “Pareto index”). - PDF:
f(x) = alpha x_m^alpha / x^(alpha + 1)forx >= x_m. - CDF:
F(x) = 1 - (x_m / x)^alpha. - Mean:
alpha x_m / (alpha - 1)foralpha > 1; infinite otherwise. - Variance:
x_m^2 alpha / ((alpha - 1)^2 (alpha - 2))foralpha > 2; infinite otherwise. - The 80/20 rule (Pareto principle): for
alpha approximately 1.16, about80%of the mass sits in the top20%of values.
Use cases: wealth and income distributions (Vilfredo Pareto’s original observation); file-size distributions on the web; severity of insurance claims; city sizes (also Zipf); the continuous companion to Zipf for ranked-frequency data.
Logistic(mu, s)
Symmetric and bell-shaped like the Normal, with heavier tails. CDF is the famous logistic / sigmoid function.
- Support: all of
R. - Parameters:
mu in R(location),s > 0(scale). - PDF:
f(x) = exp(-(x - mu)/s) / (s (1 + exp(-(x - mu)/s))^2). - CDF:
F(x) = 1 / (1 + exp(-(x - mu)/s))— the sigmoid. - Mean:
E[X] = mu. - Variance:
Var(X) = (s^2 pi^2) / 3. - Excess kurtosis:
1.2(heavier-tailed than Normal’s0).
Use cases: foundation of logistic regression (the link function is the logit, the inverse of the logistic CDF; the latent-variable derivation has Logistic errors); softmax classifiers (multinomial generalization); growth curves in biology and adoption modeling.
Laplace (double-exponential)(mu, b)
Two mirrored Exponentials joined at the mean: a symmetric distribution with a sharp peak and exponentially decaying tails (heavier than Normal).
- Support: all of
R. - Parameters:
mu in R(location),b > 0(scale). - PDF:
f(x) = (1 / (2b)) exp(-|x - mu| / b). - Mean:
E[X] = mu. - Variance:
Var(X) = 2 b^2. - MGF:
M(t) = exp(mu t) / (1 - b^2 t^2)for|t| < 1/b.
Use cases: as a prior over regression coefficients, the Laplace yields the Lasso penalty (L1 regularization is the MAP estimate under a Laplace prior); the Laplace mechanism in differential privacy adds Laplace noise calibrated to query sensitivity; a heavier-tailed alternative to Normal for residuals that yields median regression in the MLE limit.
Rayleigh(sigma)
The distribution of the magnitude of a 2D Gaussian vector with iid components.
- Support:
[0, infinity). - PDF:
f(x) = (x / sigma^2) exp(-x^2 / (2 sigma^2)). - Mean:
E[X] = sigma sqrt(pi / 2). - Variance:
Var(X) = (4 - pi)/2 cdot sigma^2. - Construction: if
X, Y ~ N(0, sigma^2)independent thensqrt(X^2 + Y^2) ~ Rayleigh(sigma). - Special case of Weibull:
Rayleigh(sigma) = Weibull(k = 2, lambda = sigma sqrt(2)).
Use cases: magnitude of complex Gaussian noise in communications (envelope of narrow-band noise); MRI image noise in magnitude images; wind speed (one of several candidate models, alongside Weibull).
Rice(nu, sigma)
A non-central generalization of Rayleigh: magnitude of a 2D Gaussian vector with non-zero mean.
- PDF:
f(x) = (x / sigma^2) exp(-(x^2 + nu^2)/(2 sigma^2)) I_0(x nu / sigma^2)forx >= 0, whereI_0is the modified Bessel function of the first kind of order 0. - Reduces to Rayleigh when
nu = 0.
Use cases: signal-plus-noise envelope in communications (Rician fading channels); MRI magnitude images at low SNR (Rayleigh is the high-noise limit, Rice corrects for signal).
Inverse-Gamma(alpha, beta)
If X ~ Gamma(alpha, beta) (shape-rate form), then 1/X ~ InverseGamma(alpha, beta).
- PDF:
f(x) = (beta^alpha / Gamma(alpha)) x^(-alpha - 1) exp(-beta / x)forx > 0. - Mean:
beta / (alpha - 1)foralpha > 1. - Variance:
beta^2 / ((alpha - 1)^2 (alpha - 2))foralpha > 2.
Use cases: the conjugate prior for the variance sigma^2 of a Normal likelihood with known mean. Together with a Normal prior on the mean, this gives the Normal-Inverse-Gamma (NIG) joint prior with closed-form Bayesian updating for the full Normal model.
Dirichlet(alpha_1, …, alpha_K)
The multivariate generalization of Beta: a distribution over the (K-1)-simplex (vectors of non-negative components summing to 1).
- Support:
{ p = (p_1, ..., p_K) : p_i >= 0, sum_i p_i = 1 }. - Parameters:
alpha_i > 0. - PDF:
f(p) = (1/B(alpha)) prod_i p_i^(alpha_i - 1), whereB(alpha) = (prod_i Gamma(alpha_i)) / Gamma(sum_i alpha_i)is the multivariate beta function. - Mean:
E[p_i] = alpha_i / (sum_j alpha_j). - Variance:
Var(p_i) = (alpha_i (alpha_0 - alpha_i)) / (alpha_0^2 (alpha_0 + 1))wherealpha_0 = sum_j alpha_j. - Marginals: each
p_imarginallyBeta(alpha_i, alpha_0 - alpha_i). - Aggregation: merging two components
i, jintoi + jyields a Dirichlet with parameters(alpha_1, ..., alpha_i + alpha_j, ..., alpha_K). - Conjugate prior to the Multinomial likelihood.
Use cases: topic models (Latent Dirichlet Allocation: per-document topic mixture is Dirichlet, per-topic word mixture is Dirichlet); priors over discrete probability vectors; population-genetics allele-frequency models; the basis of Dirichlet processes (nonparametric Bayesian).
Wishart + Inverse-Wishart
Distributions over symmetric positive-definite matrices. The matrix-valued generalization of (Inverse-)Gamma.
- Wishart(V, nu): if
Y_i in R^dare iidN(0, V)fori = 1, ..., nu, thenS = sum_i Y_i Y_i^T ~ Wishart(V, nu). Used to model sample covariance matrices. - Inverse-Wishart(Psi, nu): if
S ~ Wishart(V, nu), thenS^(-1) ~ Wishart^(-1)(V^(-1), nu). The conjugate prior for the covariance matrixSigmaof a multivariate Normal likelihood with known mean.
PDFs and full parameterizations are in Johnson-Kotz-Balakrishnan and Bishop Appendix B.
Use cases: Bayesian inference over covariance matrices (multivariate Normal with unknown covariance — use Inverse-Wishart prior); factor analysis priors; multivariate hierarchical models. Be aware: the Inverse-Wishart prior is notoriously inflexible (single concentration parameter nu) — the LKJ prior on the correlation matrix combined with a separate scale prior is often preferred in modern Bayesian software (Stan, PyMC).
von Mises (circular Normal)(mu, kappa)
The Normal-analog on the circle: a unimodal distribution for angular data.
- Support:
[-pi, pi)(or any interval of length2 pi). - Parameters:
mu in [-pi, pi)(mean direction),kappa >= 0(concentration — analog of inverse variance;kappa = 0is uniform on the circle). - PDF:
f(theta) = exp(kappa cos(theta - mu)) / (2 pi I_0(kappa)), whereI_0is the modified Bessel function of the first kind of order 0. - Mean direction:
mu. - Circular variance:
1 - I_1(kappa)/I_0(kappa).
Use cases: any angular measurement (wind direction, compass bearings, time-of-day as an angle modulo 24h, phase of an oscillator, robot orientation); the foundational distribution for directional statistics. The Bingham, Kent, and matrix-Fisher distributions generalize to spheres and rotation groups (used in robotics for SO(3) — see bayesian-estimation).
Beta-prime / inverted-Beta(alpha, beta)
If X ~ Beta(alpha, beta), then Y = X / (1 - X) ~ Beta'(alpha, beta) on (0, infinity).
- PDF:
f(y) = y^(alpha - 1) (1 + y)^(-(alpha + beta)) / B(alpha, beta). - Mean:
alpha / (beta - 1)forbeta > 1.
Use cases: odds-ratio distributions (since Y = X/(1-X) is the odds); priors over ratios; specialty distribution in Bayesian variance modeling.
Skew-normal(xi, omega, alpha)
Adds a skewness parameter alpha to the Normal, allowing asymmetric bell shapes.
- PDF:
f(x) = (2 / omega) phi((x - xi)/omega) Phi(alpha (x - xi)/omega), wherephi, Phiare standard Normal PDF and CDF. - Recovers Normal when
alpha = 0. - Right-skewed when
alpha > 0, left-skewed whenalpha < 0.
Use cases: modeling moderately skewed continuous data where the log-transform is too aggressive (and would over-correct); regression residuals that exhibit asymmetry; financial returns.
Mixture distributions
A finite (or infinite) weighted sum of component densities:
f(x) = sum_i w_i f_i(x), with w_i >= 0 and sum_i w_i = 1.
Each f_i can be any distribution. The most-used special case is the Gaussian Mixture Model (GMM): f(x) = sum_i w_i N(x; mu_i, Sigma_i). Pearson (1894) introduced GMMs to model crab measurements; the modern Expectation-Maximization (EM) algorithm fits them by iterating between soft-assignment of points to components (E-step) and maximum-likelihood re-estimation of (w_i, mu_i, Sigma_i) (M-step).
Properties:
- Mixtures can approximate any continuous density arbitrarily well (universal approximators, given enough components).
- Identifiability is a concern: many parameter settings produce the same mixture (label-switching).
- Bayesian variants use Dirichlet priors on
w_i; nonparametric Bayes (Dirichlet processes) allows the number of components to be inferred from data.
Use cases: density estimation; clustering (GMM is “soft k-means” with full covariances); speaker recognition (each speaker = a GMM); background subtraction in video; segmentation.
Heavy-tailed family — Cauchy, Pareto, Lévy stable
A distribution is heavy-tailed if its tails decay slower than exponentially, i.e. lim_{x -> infinity} exp(lambda x) (1 - F(x)) = infinity for all lambda > 0. Equivalently, MGF does not exist in any neighborhood of 0. Important heavy-tailed families:
- Power-law tails:
1 - F(x) ~ c x^(-alpha)asx -> infinity. Pareto, Cauchy (alpha = 1), Student’s t. - Lévy stable distributions: a 4-parameter family
S(alpha, beta, gamma, delta)with stability parameteralpha in (0, 2]. The Normal isalpha = 2, Cauchy isalpha = 1andbeta = 0, Lévy distribution isalpha = 1/2. Stable distributions are the only distributions that arise as limits of sums of iid random variables (generalized CLT — when finite-variance assumption is relaxed).
Use cases: financial returns (well-documented heavier-than-Gaussian tails: market crashes happen far more often than Normal predicts); insurance loss distributions; failure-time data; network traffic volumes; word and city size distributions.
Generalized Extreme Value (GEV) + Generalized Pareto
The two pillars of extreme-value theory (EVT): asymptotic distributions of maxima and of excesses over a threshold.
Fisher-Tippett-Gnedenko theorem: if M_n = max(X_1, ..., X_n) for iid X_i and there exist sequences a_n, b_n > 0 such that (M_n - a_n)/b_n converges in distribution to a non-degenerate G, then G is one of three types: Gumbel, Frechet, or Weibull (the “extreme-value families”). The Generalized Extreme Value (GEV) distribution unifies them via a shape parameter xi:
- PDF:
f(x) = (1/sigma) [1 + xi(x - mu)/sigma]^(-1/xi - 1) exp(-[1 + xi(x - mu)/sigma]^(-1/xi))for1 + xi(x - mu)/sigma > 0. xi > 0: Frechet (heavy tail).xi = 0: Gumbel (light tail; take limit).xi < 0: reverse Weibull (bounded tail).
Pickands-Balkema-de Haan theorem: the conditional distribution of excesses over a high threshold u (i.e. X - u | X > u) is approximately Generalized Pareto: f(x) = (1/sigma)(1 + xi x/sigma)^(-1/xi - 1).
Use cases: flood and rainfall return-period modeling; wind speed extremes for structural engineering; hurricane intensity tails; financial Value-at-Risk and Expected Shortfall; insurance catastrophe modeling; reliability of safety-critical systems.
Half-normal / half-t / truncated-normal
When the underlying quantity is nonnegative (a magnitude, a variance, a scale parameter), truncated and folded Normals are natural.
- Half-normal(sigma): PDF
f(x) = sqrt(2/(pi sigma^2)) exp(-x^2 / (2 sigma^2))forx >= 0. Meansigma sqrt(2/pi), variancesigma^2 (1 - 2/pi). - Half-t(nu, sigma): same idea but t-tailed; Gelman recommends as a weakly-informative prior on variance / standard-deviation parameters in Bayesian hierarchical models.
- Truncated-Normal: restrict
N(mu, sigma^2)to[a, b]. PDF:phi((x - mu)/sigma) / (sigma (Phi((b - mu)/sigma) - Phi((a - mu)/sigma)))forx in [a, b].
Use cases: priors on standard deviations and scale parameters; censored regression and Tobit models; rejection-sampling helpers.
Stable distribution
The family of distributions closed under convolution (up to rescaling). The Normal, Cauchy, and Lévy distributions are members. Characterized via characteristic function:
phi(t) = exp(i delta t - |gamma t|^alpha (1 - i beta sign(t) Phi(alpha, t)))
with stability index alpha in (0, 2], skewness beta in [-1, 1], scale gamma > 0, location delta in R, and a piece Phi(alpha, t) that depends on alpha. Only alpha = 2 (Normal), alpha = 1, beta = 0 (Cauchy), and alpha = 1/2, beta = 1 (Lévy) have closed-form densities.
Key property: by the generalized CLT, sums of iid random variables with infinite variance converge (after rescaling) to a stable law with alpha < 2.
Use cases: financial-return modeling (Mandelbrot’s original application); modeling aggregate insurance losses; pure-mathematics scaling-limit constructions.
Distribution selection by use case
A practical lookup. When in doubt, start with the canonical choice and only deviate if the data forces it.
- Coin flip / single yes-no → Bernoulli.
- Count of successes in
ntrials → Binomial; → Beta-Binomial if overdispersed. - Count of events per fixed time/area → Poisson; → Negative Binomial if overdispersed (var > mean).
- Time to first event → Exponential (constant hazard); → Weibull (changing hazard); → Log-normal (multiplicative growth processes).
- Time to
r-th event → Gamma / Erlang. - Probability or proportion → Beta.
- Bounded continuous quantity on
[a, b]→ Beta after rescaling to[0, 1]. - Normal residuals (CLT-justified, sum of many small noises) → Normal.
- Small-sample mean inference with unknown variance → t.
- Sample variance inference → chi-square.
- ANOVA / equal-variance hypothesis testing → F.
- Direction / angle on the circle → von Mises (or Bingham, Kent for higher dimensions).
- Heavy-tailed financial returns → Cauchy / Student’s t with small
nu/ Pareto / stable. - Extreme value (annual max, threshold exceedance) → GEV (for maxima) / Generalized Pareto (for exceedances).
- Multinomial probabilities → Dirichlet prior.
- Covariance matrix → Inverse-Wishart prior (or LKJ + half-Cauchy for modern Bayes).
- Positive-only continuous quantity, no special structure → Gamma (flexible scale family) or Log-normal (multiplicative).
- Multimodal continuous data → Gaussian mixture.
Conjugate prior summary table
Conjugate prior-likelihood pairs underlie almost all closed-form Bayesian inference. The pattern: the prior and posterior live in the same family; the likelihood’s sufficient statistics update the prior’s hyperparameters in a simple closed form.
| Likelihood | Parameter | Conjugate prior | Posterior update |
|---|---|---|---|
| Bernoulli(p) | p | Beta(alpha, beta) | Beta(alpha + sum x_i, beta + n - sum x_i) |
| Binomial(n, p) | p | Beta(alpha, beta) | Beta(alpha + k, beta + n - k) |
| Geometric(p) | p | Beta(alpha, beta) | Beta(alpha + n, beta + sum x_i - n) (failures-form) |
| Negative Binomial(r, p), r fixed | p | Beta(alpha, beta) | Beta(alpha + nr, beta + sum x_i) |
| Poisson(lambda) | lambda | Gamma(alpha, beta) | Gamma(alpha + sum x_i, beta + n) |
| Multinomial(n, p) | p | Dirichlet(alpha) | Dirichlet(alpha + k) |
| Exponential(lambda) | lambda | Gamma(alpha, beta) | Gamma(alpha + n, beta + sum x_i) |
| Gamma(k, theta), k known | rate = 1/theta | Gamma(alpha, beta) | Gamma(alpha + nk, beta + sum x_i) |
| Normal(mu, sigma^2), sigma^2 known | mu | Normal(mu_0, tau_0^2) | Normal(precision-weighted mean, updated precision) |
| Normal(mu, sigma^2), mu known | sigma^2 | Inverse-Gamma(alpha, beta) | Inv-Gamma(alpha + n/2, beta + (1/2) sum (x_i - mu)^2) |
| Normal(mu, sigma^2), both unknown | (mu, sigma^2) | Normal-Inverse-Gamma | Closed-form NIG update |
| Multivariate Normal, Sigma known | mu | Multivariate Normal | Multivariate Normal |
| Multivariate Normal, mu known | Sigma | Inverse-Wishart | Inverse-Wishart |
| Uniform(0, theta) | theta | Pareto(x_m, alpha) | Pareto(max(x_m, max x_i), alpha + n) |
For derivations, see bayesian-inference and Bishop PRML chapter 2.
MGF / characteristic function summary
| Distribution | MGF M(t) | Characteristic function phi(t) |
|---|---|---|
| Bernoulli(p) | (1 - p) + p e^t | (1 - p) + p e^(it) |
| Binomial(n, p) | ((1 - p) + p e^t)^n | ((1 - p) + p e^(it))^n |
| Geometric(p) (trials form) | p e^t / (1 - (1 - p) e^t) | p e^(it) / (1 - (1 - p) e^(it)) |
| Negative Binomial(r, p) | (p / (1 - (1 - p) e^t))^r | analog |
| Poisson(lambda) | exp(lambda (e^t - 1)) | exp(lambda (e^(it) - 1)) |
| Multinomial(n, p) | (sum_i p_i e^(t_i))^n | analog |
| Uniform(a, b) | (e^(bt) - e^(at)) / (t (b - a)) | (e^(ibt) - e^(iat)) / (it (b - a)) |
| Normal(mu, sigma^2) | exp(mu t + sigma^2 t^2 / 2) | exp(i mu t - sigma^2 t^2 / 2) |
| Exponential(lambda) | lambda / (lambda - t) for t < lambda | lambda / (lambda - it) |
| Gamma(k, theta) | (1 - theta t)^(-k) for t < 1/theta | (1 - i theta t)^(-k) |
| Chi-square(k) | (1 - 2t)^(-k/2) for t < 1/2 | (1 - 2it)^(-k/2) |
| Laplace(mu, b) | exp(mu t) / (1 - b^2 t^2) for ` | t |
| Cauchy(x_0, gamma) | does not exist | `exp(i x_0 t - gamma |
| Pareto(x_m, alpha) | does not exist for t > 0 | given via incomplete gamma |
| Logistic(mu, s) | exp(mu t) B(1 - s t, 1 + s t) for ` | s t |
MGFs uniquely determine distributions when they exist on an open neighborhood of 0. Characteristic functions always exist and always uniquely determine the distribution (Levy continuity theorem governs convergence).
Software
- Python:
scipy.statsprovides over 90 distributions with consistentpdf/pmf,cdf,ppf(inverse CDF),rvs(random sampling),fit(MLE fitting), andentropy.numpy.randomprovides high-throughput samplers. Probabilistic-programming libraries: PyMC and NumPyro wrap distributions for Bayesian inference; TensorFlow Probability and Pyro (PyTorch) for differentiable distributions in deep-learning pipelines. - R: base
statspackage hasd/p/q/rfamily functions (dnorm,pnorm,qnorm,rnorm, etc.) for the standard distributions.MASSadds multivariate and specialty families. Stan (viarstanorcmdstanr) provides Bayesian inference. - Julia:
Distributions.jlis a comprehensive, fast, well-typed library covering 50+ distributions with consistent API.Turing.jlfor Bayesian inference. - MATLAB: Statistics and Machine Learning Toolbox.
- C++:
Boost.Mathand<random>in the standard library.
Practical fitting tip: scipy’s scipy.stats.<distribution>.fit(data) returns MLE parameter estimates; combine with scipy.stats.kstest or scipy.stats.anderson for goodness-of-fit testing.
Pitfalls
- Confusing PMF and PDF.
P(X = x) = 0for any continuous distribution; the PDF is a density, not a probability.f(x) > 1is fine (the integral is what must equal 1). - Assuming mean equals variance. True for Poisson, false in general. Empirically check before applying the Poisson; if
var > mean, switch to Negative Binomial. - Fitting Normal to right-skewed data. Income, prices, sizes, durations are almost never Normal. Log-transform first (yielding Log-normal) or use Gamma / Weibull directly.
- Using Exponential where Weibull is needed. Exponential assumes constant hazard; most physical components have a bathtub curve (decreasing → constant → increasing hazard). Weibull captures all three regimes via its shape parameter
k. - Forgetting Bessel’s correction (
n - 1divisor in sample variance). The sample variance withnin the denominator is the MLE under Normal but is biased downward; dividing byn - 1gives the unbiased estimator. Most software usesn - 1by default (numpy’snp.vardefaults ton! Passddof=1for the unbiased version). - MLE of Normal variance is biased. Same point as above:
sigma_MLE^2 = (1/n) sum (x_i - x_bar)^2has bias-sigma^2/n. Bayesian inference avoids this issue (posterior accounts for parameter uncertainty); REML and Bessel-corrected estimators are frequentist fixes. - Naive use of the Cauchy “mean”. The arithmetic mean of iid Cauchy samples is itself Cauchy — averaging does not reduce the spread. Use the sample median (which does concentrate, at rate
1/sqrt(n)). - Conflating “no correlation” with “independence”. Equivalent under joint normality but not in general. Many ML papers slip on this; check the multivariate-Normal section for the precise statement.
- Improper priors silently giving improper posteriors. Especially with Cauchy/Pareto/half-Cauchy on scale parameters in hierarchical models — always verify the posterior is proper.
- Negative Binomial parameterization confusion. SciPy, R, NumPy, and textbooks all use slightly different conventions (success-count
rvs failure-count, probabilitypvs odds, mean-dispersion form vs(r, p)form). Always read the docs before plugging in values. - Beta-Binomial vs Binomial confusion in A/B testing. Pure Binomial assumes a single fixed
p; Beta-Binomial allowspto vary per session/user. The two give very different uncertainty intervals.
Cross-references
- probability-fundamentals — sigma-algebras, expectation, CLT, conditional probability (the machinery this catalog builds on).
- bayesian-inference — conjugate updates, posterior derivations, MCMC, variational methods (consumer of this catalog’s prior pairs).
- hypothesis-testing-mle — sampling distributions (t, F, chi-square) used in classical inference; MLE properties.
- mcmc-sampling — for posteriors that are not closed-form (no conjugate prior).
- information-theory — entropy and divergences for these distributions.
- reliability-engineering — Weibull bathtub-curve modeling, MTBF.
- six-sigma — Normal-based capability indices, process control.
- bayesian-estimation — multivariate Normal in Kalman filters, von Mises / Bingham on SO(3).
Citations
- Johnson, N. L., Kotz, S., Balakrishnan, N. (1994, 1995). Continuous Univariate Distributions, Volumes 1 and 2, 2nd edition. Wiley. The canonical encyclopedic reference for parametric families on the real line.
- Johnson, N. L., Kemp, A. W., Kotz, S. (2005). Univariate Discrete Distributions, 3rd edition. Wiley. Companion for discrete families.
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Compact, modern survey of distributions and the inference machinery that uses them.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Appendix B gives one-page summaries of all distributions used in the main text, with PDFs, moments, and conjugacy.
- Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis, 3rd edition. CRC. Practical guide to prior choices and conjugate vs non-conjugate inference.
- SciPy contributors.
scipy.statsreference documentation. https://docs.scipy.org/doc/scipy/reference/stats.html (authoritative, code-checkable distribution catalog). - Embrechts, P., Klueppelberg, C., Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer. The reference for GEV, Generalized Pareto, and heavy-tailed practice.
- Mardia, K. V., Jupp, P. E. (2000). Directional Statistics. Wiley. The reference for von Mises, Bingham, and related circular/spherical distributions.