Probability Distributions Reference

At a glance

This note is the companion catalog to probability-fundamentals. Where that note develops the machinery (sigma-algebras, expectation, conditional probability, CLT, convergence), this one is the bestiary: a reference of the common parametric families, with PMF or PDF, moments, moment-generating / characteristic functions, conjugate priors, and the canonical real-world use cases.

Use this document as a lookup. When you need to model a real phenomenon (count of events, waiting time, proportion, angle, extreme), scan section 42 (Distribution selection by use case) for the right family, then jump to the dedicated section for formulas and properties. Section 43 collects the conjugate-prior pairs in one table for Bayesian work; section 44 collects MGF / characteristic functions.

A few conventions used throughout:

PMF denotes p(k) = P(X = k) for discrete X; PDF denotes f(x) for continuous X with P(X in A) = integral_A f(x) dx.
E[X] is the mean, Var(X) = E[(X - E[X])^2] the variance, sigma = sqrt(Var(X)) the standard deviation.
The moment-generating function is M_X(t) = E[exp(tX)] when it exists in a neighborhood of 0; the characteristic function is phi_X(t) = E[exp(i t X)] and always exists.
Gamma(z) denotes the gamma function (continuous analog of factorial: Gamma(n) = (n-1)! for positive integer n); B(a, b) = Gamma(a) Gamma(b) / Gamma(a + b) is the beta function.
“Conjugate prior” means: if the likelihood is L(theta | x) from family F, and the prior p(theta) belongs to family G, then the posterior p(theta | x) also belongs to G (closed-form Bayesian update).

The depth of coverage follows the empirical frequency of use: Normal, Bernoulli/Binomial, Poisson, Exponential, Gamma, Beta, and Dirichlet get longer treatment. Specialty distributions (Rice, von Mises, GEV, stable) get the essential formulas + a pointer to literature.

DISCRETE DISTRIBUTIONS

A discrete distribution assigns positive probability to a countable support set (often {0, 1, 2, ...} or {0, 1, ..., n}). The PMF satisfies sum_k p(k) = 1 with p(k) >= 0. The CDF F(x) = P(X <= x) is a right-continuous step function.

The discrete families below subdivide into:

Bernoulli family (single trial, count of successes): Bernoulli, Binomial, Geometric, Negative Binomial, Hypergeometric.
Count-of-events family: Poisson, Negative Binomial (overdispersed Poisson).
Categorical family (K outcomes): Categorical, Multinomial, Dirichlet-Multinomial (compound).
Heavy-tail discrete: Zipf, discrete Pareto, Yule-Simon.

Bernoulli(p)

The atom of discrete probability: a single trial with binary outcome.

Support: {0, 1}.
Parameter: p in [0, 1] (success probability).
PMF: P(X = 1) = p, P(X = 0) = 1 - p. Compactly, p(k) = p^k (1-p)^(1-k) for k in {0, 1}.
Mean: E[X] = p.
Variance: Var(X) = p(1 - p).
MGF: M(t) = (1 - p) + p e^t.
Entropy: H(X) = -p log p - (1 - p) log(1 - p) (the binary entropy function, maximized at p = 1/2).
Conjugate prior: Beta(alpha, beta) on p; posterior after observing k successes in n trials is Beta(alpha + k, beta + n - k).

Use cases: a single coin flip; a binary indicator (clicked / did not click, defective / not defective, alive / dead at end of trial); the building block for logistic regression (one Bernoulli per row, with p linked to features by the logistic function).

Binomial(n, p)

The sum of n independent identically distributed (iid) Bernoulli(p) trials.

Support: {0, 1, ..., n}.
Parameters: n in {1, 2, ...} (trial count), p in [0, 1].
PMF: p(k) = C(n, k) p^k (1 - p)^(n - k) where C(n, k) = n! / (k! (n - k)!).
Mean: E[X] = np.
Variance: Var(X) = np(1 - p).
MGF: M(t) = ((1 - p) + p e^t)^n.
Convolution rule: Binomial(n1, p) + Binomial(n2, p) (independent) = Binomial(n1 + n2, p).
Normal approximation: for np >= 5 and n(1 - p) >= 5, X is approximately N(np, np(1 - p)) (de Moivre-Laplace, a special case of the CLT).
Poisson limit: as n -> infinity, p -> 0, with np = lambda held fixed, Binomial(n, p) -> Poisson(lambda).
Conjugate prior: Beta(alpha, beta).

Use cases: number of heads in n flips; number of defective items in a batch; number of clicks among n impressions; A/B test success counts.

Geometric(p)

The number of trials needed until the first success in repeated Bernoulli(p) trials. Two common parameterizations:

“Trials” form (1-indexed, support {1, 2, ...}):
- PMF: p(k) = (1 - p)^(k - 1) p.
- Mean: E[X] = 1/p.
- Variance: Var(X) = (1 - p)/p^2.
“Failures-before-success” form (0-indexed, support {0, 1, 2, ...}):
- PMF: p(k) = (1 - p)^k p.
- Mean: E[X] = (1 - p)/p.
- Variance: same: (1 - p)/p^2.
MGF (trials form): M(t) = p e^t / (1 - (1 - p) e^t) for t < -log(1 - p).
Memoryless property (the discrete analog of the exponential’s memorylessness): P(X > m + n | X > m) = P(X > n). Geometric is the only memoryless discrete distribution on the positive integers.

Use cases: number of attempts until first success; time to first failure of a system polled at discrete intervals; classical models of waiting in queueing theory.

Negative Binomial(r, p)

Generalizes Geometric: the number of trials (or failures) until the r-th success. Equivalent (and often more useful) characterization: a Gamma-mixture of Poissons, which makes it the workhorse for overdispersed count data.

“Failures-before-r-successes” form (support {0, 1, 2, ...}):
- PMF: p(k) = C(k + r - 1, k) p^r (1 - p)^k.
- Mean: E[X] = r(1 - p)/p.
- Variance: Var(X) = r(1 - p)/p^2 = E[X]/p. Note Var > E: overdispersion.
Extension to non-integer r: replace C(k + r - 1, k) with Gamma(k + r) / (Gamma(r) k!). Then r is a real-valued dispersion parameter.
MGF: M(t) = (p / (1 - (1 - p) e^t))^r for t < -log(1 - p).
Poisson-Gamma mixture: if lambda ~ Gamma(r, (1 - p)/p) and X | lambda ~ Poisson(lambda), then marginally X ~ NegBin(r, p). The Negative Binomial is the marginal of a Gamma-distributed Poisson rate.

Use cases: count data where variance exceeds the mean (claim counts, ecological abundance, sequencing read counts where Poisson would underestimate variance). Standard GLM family for overdispersed counts; widely used in RNA-seq differential expression (DESeq2, edgeR).

Poisson(lambda)

The canonical model for the count of rare events in a fixed interval of time or space, under the assumption that events occur independently at a constant average rate lambda.

Support: {0, 1, 2, ...}.
Parameter: lambda > 0 (mean / rate).
PMF: p(k) = lambda^k e^(-lambda) / k!.
Mean: E[X] = lambda.
Variance: Var(X) = lambda. (Mean equals variance — the distinguishing equidispersion property of the Poisson.)
MGF: M(t) = exp(lambda (e^t - 1)).
Convolution: Poisson(lambda1) + Poisson(lambda2) (independent) = Poisson(lambda1 + lambda2).
Connection to Exponential: if event inter-arrival times are iid Exponential(lambda), then the count of events in [0, t] is Poisson(lambda t). This is the defining property of the Poisson process.
Conjugate prior: Gamma(alpha, beta) on lambda; posterior after observing total count K over n intervals is Gamma(alpha + K, beta + n).

Use cases: photons arriving at a detector, customer arrivals at a queue, web requests per second, mutations per genome, typos per page, earthquakes per year. Standard GLM family for count data when overdispersion is absent or modest.

Categorical / Multinomial

Generalize Bernoulli / Binomial from 2 outcomes to K outcomes.

Categorical(p_1, …, p_K) is one draw from K outcomes with probabilities p_i >= 0, sum_i p_i = 1. PMF: P(X = i) = p_i. Sometimes encoded as a one-hot vector e_i in {0, 1}^K.

Multinomial(n, p_1, …, p_K) is the joint distribution of (X_1, ..., X_K) where X_i is the count of category i after n iid Categorical draws.

Support: vectors (k_1, ..., k_K) with k_i >= 0 integers and sum_i k_i = n.
PMF: p(k_1, ..., k_K) = (n! / (k_1! ... k_K!)) p_1^k_1 ... p_K^k_K.
Marginals: each X_i ~ Binomial(n, p_i).
Means: E[X_i] = n p_i.
Variances: Var(X_i) = n p_i (1 - p_i).
Covariances: Cov(X_i, X_j) = -n p_i p_j for i != j (negative because total is fixed).
Conjugate prior: Dirichlet(alpha_1, …, alpha_K) on (p_1, ..., p_K); posterior after counts (k_1, ..., k_K) is Dirichlet(alpha_1 + k_1, …, alpha_K + k_K).

Use cases: dice rolls; observed counts across categories of a survey; topic models (LDA: each word is Categorical over topics, each topic Categorical over vocabulary); softmax classifiers in deep learning (output is Categorical).

Hypergeometric(N, K, n)

Sampling without replacement from a finite population of size N containing K “successes”. Draw n items; count the successes drawn.

Support: max(0, n - (N - K)) <= k <= min(n, K).
PMF: p(k) = C(K, k) C(N - K, n - k) / C(N, n).
Mean: E[X] = n K / N.
Variance: Var(X) = n (K/N) ((N - K)/N) ((N - n)/(N - 1)). The factor (N - n)/(N - 1) is the finite population correction; as N -> infinity with K/N -> p fixed, the hypergeometric converges to Binomial(n, p).

Use cases: card draws (poker hand probabilities); quality assurance sampling (inspect n items from a lot of N); Fisher’s exact test for 2x2 contingency tables; capture-recapture wildlife population estimation.

Discrete uniform(a, b)

Equal probability on each integer in {a, a+1, ..., b} (n = b - a + 1 outcomes).

PMF: p(k) = 1/n for a <= k <= b.
Mean: E[X] = (a + b)/2.
Variance: Var(X) = (n^2 - 1)/12.

Use cases: fair dice (a = 1, b = 6); pseudorandom integer generation; lottery; uninformative prior over a finite set of hypotheses.

Beta-binomial(n, alpha, beta)

The Binomial with a Beta-distributed success probability — a compound distribution that captures overdispersion in proportions-of-successes data.

Hierarchical: p ~ Beta(alpha, beta), X | p ~ Binomial(n, p).
Marginal PMF: p(k) = C(n, k) B(k + alpha, n - k + beta) / B(alpha, beta).
Mean: E[X] = n alpha / (alpha + beta).
Variance: Var(X) = n alpha beta (alpha + beta + n) / ((alpha + beta)^2 (alpha + beta + 1)), which is strictly greater than the Binomial variance for the same mean: overdispersion.

Use cases: pooled multi-site binary outcomes where each site has its own latent success rate; meta-analytic combination of binomial proportions; sequencing variant-calling priors.

Zipf / power-law / discrete Pareto

Heavy-tailed counts: a few categories dominate, most are rare. Standard parameterizations:

Zipf(s, N) on {1, ..., N} with PMF p(k) = (1/k^s) / H_{N, s} where H_{N, s} = sum_{j=1}^N 1/j^s is the generalized harmonic number.
Zeta distribution (infinite support, s > 1): p(k) = (1/k^s) / zeta(s) where zeta is the Riemann zeta function.
Yule-Simon for preferential-attachment processes.

For continuous Pareto see section 26.

Use cases: word-frequency distributions (Zipf’s law: rank r word frequency proportional to 1/r); city-size distributions; node-degree distributions in scale-free networks; library citation counts; income distributions (heavy upper tail).

CONTINUOUS DISTRIBUTIONS

A continuous distribution has a probability density function f(x) >= 0 with integral f(x) dx = 1 (over its support). For any interval A, P(X in A) = integral_A f(x) dx. Note P(X = x) = 0 for every single x; only intervals have positive probability.

The continuous families below subdivide into:

Location-scale: Uniform, Normal, Logistic, Laplace, Cauchy.
Positive support: Exponential, Gamma, Log-normal, Weibull, Pareto, Chi-square.
Bounded support: Uniform, Beta.
Sampling distributions (derived from Normal): Chi-square, t, F.
Multivariate: Multivariate Normal, Multivariate t, Dirichlet (on the simplex), Wishart (on PSD matrices).
Circular / directional: von Mises.
Extreme-value: Gumbel, Frechet, Weibull (all subsumed by GEV); Generalized Pareto.
Mixtures: GMM and beyond.

Uniform(a, b)

Constant density on the interval [a, b].

Support: [a, b].
PDF: f(x) = 1/(b - a) for a <= x <= b, zero otherwise.
CDF: F(x) = (x - a)/(b - a) on [a, b].
Mean: E[X] = (a + b)/2.
Variance: Var(X) = (b - a)^2 / 12.
MGF: M(t) = (e^(tb) - e^(ta)) / (t(b - a)) for t != 0.
Universality: if U ~ Uniform(0, 1) and F is any CDF, then F^(-1)(U) has CDF F (the inverse-CDF / Smirnov transform — basis of pseudorandom sampling).

Use cases: pseudorandom-number generation (every other sampler builds on Uniform(0, 1)); rejection sampling envelope; an uninformative prior on a bounded parameter; quantization noise model.

Normal / Gaussian N(mu, sigma^2)

The most important continuous distribution. Justified theoretically by the Central Limit Theorem (sums of many small independent contributions tend to Normal) and practically by analytic tractability (conjugate to itself in many settings, linear transforms preserve normality, marginals/conditionals of joint Normals are Normal).

Support: all of R.
Parameters: mu in R (mean), sigma^2 > 0 (variance).
PDF: f(x) = (1 / sqrt(2 pi sigma^2)) exp(-(x - mu)^2 / (2 sigma^2)).
CDF: no closed form; written Phi((x - mu)/sigma) where Phi is the standard Normal CDF (with mu = 0, sigma = 1). Tables and erf give numerical values.
Mean: E[X] = mu.
Variance: Var(X) = sigma^2.
Skewness: 0. Excess kurtosis: 0 (by convention — Normal is the reference).
MGF: M(t) = exp(mu t + sigma^2 t^2 / 2).
Characteristic function: phi(t) = exp(i mu t - sigma^2 t^2 / 2).
Linear closure: if X ~ N(mu, sigma^2) then aX + b ~ N(a mu + b, a^2 sigma^2).
Convolution: independent N(mu1, s1^2) + N(mu2, s2^2) = N(mu1 + mu2, s1^2 + s2^2).
Conjugate priors:
- For mu (variance known): N(mu_0, tau_0^2) is conjugate.
- For sigma^2 (mean known): InverseGamma(alpha, beta) is conjugate. Equivalently, Inv-chi^2 is sometimes used.
- For both jointly: Normal-Inverse-Gamma (NIG) prior, with closed-form posterior.

Standardization: Z = (X - mu)/sigma ~ N(0, 1). The 68-95-99.7 rule: about 68% of mass within 1 sigma, 95% within 2 sigma, 99.7% within 3 sigma.

Use cases: measurement errors in physics; residuals in linear regression (assumed Normal for inference); approximating sums and averages (CLT); maximum-likelihood estimation pipelines; Kalman filters; Gaussian processes; baseline distribution for almost every ML loss function (MSE loss = Normal log-likelihood up to constants).

Multivariate Normal N(mu, Sigma)

The joint distribution of a random vector X in R^d such that any linear combination a^T X is univariate Normal.

Support: R^d.
Parameters: mean vector mu in R^d, covariance matrix Sigma in R^(d x d) symmetric PSD.
PDF (when Sigma invertible): f(x) = (1 / sqrt((2 pi)^d det Sigma)) exp(-(1/2) (x - mu)^T Sigma^(-1) (x - mu)).
Mean: E[X] = mu.
Covariance: Cov(X) = Sigma.
Marginals: every sub-vector X_S is also multivariate Normal with mean mu_S and covariance Sigma_SS.
Conditionals: X_A | X_B = x_B is Normal with mean mu_A + Sigma_AB Sigma_BB^(-1) (x_B - mu_B) and covariance Sigma_AA - Sigma_AB Sigma_BB^(-1) Sigma_BA. (This is the Gaussian conditioning formula — the engine inside Kalman filters and Gaussian processes.)
Linear transform: if X ~ N(mu, Sigma), then AX + b ~ N(A mu + b, A Sigma A^T).
Independence: components are independent iff Sigma is diagonal (Gaussian-specific — true uncorrelatedness implies independence only for jointly Normal).

Use cases: Kalman filtering and smoothing (state vector + Gaussian noise); Gaussian processes for regression; factor models in finance; multivariate hypothesis testing (Hotelling T^2); generative models (deep latent-variable models with a Gaussian latent prior).

Exponential(lambda)

The continuous waiting time between events of a Poisson process. The continuous analog of the geometric distribution.

Support: [0, infinity).
Parameter: lambda > 0 (rate). Alternative parameterization uses scale theta = 1/lambda.
PDF: f(x) = lambda exp(-lambda x).
CDF: F(x) = 1 - exp(-lambda x).
Mean: E[X] = 1/lambda.
Variance: Var(X) = 1/lambda^2. (Standard deviation equals mean.)
MGF: M(t) = lambda / (lambda - t) for t < lambda.
Memoryless property: P(X > s + t | X > s) = P(X > t). The exponential is the only continuous distribution on [0, infinity) with this property.
Convolution: sum_{i=1}^k X_i ~ Gamma(k, 1/lambda) for iid Exponentials.

Use cases: time between Poisson arrivals (calls, web requests, particle decays); radioactive decay (half-life is the median); reliability for components with a constant hazard rate; survival analysis baseline; thinning / inverse-CDF sampling primitive.

Gamma(k, theta)

Generalizes Exponential. Two common parameterizations:

Shape-scale: k > 0 (shape), theta > 0 (scale).
- PDF: f(x) = (1 / (Gamma(k) theta^k)) x^(k - 1) exp(-x/theta) for x > 0.
- Mean: E[X] = k theta.
- Variance: Var(X) = k theta^2.
Shape-rate: alpha > 0 (shape), beta > 0 (rate), with theta = 1/beta.
- PDF: f(x) = (beta^alpha / Gamma(alpha)) x^(alpha - 1) exp(-beta x).
- Mean: alpha/beta. Variance: alpha/beta^2.
MGF (shape-scale form): M(t) = (1 - theta t)^(-k) for t < 1/theta.
Integer shape: Gamma(k, theta) with integer k is Erlang(k, theta) — the sum of k iid Exponential(1/theta) random variables.
Convolution: Gamma(k1, theta) + Gamma(k2, theta) = Gamma(k1 + k2, theta) (same scale).
Conjugate prior: Gamma is the conjugate prior for the rate of a Poisson likelihood; also for the precision (inverse variance) of a Normal with known mean.

Use cases: aggregated waiting times (sum of exponentials); rainfall amounts; insurance claim sizes; the conjugate prior to the Poisson rate; the precision parameter in Normal-Gamma hierarchical models.

Beta(alpha, beta)

The canonical distribution on [0, 1] — used for probabilities, proportions, and fractions.

Support: [0, 1].
Parameters: alpha > 0, beta > 0 (both “shape” parameters).
PDF: f(x) = (1 / B(alpha, beta)) x^(alpha - 1) (1 - x)^(beta - 1), where B(alpha, beta) = Gamma(alpha) Gamma(beta) / Gamma(alpha + beta).
Mean: E[X] = alpha / (alpha + beta).
Variance: Var(X) = alpha beta / ((alpha + beta)^2 (alpha + beta + 1)).
Mode (for alpha, beta > 1): (alpha - 1) / (alpha + beta - 2).
Special cases:
- Beta(1, 1) = Uniform(0, 1).
- Beta(1/2, 1/2) is the arcsine distribution (U-shaped, common in random-walk first-passage problems).
- As alpha + beta -> infinity with mean held fixed, mass concentrates at the mean.
Conjugate prior to Bernoulli and Binomial. Updating: prior Beta(alpha, beta), observe k successes in n trials, posterior Beta(alpha + k, beta + n - k). The hyperparameters (alpha, beta) act like pseudo-counts of successes and failures.

Use cases: prior over a probability or proportion (click-through rate, conversion rate); Thompson sampling for Bernoulli bandits (sample a probability from each arm’s Beta posterior, pick the arm with the largest sample); modeling proportions in ecology, voting shares, election outcomes; a flexible distribution shape on a bounded interval after rescaling.

Chi-square(k)

The distribution of the sum of squares of k independent standard Normal random variables.

Support: [0, infinity).
Parameter: k > 0 (degrees of freedom, usually integer but generalizable).
PDF: f(x) = (1 / (2^(k/2) Gamma(k/2))) x^(k/2 - 1) exp(-x/2).
Mean: E[X] = k.
Variance: Var(X) = 2k.
MGF: M(t) = (1 - 2t)^(-k/2) for t < 1/2.
Special case of Gamma: chi^2(k) = Gamma(shape = k/2, scale = 2).
Convolution: independent chi^2(k1) + chi^2(k2) = chi^2(k1 + k2).

Use cases: distribution of the sample variance (scaled) under a Normal model: (n - 1) S^2 / sigma^2 ~ chi^2(n - 1); Pearson’s chi-square goodness-of-fit and independence tests; likelihood-ratio test statistics (asymptotically chi^2 under regularity, Wilks’ theorem); confidence intervals for variance.

t-distribution(nu)

The “Student’s t” distribution: a heavier-tailed substitute for the Normal when estimating the mean of a Normal sample with unknown variance.

Support: all of R.
Parameter: nu > 0 (degrees of freedom).
PDF: f(t) = (Gamma((nu + 1)/2) / (sqrt(nu pi) Gamma(nu/2))) (1 + t^2/nu)^(-(nu + 1)/2).
Mean: 0 for nu > 1, undefined for nu <= 1 (Cauchy at nu = 1).
Variance: nu / (nu - 2) for nu > 2, infinite for 1 < nu <= 2, undefined for nu <= 1.
Tail behavior: power-law tails f(t) ~ |t|^(-(nu + 1)). As nu -> infinity, t(nu) -> N(0, 1).
Construction: if Z ~ N(0, 1) and V ~ chi^2(nu) independent, then T = Z / sqrt(V/nu) ~ t(nu).

Use cases: classical “one-sample t-test” and “two-sample t-test” for the mean when the population variance is unknown and estimated from the sample; confidence intervals for the mean of a Normal sample; robust regression priors (a t-prior on errors tolerates outliers far better than Normal); Bayesian inference where t marginal arises from integrating out an unknown variance.

F-distribution(d1, d2)

The ratio (suitably scaled) of two independent chi-squared random variables.

Support: [0, infinity).
Parameters: d1, d2 > 0 (numerator and denominator degrees of freedom).
PDF: f(x) = (1/B(d1/2, d2/2)) (d1/d2)^(d1/2) x^(d1/2 - 1) (1 + (d1/d2) x)^(-(d1 + d2)/2).
Construction: if U ~ chi^2(d1) and V ~ chi^2(d2) independent, then (U/d1) / (V/d2) ~ F(d1, d2).
Mean: d2 / (d2 - 2) for d2 > 2.
Reciprocal: 1 / F(d1, d2) ~ F(d2, d1).

Use cases: ANOVA F-test (ratio of between-group to within-group variance); F-test for nested linear models (full vs reduced); test for equality of two variances; the basis of significance testing in ordinary least-squares regression.

Log-normal(mu, sigma^2)

The distribution whose logarithm is Normal: if Y = log(X) ~ N(mu, sigma^2) then X is log-normal.

Support: (0, infinity).
PDF: f(x) = (1 / (x sigma sqrt(2 pi))) exp(-(log(x) - mu)^2 / (2 sigma^2)).
Mean: E[X] = exp(mu + sigma^2 / 2). (Note: not exp(mu).)
Variance: Var(X) = (exp(sigma^2) - 1) exp(2 mu + sigma^2).
Median: exp(mu). Mode: exp(mu - sigma^2).
Multiplicative CLT: products of many positive independent random variables are approximately log-normal (their logs are approximately Normal by the additive CLT).

Use cases: income and wealth distributions (right-skewed, positive); stock prices and asset returns (geometric Brownian motion has log-normal marginals — the Black-Scholes assumption); particle size distributions; biological growth quantities; multiplicative noise models.

Cauchy(x_0, gamma)

The Lorentz distribution: a symmetric, heavy-tailed distribution with no defined mean or variance.

Support: all of R.
Parameters: x_0 in R (location, the median), gamma > 0 (scale, half-width at half-maximum).
PDF: f(x) = (1 / (pi gamma)) (gamma^2 / ((x - x_0)^2 + gamma^2)).
CDF: F(x) = (1/pi) arctan((x - x_0)/gamma) + 1/2.
Mean and variance: undefined. The integral defining E[X] does not converge absolutely.
Characteristic function (exists, even though MGF does not): phi(t) = exp(i x_0 t - gamma |t|).
Construction: ratio of two independent standard Normals: N(0, 1) / N(0, 1) ~ Cauchy(0, 1).
Stability: Cauchy is a stable distribution — sums of iid Cauchies are Cauchy (with rescaled parameters). The sample mean does not concentrate.

Use cases: resonance line shapes in spectroscopy and laser physics (Lorentzian profile); a stress-test for “robust” estimators (the median, not the mean, recovers the location); pathological-case prior to demonstrate the importance of finite moments.

Weibull(k, lambda)

A flexible distribution for positive quantities, generalizing the Exponential.

Support: [0, infinity).
Parameters: k > 0 (shape), lambda > 0 (scale).
PDF: f(x) = (k/lambda) (x/lambda)^(k - 1) exp(-(x/lambda)^k).
CDF: F(x) = 1 - exp(-(x/lambda)^k).
Mean: E[X] = lambda Gamma(1 + 1/k).
Variance: Var(X) = lambda^2 (Gamma(1 + 2/k) - (Gamma(1 + 1/k))^2).
Hazard rate: h(x) = (k/lambda) (x/lambda)^(k - 1).
- k < 1: decreasing hazard (infant mortality).
- k = 1: constant hazard (Exponential).
- k > 1: increasing hazard (wear-out failure).
- Combining all three over a product lifetime gives the classic “bathtub curve”.

Use cases: reliability and survival analysis (lifetime of components); wind-speed distributions in renewable-energy modeling; particle-size distributions; material fatigue and breaking strength; extreme-value modeling for minima.

Pareto(x_m, alpha)

The continuous power-law distribution.

Support: [x_m, infinity) with x_m > 0.
Parameters: x_m > 0 (scale / minimum), alpha > 0 (shape, “Pareto index”).
PDF: f(x) = alpha x_m^alpha / x^(alpha + 1) for x >= x_m.
CDF: F(x) = 1 - (x_m / x)^alpha.
Mean: alpha x_m / (alpha - 1) for alpha > 1; infinite otherwise.
Variance: x_m^2 alpha / ((alpha - 1)^2 (alpha - 2)) for alpha > 2; infinite otherwise.
The 80/20 rule (Pareto principle): for alpha approximately 1.16, about 80% of the mass sits in the top 20% of values.

Use cases: wealth and income distributions (Vilfredo Pareto’s original observation); file-size distributions on the web; severity of insurance claims; city sizes (also Zipf); the continuous companion to Zipf for ranked-frequency data.

Logistic(mu, s)

Symmetric and bell-shaped like the Normal, with heavier tails. CDF is the famous logistic / sigmoid function.

Support: all of R.
Parameters: mu in R (location), s > 0 (scale).
PDF: f(x) = exp(-(x - mu)/s) / (s (1 + exp(-(x - mu)/s))^2).
CDF: F(x) = 1 / (1 + exp(-(x - mu)/s)) — the sigmoid.
Mean: E[X] = mu.
Variance: Var(X) = (s^2 pi^2) / 3.
Excess kurtosis: 1.2 (heavier-tailed than Normal’s 0).

Use cases: foundation of logistic regression (the link function is the logit, the inverse of the logistic CDF; the latent-variable derivation has Logistic errors); softmax classifiers (multinomial generalization); growth curves in biology and adoption modeling.

Laplace (double-exponential)(mu, b)

Two mirrored Exponentials joined at the mean: a symmetric distribution with a sharp peak and exponentially decaying tails (heavier than Normal).

Support: all of R.
Parameters: mu in R (location), b > 0 (scale).
PDF: f(x) = (1 / (2b)) exp(-|x - mu| / b).
Mean: E[X] = mu.
Variance: Var(X) = 2 b^2.
MGF: M(t) = exp(mu t) / (1 - b^2 t^2) for |t| < 1/b.

Use cases: as a prior over regression coefficients, the Laplace yields the Lasso penalty (L1 regularization is the MAP estimate under a Laplace prior); the Laplace mechanism in differential privacy adds Laplace noise calibrated to query sensitivity; a heavier-tailed alternative to Normal for residuals that yields median regression in the MLE limit.

Rayleigh(sigma)

The distribution of the magnitude of a 2D Gaussian vector with iid components.

Support: [0, infinity).
PDF: f(x) = (x / sigma^2) exp(-x^2 / (2 sigma^2)).
Mean: E[X] = sigma sqrt(pi / 2).
Variance: Var(X) = (4 - pi)/2 cdot sigma^2.
Construction: if X, Y ~ N(0, sigma^2) independent then sqrt(X^2 + Y^2) ~ Rayleigh(sigma).
Special case of Weibull: Rayleigh(sigma) = Weibull(k = 2, lambda = sigma sqrt(2)).

Use cases: magnitude of complex Gaussian noise in communications (envelope of narrow-band noise); MRI image noise in magnitude images; wind speed (one of several candidate models, alongside Weibull).

Rice(nu, sigma)

A non-central generalization of Rayleigh: magnitude of a 2D Gaussian vector with non-zero mean.

PDF: f(x) = (x / sigma^2) exp(-(x^2 + nu^2)/(2 sigma^2)) I_0(x nu / sigma^2) for x >= 0, where I_0 is the modified Bessel function of the first kind of order 0.
Reduces to Rayleigh when nu = 0.

Use cases: signal-plus-noise envelope in communications (Rician fading channels); MRI magnitude images at low SNR (Rayleigh is the high-noise limit, Rice corrects for signal).

Inverse-Gamma(alpha, beta)

If X ~ Gamma(alpha, beta) (shape-rate form), then 1/X ~ InverseGamma(alpha, beta).

PDF: f(x) = (beta^alpha / Gamma(alpha)) x^(-alpha - 1) exp(-beta / x) for x > 0.
Mean: beta / (alpha - 1) for alpha > 1.
Variance: beta^2 / ((alpha - 1)^2 (alpha - 2)) for alpha > 2.

Use cases: the conjugate prior for the variance sigma^2 of a Normal likelihood with known mean. Together with a Normal prior on the mean, this gives the Normal-Inverse-Gamma (NIG) joint prior with closed-form Bayesian updating for the full Normal model.

Dirichlet(alpha_1, …, alpha_K)

The multivariate generalization of Beta: a distribution over the (K-1)-simplex (vectors of non-negative components summing to 1).

Support: { p = (p_1, ..., p_K) : p_i >= 0, sum_i p_i = 1 }.
Parameters: alpha_i > 0.
PDF: f(p) = (1/B(alpha)) prod_i p_i^(alpha_i - 1), where B(alpha) = (prod_i Gamma(alpha_i)) / Gamma(sum_i alpha_i) is the multivariate beta function.
Mean: E[p_i] = alpha_i / (sum_j alpha_j).
Variance: Var(p_i) = (alpha_i (alpha_0 - alpha_i)) / (alpha_0^2 (alpha_0 + 1)) where alpha_0 = sum_j alpha_j.
Marginals: each p_i marginally Beta(alpha_i, alpha_0 - alpha_i).
Aggregation: merging two components i, j into i + j yields a Dirichlet with parameters (alpha_1, ..., alpha_i + alpha_j, ..., alpha_K).
Conjugate prior to the Multinomial likelihood.

Use cases: topic models (Latent Dirichlet Allocation: per-document topic mixture is Dirichlet, per-topic word mixture is Dirichlet); priors over discrete probability vectors; population-genetics allele-frequency models; the basis of Dirichlet processes (nonparametric Bayesian).

Wishart + Inverse-Wishart

Distributions over symmetric positive-definite matrices. The matrix-valued generalization of (Inverse-)Gamma.

Wishart(V, nu): if Y_i in R^d are iid N(0, V) for i = 1, ..., nu, then S = sum_i Y_i Y_i^T ~ Wishart(V, nu). Used to model sample covariance matrices.
Inverse-Wishart(Psi, nu): if S ~ Wishart(V, nu), then S^(-1) ~ Wishart^(-1)(V^(-1), nu). The conjugate prior for the covariance matrix Sigma of a multivariate Normal likelihood with known mean.

PDFs and full parameterizations are in Johnson-Kotz-Balakrishnan and Bishop Appendix B.

Use cases: Bayesian inference over covariance matrices (multivariate Normal with unknown covariance — use Inverse-Wishart prior); factor analysis priors; multivariate hierarchical models. Be aware: the Inverse-Wishart prior is notoriously inflexible (single concentration parameter nu) — the LKJ prior on the correlation matrix combined with a separate scale prior is often preferred in modern Bayesian software (Stan, PyMC).

von Mises (circular Normal)(mu, kappa)

The Normal-analog on the circle: a unimodal distribution for angular data.

Support: [-pi, pi) (or any interval of length 2 pi).
Parameters: mu in [-pi, pi) (mean direction), kappa >= 0 (concentration — analog of inverse variance; kappa = 0 is uniform on the circle).
PDF: f(theta) = exp(kappa cos(theta - mu)) / (2 pi I_0(kappa)), where I_0 is the modified Bessel function of the first kind of order 0.
Mean direction: mu.
Circular variance: 1 - I_1(kappa)/I_0(kappa).

Use cases: any angular measurement (wind direction, compass bearings, time-of-day as an angle modulo 24h, phase of an oscillator, robot orientation); the foundational distribution for directional statistics. The Bingham, Kent, and matrix-Fisher distributions generalize to spheres and rotation groups (used in robotics for SO(3) — see bayesian-estimation).

Beta-prime / inverted-Beta(alpha, beta)

If X ~ Beta(alpha, beta), then Y = X / (1 - X) ~ Beta'(alpha, beta) on (0, infinity).

PDF: f(y) = y^(alpha - 1) (1 + y)^(-(alpha + beta)) / B(alpha, beta).
Mean: alpha / (beta - 1) for beta > 1.

Use cases: odds-ratio distributions (since Y = X/(1-X) is the odds); priors over ratios; specialty distribution in Bayesian variance modeling.

Skew-normal(xi, omega, alpha)

Adds a skewness parameter alpha to the Normal, allowing asymmetric bell shapes.

PDF: f(x) = (2 / omega) phi((x - xi)/omega) Phi(alpha (x - xi)/omega), where phi, Phi are standard Normal PDF and CDF.
Recovers Normal when alpha = 0.
Right-skewed when alpha > 0, left-skewed when alpha < 0.

Use cases: modeling moderately skewed continuous data where the log-transform is too aggressive (and would over-correct); regression residuals that exhibit asymmetry; financial returns.

Mixture distributions

A finite (or infinite) weighted sum of component densities:

f(x) = sum_i w_i f_i(x), with w_i >= 0 and sum_i w_i = 1.

Each f_i can be any distribution. The most-used special case is the Gaussian Mixture Model (GMM): f(x) = sum_i w_i N(x; mu_i, Sigma_i). Pearson (1894) introduced GMMs to model crab measurements; the modern Expectation-Maximization (EM) algorithm fits them by iterating between soft-assignment of points to components (E-step) and maximum-likelihood re-estimation of (w_i, mu_i, Sigma_i) (M-step).

Properties:

Mixtures can approximate any continuous density arbitrarily well (universal approximators, given enough components).
Identifiability is a concern: many parameter settings produce the same mixture (label-switching).
Bayesian variants use Dirichlet priors on w_i; nonparametric Bayes (Dirichlet processes) allows the number of components to be inferred from data.

Use cases: density estimation; clustering (GMM is “soft k-means” with full covariances); speaker recognition (each speaker = a GMM); background subtraction in video; segmentation.

Heavy-tailed family — Cauchy, Pareto, Lévy stable

A distribution is heavy-tailed if its tails decay slower than exponentially, i.e. lim_{x -> infinity} exp(lambda x) (1 - F(x)) = infinity for all lambda > 0. Equivalently, MGF does not exist in any neighborhood of 0. Important heavy-tailed families:

Power-law tails: 1 - F(x) ~ c x^(-alpha) as x -> infinity. Pareto, Cauchy (alpha = 1), Student’s t.
Lévy stable distributions: a 4-parameter family S(alpha, beta, gamma, delta) with stability parameter alpha in (0, 2]. The Normal is alpha = 2, Cauchy is alpha = 1 and beta = 0, Lévy distribution is alpha = 1/2. Stable distributions are the only distributions that arise as limits of sums of iid random variables (generalized CLT — when finite-variance assumption is relaxed).

Use cases: financial returns (well-documented heavier-than-Gaussian tails: market crashes happen far more often than Normal predicts); insurance loss distributions; failure-time data; network traffic volumes; word and city size distributions.

Generalized Extreme Value (GEV) + Generalized Pareto

The two pillars of extreme-value theory (EVT): asymptotic distributions of maxima and of excesses over a threshold.

Fisher-Tippett-Gnedenko theorem: if M_n = max(X_1, ..., X_n) for iid X_i and there exist sequences a_n, b_n > 0 such that (M_n - a_n)/b_n converges in distribution to a non-degenerate G, then G is one of three types: Gumbel, Frechet, or Weibull (the “extreme-value families”). The Generalized Extreme Value (GEV) distribution unifies them via a shape parameter xi:

PDF: f(x) = (1/sigma) [1 + xi(x - mu)/sigma]^(-1/xi - 1) exp(-[1 + xi(x - mu)/sigma]^(-1/xi)) for 1 + xi(x - mu)/sigma > 0.
xi > 0: Frechet (heavy tail). xi = 0: Gumbel (light tail; take limit). xi < 0: reverse Weibull (bounded tail).

Pickands-Balkema-de Haan theorem: the conditional distribution of excesses over a high threshold u (i.e. X - u | X > u) is approximately Generalized Pareto: f(x) = (1/sigma)(1 + xi x/sigma)^(-1/xi - 1).

Use cases: flood and rainfall return-period modeling; wind speed extremes for structural engineering; hurricane intensity tails; financial Value-at-Risk and Expected Shortfall; insurance catastrophe modeling; reliability of safety-critical systems.

Half-normal / half-t / truncated-normal

When the underlying quantity is nonnegative (a magnitude, a variance, a scale parameter), truncated and folded Normals are natural.

Half-normal(sigma): PDF f(x) = sqrt(2/(pi sigma^2)) exp(-x^2 / (2 sigma^2)) for x >= 0. Mean sigma sqrt(2/pi), variance sigma^2 (1 - 2/pi).
Half-t(nu, sigma): same idea but t-tailed; Gelman recommends as a weakly-informative prior on variance / standard-deviation parameters in Bayesian hierarchical models.
Truncated-Normal: restrict N(mu, sigma^2) to [a, b]. PDF: phi((x - mu)/sigma) / (sigma (Phi((b - mu)/sigma) - Phi((a - mu)/sigma))) for x in [a, b].

Use cases: priors on standard deviations and scale parameters; censored regression and Tobit models; rejection-sampling helpers.

Stable distribution

The family of distributions closed under convolution (up to rescaling). The Normal, Cauchy, and Lévy distributions are members. Characterized via characteristic function:

phi(t) = exp(i delta t - |gamma t|^alpha (1 - i beta sign(t) Phi(alpha, t)))

with stability index alpha in (0, 2], skewness beta in [-1, 1], scale gamma > 0, location delta in R, and a piece Phi(alpha, t) that depends on alpha. Only alpha = 2 (Normal), alpha = 1, beta = 0 (Cauchy), and alpha = 1/2, beta = 1 (Lévy) have closed-form densities.

Key property: by the generalized CLT, sums of iid random variables with infinite variance converge (after rescaling) to a stable law with alpha < 2.

Use cases: financial-return modeling (Mandelbrot’s original application); modeling aggregate insurance losses; pure-mathematics scaling-limit constructions.

Distribution selection by use case

A practical lookup. When in doubt, start with the canonical choice and only deviate if the data forces it.

Coin flip / single yes-no → Bernoulli.
Count of successes in n trials → Binomial; → Beta-Binomial if overdispersed.
Count of events per fixed time/area → Poisson; → Negative Binomial if overdispersed (var > mean).
Time to first event → Exponential (constant hazard); → Weibull (changing hazard); → Log-normal (multiplicative growth processes).
Time to r-th event → Gamma / Erlang.
Probability or proportion → Beta.
Bounded continuous quantity on [a, b] → Beta after rescaling to [0, 1].
Normal residuals (CLT-justified, sum of many small noises) → Normal.
Small-sample mean inference with unknown variance → t.
Sample variance inference → chi-square.
ANOVA / equal-variance hypothesis testing → F.
Direction / angle on the circle → von Mises (or Bingham, Kent for higher dimensions).
Heavy-tailed financial returns → Cauchy / Student’s t with small nu / Pareto / stable.
Extreme value (annual max, threshold exceedance) → GEV (for maxima) / Generalized Pareto (for exceedances).
Multinomial probabilities → Dirichlet prior.
Covariance matrix → Inverse-Wishart prior (or LKJ + half-Cauchy for modern Bayes).
Positive-only continuous quantity, no special structure → Gamma (flexible scale family) or Log-normal (multiplicative).
Multimodal continuous data → Gaussian mixture.

Conjugate prior summary table

Conjugate prior-likelihood pairs underlie almost all closed-form Bayesian inference. The pattern: the prior and posterior live in the same family; the likelihood’s sufficient statistics update the prior’s hyperparameters in a simple closed form.

Likelihood	Parameter	Conjugate prior	Posterior update
Bernoulli(p)	p	Beta(alpha, beta)	Beta(alpha + sum x_i, beta + n - sum x_i)
Binomial(n, p)	p	Beta(alpha, beta)	Beta(alpha + k, beta + n - k)
Geometric(p)	p	Beta(alpha, beta)	Beta(alpha + n, beta + sum x_i - n) (failures-form)
Negative Binomial(r, p), r fixed	p	Beta(alpha, beta)	Beta(alpha + nr, beta + sum x_i)
Poisson(lambda)	lambda	Gamma(alpha, beta)	Gamma(alpha + sum x_i, beta + n)
Multinomial(n, p)	p	Dirichlet(alpha)	Dirichlet(alpha + k)
Exponential(lambda)	lambda	Gamma(alpha, beta)	Gamma(alpha + n, beta + sum x_i)
Gamma(k, theta), k known	rate = 1/theta	Gamma(alpha, beta)	Gamma(alpha + nk, beta + sum x_i)
Normal(mu, sigma^2), sigma^2 known	mu	Normal(mu_0, tau_0^2)	Normal(precision-weighted mean, updated precision)
Normal(mu, sigma^2), mu known	sigma^2	Inverse-Gamma(alpha, beta)	Inv-Gamma(alpha + n/2, beta + (1/2) sum (x_i - mu)^2)
Normal(mu, sigma^2), both unknown	(mu, sigma^2)	Normal-Inverse-Gamma	Closed-form NIG update
Multivariate Normal, Sigma known	mu	Multivariate Normal	Multivariate Normal
Multivariate Normal, mu known	Sigma	Inverse-Wishart	Inverse-Wishart
Uniform(0, theta)	theta	Pareto(x_m, alpha)	Pareto(max(x_m, max x_i), alpha + n)

For derivations, see bayesian-inference and Bishop PRML chapter 2.

MGF / characteristic function summary

Distribution	MGF `M(t)`	Characteristic function `phi(t)`
Bernoulli(p)	`(1 - p) + p e^t`	`(1 - p) + p e^(it)`
Binomial(n, p)	`((1 - p) + p e^t)^n`	`((1 - p) + p e^(it))^n`
Geometric(p) (trials form)	`p e^t / (1 - (1 - p) e^t)`	`p e^(it) / (1 - (1 - p) e^(it))`
Negative Binomial(r, p)	`(p / (1 - (1 - p) e^t))^r`	analog
Poisson(lambda)	`exp(lambda (e^t - 1))`	`exp(lambda (e^(it) - 1))`
Multinomial(n, p)	`(sum_i p_i e^(t_i))^n`	analog
Uniform(a, b)	`(e^(bt) - e^(at)) / (t (b - a))`	`(e^(ibt) - e^(iat)) / (it (b - a))`
Normal(mu, sigma^2)	`exp(mu t + sigma^2 t^2 / 2)`	`exp(i mu t - sigma^2 t^2 / 2)`
Exponential(lambda)	`lambda / (lambda - t)` for `t < lambda`	`lambda / (lambda - it)`
Gamma(k, theta)	`(1 - theta t)^(-k)` for `t < 1/theta`	`(1 - i theta t)^(-k)`
Chi-square(k)	`(1 - 2t)^(-k/2)` for `t < 1/2`	`(1 - 2it)^(-k/2)`
Laplace(mu, b)	`exp(mu t) / (1 - b^2 t^2)` for `	t
Cauchy(x_0, gamma)	does not exist	`exp(i x_0 t - gamma
Pareto(x_m, alpha)	does not exist for `t > 0`	given via incomplete gamma
Logistic(mu, s)	`exp(mu t) B(1 - s t, 1 + s t)` for `	s t

MGFs uniquely determine distributions when they exist on an open neighborhood of 0. Characteristic functions always exist and always uniquely determine the distribution (Levy continuity theorem governs convergence).

Software

Python: scipy.stats provides over 90 distributions with consistent pdf/pmf, cdf, ppf (inverse CDF), rvs (random sampling), fit (MLE fitting), and entropy. numpy.random provides high-throughput samplers. Probabilistic-programming libraries: PyMC and NumPyro wrap distributions for Bayesian inference; TensorFlow Probability and Pyro (PyTorch) for differentiable distributions in deep-learning pipelines.
R: base stats package has d/p/q/r family functions (dnorm, pnorm, qnorm, rnorm, etc.) for the standard distributions. MASS adds multivariate and specialty families. Stan (via rstan or cmdstanr) provides Bayesian inference.
Julia: Distributions.jl is a comprehensive, fast, well-typed library covering 50+ distributions with consistent API. Turing.jl for Bayesian inference.
MATLAB: Statistics and Machine Learning Toolbox.
C++: Boost.Math and <random> in the standard library.

Practical fitting tip: scipy’s scipy.stats.<distribution>.fit(data) returns MLE parameter estimates; combine with scipy.stats.kstest or scipy.stats.anderson for goodness-of-fit testing.

Pitfalls

Confusing PMF and PDF. P(X = x) = 0 for any continuous distribution; the PDF is a density, not a probability. f(x) > 1 is fine (the integral is what must equal 1).
Assuming mean equals variance. True for Poisson, false in general. Empirically check before applying the Poisson; if var > mean, switch to Negative Binomial.
Fitting Normal to right-skewed data. Income, prices, sizes, durations are almost never Normal. Log-transform first (yielding Log-normal) or use Gamma / Weibull directly.
Using Exponential where Weibull is needed. Exponential assumes constant hazard; most physical components have a bathtub curve (decreasing → constant → increasing hazard). Weibull captures all three regimes via its shape parameter k.
Forgetting Bessel’s correction (n - 1 divisor in sample variance). The sample variance with n in the denominator is the MLE under Normal but is biased downward; dividing by n - 1 gives the unbiased estimator. Most software uses n - 1 by default (numpy’s np.var defaults to n! Pass ddof=1 for the unbiased version).
MLE of Normal variance is biased. Same point as above: sigma_MLE^2 = (1/n) sum (x_i - x_bar)^2 has bias -sigma^2/n. Bayesian inference avoids this issue (posterior accounts for parameter uncertainty); REML and Bessel-corrected estimators are frequentist fixes.
Naive use of the Cauchy “mean”. The arithmetic mean of iid Cauchy samples is itself Cauchy — averaging does not reduce the spread. Use the sample median (which does concentrate, at rate 1/sqrt(n)).
Conflating “no correlation” with “independence”. Equivalent under joint normality but not in general. Many ML papers slip on this; check the multivariate-Normal section for the precise statement.
Improper priors silently giving improper posteriors. Especially with Cauchy/Pareto/half-Cauchy on scale parameters in hierarchical models — always verify the posterior is proper.
Negative Binomial parameterization confusion. SciPy, R, NumPy, and textbooks all use slightly different conventions (success-count r vs failure-count, probability p vs odds, mean-dispersion form vs (r, p) form). Always read the docs before plugging in values.
Beta-Binomial vs Binomial confusion in A/B testing. Pure Binomial assumes a single fixed p; Beta-Binomial allows p to vary per session/user. The two give very different uncertainty intervals.

Cross-references

probability-fundamentals — sigma-algebras, expectation, CLT, conditional probability (the machinery this catalog builds on).
bayesian-inference — conjugate updates, posterior derivations, MCMC, variational methods (consumer of this catalog’s prior pairs).
hypothesis-testing-mle — sampling distributions (t, F, chi-square) used in classical inference; MLE properties.
mcmc-sampling — for posteriors that are not closed-form (no conjugate prior).
information-theory — entropy and divergences for these distributions.
reliability-engineering — Weibull bathtub-curve modeling, MTBF.
six-sigma — Normal-based capability indices, process control.
bayesian-estimation — multivariate Normal in Kalman filters, von Mises / Bingham on SO(3).

Citations

Johnson, N. L., Kotz, S., Balakrishnan, N. (1994, 1995). Continuous Univariate Distributions, Volumes 1 and 2, 2nd edition. Wiley. The canonical encyclopedic reference for parametric families on the real line.
Johnson, N. L., Kemp, A. W., Kotz, S. (2005). Univariate Discrete Distributions, 3rd edition. Wiley. Companion for discrete families.
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. Compact, modern survey of distributions and the inference machinery that uses them.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Appendix B gives one-page summaries of all distributions used in the main text, with PDFs, moments, and conjugacy.
Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., Rubin, D. (2013). Bayesian Data Analysis, 3rd edition. CRC. Practical guide to prior choices and conjugate vs non-conjugate inference.
SciPy contributors. scipy.stats reference documentation. https://docs.scipy.org/doc/scipy/reference/stats.html (authoritative, code-checkable distribution catalog).
Embrechts, P., Klueppelberg, C., Mikosch, T. (1997). Modelling Extremal Events for Insurance and Finance. Springer. The reference for GEV, Generalized Pareto, and heavy-tailed practice.
Mardia, K. V., Jupp, P. E. (2000). Directional Statistics. Wiley. The reference for von Mises, Bingham, and related circular/spherical distributions.

Compendium

Explorer

Probability Distributions Reference — Math Reference