Probability Fundamentals — Math Reference

1. At a glance

Probability is the mathematical language of uncertainty. It quantifies how likely outcomes are when we cannot predict them deterministically, and it provides the substrate for statistics, machine learning, simulation, reliability engineering, finance, signal processing, communication theory, and quantum mechanics. Anywhere randomness — true or epistemic — enters a system, probability is the toolbox.

Why it matters:

  • Uncertainty quantification — every measurement has noise; every prediction has error bars.
  • Machine learning — loss functions are negative log-likelihoods, regularizers are log-priors, and generalization bounds rest on concentration inequalities.
  • Simulation + Monte Carlo — when closed-form integration fails, sample.
  • Reliability — failure-time distributions (Weibull, exponential) drive MTBF and maintenance schedules.
  • Finance — option pricing (Black-Scholes), VaR, expected shortfall — all distributional.
  • Signal processing — noise is modeled probabilistically; Kalman + particle filters update beliefs.

Two interpretations (both consistent with the same axioms):

  • Frequentist — probability is the long-run frequency of an event under repeated trials. P(heads) = 1/2 because flipping a fair coin many times converges to half heads. This view treats probabilities as objective properties of the world.
  • Bayesian — probability is a degree of belief, updated by evidence via Bayes’ theorem. P(heads) = 1/2 because that is your prior credence; new data shifts it. This view treats probabilities as epistemic — they live in the observer.

The interpretive split affects how you do inference (confidence intervals vs credible intervals, p-values vs posteriors), but the math underneath — the Kolmogorov axiomatization — is identical.

Kolmogorov axioms (1933) — Andrey Kolmogorov in Grundbegriffe der Wahrscheinlichkeitsrechnung gave probability theory its modern measure-theoretic foundation. A probability space is a triple (Ω, F, P) where:

  1. Non-negativity — P(A) ≥ 0 for every event A ∈ F.
  2. Normalization — P(Ω) = 1.
  3. σ-additivity (countable additivity) — for any countable collection of pairwise disjoint events {A_i}: P(∪_i A_i) = Σ_i P(A_i).

Everything in probability theory — distributions, expectation, limit theorems, stochastic processes — is derived from these three rules plus measure theory. F is a σ-algebra of measurable subsets; this technical detail prevents pathological “non-measurable” sets from breaking countable additivity.

2. Events and sample space

  • Sample space Ω — the set of all possible outcomes of an experiment. For a coin flip, Ω = {H, T}. For a die roll, Ω = {1, 2, 3, 4, 5, 6}. For a continuous measurement, Ω might be ℝ or [0, ∞).
  • Event — a subset A ⊆ Ω. “Roll an even number” = {2, 4, 6}.
  • σ-algebra F — the collection of events we can assign probability to. For finite/countable Ω, F = 2^Ω (the power set). For Ω = ℝ, F is the Borel σ-algebra.

Basic set operations and probability rules:

  • Union — P(A ∪ B) = P(A) + P(B) − P(A ∩ B). The subtraction prevents double-counting the overlap. For disjoint events (A ∩ B = ∅), it reduces to P(A) + P(B).
  • Inclusion-exclusion (general) — P(∪_i A_i) = Σ P(A_i) − Σ P(A_i ∩ A_j) + Σ P(A_i ∩ A_j ∩ A_k) − …
  • Complement — P(A^c) = 1 − P(A). Useful for “at least one” problems: P(at least one head in n flips) = 1 − P(no heads) = 1 − (1/2)^n.
  • Monotonicity — A ⊆ B ⇒ P(A) ≤ P(B).
  • Union bound (Boole’s inequality) — P(∪_i A_i) ≤ Σ_i P(A_i). Loose but always true; useful for upper-bounding tail events.

Independence — two events A and B are independent iff P(A ∩ B) = P(A) · P(B). Equivalently, P(A | B) = P(A) (conditioning on B doesn’t change the probability of A). Mutually independent ⇒ pairwise independent, but the converse fails (Bernstein’s example: three pairwise independent events that fail joint independence).

3. Conditional probability and Bayes

Definition — for events A, B with P(B) > 0: P(A | B) = P(A ∩ B) / P(B)

Read as “probability of A given B.” Conditioning restricts the sample space to B and re-normalizes.

Chain rule — P(A ∩ B) = P(A | B) · P(B) = P(B | A) · P(A). Generalizes to: P(A_1 ∩ … ∩ A_n) = P(A_1) · P(A_2 | A_1) · P(A_3 | A_1 ∩ A_2) · … · P(A_n | A_1 ∩ … ∩ A_{n-1}).

Law of total probability — if {B_i} partitions Ω: P(A) = Σ_i P(A | B_i) · P(B_i)

Bayes’ theorem (Thomas Bayes, posthumous 1763; rediscovered + generalized by Laplace 1774): P(A | B) = P(B | A) · P(A) / P(B)

Or, with a partition {B_i}: P(B_i | A) = P(A | B_i) · P(B_i) / Σ_j P(A | B_j) · P(B_j)

In Bayesian-inference vocabulary: posterior = (likelihood × prior) / evidence

Use cases:

  • Medical testing (base rate fallacy) — a test for a rare disease has 99% sensitivity (P(positive | disease) = 0.99) and 99% specificity (P(negative | no disease) = 0.99). The disease prevalence is 1/10000. If you test positive, what’s P(disease | positive)?

    P(disease | positive) = (0.99 × 0.0001) / (0.99 × 0.0001 + 0.01 × 0.9999) ≈ 0.0098

    Less than 1%. The huge negative class — 99.99% of the population — generates more false positives in absolute terms than the tiny positive class generates true positives. This is the base rate fallacy: people conflate P(positive | disease) (the test’s quality) with P(disease | positive) (what you actually want).

  • Spam filtering (naive Bayes) — model P(spam | words) ∝ P(words | spam) · P(spam), assuming words are conditionally independent given the class. Crude but effective; the assumption is wildly wrong but the decision boundary is often robust.

  • Bayesian inference (broadly) — Bayes is the engine. Start with a prior over parameters θ, observe data D, update to posterior P(θ | D) ∝ P(D | θ) · P(θ). The evidence P(D) = ∫ P(D | θ) · P(θ) dθ is often the hard part (motivation for MCMC + variational inference).

4. Random variables

A random variable X is a measurable function X: Ω → ℝ that assigns a real number to each outcome. It is not random in itself — Ω is random; X is the mapping. The “value of X” is X(ω) where ω ∈ Ω is the actual outcome that occurred.

Discrete random variable — takes values in a countable set (e.g., ℕ, {0,1}). Described by:

  • PMF (probability mass function) p(x) = P(X = x), with Σ_x p(x) = 1 and p(x) ≥ 0.

Continuous random variable — takes values in an uncountable set (typically an interval of ℝ). Described by:

  • PDF (probability density function) f(x) ≥ 0 with ∫ f(x) dx = 1. For continuous X, P(X = x) = 0 for any single x; probabilities come from integrating: P(a ≤ X ≤ b) = ∫_a^b f(x) dx.

CDF (cumulative distribution function) — F(x) = P(X ≤ x), defined for all random variables (discrete, continuous, mixed). Properties:

  • Non-decreasing.
  • F(−∞) = 0, F(+∞) = 1.
  • Right-continuous.
  • For continuous X: f(x) = F’(x) where the derivative exists.
  • For discrete X: F has jumps at the support points; the jump size at x equals p(x).

Quantile function — Q(p) = F⁻¹(p) = inf{x : F(x) ≥ p}. Inverts the CDF; useful for sampling (inverse-CDF method, §13).

5. Expectation, variance, and moments

Expectation (mean, first moment) — the probability-weighted average of X:

  • Discrete: E[X] = Σ_x x · p(x)
  • Continuous: E[X] = ∫ x · f(x) dx

E[X] need not equal any value X actually takes. For a fair die, E[X] = 3.5.

Law of the unconscious statistician (LOTUS) — for any function g: E[g(X)] = Σ_x g(x) · p(x) or ∫ g(x) · f(x) dx

You can compute E[g(X)] without first deriving the distribution of Y = g(X).

Linearity of expectation — for any random variables X, Y and constants a, b: E[aX + bY] = a · E[X] + b · E[Y]

This holds regardless of independence. Linearity is one of the most useful tools in probability — it lets you decompose complicated expectations into sums of simple ones.

Variance (second central moment) — spread around the mean: Var(X) = E[(X − µ)²] = E[X²] − (E[X])² where µ = E[X]

The right-hand identity is the computational formula — usually easier than the definition.

Standard deviation — σ = √Var(X). Has the same units as X.

Scaling and shift: Var(aX + b) = a² · Var(X) The shift b drops out (location doesn’t affect spread); a is squared (variance has squared units).

Variance of sums — for any X, Y: Var(X + Y) = Var(X) + Var(Y) + 2 · Cov(X, Y)

If X and Y are independent (or merely uncorrelated), Cov = 0 and variances add.

Covariance: Cov(X, Y) = E[(X − µ_X)(Y − µ_Y)] = E[XY] − E[X] · E[Y]

Measures linear association. Cov(X, X) = Var(X). Symmetric: Cov(X, Y) = Cov(Y, X).

Correlation (Pearson): ρ(X, Y) = Cov(X, Y) / (σ_X · σ_Y) ∈ [−1, 1]

Dimensionless; ρ = ±1 iff X and Y are linear functions of each other. ρ = 0 means uncorrelated (no linear relationship), not necessarily independent.

Higher moments:

  • Skewness — γ_1 = E[((X − µ)/σ)³]. Asymmetry: positive = right-tailed (income, log-normal), negative = left-tailed, zero = symmetric (normal).
  • Kurtosis — γ_2 = E[((X − µ)/σ)⁴]. Tail heaviness. Normal has kurtosis 3; “excess kurtosis” subtracts 3. Heavy tails (Cauchy, t, Pareto) have high kurtosis; light tails (uniform) have low.

Conditional expectation — E[X | Y] is itself a random variable (a function of Y). Key identity — tower property / law of iterated expectation: E[X] = E[E[X | Y]]

And law of total variance: Var(X) = E[Var(X | Y)] + Var(E[X | Y])

6. Common distributions — discrete

Bernoulli(p) — single trial with success probability p.

  • Support: {0, 1}
  • PMF: p(1) = p, p(0) = 1 − p
  • E = p, Var = p(1 − p)
  • Uses: indicator of a single event (coin flip, click/no-click).

Binomial(n, p) — number of successes in n independent Bernoulli(p) trials.

  • Support: {0, 1, …, n}
  • PMF: p(k) = C(n, k) · p^k · (1 − p)^{n−k}
  • E = np, Var = np(1 − p)
  • Uses: defect counts in a batch, conversions in a campaign.

Geometric(p) — number of trials until the first success (parameterization varies — “trials including the success” or “failures before the success”).

  • Support: {1, 2, 3, …}
  • PMF: p(k) = (1 − p)^{k−1} · p
  • E = 1/p, Var = (1 − p)/p²
  • Memoryless — P(X > m + n | X > m) = P(X > n).
  • Uses: time-to-first-event in discrete time.

Negative binomial(r, p) — number of trials until the r-th success.

  • Support: {r, r+1, …}
  • PMF: p(k) = C(k − 1, r − 1) · p^r · (1 − p)^{k−r}
  • E = r/p, Var = r(1 − p)/p²
  • Reduces to Geometric when r = 1.
  • Uses: over-dispersed count data (when Poisson under-fits because variance > mean).

Poisson(λ) — count of rare events in a fixed interval, when individual events are independent and the rate is constant.

  • Support: {0, 1, 2, …}
  • PMF: p(k) = e^{−λ} · λ^k / k!
  • E = Var = λ (the equidispersion property is diagnostic).
  • Poisson limit theorem (Poisson 1837, formalized later) — Bin(n, p) → Poisson(λ) as n → ∞, p → 0 with np = λ held constant.
  • Uses: arrivals at a queue, photons hitting a sensor, mutations per genome, typos per page.

Categorical / Multinomial — generalize Bernoulli / Binomial to k > 2 outcomes.

  • Categorical(p_1, …, p_k): one draw with k possible outcomes, Σ p_i = 1.
  • Multinomial(n, p_1, …, p_k): n independent categorical draws; joint distribution of counts (n_1, …, n_k).
  • E[n_i] = n · p_i, Var(n_i) = n · p_i · (1 − p_i), Cov(n_i, n_j) = −n · p_i · p_j.

7. Common distributions — continuous

Uniform(a, b) — equal density on [a, b].

  • PDF: f(x) = 1/(b − a) for a ≤ x ≤ b
  • E = (a + b)/2, Var = (b − a)²/12
  • Uses: pseudorandom-number primitive; “no information” prior on a bounded interval.

Normal / Gaussian N(µ, σ²) — the bell curve, ubiquitous because of the CLT.

  • PDF: f(x) = (1 / √(2πσ²)) · exp(−(x − µ)² / (2σ²))
  • E = µ, Var = σ²
  • 68/95/99.7 rule — P(|X − µ| < kσ) ≈ 0.68, 0.95, 0.997 for k = 1, 2, 3.
  • Standard normal: Z = (X − µ)/σ ~ N(0, 1); CDF denoted Φ(z), no closed form, uses error function.
  • Stable under linear combinations: aX + bY (independent normals) is normal.
  • Uses: measurement error, CLT-justified models, Gaussian processes, Kalman filter.

Exponential(λ) — waiting time until the next event in a Poisson process with rate λ.

  • PDF: f(x) = λ · e^{−λx} for x ≥ 0
  • E = 1/λ, Var = 1/λ²
  • CDF: F(x) = 1 − e^{−λx}
  • Memoryless — P(X > s + t | X > s) = P(X > t). Only continuous memoryless distribution.
  • Uses: inter-arrival times, radioactive decay, hazard models with constant failure rate.

Gamma(k, θ) — sum of k independent Exponential(1/θ) variables (k = “shape,” θ = “scale”). Two parameterizations are common — scale θ or rate β = 1/θ.

  • PDF: f(x) = (1 / (Γ(k) · θ^k)) · x^{k−1} · e^{−x/θ}, x ≥ 0
  • E = kθ, Var = kθ²
  • Reduces to Exponential when k = 1, Chi-square when k = ν/2, θ = 2.
  • Uses: waiting time for k-th event, conjugate prior for Poisson rate, Bayesian variance prior.

Beta(α, β) — distribution on [0, 1] with shape parameters α, β.

  • PDF: f(x) = (1 / B(α, β)) · x^{α−1} · (1 − x)^{β−1}, where B is the beta function.
  • E = α / (α + β), Var = αβ / ((α + β)² (α + β + 1))
  • Conjugate prior to Bernoulli/Binomial — if prior is Beta(α, β) and you observe s successes / f failures, posterior is Beta(α + s, β + f).
  • Uses: modeling probabilities, A/B test rates, Bayesian inference.

Chi-square(k) — sum of k squared independent standard normals: X = Σ Z_i², Z_i ~ N(0, 1).

  • E = k, Var = 2k
  • Special case of Gamma (shape k/2, scale 2).
  • Uses: sample variance distribution (for normal data), goodness-of-fit tests.

Student’s t(ν) — ratio Z / √(W/ν) where Z ~ N(0,1), W ~ χ²(ν) independent.

  • Heavier tails than normal; tail heaviness decreases with ν.
  • As ν → ∞, t(ν) → N(0, 1).
  • Uses: hypothesis testing on small samples with unknown variance (W. S. Gosset, 1908).

F(d_1, d_2) — ratio of two scaled chi-squares: F = (X_1/d_1) / (X_2/d_2).

  • Uses: ANOVA, comparing variances of two normal samples.

Log-normal — X is log-normal iff log X is normal. PDF derived via change of variables.

  • E = exp(µ + σ²/2), Var = (exp(σ²) − 1) · exp(2µ + σ²)
  • Right-skewed, support on (0, ∞).
  • Uses: multiplicative growth processes — income, stock prices, biological measurements, file sizes.

Cauchy — PDF f(x) = (1/π) · (1 / (1 + x²)).

  • No mean, no variance — integrals diverge. Heavy tails (1/x² decay).
  • Arises as ratio of two independent N(0, 1).
  • Pathological example — illustrates that CLT requires finite variance.

Weibull(k, λ) — generalizes exponential with shape parameter k.

  • PDF: f(x) = (k/λ) · (x/λ)^{k−1} · exp(−(x/λ)^k), x ≥ 0
  • k < 1: decreasing hazard (infant mortality).
  • k = 1: constant hazard (exponential).
  • k > 1: increasing hazard (wear-out).
  • Uses: reliability, time-to-failure, wind-speed extremes.

Pareto — power-law distribution.

  • PDF: f(x) = (αx_m^α) / x^{α+1} for x ≥ x_m
  • Heavy tails; mean exists only for α > 1; variance only for α > 2.
  • Uses: income/wealth distribution, city sizes, file sizes — anywhere “80/20” or scale-free behavior shows up.

8. Joint distributions

For two random variables X, Y:

  • Joint PMF/PDF — p(x, y) (discrete) or f(x, y) (continuous), with Σ Σ p(x, y) = 1 or ∫∫ f(x, y) dx dy = 1.
  • Marginal — p(x) = Σ_y p(x, y) (discrete) or f(x) = ∫ f(x, y) dy (continuous). The distribution of one variable, ignoring the other.
  • Conditional — p(x | y) = p(x, y) / p(y) (where p(y) > 0). Distribution of X given Y = y.
  • Independence — p(x, y) = p(x) · p(y) for all x, y. Equivalently, conditional equals marginal.

Multivariate normal N(µ, Σ) — k-dimensional generalization of the normal.

  • PDF: f(x) = (1 / √((2π)^k |Σ|)) · exp(−(1/2)(x − µ)ᵀ Σ⁻¹ (x − µ))
  • µ ∈ ℝ^k is the mean vector; Σ is the k×k covariance matrix (symmetric, positive semi-definite).
  • Marginals and conditionals are also normal.
  • Σ diagonal ⇔ components uncorrelated ⇔ components independent (the rare case where the implication runs both ways).
  • Uses: Gaussian processes, Kalman filter state estimates, linear regression error model.

9. Functions of random variables

Discrete — for Y = g(X), p_Y(y) = Σ_{x: g(x) = y} p_X(x).

Continuous, monotone g — Y = g(X) with g differentiable and invertible: f_Y(y) = f_X(g⁻¹(y)) · |d g⁻¹(y) / dy|

The absolute value of the derivative is the Jacobian of the inverse transform — it accounts for how the transform stretches or compresses density.

Higher-dimensional — for Y = g(X) with X, Y ∈ ℝ^k and g a diffeomorphism: f_Y(y) = f_X(g⁻¹(y)) · |det J_{g⁻¹}(y)|

where J is the Jacobian matrix.

Sum of independent random variables — distribution is the convolution:

  • Continuous: f_{X+Y}(z) = ∫ f_X(x) · f_Y(z − x) dx
  • Discrete: p_{X+Y}(z) = Σ_x p_X(x) · p_Y(z − x)

Moment-generating function (MGF): M_X(t) = E[e^{tX}]

When it exists in a neighborhood of t = 0, the MGF uniquely determines the distribution. Useful because:

  • E[X^n] = M_X^{(n)}(0) (n-th derivative at 0).
  • M_{X+Y}(t) = M_X(t) · M_Y(t) for independent X, Y.

The MGF doesn’t always exist (e.g., Cauchy, log-normal — heavy-tailed). In those cases use:

Characteristic function: φ_X(t) = E[e^{itX}]

Always exists (|e^{itX}| = 1, bounded). The characteristic function uniquely determines the distribution (Lévy’s continuity theorem). It is the Fourier transform of the PDF.

10. Limit theorems

The two foundational asymptotic results in probability.

Law of Large Numbers (LLN) — for iid X_1, X_2, … with E[X_i] = µ finite, the sample mean X̄_n = (1/n) Σ X_i converges to µ.

  • Weak LLN (Khintchine 1929) — convergence in probability: for every ε > 0, P(|X̄_n − µ| > ε) → 0 as n → ∞.
  • Strong LLN (Kolmogorov 1933) — convergence almost surely: P(lim X̄_n = µ) = 1.

LLN justifies the frequentist interpretation: the empirical frequency of an event in n trials converges to its probability.

Central Limit Theorem (CLT) — for iid X_1, X_2, … with E[X_i] = µ and Var(X_i) = σ² (both finite), the standardized sample mean converges in distribution to standard normal:

(X̄_n − µ) · √n / σ → N(0, 1)

Equivalently, the sum Σ X_i is approximately N(nµ, nσ²) for large n. Strikingly, this holds regardless of the original distribution of X_i, as long as the variance is finite. The CLT is the reason normal distributions are everywhere — many real-world quantities are sums of many small independent effects.

Variants:

  • Lindeberg-Lévy CLT (1922) — the classical iid version.
  • Lyapunov CLT — drops the iid assumption; allows the X_i to be independent but not identically distributed, provided a moment-condition (Lyapunov’s condition on the (2+δ)-th absolute central moments) holds.
  • Lindeberg-Feller CLT — necessary and sufficient condition (Lindeberg’s condition) for the CLT to hold for triangular arrays of independent variables.
  • Martingale CLT — drops independence; works for martingale-difference sequences.
  • Multivariate CLT — sample mean vector → multivariate normal.

Caveats:

  • Finite variance is essential. Cauchy variables have infinite variance and their sample mean is also Cauchy (not normal) — the CLT fails.
  • Convergence rate is O(1/√n) (Berry-Esseen theorem, 1941/1942 — quantifies the CLT error in terms of the third absolute moment).
  • For heavy-tailed but finite-variance distributions, n may need to be very large.

Glivenko-Cantelli theorem (1933) — the empirical CDF converges uniformly to the true CDF. “Fundamental theorem of statistics.”

11. Inequalities

Tools for bounding tail probabilities and proving convergence. All are exact upper bounds, often loose, but indispensable when distributional details are unknown.

Markov’s inequality — for X ≥ 0 and a > 0: P(X ≥ a) ≤ E[X] / a

Trivial proof, surprisingly useful. No assumption beyond non-negativity and finite mean.

Chebyshev’s inequality (Chebyshev 1867) — for any X with finite variance: P(|X − µ| ≥ kσ) ≤ 1/k²

Says no more than 25% of mass can be more than 2 standard deviations from the mean, no more than 11% beyond 3σ. Loose for nice distributions (normal: 4.6% beyond 2σ, 0.27% beyond 3σ) but distribution-free.

One-sided Chebyshev (Cantelli’s inequality) — P(X − µ ≥ kσ) ≤ 1/(1 + k²). Tighter than two-sided Chebyshev for one tail.

Chernoff bounds — for any t > 0 and threshold a: P(X ≥ a) = P(e^{tX} ≥ e^{ta}) ≤ E[e^{tX}] / e^{ta} = M_X(t) · e^{−ta}

Optimize over t for the tightest bound. For sums of independent bounded random variables, the bound is exponentially small — the foundation of concentration-of-measure.

Hoeffding’s inequality (Hoeffding 1963) — for iid X_1, …, X_n with X_i ∈ [a, b] and X̄ = (1/n) Σ X_i: P(|X̄ − µ| ≥ t) ≤ 2 · exp(−2nt² / (b − a)²)

Distribution-free concentration bound; the workhorse of generalization theory in machine learning. Says: with bounded data, the sample mean is exponentially close to the true mean.

Bernstein’s inequality — refines Hoeffding when variance is small relative to the range.

Jensen’s inequality (Jensen 1906) — for f convex and X integrable: f(E[X]) ≤ E[f(X)]

Reversed for concave f. Used to derive log-likelihood lower bounds (EM, variational inference) and AM-GM-type identities. Equality iff f is linear on the support of X, or X is constant.

Cauchy-Schwarz inequality — for random variables X, Y with E[X²], E[Y²] < ∞: (E[XY])² ≤ E[X²] · E[Y²]

Equivalently, |Cov(X, Y)| ≤ σ_X · σ_Y, so correlation lies in [−1, 1].

Minkowski + Hölder inequalities — generalize triangle inequality + Cauchy-Schwarz to L^p norms; basis of integral inequalities in probability and analysis.

McDiarmid’s inequality (1989) — for functions of independent random variables with bounded difference property; generalizes Hoeffding beyond sample means.

12. Stochastic processes (brief)

A stochastic process is a family {X_t : t ∈ T} of random variables indexed by time (or space). The index set T may be discrete or continuous; the state space may be discrete or continuous. Four canonical processes:

Markov chains — discrete-time, discrete-state stochastic process with the Markov property: P(X_{n+1} = j | X_n = i, X_{n−1}, …, X_0) = P(X_{n+1} = j | X_n = i)

“The future depends on the past only through the present.” Characterized by a transition matrix P with P_{ij} = P(X_{n+1} = j | X_n = i); rows sum to 1.

  • n-step transitions — P^n (matrix power).
  • Stationary distribution π — a row vector with π = π · P and Σ π_i = 1; describes the long-run fraction of time spent in each state.
  • Ergodic chain (irreducible + aperiodic) — converges to unique stationary distribution regardless of start state.
  • Uses: PageRank, MCMC, queueing, language models (n-gram).

Poisson process — continuous-time count process with rate λ.

  • N(t) = number of events in [0, t].
  • N(t) ~ Poisson(λt).
  • Inter-arrival times are iid Exponential(λ).
  • Independent increments — events in disjoint intervals are independent.
  • Uses: arrivals at a server, photon counts, gene mutations, customer support tickets.

Random walks — sequence S_n = X_1 + X_2 + … + X_n where X_i are iid steps.

  • Simple random walk on ℤ: X_i = ±1 with equal probability.
  • E[S_n] = 0, Var(S_n) = n.
  • Recurrent in 1D and 2D (returns to origin almost surely), transient in 3D+ (Pólya 1921).
  • Expected first return time to 0 is infinite in 1D (null recurrent).
  • Uses: gambler’s ruin, diffusion, MCMC mixing analysis.

Brownian motion (Wiener process) — continuous-time, continuous-state limit of random walks. Named for Robert Brown (1827, pollen observation), formalized by Wiener (1923).

  • B(0) = 0; B(t) − B(s) ~ N(0, t − s) for s < t; independent increments.
  • Sample paths are continuous everywhere, differentiable nowhere.
  • Foundation of Itô calculus, stochastic differential equations, Black-Scholes pricing.

13. Sampling and Monte Carlo

When you can’t compute an integral or expectation in closed form, simulate. Monte Carlo methods replace integration with sampling, leveraging LLN to converge to the true value at rate O(1/√n) (the price of dimensional freedom — the rate doesn’t depend on dimension, unlike numerical quadrature).

Inverse-CDF sampling — given U ~ Uniform(0, 1), let X = F⁻¹(U). Then X has CDF F. Works whenever the quantile function is computable. Standard for: exponential (X = −log(U)/λ), Cauchy, geometric, log-normal.

Box-Muller transform (1958) — generates two independent N(0, 1) samples from two independent U(0, 1): Z_1 = √(−2 ln U_1) · cos(2π U_2) Z_2 = √(−2 ln U_1) · sin(2π U_2)

Used internally by many normal-RNG implementations (or the related Marsaglia polar method).

Rejection sampling — to sample from f, find a proposal density g and constant M such that f(x) ≤ M · g(x) everywhere. Sample X ~ g, U ~ Uniform(0, 1); accept X if U ≤ f(X) / (M · g(X)), else reject. Acceptance rate is 1/M. Bad in high dimensions (M grows exponentially).

Importance sampling — instead of sampling from f directly, sample from an easier g and reweight: E_f[h(X)] = E_g[h(X) · f(X) / g(X)]

Estimator: (1/n) Σ h(X_i) · f(X_i)/g(X_i) where X_i ~ g. Variance depends critically on the choice of g — bad choices have infinite variance.

Markov Chain Monte Carlo (MCMC) — construct a Markov chain whose stationary distribution is the target distribution; run long enough and the samples converge.

  • Metropolis-Hastings (Metropolis et al. 1953; Hastings 1970) — propose X’ from q(X’ | X), accept with probability α = min(1, [π(X’) q(X | X’)] / [π(X) q(X’ | X)]). Stationary distribution is π. Works for any unnormalized π — only ratios are needed, so the partition function cancels.
  • Gibbs sampling (Geman + Geman 1984) — special case of MH where you cycle through components, sampling each from its full conditional. Especially powerful when full conditionals are tractable (conjugate models).
  • Hamiltonian Monte Carlo (HMC; Duane et al. 1987; Neal 2010 for ML audience) — uses gradient information and a physics analogy (Hamiltonian dynamics) to propose distant moves with high acceptance. Backbone of modern Bayesian software (Stan, PyMC, NumPyro).

Bootstrap (Efron 1979) — resample with replacement from the observed sample to approximate the sampling distribution of an estimator. Powerful nonparametric tool for standard errors and confidence intervals when analytic forms are unknown.

Other Monte Carlo techniques — quasi-Monte Carlo (low-discrepancy sequences, faster convergence in low dimensions), sequential Monte Carlo / particle filters (for dynamic models), variational inference (approximate posterior with a tractable family, optimization-based, faster but biased).

14. Probabilistic programming and tools

Practical libraries for working with probability:

  • NumPy + SciPynumpy.random for sampling, scipy.stats for 90+ named distributions with consistent .pdf(), .cdf(), .ppf(), .rvs(), .fit() methods. The catalog includes everything in this note plus exotica (gennorm, levy_stable, geninvgauss, …).
  • statsmodels — frequentist statistics, regression, time series.
  • PyMC — Python probabilistic programming, NUTS (HMC variant) sampler, Bayesian modeling.
  • Stan — domain-specific language for Bayesian models, interface available in R (rstan), Python (pystan, cmdstanpy), Julia, etc. HMC + NUTS sampler.
  • TensorFlow Probability / Edward — distributions and probabilistic layers integrated with TensorFlow.
  • Pyro / NumPyro — Uber’s PyTorch / JAX-based probabilistic programming, focuses on stochastic variational inference.
  • Turing.jl — Julia-based PPL with multiple inference backends.
  • R stats — base R has comprehensive distribution support; dnorm, pnorm, qnorm, rnorm (and the same suffix pattern for ~30 distributions).
  • JAGS / BUGS — older Gibbs-based BUGS-family samplers.

Distribution catalog (SciPy)scipy.stats includes ~90 continuous and ~10 discrete distributions; spot-checks like:

from scipy import stats
stats.norm.cdf(1.96)           # ~0.975
stats.binom.pmf(5, n=10, p=0.5) # ~0.246
stats.expon.rvs(scale=2, size=1000)

15. Common pitfalls

A short rogues’ gallery of conceptual errors that cost real money in real systems.

Conflating independence and uncorrelation. Independence ⇒ Cov = 0; the converse is false in general. Classic counterexample: let X ~ N(0, 1) and Y = X². Then Cov(X, Y) = E[X³] − E[X]·E[X²] = 0 − 0 = 0, but Y is a deterministic function of X — wildly dependent. The implication only runs both ways for jointly Gaussian variables.

Base-rate neglect. In Bayesian reasoning, people anchor on the conditional likelihood (P(test+ | disease)) and ignore the prior (P(disease)). See §3 for the medical-testing example.

Selection bias and survivorship bias. If your sample is conditioned on an outcome (success, survival, visibility), inferences about the population are biased. Backtested trading strategies that survived to be tested are unrepresentative; WWII engineers who studied returning bombers’ damage patterns (Abraham Wald) realized the unreturned planes were the relevant signal.

Simpson’s paradox (Yule 1903 / Simpson 1951). A trend in aggregated data can reverse in subgroups (or vice versa). Famous case: UC Berkeley admissions in 1973 — appeared to discriminate against women in aggregate, but department-by-department actually favored women slightly; women applied disproportionately to more selective departments. Always check stratification.

P-hacking and multiple comparisons. Running 20 tests at α = 0.05 gives ~64% chance of at least one false positive. Correct with Bonferroni (conservative), Holm-Bonferroni (better), or false-discovery-rate control (Benjamini-Hochberg 1995, less conservative, better power).

Confusing P(A | B) and P(B | A) (prosecutor’s fallacy). “The probability of this DNA match by chance is 1 in a million” (P(match | innocent)) is not “the probability he’s innocent is 1 in a million” (P(innocent | match)). Apply Bayes.

Treating uncorrelated variables as uninformative. Nonlinear dependencies (mutual information) can be large even when Pearson correlation is zero (e.g., Y = X² with X ~ N(0,1)).

Ignoring heavy tails. Many real systems (financial returns, insurance claims, network traffic) have tails far heavier than normal. Variance-based reasoning underestimates tail risk; sample means converge slowly; CLT-based confidence intervals can be misleading.

Mistaking absence of evidence for evidence of absence. Sample size matters; a non-significant result with low power says little.

Stationary-data assumption. Most theory assumes iid or stationary data. Real-world data drifts (concept drift, covariate shift) — models trained on yesterday’s distribution silently degrade.

16. Engineering use cases

Reliability engineering. Weibull(k, λ) is the workhorse for time-to-failure: k < 1 captures infant mortality, k = 1 collapses to constant-hazard exponential, k > 1 captures wear-out. The bathtub curve (composite hazard rate over a product’s lifetime) is the superposition of these three regimes. MTBF = ∫₀^∞ t · f(t) dt = E[T]. See reliability-engineering.

Queueing theory. M/M/1: Poisson arrivals at rate λ, exponential service at rate µ, single server. Utilization ρ = λ/µ. Average number in system L = ρ/(1 − ρ) — blows up as ρ → 1 (Erlang 1909, foundational for telecom). M/M/c: c parallel servers. M/G/1: general service distribution (Pollaczek-Khinchine formula).

Signal processing. Additive white Gaussian noise (AWGN) is the default channel model; least-squares estimators are optimal under Gaussian errors (Gauss-Markov theorem). Kalman filter (1960) optimally fuses prior + Gaussian likelihood into Gaussian posterior — recursively, in linear-Gaussian systems. See bayesian-estimation.

Communications — Shannon-Hartley theorem (Shannon 1948). Channel capacity C = B · log_2(1 + SNR) bits/second for a band-limited Gaussian noise channel of bandwidth B and signal-to-noise ratio SNR. Sets the fundamental limit for reliable data rate; everything in modern wireless (5G, Wi-Fi, satellite) operates relative to this bound.

Risk and finance. Value at Risk (VaR) at level α is the α-quantile of the loss distribution — “with probability 1 − α, losses don’t exceed VaR.” Criticized because it ignores the magnitude of losses beyond the quantile; Expected Shortfall (ES) = E[loss | loss > VaR] is coherent (subadditive). Both depend critically on accurate tail modeling — Gaussian-VaR badly underestimates risk in markets with heavy tails (the 2008 crisis lesson).

Machine learning.

  • Maximum likelihood estimation (MLE) — θ̂_MLE = argmax_θ Π_i p(x_i | θ). The classical training objective; equivalent to minimizing negative log-likelihood (NLL).
  • Cross-entropy loss — H(p, q) = −Σ p(x) log q(x); classification loss in neural networks. See transformer-architecture.
  • Maximum a posteriori (MAP) — adds a log-prior: θ̂_MAP = argmax_θ [log p(D | θ) + log p(θ)]. Regularization (L2, L1) corresponds to Gaussian and Laplace priors.
  • Full Bayesian — keep the posterior P(θ | D), integrate over it for predictions: p(y*| x*, D) = ∫ p(y*| x*, θ) · p(θ | D) dθ. Captures uncertainty in θ.
  • Generalization bounds — concentration inequalities (Hoeffding, McDiarmid) bound the gap between empirical risk and true risk; foundation of PAC learning.

Robotics — Kalman filter. State estimation in linear-Gaussian dynamical systems. Prior is Gaussian (initial belief); motion model adds Gaussian noise (still Gaussian); measurement is Gaussian-corrupted (likelihood is Gaussian); Bayes’ update yields a Gaussian posterior — closed-form. Extended (EKF) and unscented (UKF) variants handle nonlinearity; particle filters drop the Gaussian assumption entirely.

Quality control. Six Sigma targets a defect rate of 3.4 per million opportunities — corresponds to a Φ(−4.5)-tail event in a normal-distribution model. Statistical process control charts (Shewhart 1924) use ±3σ bounds; any out-of-bound observation is investigated.

A/B testing. Two-proportion z-test or Bayesian Beta-Binomial. Sample-size calculations use Var(p̂) = p(1 − p)/n and target the desired margin of error.

Cryptography. Random oracle model treats hash functions as truly random; security proofs reduce attacks to events with negligible probability (asymptotically, faster than any inverse polynomial).

Epidemiology. Compartmental models (SIR, SEIR) are stochastic processes; reproduction number R_0 = E[secondary infections per primary]; outbreak size has heavy-tailed distribution under branching-process approximation.

17. Cross-references

18. Citations and further reading

  • Andrey Kolmogorov, 1933. Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability). The measure-theoretic axiomatization that turned probability from collection-of-tricks into a branch of mathematics.
  • Sheldon Ross, 2018. A First Course in Probability, 10th edition. The standard undergraduate text; thorough on discrete + continuous distributions, joint distributions, limit theorems.
  • Christopher Bishop, 2006. Pattern Recognition and Machine Learning, ch. 2 (“Probability Distributions”). The ML-oriented treatment — exponential family, conjugate priors, Bayesian linear models.
  • Larry Wasserman, 2004. All of Statistics: A Concise Course in Statistical Inference. Compact, modern, covers both frequentist and Bayesian foundations; excellent for a quick deep dive.
  • Wassily Hoeffding, 1963. “Probability inequalities for sums of bounded random variables.” Journal of the American Statistical Association, 58(301): 13-30. The inequality bearing his name.
  • Bradley Efron, 1979. “Bootstrap methods: another look at the jackknife.” Annals of Statistics, 7(1): 1-26. Founded the bootstrap.
  • N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, 1953. “Equation of state calculations by fast computing machines.” Journal of Chemical Physics, 21(6): 1087-1092. Original Metropolis algorithm.
  • W. K. Hastings, 1970. “Monte Carlo sampling methods using Markov chains and their applications.” Biometrika, 57(1): 97-109. Generalized Metropolis to Metropolis-Hastings.
  • Claude Shannon, 1948. “A mathematical theory of communication.” Bell System Technical Journal, 27: 379-423, 623-656. Founded information theory; Shannon-Hartley capacity theorem.
  • Rudolf Kalman, 1960. “A new approach to linear filtering and prediction problems.” Journal of Basic Engineering, 82(1): 35-45. The Kalman filter.
  • Geman + Geman, 1984. “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE PAMI, 6(6): 721-741. Introduced Gibbs sampling.
  • Joseph Blitzstein + Jessica Hwang, 2019. Introduction to Probability, 2nd edition. Friendly, intuition-first; pairs well with Ross.
  • Persi Diaconis + Brad Efron, 1983. “Computer-intensive methods in statistics.” Scientific American, 248(5): 116-130. Pop-sci on bootstrap + Monte Carlo.
  • Edwin Jaynes, 2003. Probability Theory: The Logic of Science. The Bayesian-as-extended-logic manifesto; opinionated, deep.
  • William Feller, 1968 / 1971. An Introduction to Probability Theory and Its Applications, Vols. I and II. The classic two-volume reference; combinatorial proofs throughout.