Probability Frameworks — Cross-Cutting Comparison

This note compares the philosophical and operational frameworks for reasoning under uncertainty — frequentist, Bayesian (subjective + objective), likelihoodist, information-theoretic, causal, measure-theoretic, max-entropy — across every probability/statistics note in the Math library. Read the dimension tables first; the closing decision tree picks a framework by data size, prior availability, decision need, and reporting requirement.

See also

1. The eight frameworks

foundational          inferential                    decision / pragmatic
  |                       |                                |
measure-theoretic       frequentist                     Bayesian decision theory
(Kolmogorov 1933)       (Neyman-Pearson)                (Lindley, Berger)
                        (Fisher likelihood + MLE)
Cox theorem             Bayesian (subjective)           causal (Pearl do-calc, Rubin PO)
(Cox 1946 →             (Jeffreys, de Finetti, Savage)
Jaynes 1957 max-ent)
                        likelihoodist
                        (Edwards 1972, Royall 1997)
                        information-theoretic
                        (AIC, BIC, WAIC, LOO, DIC)

2. The frameworks defined

Frequentist

  • Probability = long-run frequency of an event under repeated trials.
  • Parameters = fixed, unknown constants.
  • Inference = construct estimators (MLE, MoM) and pivot statistics; report confidence intervals and p-values.
  • Founders: Fisher (likelihood, MLE, sufficiency, ANOVA), Neyman + Pearson (hypothesis testing, confidence intervals 1933), Wald (decision theory 1939).
  • Strengths: well-developed asymptotic theory, doesn’t require priors, dominant in regulatory science (FDA, EMA), reproducible — anyone with the data + model gets the same p-value.
  • Weaknesses: p-values misinterpreted at scale (ASA 2016 statement on p-values), confidence intervals not the probability statements they look like, struggles with sequential analysis (alpha-spending), can’t condition on observed data.

Bayesian (subjective)

  • Probability = degree of belief; subjective.
  • Parameters = random variables with prior distributions.
  • Inference = posterior ∝ likelihood × prior; report credible intervals, posterior summaries.
  • Founders: Bayes 1763 (posthumous), Laplace 1812, Jeffreys 1939, de Finetti 1937 (exchangeability), Savage 1954 (foundations of statistics, axioms of rational choice).
  • Strengths: handles small data + prior knowledge naturally, gives full posterior distributions (not point estimates), nests sequential analysis trivially (today’s posterior = tomorrow’s prior), coherent under Dutch-book / de Finetti.
  • Weaknesses: prior choice can be controversial, computation often expensive (MCMC), sensitivity to misspecified priors, harder to communicate to non-statisticians.

Bayesian (objective / reference)

  • Prior = chosen by rule, not subjective belief. Jeffreys prior (1946, ∝ √det I(θ)), reference prior (Bernardo 1979 / Berger-Bernardo 1992), maximum-entropy prior (Jaynes 1957).
  • Aims to be “non-informative” or “least-informative” subject to constraints.
  • Often improper (∫ prior = ∞); only valid if posterior is proper.
  • Used when subjective prior is unavailable or controversial (regulatory science).
  • Empirical Bayes (Robbins 1956; Efron 2010) — estimate hyperprior from data; pragmatic but breaks strict Bayesian coherence.

Likelihoodist

  • Probability = degree of support that data provides for parameter values.
  • Inference = report likelihood functions / likelihood ratios; no priors, no error rates.
  • Founders: Edwards (Likelihood 1972), Royall (Statistical Evidence 1997), Hacking (Logic of Statistical Inference 1965).
  • Strengths: free of prior choice, free of Type-I/Type-II error framework, clean philosophical position.
  • Weaknesses: doesn’t give a probability of hypothesis, niche adoption — mostly philosophy of statistics.

Information-theoretic / predictive

  • Probability = framework-neutral; the question is model selection by predictive accuracy.
  • Inference = compute information criteria: AIC (Akaike 1974), BIC (Schwarz 1978), DIC (Spiegelhalter et al 2002), WAIC (Watanabe 2010), LOO-CV (leave-one-out).
  • Founders: Akaike, Schwarz, Burnham + Anderson (Model Selection and Multimodel Inference 2002), Watanabe.
  • Strengths: model comparison without nested-hypothesis machinery, handles non-nested models, predictive focus (rather than truth-of-hypothesis), AIC ≈ predictive cross-validation, BIC ≈ marginal likelihood (asymptotic).
  • Weaknesses: BIC’s assumption of “true model in candidate set” rarely holds, AIC penalty for complexity is asymptotic.

Bayesian decision theory

  • Probability + utility function → choose action that maximizes expected utility.
  • Inference + action combined.
  • Founders: Wald 1950 (Statistical Decision Functions), Lindley 1972, Berger 1985 (Statistical Decision Theory).
  • Strengths: directly addresses “what should I do”, coherent under Savage axioms, integrates uncertainty + losses, nests all of frequentist’s “tests + decisions”.
  • Weaknesses: need to specify utility function (often more controversial than prior).

Causal inference (Pearl / Rubin)

  • Probability of an intervention ≠ probability of an observation. Layer separation: observation, intervention, counterfactual.
  • Two main schools:
    • Pearl’s structural causal models (SCM) — DAGs + do-calculus + counterfactuals (Pearl 1995, Causality 2000, ACM Turing 2011).
    • Rubin’s potential outcomes (Neyman-Rubin) — Y(0), Y(1) potential outcomes; SUTVA; propensity-score methods (Rubin 1974, 1978).
  • The two schools are mostly equivalent (Pearl 2009 has the mapping).
  • Founders: Sewall Wright (path analysis 1921), Neyman (1923 thesis), Rubin 1974, Pearl 1988+, Spirtes-Glymour-Scheines 1993 (PC algorithm).
  • Strengths: only framework that handles “what if?” without RCT data, integrates with ML (double-ML, causal forests, X-learner), supports identifiability theorems.
  • Weaknesses: requires assumptions about confounding (ignorability, exchangeability), often not testable from data alone, requires graphical model expertise.

Measure-theoretic (Kolmogorov axioms)

  • Probability = measure on a σ-algebra. Foundational, not inferential.
  • Founders: Borel, Lebesgue, Kolmogorov (Grundbegriffe 1933).
  • Used as the substrate for every framework above. Required for stochastic-process work (Brownian motion, Markov chains).
  • Does not by itself prescribe inference — that comes from the framework layered on top.

Cox/Jaynes (subjective derivation)

  • Cox 1946 — derives Bayesian probability from desiderata of rational belief (consistency, completeness).
  • Jaynes — extends with maximum entropy principle: among priors consistent with constraints, pick the one with maximum entropy (“least committed”).
  • The philosophical basis for subjective Bayesian probability that argues it’s the unique consistent framework, not a choice.

3. Frequentist vs Bayesian — the deciding axes

AxisFrequentistBayesian
What’s random?data (under fixed θ)θ (given fixed data)
Priornone (or improper as null hypothesis)required
Outputpoint estimate + confidence interval + p-valuefull posterior distribution
Sequential analysisrequires alpha-spending (O’Brien-Fleming, Pocock)trivial — posterior updates
Small-data behaviorbreaks (n=1 → no SE)works (prior provides regularization)
Big-data behaviorworksworks (posterior concentrates)
Computationclosed-form often availableMCMC / VI / Laplace often needed
Communication”p < 0.05” — familiar to regulators”P(θ > 0
Hypothesis testingNeyman-Pearson lemma + UMP testsBayes factor (Kass-Raftery 1995), posterior odds
Multi-comparisonsBonferroni, FDR (Benjamini-Hochberg 1995), Holmhierarchical shrinkage, posterior pooling

4. Map every Math note to a framework

Math notePrimary frameworkNotes
probability-fundamentalsmeasure-theoretic substrateboth views presented
probability-distributionsmeasure-theoreticdistribution-by-distribution
hypothesis-testing-mlefrequentistNeyman-Pearson, MLE, Wald/LRT/score tests
bayesian-inferenceBayesian (subjective + objective)priors, posteriors, posterior predictive, hierarchical models
causal-inferencecausal (Pearl + Rubin)do-calculus, PSM, IV, double-ML
mcmc-samplingBayesianMetropolis-Hastings, Gibbs, HMC, NUTS, parallel tempering
variational-inferenceBayesianELBO + amortized inference (VAE, BBVI)
measure-theory-and-integrationmeasure-theoreticfoundational
information-theoryinfo-theoreticKL divergence, mutual information, AIC/BIC underpinning
markov-chains-and-hmmboth — frequentist EM-on-HMM, Bayesian HMM with priorsdiscrete-time, ergodicity, Baum-Welch
time-series-and-hmmboth — Box-Jenkins frequentist, Bayesian state-spaceARIMA, Kalman, particle filter
copulas-and-dependenceboth — Sklar’s theorem is measure-theoretic, fitting is frequentist or BayesianGaussian/Student-t/Archimedean
gaussian-processesBayesian (non-parametric)infinite-dim prior over functions
stochastic-calculusmeasure-theoretic + frequentistBrownian, Itô, martingales, change of measure
probability-distribution-zoomeasure-theoretic catalogfor reference
sampling-algorithms-catalogboth — frequentist MC, Bayesian MCMCsampling primitives
statistical-distributions-catalog-extendedcatalogdistribution properties

5. p-values, the ASA 2016 statement, and the reproducibility crisis

The American Statistical Association in 2016 (Wasserstein-Lazar; expanded 2019 Wasserstein-Schirm-Lazar) issued formal warnings about p-value misuse. Key points:

  1. A p-value is not the probability the null is true.
  2. A non-significant p does not mean no effect — power matters.
  3. A “statistically significant” finding is not necessarily large or important.
  4. p < 0.05 should not be a bright line — Benjamin et al 2018 proposed p < 0.005.
  5. The 2019 follow-up urged abandoning “statistical significance” as a dichotomy entirely.

The reproducibility crisis (Ioannidis 2005 “Why Most Published Research Findings Are False”; Open Science Collaboration 2015 reproducibility project on psychology — only 36% of 100 studies replicated) catalyzed:

  • Preregistration — commit to analysis plan before data collection (OSF, AsPredicted).
  • Registered reports — peer review of design before data collection.
  • Multiverse analysis (Steegen et al 2016) — present results across all reasonable analytic choices.
  • Specification curve / sensitivity analyses (Simonsohn-Simmons-Nelson 2020).
  • Open data + open code — Open Science Framework, code repositories.
  • TOP guidelines — Transparency and Openness Promotion (Nosek et al 2015).

6. Bayesian computation — the practical stack

MethodWhenLibrary
Conjugate analysissmall prior-likelihood pairstextbook
Grid approximationlow-dim (≤ 3 params)by hand
Laplace approximationunimodal posteriorINLA (Rue-Martino 2009)
Variational inference (VI / ELBO)large dataset, tractable familyStan (ADVI), PyMC, NumPyro, Pyro, TensorFlow Probability
Black-box VI (BBVI)flexible posteriorNumPyro, Pyro, BlackJAX
Normalizing flows for VInon-Gaussian posteriorPyro, BlackJAX
MCMC: Metropolis-Hastingslow-dim, any posteriorrare in practice now
MCMC: Gibbs samplingconditionally conjugateJAGS, BUGS
MCMC: HMC (Hamiltonian Monte Carlo, Duane et al 1987; Neal 2010)continuous, smoothStan, PyMC, NumPyro
MCMC: NUTS (Hoffman-Gelman 2014)continuous, smooth, no manual tuningStan, PyMC, NumPyro (default)
MCMC: parallel temperingmultimodalBlackJAX
Sequential Monte Carlo (particle filter / SMC sampler)sequential / dynamic / multimodalparticles, BlackJAX, ParticleSMC.jl
Approximate Bayesian Computation (ABC)likelihood intractableabc-py, ELFI
Simulation-based inference (SBI, Cranmer-Brehmer-Louppe 2020)likelihood intractable but simulablesbi (Mackelab Tübingen)

In 2025 the canonical Bayesian stack is Stan (Carpenter et al 2017, ~400 citations/month) for “I want a probabilistic programming language”, NumPyro / Pyro for “I want JAX-/PyTorch-integrated VI”, PyMC for “I want pythonic Bayesian modeling”, and INLA for spatio-temporal GLMs. BlackJAX for JAX-native MCMC. brms / rstanarm for R users wanting lme4-style syntax with Bayesian backend.

7. The information criteria

CriterionFormulaWhenPenalty for complexity
AIC-2 log L̂ + 2kmodel selection, predictive focus2 per parameter
BIC-2 log L̂ + k log nasymptotic marginal likelihoodlog n per parameter (heavier than AIC for n > 7)
DIC-2 log L̂_θ̄ + 2 p_DBayesian, deviance + complexityeffective number of parameters p_D
WAIC (Watanabe-Akaike)-2 (lppd - p_WAIC)Bayesian, fully Bayesianp_WAIC measured from posterior
PSIS-LOO-2 lppd_LOOBayesian, leave-one-out approximationaccounts for posterior naturally
BPIC (Brooks 2002)variant of DICn/an/a
Hannan-Quinn-2 log L̂ + 2k log log nbetween AIC + BIClog log n

Gelman-Hwang-Vehtari 2014 review recommends PSIS-LOO as default for Bayesian model comparison (with WAIC second), and AIC for ML approaches where MLE is the norm.

8. Causal frameworks — Pearl vs Rubin in practice

Pearl SCMRubin Potential Outcomes
DAG over variablesY(0), Y(1) for each unit
do(X = x) operator”treatment received w”
Causal effect P(Y | do(X))E[Y(1) - Y(0)]
Identifiability via do-calculus rulesIdentifiability via ignorability (Y(0), Y(1) ⊥ T | X)
Counterfactuals as third tierImplicit in Y(t) notation
Front-door, back-door criteriaPropensity score, IV, IPTW
dagitty (Textor et al) — softwareMatchIt, ipw, twang — software

Both schools handle the same problems; Pearl’s DAG-language is better for causal discovery and explanatory mechanism work; Rubin’s potential-outcomes language is better for experimental design and estimation.

Modern ML-causal: Double Machine Learning (Chernozhukov et al 2018), Causal Forests (Wager-Athey 2018), X-Learner (Künzel et al 2019), CausalML library (Uber), EconML library (Microsoft Research), DoWhy (Microsoft, Sharma-Kiciman 2020). These integrate causal identification with off-the-shelf ML.

9. The Wasserstein-Lazar-Lazar 2019 framing

The ASA’s 2019 follow-up identified the “p < 0.05” problem as a bright-line fallacy. They recommended:

  1. Stop dichotomizing — present effect sizes + uncertainty intervals.
  2. “Statistical significance” is not the same as “scientific significance”.
  3. Embrace uncertainty — multiple analyses, sensitivity analyses, robustness checks.
  4. Use Bayesian, info-theoretic, or other frameworks as appropriate to the question.
  5. Move toward “Accept Uncertainty, be Thoughtful, Open, and Modest” (ATOM principles).

10. Reporting frameworks

FrameworkStandard report
FrequentistPoint estimate ± SE; 95% CI; p-value; sample size; effect size (Cohen’s d, OR, RR); power
BayesianPosterior mean / median / mode; 95% credible interval (CI); posterior probability of direction; Bayes factor
Information-theoreticΔAIC or ΔBIC across models; Akaike weights; LOO-CV
Causal (Rubin)ATE / ATT estimate; SE under sample-size assumptions; sensitivity to unmeasured confounding (Rosenbaum bounds, E-values, VanderWeele-Ding 2017)
Causal (Pearl)DAG; do-calculus derivation; identification result; estimator; standard error

Modern practice in epidemiology (Hernán-Robins 2020 Causal Inference: What If) blends Rubin + Pearl; preregistered DAG + IPTW + double-ML.

11. Modern (2020–2026) developments

  • Probabilistic programming — Stan, PyMC, NumPyro, Turing.jl, Pyro, Edward2 ubiquitous in research labs.
  • Differentiable / GPU-accelerated MCMC — BlackJAX (JAX), NumPyro (JAX) deliver 10–100× speedups.
  • Simulation-based inference (SBI) — sbi library (Tübingen), Bayesian inference when likelihood is intractable. Cosmology, neuroscience, particle physics.
  • Conformal prediction (Vovk-Gammerman-Shafer 2005; resurgence 2020+; Angelopoulos-Bates 2021 tutorial) — distribution-free prediction sets with finite-sample coverage. Major shift in uncertainty quantification.
  • Conformal inference + Bayesian — combine well.
  • Posterior predictive checks standard in any Bayesian analysis (Gelman et al Bayesian Data Analysis 3rd ed).
  • Causal ML at scale — EconML, CausalML, DoWhy in production at Uber, Microsoft, Meta.
  • Bayesian deep learning — variational dropout (Gal-Ghahramani 2016), MC dropout, deep ensembles, SWAG (Maddox et al 2019), Laplace approximation for NNs (Daxberger et al 2021).
  • Diffusion models as SDE-based generative inference — Song-Sohl-Dickstein-Kingma-Kumar-Ermon-Poole 2020+; bridges Bayesian inference + generative modeling.
  • Foundation models for inference — TabPFN (Hollmann et al 2023), prior-fitted networks; transformer-based Bayesian inference.

12. Decision tree — pick a framework

What's your question?
├─ "Is the effect zero?" (NHST, regulatory)
│    → Frequentist; report p-value + 95% CI + effect size.
│    → If preregistered → maintain alpha-spending.
│    → If multiple comparisons → FDR (BH) or Bonferroni.
├─ "What's my best estimate + uncertainty?"
│    ├─ Have prior info? → Bayesian; report posterior mean + 95% CI.
│    ├─ No prior info, want frequentist? → MLE + SE + CI.
│    └─ Want distribution-free finite-sample coverage? → Conformal prediction.
├─ "Which of several models is best?"
│    ├─ Predictive focus → AIC or LOO-CV.
│    ├─ Truth focus / penalize complexity → BIC.
│    ├─ Bayesian → posterior model probability / Bayes factor / WAIC.
│    └─ Non-nested → cross-validation.
├─ "What's the causal effect of X on Y?"
│    ├─ Have RCT? → Frequentist or Bayesian on Y ~ T.
│    ├─ Have observational data? → Causal inference (PSM, IPTW, IV, RDD, DiD, double-ML).
│    └─ Need to identify? → DAG + do-calculus + sensitivity analysis (E-value).
├─ "What's the probability of an event given a model?"
│    → Direct probability calculation; measure-theoretic.
├─ "What's the maximum-likelihood estimate?"
│    → MLE (Fisher, frequentist) or MAP (Bayesian point estimate).
├─ "I have a sequential / streaming experiment"
│    → Bayesian (natural) or frequentist w/ alpha-spending (O'Brien-Fleming) / sequential probability ratio test.
├─ "I want to forecast"
│    ├─ Time series → ARIMA / state-space / Bayesian structural / Prophet / NeuralForecast.
│    ├─ Probabilistic forecast → Bayesian or conformal.
│    └─ ML forecast → quantile regression / conformal calibration.
├─ "My data is messy / noisy / has outliers"
│    → Robust methods (M-estimators, MCD, S-estimators); Bayesian w/ heavy-tailed prior (Student-t).
├─ "I can simulate but not compute likelihood"
│    → Simulation-based inference (sbi library), ABC, or amortized inference.
└─ "I want to discover causal structure from data"
     → Causal discovery (PC, FCI, GES, NOTEARS); see [[Math/causal-inference]].

13. Anti-patterns

  1. “P < 0.05 = effect exists” — see ASA 2016 / 2019 statements. Effect size matters; reproducibility matters.
  2. Using Bayes factors without prior sensitivity analysis — BFs are heavily prior-dependent.
  3. Using flat priors as “non-informative” — flat is not always non-informative; depends on parameterization.
  4. MAP as Bayesian point estimate without considering posterior shape — MAP can be far from posterior mean for skewed posteriors.
  5. Sequential frequentist tests without alpha-spending — alpha inflates beyond control.
  6. Causal claims from observational data without DAG / identification argument — unidentified.
  7. “Significant” with n=1,000,000 — every effect is significant; report effect size.
  8. Confusing Bayesian credible interval with frequentist CI — different probability statements.
  9. Hierarchical Bayes without convergence diagnostics — R̂ < 1.01, ESS > 400 per chain are minimum bars.
  10. Reporting only mean without uncertainty — always include CI / SE / posterior summary.

14. The reproducibility crisis frame

The crisis is not a single discipline’s failure but a foundational issue with how inference is taught and used:

  1. Publication bias — significant results published, null results filed away.
  2. HARKing (Hypothesizing After Results are Known) — post-hoc hypotheses dressed up as a priori.
  3. p-hacking — analytic flexibility until p < 0.05.
  4. Low power — Cohen 1962 found median power 0.18; little has changed.
  5. Multiple testing without correction.
  6. Forking paths (Gelman-Loken 2014) — the analyst’s degrees of freedom.

The institutional responses (preregistration, registered reports, multiverse analysis, open data, registered direct replications, p < 0.005 advocacy, abandonment of p-value dichotomy) are all reactions to this. By 2026 most major psych / med / econ journals require preregistration or trial registration.

Adjacent

When to pick what

The fastest narrowing: regulatory / NHST → frequentist with multiplicity correction; small data + prior → Bayesian; prediction focus → information-theoretic / cross-validation; causal question → Pearl/Rubin causal; decision under uncertainty → Bayesian decision theory; distribution-free finite-sample coverage → conformal prediction; likelihood intractable but simulable → SBI; sequential / streaming → Bayesian or sequential frequentist (alpha-spending). The single biggest practical lesson of the 2010s reproducibility crisis is preregister your analysis — commit to your framework, model, and inference procedure before seeing data. Without that, every framework above can be gamed.