Probability Frameworks — Cross-Cutting Comparison
This note compares the philosophical and operational frameworks for reasoning under uncertainty — frequentist, Bayesian (subjective + objective), likelihoodist, information-theoretic, causal, measure-theoretic, max-entropy — across every probability/statistics note in the Math library. Read the dimension tables first; the closing decision tree picks a framework by data size, prior availability, decision need, and reporting requirement.
See also
- probability-fundamentals
- probability-distributions
- hypothesis-testing-mle
- bayesian-inference
- causal-inference
- measure-theory-and-integration
- information-theory
- mcmc-sampling
- variational-inference
- markov-chains-and-hmm
- copulas-and-dependence
- gaussian-processes
- probability-distribution-zoo
- sampling-algorithms-catalog
- statistical-distributions-catalog-extended
1. The eight frameworks
foundational inferential decision / pragmatic
| | |
measure-theoretic frequentist Bayesian decision theory
(Kolmogorov 1933) (Neyman-Pearson) (Lindley, Berger)
(Fisher likelihood + MLE)
Cox theorem Bayesian (subjective) causal (Pearl do-calc, Rubin PO)
(Cox 1946 → (Jeffreys, de Finetti, Savage)
Jaynes 1957 max-ent)
likelihoodist
(Edwards 1972, Royall 1997)
information-theoretic
(AIC, BIC, WAIC, LOO, DIC)
2. The frameworks defined
Frequentist
- Probability = long-run frequency of an event under repeated trials.
- Parameters = fixed, unknown constants.
- Inference = construct estimators (MLE, MoM) and pivot statistics; report confidence intervals and p-values.
- Founders: Fisher (likelihood, MLE, sufficiency, ANOVA), Neyman + Pearson (hypothesis testing, confidence intervals 1933), Wald (decision theory 1939).
- Strengths: well-developed asymptotic theory, doesn’t require priors, dominant in regulatory science (FDA, EMA), reproducible — anyone with the data + model gets the same p-value.
- Weaknesses: p-values misinterpreted at scale (ASA 2016 statement on p-values), confidence intervals not the probability statements they look like, struggles with sequential analysis (alpha-spending), can’t condition on observed data.
Bayesian (subjective)
- Probability = degree of belief; subjective.
- Parameters = random variables with prior distributions.
- Inference = posterior ∝ likelihood × prior; report credible intervals, posterior summaries.
- Founders: Bayes 1763 (posthumous), Laplace 1812, Jeffreys 1939, de Finetti 1937 (exchangeability), Savage 1954 (foundations of statistics, axioms of rational choice).
- Strengths: handles small data + prior knowledge naturally, gives full posterior distributions (not point estimates), nests sequential analysis trivially (today’s posterior = tomorrow’s prior), coherent under Dutch-book / de Finetti.
- Weaknesses: prior choice can be controversial, computation often expensive (MCMC), sensitivity to misspecified priors, harder to communicate to non-statisticians.
Bayesian (objective / reference)
- Prior = chosen by rule, not subjective belief. Jeffreys prior (1946, ∝ √det I(θ)), reference prior (Bernardo 1979 / Berger-Bernardo 1992), maximum-entropy prior (Jaynes 1957).
- Aims to be “non-informative” or “least-informative” subject to constraints.
- Often improper (∫ prior = ∞); only valid if posterior is proper.
- Used when subjective prior is unavailable or controversial (regulatory science).
- Empirical Bayes (Robbins 1956; Efron 2010) — estimate hyperprior from data; pragmatic but breaks strict Bayesian coherence.
Likelihoodist
- Probability = degree of support that data provides for parameter values.
- Inference = report likelihood functions / likelihood ratios; no priors, no error rates.
- Founders: Edwards (Likelihood 1972), Royall (Statistical Evidence 1997), Hacking (Logic of Statistical Inference 1965).
- Strengths: free of prior choice, free of Type-I/Type-II error framework, clean philosophical position.
- Weaknesses: doesn’t give a probability of hypothesis, niche adoption — mostly philosophy of statistics.
Information-theoretic / predictive
- Probability = framework-neutral; the question is model selection by predictive accuracy.
- Inference = compute information criteria: AIC (Akaike 1974), BIC (Schwarz 1978), DIC (Spiegelhalter et al 2002), WAIC (Watanabe 2010), LOO-CV (leave-one-out).
- Founders: Akaike, Schwarz, Burnham + Anderson (Model Selection and Multimodel Inference 2002), Watanabe.
- Strengths: model comparison without nested-hypothesis machinery, handles non-nested models, predictive focus (rather than truth-of-hypothesis), AIC ≈ predictive cross-validation, BIC ≈ marginal likelihood (asymptotic).
- Weaknesses: BIC’s assumption of “true model in candidate set” rarely holds, AIC penalty for complexity is asymptotic.
Bayesian decision theory
- Probability + utility function → choose action that maximizes expected utility.
- Inference + action combined.
- Founders: Wald 1950 (Statistical Decision Functions), Lindley 1972, Berger 1985 (Statistical Decision Theory).
- Strengths: directly addresses “what should I do”, coherent under Savage axioms, integrates uncertainty + losses, nests all of frequentist’s “tests + decisions”.
- Weaknesses: need to specify utility function (often more controversial than prior).
Causal inference (Pearl / Rubin)
- Probability of an intervention ≠ probability of an observation. Layer separation: observation, intervention, counterfactual.
- Two main schools:
- Pearl’s structural causal models (SCM) — DAGs + do-calculus + counterfactuals (Pearl 1995, Causality 2000, ACM Turing 2011).
- Rubin’s potential outcomes (Neyman-Rubin) — Y(0), Y(1) potential outcomes; SUTVA; propensity-score methods (Rubin 1974, 1978).
- The two schools are mostly equivalent (Pearl 2009 has the mapping).
- Founders: Sewall Wright (path analysis 1921), Neyman (1923 thesis), Rubin 1974, Pearl 1988+, Spirtes-Glymour-Scheines 1993 (PC algorithm).
- Strengths: only framework that handles “what if?” without RCT data, integrates with ML (double-ML, causal forests, X-learner), supports identifiability theorems.
- Weaknesses: requires assumptions about confounding (ignorability, exchangeability), often not testable from data alone, requires graphical model expertise.
Measure-theoretic (Kolmogorov axioms)
- Probability = measure on a σ-algebra. Foundational, not inferential.
- Founders: Borel, Lebesgue, Kolmogorov (Grundbegriffe 1933).
- Used as the substrate for every framework above. Required for stochastic-process work (Brownian motion, Markov chains).
- Does not by itself prescribe inference — that comes from the framework layered on top.
Cox/Jaynes (subjective derivation)
- Cox 1946 — derives Bayesian probability from desiderata of rational belief (consistency, completeness).
- Jaynes — extends with maximum entropy principle: among priors consistent with constraints, pick the one with maximum entropy (“least committed”).
- The philosophical basis for subjective Bayesian probability that argues it’s the unique consistent framework, not a choice.
3. Frequentist vs Bayesian — the deciding axes
| Axis | Frequentist | Bayesian |
|---|---|---|
| What’s random? | data (under fixed θ) | θ (given fixed data) |
| Prior | none (or improper as null hypothesis) | required |
| Output | point estimate + confidence interval + p-value | full posterior distribution |
| Sequential analysis | requires alpha-spending (O’Brien-Fleming, Pocock) | trivial — posterior updates |
| Small-data behavior | breaks (n=1 → no SE) | works (prior provides regularization) |
| Big-data behavior | works | works (posterior concentrates) |
| Computation | closed-form often available | MCMC / VI / Laplace often needed |
| Communication | ”p < 0.05” — familiar to regulators | ”P(θ > 0 |
| Hypothesis testing | Neyman-Pearson lemma + UMP tests | Bayes factor (Kass-Raftery 1995), posterior odds |
| Multi-comparisons | Bonferroni, FDR (Benjamini-Hochberg 1995), Holm | hierarchical shrinkage, posterior pooling |
4. Map every Math note to a framework
| Math note | Primary framework | Notes |
|---|---|---|
| probability-fundamentals | measure-theoretic substrate | both views presented |
| probability-distributions | measure-theoretic | distribution-by-distribution |
| hypothesis-testing-mle | frequentist | Neyman-Pearson, MLE, Wald/LRT/score tests |
| bayesian-inference | Bayesian (subjective + objective) | priors, posteriors, posterior predictive, hierarchical models |
| causal-inference | causal (Pearl + Rubin) | do-calculus, PSM, IV, double-ML |
| mcmc-sampling | Bayesian | Metropolis-Hastings, Gibbs, HMC, NUTS, parallel tempering |
| variational-inference | Bayesian | ELBO + amortized inference (VAE, BBVI) |
| measure-theory-and-integration | measure-theoretic | foundational |
| information-theory | info-theoretic | KL divergence, mutual information, AIC/BIC underpinning |
| markov-chains-and-hmm | both — frequentist EM-on-HMM, Bayesian HMM with priors | discrete-time, ergodicity, Baum-Welch |
| time-series-and-hmm | both — Box-Jenkins frequentist, Bayesian state-space | ARIMA, Kalman, particle filter |
| copulas-and-dependence | both — Sklar’s theorem is measure-theoretic, fitting is frequentist or Bayesian | Gaussian/Student-t/Archimedean |
| gaussian-processes | Bayesian (non-parametric) | infinite-dim prior over functions |
| stochastic-calculus | measure-theoretic + frequentist | Brownian, Itô, martingales, change of measure |
| probability-distribution-zoo | measure-theoretic catalog | for reference |
| sampling-algorithms-catalog | both — frequentist MC, Bayesian MCMC | sampling primitives |
| statistical-distributions-catalog-extended | catalog | distribution properties |
5. p-values, the ASA 2016 statement, and the reproducibility crisis
The American Statistical Association in 2016 (Wasserstein-Lazar; expanded 2019 Wasserstein-Schirm-Lazar) issued formal warnings about p-value misuse. Key points:
- A p-value is not the probability the null is true.
- A non-significant p does not mean no effect — power matters.
- A “statistically significant” finding is not necessarily large or important.
- p < 0.05 should not be a bright line — Benjamin et al 2018 proposed p < 0.005.
- The 2019 follow-up urged abandoning “statistical significance” as a dichotomy entirely.
The reproducibility crisis (Ioannidis 2005 “Why Most Published Research Findings Are False”; Open Science Collaboration 2015 reproducibility project on psychology — only 36% of 100 studies replicated) catalyzed:
- Preregistration — commit to analysis plan before data collection (OSF, AsPredicted).
- Registered reports — peer review of design before data collection.
- Multiverse analysis (Steegen et al 2016) — present results across all reasonable analytic choices.
- Specification curve / sensitivity analyses (Simonsohn-Simmons-Nelson 2020).
- Open data + open code — Open Science Framework, code repositories.
- TOP guidelines — Transparency and Openness Promotion (Nosek et al 2015).
6. Bayesian computation — the practical stack
| Method | When | Library |
|---|---|---|
| Conjugate analysis | small prior-likelihood pairs | textbook |
| Grid approximation | low-dim (≤ 3 params) | by hand |
| Laplace approximation | unimodal posterior | INLA (Rue-Martino 2009) |
| Variational inference (VI / ELBO) | large dataset, tractable family | Stan (ADVI), PyMC, NumPyro, Pyro, TensorFlow Probability |
| Black-box VI (BBVI) | flexible posterior | NumPyro, Pyro, BlackJAX |
| Normalizing flows for VI | non-Gaussian posterior | Pyro, BlackJAX |
| MCMC: Metropolis-Hastings | low-dim, any posterior | rare in practice now |
| MCMC: Gibbs sampling | conditionally conjugate | JAGS, BUGS |
| MCMC: HMC (Hamiltonian Monte Carlo, Duane et al 1987; Neal 2010) | continuous, smooth | Stan, PyMC, NumPyro |
| MCMC: NUTS (Hoffman-Gelman 2014) | continuous, smooth, no manual tuning | Stan, PyMC, NumPyro (default) |
| MCMC: parallel tempering | multimodal | BlackJAX |
| Sequential Monte Carlo (particle filter / SMC sampler) | sequential / dynamic / multimodal | particles, BlackJAX, ParticleSMC.jl |
| Approximate Bayesian Computation (ABC) | likelihood intractable | abc-py, ELFI |
| Simulation-based inference (SBI, Cranmer-Brehmer-Louppe 2020) | likelihood intractable but simulable | sbi (Mackelab Tübingen) |
In 2025 the canonical Bayesian stack is Stan (Carpenter et al 2017, ~400 citations/month) for “I want a probabilistic programming language”, NumPyro / Pyro for “I want JAX-/PyTorch-integrated VI”, PyMC for “I want pythonic Bayesian modeling”, and INLA for spatio-temporal GLMs. BlackJAX for JAX-native MCMC. brms / rstanarm for R users wanting lme4-style syntax with Bayesian backend.
7. The information criteria
| Criterion | Formula | When | Penalty for complexity |
|---|---|---|---|
| AIC | -2 log L̂ + 2k | model selection, predictive focus | 2 per parameter |
| BIC | -2 log L̂ + k log n | asymptotic marginal likelihood | log n per parameter (heavier than AIC for n > 7) |
| DIC | -2 log L̂_θ̄ + 2 p_D | Bayesian, deviance + complexity | effective number of parameters p_D |
| WAIC (Watanabe-Akaike) | -2 (lppd - p_WAIC) | Bayesian, fully Bayesian | p_WAIC measured from posterior |
| PSIS-LOO | -2 lppd_LOO | Bayesian, leave-one-out approximation | accounts for posterior naturally |
| BPIC (Brooks 2002) | variant of DIC | n/a | n/a |
| Hannan-Quinn | -2 log L̂ + 2k log log n | between AIC + BIC | log log n |
Gelman-Hwang-Vehtari 2014 review recommends PSIS-LOO as default for Bayesian model comparison (with WAIC second), and AIC for ML approaches where MLE is the norm.
8. Causal frameworks — Pearl vs Rubin in practice
| Pearl SCM | Rubin Potential Outcomes |
|---|---|
| DAG over variables | Y(0), Y(1) for each unit |
| do(X = x) operator | ”treatment received w” |
| Causal effect P(Y | do(X)) | E[Y(1) - Y(0)] |
| Identifiability via do-calculus rules | Identifiability via ignorability (Y(0), Y(1) ⊥ T | X) |
| Counterfactuals as third tier | Implicit in Y(t) notation |
| Front-door, back-door criteria | Propensity score, IV, IPTW |
| dagitty (Textor et al) — software | MatchIt, ipw, twang — software |
Both schools handle the same problems; Pearl’s DAG-language is better for causal discovery and explanatory mechanism work; Rubin’s potential-outcomes language is better for experimental design and estimation.
Modern ML-causal: Double Machine Learning (Chernozhukov et al 2018), Causal Forests (Wager-Athey 2018), X-Learner (Künzel et al 2019), CausalML library (Uber), EconML library (Microsoft Research), DoWhy (Microsoft, Sharma-Kiciman 2020). These integrate causal identification with off-the-shelf ML.
9. The Wasserstein-Lazar-Lazar 2019 framing
The ASA’s 2019 follow-up identified the “p < 0.05” problem as a bright-line fallacy. They recommended:
- Stop dichotomizing — present effect sizes + uncertainty intervals.
- “Statistical significance” is not the same as “scientific significance”.
- Embrace uncertainty — multiple analyses, sensitivity analyses, robustness checks.
- Use Bayesian, info-theoretic, or other frameworks as appropriate to the question.
- Move toward “Accept Uncertainty, be Thoughtful, Open, and Modest” (ATOM principles).
10. Reporting frameworks
| Framework | Standard report |
|---|---|
| Frequentist | Point estimate ± SE; 95% CI; p-value; sample size; effect size (Cohen’s d, OR, RR); power |
| Bayesian | Posterior mean / median / mode; 95% credible interval (CI); posterior probability of direction; Bayes factor |
| Information-theoretic | ΔAIC or ΔBIC across models; Akaike weights; LOO-CV |
| Causal (Rubin) | ATE / ATT estimate; SE under sample-size assumptions; sensitivity to unmeasured confounding (Rosenbaum bounds, E-values, VanderWeele-Ding 2017) |
| Causal (Pearl) | DAG; do-calculus derivation; identification result; estimator; standard error |
Modern practice in epidemiology (Hernán-Robins 2020 Causal Inference: What If) blends Rubin + Pearl; preregistered DAG + IPTW + double-ML.
11. Modern (2020–2026) developments
- Probabilistic programming — Stan, PyMC, NumPyro, Turing.jl, Pyro, Edward2 ubiquitous in research labs.
- Differentiable / GPU-accelerated MCMC — BlackJAX (JAX), NumPyro (JAX) deliver 10–100× speedups.
- Simulation-based inference (SBI) — sbi library (Tübingen), Bayesian inference when likelihood is intractable. Cosmology, neuroscience, particle physics.
- Conformal prediction (Vovk-Gammerman-Shafer 2005; resurgence 2020+; Angelopoulos-Bates 2021 tutorial) — distribution-free prediction sets with finite-sample coverage. Major shift in uncertainty quantification.
- Conformal inference + Bayesian — combine well.
- Posterior predictive checks standard in any Bayesian analysis (Gelman et al Bayesian Data Analysis 3rd ed).
- Causal ML at scale — EconML, CausalML, DoWhy in production at Uber, Microsoft, Meta.
- Bayesian deep learning — variational dropout (Gal-Ghahramani 2016), MC dropout, deep ensembles, SWAG (Maddox et al 2019), Laplace approximation for NNs (Daxberger et al 2021).
- Diffusion models as SDE-based generative inference — Song-Sohl-Dickstein-Kingma-Kumar-Ermon-Poole 2020+; bridges Bayesian inference + generative modeling.
- Foundation models for inference — TabPFN (Hollmann et al 2023), prior-fitted networks; transformer-based Bayesian inference.
12. Decision tree — pick a framework
What's your question?
├─ "Is the effect zero?" (NHST, regulatory)
│ → Frequentist; report p-value + 95% CI + effect size.
│ → If preregistered → maintain alpha-spending.
│ → If multiple comparisons → FDR (BH) or Bonferroni.
├─ "What's my best estimate + uncertainty?"
│ ├─ Have prior info? → Bayesian; report posterior mean + 95% CI.
│ ├─ No prior info, want frequentist? → MLE + SE + CI.
│ └─ Want distribution-free finite-sample coverage? → Conformal prediction.
├─ "Which of several models is best?"
│ ├─ Predictive focus → AIC or LOO-CV.
│ ├─ Truth focus / penalize complexity → BIC.
│ ├─ Bayesian → posterior model probability / Bayes factor / WAIC.
│ └─ Non-nested → cross-validation.
├─ "What's the causal effect of X on Y?"
│ ├─ Have RCT? → Frequentist or Bayesian on Y ~ T.
│ ├─ Have observational data? → Causal inference (PSM, IPTW, IV, RDD, DiD, double-ML).
│ └─ Need to identify? → DAG + do-calculus + sensitivity analysis (E-value).
├─ "What's the probability of an event given a model?"
│ → Direct probability calculation; measure-theoretic.
├─ "What's the maximum-likelihood estimate?"
│ → MLE (Fisher, frequentist) or MAP (Bayesian point estimate).
├─ "I have a sequential / streaming experiment"
│ → Bayesian (natural) or frequentist w/ alpha-spending (O'Brien-Fleming) / sequential probability ratio test.
├─ "I want to forecast"
│ ├─ Time series → ARIMA / state-space / Bayesian structural / Prophet / NeuralForecast.
│ ├─ Probabilistic forecast → Bayesian or conformal.
│ └─ ML forecast → quantile regression / conformal calibration.
├─ "My data is messy / noisy / has outliers"
│ → Robust methods (M-estimators, MCD, S-estimators); Bayesian w/ heavy-tailed prior (Student-t).
├─ "I can simulate but not compute likelihood"
│ → Simulation-based inference (sbi library), ABC, or amortized inference.
└─ "I want to discover causal structure from data"
→ Causal discovery (PC, FCI, GES, NOTEARS); see [[Math/causal-inference]].
13. Anti-patterns
- “P < 0.05 = effect exists” — see ASA 2016 / 2019 statements. Effect size matters; reproducibility matters.
- Using Bayes factors without prior sensitivity analysis — BFs are heavily prior-dependent.
- Using flat priors as “non-informative” — flat is not always non-informative; depends on parameterization.
- MAP as Bayesian point estimate without considering posterior shape — MAP can be far from posterior mean for skewed posteriors.
- Sequential frequentist tests without alpha-spending — alpha inflates beyond control.
- Causal claims from observational data without DAG / identification argument — unidentified.
- “Significant” with n=1,000,000 — every effect is significant; report effect size.
- Confusing Bayesian credible interval with frequentist CI — different probability statements.
- Hierarchical Bayes without convergence diagnostics — R̂ < 1.01, ESS > 400 per chain are minimum bars.
- Reporting only mean without uncertainty — always include CI / SE / posterior summary.
14. The reproducibility crisis frame
The crisis is not a single discipline’s failure but a foundational issue with how inference is taught and used:
- Publication bias — significant results published, null results filed away.
- HARKing (Hypothesizing After Results are Known) — post-hoc hypotheses dressed up as a priori.
- p-hacking — analytic flexibility until p < 0.05.
- Low power — Cohen 1962 found median power 0.18; little has changed.
- Multiple testing without correction.
- Forking paths (Gelman-Loken 2014) — the analyst’s degrees of freedom.
The institutional responses (preregistration, registered reports, multiverse analysis, open data, registered direct replications, p < 0.005 advocacy, abandonment of p-value dichotomy) are all reactions to this. By 2026 most major psych / med / econ journals require preregistration or trial registration.
Adjacent
- Bayesian inference depth — bayesian-inference for priors, hierarchical models, posterior predictive checks.
- Frequentist hypothesis testing — hypothesis-testing-mle for NP lemma, UMP tests, multiple comparisons.
- Causal inference depth — causal-inference for DAGs, do-calculus, PSM, IV, RDD, DiD, double-ML.
- MCMC algorithms — mcmc-sampling for HMC, NUTS, Gibbs, MH, parallel tempering.
- Variational inference — variational-inference for ELBO, VAEs, BBVI, amortized inference.
- Measure theory — measure-theory-and-integration for the foundational substrate.
- Information theory — information-theory for KL, mutual information, AIC/BIC.
- Markov chains — markov-chains-and-hmm for chain ergodicity (underlying MCMC).
- Time series — time-series-and-hmm for state-space + Kalman + particle filters.
- Copulas — copulas-and-dependence for dependence beyond linear correlation.
- Gaussian processes — gaussian-processes for Bayesian non-parametrics.
- Stochastic calculus — stochastic-calculus for change-of-measure (Girsanov) and martingale-based inference.
- Optimization — _compare_optimization-methods for the MLE / MAP / VI / MCMC optimization-side.
- Probability distribution catalog — probability-distribution-zoo.
- Statistical distribution catalog — statistical-distributions-catalog-extended.
- Sampling algorithms — sampling-algorithms-catalog.
- Finance application — risk measures and probability frameworks are central in _compare_risk-measures.
When to pick what
The fastest narrowing: regulatory / NHST → frequentist with multiplicity correction; small data + prior → Bayesian; prediction focus → information-theoretic / cross-validation; causal question → Pearl/Rubin causal; decision under uncertainty → Bayesian decision theory; distribution-free finite-sample coverage → conformal prediction; likelihood intractable but simulable → SBI; sequential / streaming → Bayesian or sequential frequentist (alpha-spending). The single biggest practical lesson of the 2010s reproducibility crisis is preregister your analysis — commit to your framework, model, and inference procedure before seeing data. Without that, every framework above can be gamed.