Causal Inference — Math Reference

1. At a glance

“Correlation is not causation” — but causal inference is the formal apparatus that tells us when, and under what assumptions, an associational quantity can be read as causal. The central question is counterfactual: what would happen to outcome Y if we intervened to set treatment T? This is distinct from prediction. A predictive model asks P(Y | T = t, X = x) (what we observe when T happens to take value t); a causal model asks P(Y | do(T = t), X = x) (what we would see if we forced T to t, overriding its natural mechanism).

Two main frameworks dominate:

Potential outcomes (Neyman 1923; Rubin 1974) — every unit has a vector of counterfactual outcomes Y_i(t) for each treatment level. The fundamental problem is that only one is ever observed.
Graphical / structural causal models (Pearl 1995, 2009) — DAGs encode causal structure; do-calculus formally manipulates interventional distributions; backdoor + frontdoor criteria identify estimable effects.

Modern unification is achieved by Imbens + Rubin 2015, Pearl 2009, and Hernán + Robins “What If” 2020 — the three texts that together define the contemporary canon. The frameworks are equivalent in expressive power for the questions both can pose; choice is largely a matter of taste and problem geometry.

Practical use cases:

Policy evaluation (job training, minimum wage, welfare programs)
A/B testing + online experimentation
Medical efficacy (RCTs + observational drug-effect studies)
Ad attribution + marketing incrementality
Pricing + recommendation lift
Climate attribution (was this hurricane “caused by” warming?)
Algorithmic fairness + counterfactual fairness

2. Potential outcomes framework

For each unit i = 1, ..., n define:

Treatment assignment T_i ∈ {0, 1} (binary case; extends to multi-valued + continuous).
Potential outcomes Y_i(0) (under control) and Y_i(1) (under treatment). These are both defined for every unit, but only one is realized.
Individual causal effect τ_i = Y_i(1) − Y_i(0).

Fundamental Problem of Causal Inference (Holland 1986): we observe Y_i = Y_i(T_i), never both Y_i(0) and Y_i(1) for the same unit. Individual effects are therefore never directly estimable; we estimate population averages across units.

Key estimands:

ATE (Average Treatment Effect) τ = E[Y(1) − Y(0)] — population mean effect.
ATT (Average Treatment effect on the Treated) τ_ATT = E[Y(1) − Y(0) | T = 1] — effect on those who actually received treatment. Often the policy-relevant quantity (e.g. “did the training program work for those who attended?”).
ATU (Average Treatment effect on the Untreated) E[Y(1) − Y(0) | T = 0].
CATE (Conditional ATE) τ(x) = E[Y(1) − Y(0) | X = x] — covariate- conditional, basis for personalization (see §14).
ATO (Overlap-weighted ATE) — re-weights to where propensity is near 0.5; Li + Morgan + Zaslavsky 2018.

The naive estimator E[Y | T = 1] − E[Y | T = 0] decomposes as

ATE_naive = ATE + selection_bias + treatment_heterogeneity_bias

where selection bias is E[Y(0) | T=1] − E[Y(0) | T=0] (baseline differences between groups).

3. Identification assumptions

To estimate E[Y(t)] from observed (T, Y, X) data, three core assumptions are required (Rubin’s “stable, ignorable, overlapping”):

SUTVA (Stable Unit Treatment Value Assumption) — Rubin 1980.
- No interference: unit i’s outcome depends only on unit i’s treatment, not on others’. Violated by network effects, herd immunity, market equilibrium spillovers.
- Consistent treatment: a single, well-defined t=1 (no “hidden versions” of treatment).
Ignorability / Unconfoundedness / Strong Ignorability (Y(0), Y(1)) ⫫ T | X — given covariates X, treatment is as good as random. The unobserved counterfactual distributions are the same in the treated and control groups conditional on X. Untestable from data alone; rests on substantive domain knowledge.
Positivity / Overlap / Common Support 0 < P(T = 1 | X = x) < 1 for all x in the support. If no treated units exist at some X = x, there is no information about E[Y(1) | X = x].
Consistency Y_i = T_i · Y_i(1) + (1 − T_i) · Y_i(0) — the observed outcome equals the potential outcome under the realized treatment (often folded into SUTVA).

Under (ignorability + positivity + consistency + SUTVA), the g-formula identifies:

E[Y(t)] = E_X[E[Y | T = t, X]]

— a standardization across the covariate distribution.

4. Randomized Controlled Trial (RCT)

The gold standard. By design T ⫫ (Y(0), Y(1), X), so unconfoundedness holds unconditionally and ATE = E[Y | T=1] − E[Y | T=0] is identified by a simple difference in means. Standard error from a two-sample t-test or Neyman variance bound.

Variants:

Completely randomized — each unit independently assigned with probability p.
Stratified randomization — randomize within strata of pre-treatment covariates; reduces variance, ensures balance on key features.
Block / cluster randomized — randomize whole groups (schools, villages) — needed when interference within cluster is unavoidable.
Factorial designs — multiple treatments simultaneously, estimate main effects + interactions; common in industrial DOE (see [[Engineering/six-sigma]]).
Adaptive + multi-arm bandit RCTs — Thompson sampling, response- adaptive randomization; ethical efficiency at cost of statistical complexity (Berry + Berry).
Crossover designs — each unit receives both treatments at different times; requires no carryover.

Limitations: ethical constraints (you cannot randomize cancer onset); expense; ecological validity (does the lab effect generalize?); slow; hard for rare outcomes; political resistance; non-compliance (treated units that didn’t take the pill) introduces ITT vs LATE distinction (see §9).

5. Observational study tools

When RCTs are infeasible, observational designs attempt to recover unconfoundedness by conditioning on enough X to make treatment quasi-random.

5.1 Matching

Find for each treated unit a control unit “similar” on X:

Exact matching — exact match on all covariates; impractical with more than a handful of discrete X.
Nearest-neighbor matching — minimize a distance metric (Mahalanobis, Euclidean) on standardized X; can be 1:1, k:1, with or without replacement.
Coarsened exact matching (CEM) (Iacus + King + Porro 2012) — discretize covariates into bins, exact-match within bins.
Optimal full matching (Hansen 2004) — assign each treated unit to a (potentially unbalanced) cluster of controls minimizing global distance.
Caliper matching — reject matches beyond a propensity-score distance threshold (reduces bias, increases variance from dropped units).

After matching, estimate ATT by mean difference in matched pairs. Standard errors via Abadie + Imbens 2006 robust variance.

5.2 Propensity score (Rosenbaum + Rubin 1983)

Define π(x) = P(T = 1 | X = x). Key theorem: if unconfoundedness holds given X, it also holds given the scalar π(X). This dimensionality reduction is the engine of all propensity-based methods.

Estimation: logistic regression, gradient boosting (XGBoost), calibrated random forests, neural-net classifiers.

Uses:

Propensity score matching (PSM) — match on π(X) instead of X.
Propensity stratification — bin into 5–10 strata of π(X), compute within-stratum effect, weight by stratum size.
Propensity weighting (IPW) — see 5.3.
Doubly robust — combine outcome + propensity model (5.4).

Diagnostics: check covariate balance (standardized mean difference < 0.1 post-matching), overlap (histogram of π(X) by treatment group), sensitivity (Rosenbaum bounds).

5.3 Inverse propensity weighting (IPW)

Horvitz + Thompson 1952 estimator adapted to causal inference:

τ_IPW = (1/n) Σ [ T_i · Y_i / π(X_i) − (1 − T_i) · Y_i / (1 − π(X_i)) ]

Intuition: re-weight observed data so that, conditional on X, treated and control groups have the same covariate distribution. Variance explodes when π(X) is near 0 or 1 — stabilized IPW (Robins + Hernán

Brumback 2000) normalizes by the marginal probability, truncated IPW clips extreme weights at some percentile (e.g. 1 + 99).

5.4 Doubly robust estimators

Robins + Rotnitzky 1995. Combine an outcome regression μ_t(X) = E[Y | T=t, X] with the propensity score π(X):

AIPW (Augmented IPW):

τ_AIPW = (1/n) Σ [ μ_1(X_i) − μ_0(X_i)
                  + T_i (Y_i − μ_1(X_i)) / π(X_i)
                  − (1 − T_i)(Y_i − μ_0(X_i)) / (1 − π(X_i)) ]

The “double” robustness: τ_AIPW is consistent if either μ_t or π is correctly specified; both wrong → bias. Variance is minimized when both are correct (semiparametric efficiency bound; Hahn 1998).

TMLE (Targeted Maximum Likelihood Estimation) — Mark van der Laan + Rubin 2006; iteratively updates μ to be “targeted” toward the estimand; combines well with SuperLearner ensembles.

6. Difference-in-Differences (DiD)

Card + Krueger 1994 (NJ vs PA fast-food employment after NJ minimum wage hike) is the canonical study. Two-period two-group setup:

Pre-period t = 0 and post-period t = 1.
Groups g ∈ {0, 1} (control, treated). Only group 1 receives treatment, and only in period 1.

ATT_DiD = (E[Y_{t=1} | g=1] − E[Y_{t=0} | g=1])
        − (E[Y_{t=1} | g=0] − E[Y_{t=0} | g=0])

Identifying assumption: parallel trends — in absence of treatment, mean outcome would have evolved identically in both groups. Untestable; visually inspected with pre-period trends.

Equivalent regression form:

Y_it = α_i + λ_t + β · (g_i · post_t) + ε_it

— unit and time fixed effects, coefficient on the interaction is ATT.

6.1 Two-way fixed effects (TWFE) under heterogeneous timing

When treatment is staggered (different units treated at different times), TWFE is not a clean DiD. Goodman-Bacon 2021 showed TWFE is a weighted average of all possible 2x2 DiDs, including “forbidden” comparisons where already-treated units serve as controls — yielding negative weights and biased estimates under heterogeneous effects.

Modern fixes:

Callaway + Sant’Anna 2021 — group-time ATTs aggregated with positive weights; csdid in R, csdid in Stata.
Sun + Abraham 2021 — interaction-weighted estimator using not-yet-treated as controls.
de Chaisemartin + D’Haultfœuille 2020 — multiple-period DiD with heterogeneous adoption.
Borusyak + Jaravel + Spiess 2024 — imputation estimator.

6.2 Synthetic Control

Abadie + Gardeazabal 2003 (Basque terrorism) + Abadie + Diamond + Hainmueller 2010 (California Prop 99 anti-tobacco). When N_treated = 1 and standard DiD has no comparable control, construct counterfactual as weighted combination of donor units:

Y_t(0)_treated ≈ Σ_j w_j · Y_t_j

with w_j ≥ 0, Σ w_j = 1, weights chosen to match pre-treatment trajectory + covariates of the treated unit.

Famous applications: California Prop 99 (smoking declined ~25% relative to synthetic California), German reunification (West-only synthetic counterfactual for unified Germany).

Extensions: augmented synthetic control (Ben-Michael + Feller + Rothstein 2021) corrects for imperfect pre-period fit; generalized synthetic control (Xu 2017) for multiple treated units; matrix completion (Athey + Bayati + Doudchenko + Imbens + Khosravi 2021).

7. Regression Discontinuity Design (RDD)

Thistlethwaite + Campbell 1960 (scholarship effect via test-score threshold). Treatment is assigned by whether a running variable X crosses a cutoff c:

Sharp RDD: T = 𝟙(X ≥ c) (deterministic).
Fuzzy RDD: P(T = 1 | X) jumps at c but is not 0/1.

Identification: units just above + below the cutoff are quasi-randomly similar; in the limit, this approximates a local RCT.

Estimator: local linear regression on a bandwidth around c:

Y_i = α + τ · 𝟙(X_i ≥ c) + β_1 (X_i − c) + β_2 (X_i − c) · 𝟙(X_i ≥ c) + ε_i

— τ is the local ATE at the cutoff.

Bandwidth selection: Imbens + Kalyanaraman 2012; Calonico + Cattaneo + Titiunik 2014 robust bias-corrected inference + the rdrobust package. Modern guide: Cattaneo + Idrobo + Titiunik 2020 (2-volume Cambridge monograph).

Applications:

Scholarship eligibility cutoffs (test-score thresholds).
Election RDD — Lee 2008, incumbency advantage from close-vote wins.
Medicare eligibility at age 65 (Card + Dobkin + Maestas 2008).
Class-size effects from Maimonides’ rule (Angrist + Lavy 1999).

Validity checks: no manipulation of X around c (McCrary 2008 density test); covariate balance at cutoff; placebo cutoffs.

8. Instrumental Variables (IV)

Used when treatment is endogenous (correlated with unobserved confounders). Find an instrument Z satisfying:

Relevance: Cov(Z, T) ≠ 0 — Z affects T.
Exclusion: Z affects Y only through T — no direct path.
Independence: Z is independent of unobserved confounders.

8.1 Wald estimator

Binary Z, binary T:

τ_Wald = (E[Y | Z=1] − E[Y | Z=0]) / (E[T | Z=1] − E[T | Z=0])
       = (intent-to-treat effect on Y) / (first-stage effect on T)

8.2 Two-stage least squares (2SLS)

Continuous case. Stage 1: regress T on Z (and controls X) to get T̂. Stage 2: regress Y on T̂ (and X). Coefficient on T̂ is IV estimate.

8.3 LATE (Local Average Treatment Effect)

Imbens + Angrist 1994 LATE theorem: under monotonicity (no “defiers”), IV identifies the average treatment effect on compliers — units who would take T = 1 if Z = 1 and T = 0 if Z = 0. Not ATE unless effects are homogeneous.

Compliance types:

Always-takers: T = 1 regardless of Z.
Never-takers: T = 0 regardless of Z.
Compliers: T = Z.
Defiers: T = 1 − Z (assumed away).

8.4 Weak instruments + diagnostics

If Cov(Z, T) is small, 2SLS is biased toward OLS and confidence intervals are unreliable. Stock + Yogo 2005 propose F-statistic on first stage > 10 rule of thumb; Lee + McCrary + Moreira + Porter 2022 tighten to F > 104.7 for reliable 5% inference.

8.5 Famous instruments

Vietnam draft lottery (Angrist 1990) — lottery number as instrument for military service effect on earnings.
Quarter of birth (Angrist + Krueger 1991) — instruments for schooling via compulsory-attendance laws; later criticized as weak.
Mendelian randomization — genetic variants as instruments in epidemiology (Smith + Ebrahim 2003); SNPs are randomized at conception.
Distance to college (Card 1993) — proximity instruments for educational attainment.
Judge fixed effects (Kling 2006) — random assignment of judges with different sentencing tendencies.

9. Pearl’s structural framework

9.1 DAGs (Directed Acyclic Graphs)

Nodes are variables, directed edges represent direct causal effects. The graph implies a factorization:

P(V_1, ..., V_n) = Π_i P(V_i | parents(V_i))

A Structural Causal Model (SCM) specifies for each variable a function V_i = f_i(parents(V_i), U_i) with exogenous noise U_i.

9.2 d-separation

A graphical criterion for conditional independence. A path between X and Y is blocked by a set Z if it contains:

A chain A → B → C with B ∈ Z.
A fork A ← B → C with B ∈ Z.
A collider A → B ← C with B ∉ Z and no descendant of B in Z.

If all paths between X and Y are blocked by Z, then X ⫫ Y | Z in any distribution compatible with the DAG.

9.3 do-operator + interventional distribution

do(T = t) represents an intervention: replace the structural equation for T with T := t, severing all incoming edges. Then P(Y | do(T = t)) is the post-intervention distribution.

Generally P(Y | T = t) ≠ P(Y | do(T = t)) — they coincide only under specific structural conditions (e.g. no backdoor paths).

9.4 Do-calculus (Pearl 1995, 2009)

Three rules that transform P(Y | do(T), Z, W) expressions:

Rule 1 (Insertion / deletion of observations): ignore observations not affecting the outcome.
Rule 2 (Action / observation exchange): replace do(X) with observing X when no backdoor remains.
Rule 3 (Insertion / deletion of actions): drop do(X) when X doesn’t affect Y in mutilated graph.

Tian + Pearl 2003 + Shpitser + Pearl 2006 give the ID algorithm: a complete procedure deciding whether P(Y | do(T)) is identifiable from the observational distribution + DAG, and if so produces the estimating formula.

9.5 Backdoor criterion

A set Z satisfies the backdoor criterion for (T, Y) if:

No node in Z is a descendant of T.
Z blocks every path between T and Y that has an arrow into T (backdoor path).

Then P(Y | do(T)) = Σ_z P(Y | T, Z=z) P(Z=z) — the standardization formula.

9.6 Frontdoor criterion

When backdoor is infeasible (key confounder unobserved), seek a set M on the directed path T → M → Y such that:

M intercepts all directed paths from T to Y.
No backdoor path from T to M.
All backdoor paths from M to Y are blocked by T.

Then P(Y | do(T)) = Σ_m P(M=m | T) Σ_t' P(Y | M=m, T=t') P(T=t').

Pearl’s smoking → tar → cancer example illustrates frontdoor when genetic confounding makes backdoor adjustment impossible.

10. Common DAG-mistake patterns

Adjusting for a collider opens a spurious path. Classic example: Berkson’s paradox — selecting on hospital admission induces a correlation between two independent diseases.
Adjusting for a mediator blocks the very effect you want to estimate. If T → M → Y is the causal path, conditioning on M removes the indirect effect.
M-bias — conditioning on a variable that is a collider with respect to two upstream causes can open new bias paths even though it is associated with both T and Y.
Bad controls — Cinelli + Forney + Pearl 2024: include only pre-treatment, non-collider, non-mediator covariates that block backdoors.

Domain knowledge is irreducible: data alone cannot distinguish a confounder from a mediator from a collider. Tools like dagitty.net and the R package dagitty (Textor + van der Zander 2016) help draw DAGs and read off adjustment sets.

11. Mediation analysis

How much of the effect of T on Y goes through a mediator M?

Baron + Kenny 1986 classic three-regression decomposition:

T → Y (total)
T → M (a-path)
T → Y controlling M (direct)
Product a · b = indirect.

Limitations: assumes no interaction, no confounding of M-Y, linear.

Counterfactual mediation — Robins + Greenland 1992 + Pearl 2001:

Controlled direct effect (CDE) Y(t=1, m) − Y(t=0, m) — effect setting M to a fixed value.
Natural direct effect (NDE) Y(1, M(0)) − Y(0, M(0)) — effect of T fixing M at its natural value under control.
Natural indirect effect (NIE) Y(1, M(1)) − Y(1, M(0)) — change in Y from changing M to its T=1 distribution, holding T fixed at 1.
Total effect = NDE + NIE.

Identification requires “no unmeasured M-Y confounder affected by T” — Pearl’s cross-world independence (often controversial).

Reference: VanderWeele 2015 “Explanation in Causal Inference”.

12. Sensitivity analysis

If unconfoundedness might fail, how much unmeasured confounding would overturn the conclusion?

Rosenbaum bounds (Rosenbaum 2002) — for matched pairs, bound how much the odds of treatment could differ between matched units due to unobserved U before the result loses significance.
E-value (VanderWeele + Ding 2017) — minimum strength of unmeasured confounder (associated with both T and Y) required to explain away the observed RR. Publishable as a single number; widely used in epi.
Cinelli + Hazlett 2020 omitted-variable bias bounds for OLS.
Tipping-point analysis — vary assumptions to find the value at which conclusion flips.

13. Heterogeneous treatment effects (HTE) + ML

Beyond a single ATE, we often want CATE τ(x) = E[Y(1) − Y(0) | X = x] — who benefits, who doesn’t.

13.1 Meta-learners

Pre-existing ML regressors wrapped into causal estimators:

S-learner (Single) — fit μ(t, x) = E[Y | T=t, X=x] with T as a feature; τ̂(x) = μ(1, x) − μ(0, x). Biased toward zero when T has low importance.
T-learner (Two) — separate models per arm; high variance in small-treatment-arm regime.
X-learner (Künzel + Sekhon + Bickel + Yu 2019) — use T-learner predictions to impute counterfactuals, then regress imputed individual effects on X, weight by propensity. Strong with imbalanced treatment.
R-learner (Nie + Wager 2021) — Robinson residual-on-residual regression after fitting nuisance models.
DR-learner (Kennedy 2023) — regress AIPW pseudo-outcomes on X.

13.2 Causal Forests

Wager + Athey 2018 — random forest variant for HTE. Each tree splits to maximize heterogeneity of treatment effect rather than outcome variance; honest splitting uses one subsample for splits and another for leaf estimates → valid asymptotic inference.

Generalized Random Forests (GRF) (Athey + Tibshirani + Wager 2019) extend to instrumental forests, quantile forests, local moment- condition forests.

13.3 Double / debiased machine learning

Chernozhukov + Chetverikov + Demirer + Duflo + Hansen + Newey + Robins 2018 — Neyman-orthogonal moment conditions plus cross-fitting allow ML nuisance estimators (boosting, deep nets) to plug into a final causal estimator with √n inference. The DoubleML package (R + Python) implements it.

13.4 Deep representation learning for CATE

TARNet + CFRNet (Shalit + Johansson + Sontag 2017) — shared representation with two outcome heads + IPM balancing penalty.
Dragonnet (Shi + Blei + Veitch 2019) — three-headed net (μ_0, μ_1, π) + targeted regularization.
CEVAE (Louizos + Shalit + Mooij + Sontag + Zemel + Welling 2017) — latent-variable model for hidden confounders.
Causal Transformer / CT (Melnychuk + Frauen + Feuerriegel 2022) — attention over treatment + covariate + outcome sequences for time-varying treatments.
TransTEE + GCN-based HTE (2023–24) — graph + transformer hybrids.

13.5 Software ecosystem

EconML (Microsoft) — DML, DR-learner, causal forests, deep IV, meta-learners; production-grade Python.
CausalML (Uber) — meta-learners, uplift trees, A/B segmentation.
DoWhy (Microsoft, Pearl-based) — graphical identification + multi-method estimation + refutation suite.
PyMC + pymc-experimental — Bayesian causal modeling.
EconML’s causalforestdml — production causal forest with DML.

14. Software inventory

Python

DoWhy (Microsoft) — graphical identification + estimation + refutation tests; supports backdoor, IV, frontdoor, mediation.
CausalML (Uber) — uplift trees, meta-learners, sensitivity.
EconML (Microsoft) — DML, causal forests, deep IV, dynamic treatments.
DoubleML — Chernozhukov et al. orthogonal estimators.
pymc + pymc-experimental — Bayesian causal models, ABCs.
statsmodels — basic OLS / IV / 2SLS.
linearmodels — robust IV, panel data, GMM.
rdrobust (Python port) — Calonico-Cattaneo-Titiunik RDD.
CausalNex (QuantumBlack) — structure learning + Bayesian networks.
CausalImpact — Bayesian structural time series (Brodersen 2015; Google).

R

MatchIt — matching (Ho + Imai + King + Stuart 2011).
twang — generalized boosted modeling for propensity.
WeightIt — IPW + entropy + CBPS.
did (Callaway + Sant’Anna), DRDID, bacondecomp, fixest (high-dim FE).
rdrobust, rddtools — RDD.
AER, ivreg, ivmodel — IV / 2SLS.
dagitty, ggdag — DAG construction + adjustment-set identification.
mediation (Tingley + Yamamoto + Hirose + Keele + Imai 2014) — causal mediation.
EpiABC, tmle — TMLE.
grf — generalized random forests.

Stata

teffects — propensity, IPW, AIPW, matching.
ivregress, ivreg2 — IV.
xtdidregress, didregress, csdid — DiD variants.
rdrobust, rdmulti, rddensity — RDD.

Julia

CausalInference.jl — DAG identification + PC algorithm.
StochasticAD.jl — automatic differentiation for stochastic computation graphs.
TuringGLM.jl, Turing.jl — Bayesian causal models.

15. Applications

15.1 A/B testing + online experimentation

Industry uses RCTs at scale — Microsoft, Google, Netflix, Booking each run thousands of experiments per year. Platforms: Optimizely, GrowthBook, Statsig, Eppo, VWO. Statistical techniques specific to online settings:

Sequential testing — peek without inflating Type I (mSPRT (Johari + Pekelis + Walsh 2017), always-valid p-values).
CUPED (Deng + Xu + Kohavi + Walker 2013) — variance reduction by regressing on pre-experiment covariate.
Switchback experiments — for marketplaces with interference.
Synthetic control + ITS for marketing campaigns where unit-level randomization is infeasible.

15.2 Marketing attribution + incrementality

Last-click attribution overstates already-converted users. Modern:

Geo experiments (Chen + Au + Au + Cohen 2018, Google) — randomize cities to measure ad lift.
Ghost ads + holdout — randomized exposure suppression.
Media mix modeling (MMM) with Bayesian priors — PyMC-Marketing, Robyn (Meta), LightweightMMM (Google).

15.3 Pricing experiments

Customer-level price experiments are often ethically + legally constrained. Used: A/B-tested email coupons (legal), price-elasticity discontinuities (RDD around list-price breaks), conjoint analysis (stated preference).

15.4 Medical trials

RCT remains the regulatory standard (FDA, EMA). But observational real-world-evidence is rising — platforms include TriNetX, Aetion, Flatiron. Common methods: PSM, IPW, TMLE on EHR data. COVID-19 vaccine effectiveness studies (Polack + Thomas 2020 BNT162b2 RCT; Dagan 2021 observational Pfizer in Israel).

15.5 Policy evaluation

J-PAL (Abdul Latif Jameel Poverty Action Lab) RCT-based development economics. Banerjee + Duflo + Kremer 2019 Nobel for experimental approach. Examples: deworming (Kremer + Miguel 2004 — contested by 2015 reanalysis), microcredit (Banerjee + Duflo 2015), conditional cash transfers (PROGRESA Mexico — Schultz 2004).

15.6 Education

Hanushek + Rivkin 2010 value-added teacher effects (causal interpretation contested).
Chetty + Friedman + Rockoff 2014 long-run teacher quality effects.
STAR experiment (Tennessee class-size RCT, Krueger 1999).
Project Follow Through, Head Start RDD around eligibility.

15.7 Climate attribution

CMIP6-based detection + attribution: did a specific extreme event become more likely under anthropogenic forcing? Methods: probabilistic event attribution (Stott + Stone + Allen 2004 European 2003 heatwave); pseudo-global warming experiments; ML-based attribution (Diffenbaugh + Burke 2019).

15.8 Cybersecurity + fraud

Counterfactual reasoning: what would the loss have been absent this detection rule? Used in lift modeling for fraud detection, evaluation of phishing-training programs, intrusion-detection ROI.

16. Recent topics (2024–26)

Targeted Learning + TMLE — van der Laan school continues; the tlverse R ecosystem matures.
Causal Transformer + LLM-assisted causal discovery — Kıcıman + Ness + Sharma + Tan 2024 “Causal Reasoning and LLMs” — GPT-style models scoring causal-discovery benchmarks; combined with PC + GES algorithms.
Causal abstraction + intervention abstraction (Geiger + Lu + Icard + Potts 2021–25) — when can a “high-level” causal model be realized by a neural network?
Soft interventions + dynamic treatment regimes (DTR) — Murphy 2003, Chakraborty + Moodie 2013; reinforcement-learning style adaptive treatments (Q-learning, A-learning).
Recurrent + temporal causal — Robins g-methods: G-computation, marginal structural models with IPW, G-formula for time-varying confounders that are also intermediates.
Dynamic causal graphs (DCG) + graphical learning with neural nets — NOTEARS (Zheng + Aragam + Ravikumar + Xing 2018) makes DAG structure learning differentiable; subsequent DAGMA, GOLEM.
Causal reinforcement learning — Bareinboim + Pearl 2014 “data fusion” theory; off-policy evaluation with do-calculus identification.
Fairness as counterfactual — Kusner + Loftus + Russell + Silva 2017 counterfactual fairness; path-specific fairness via mediation.
Causal representation learning (Schölkopf + Locatello + Bauer + Ke + Kalchbrenner + Goyal + Bengio 2021 “Toward Causal Representation Learning”) — disentangle factors of variation as causal variables.

17. Pitfalls

Treating a prediction model as causal — P(Y | T) ≠ P(Y | do(T)) in general. Reading XGBoost feature importance as “what to intervene on” is a categorical error.
SUTVA violation under network interference — vaccination, fashion, peer effects. Need network-aware estimators (Aronow + Samii 2017; Hudgens + Halloran 2008).
Forgetting overlap — fitting outcome models in regions with no treated (or no control) data extrapolates the model, not the treatment.
Selection bias / missing-not-at-random — survivors, responders, click-through.
Internal vs external validity — a clean RCT in one population may not transport. Transportability + generalizability literature (Pearl + Bareinboim 2014, Westreich + Edwards + Lesko + Cole + Stuart 2017).
p-hacking + multiple testing / “garden of forking paths” (Gelman
- Loken 2014) — pre-registration helps; Bonferroni / Benjamini- Hochberg adjust.
Statistical vs practical significance — a tiny but statistically- significant ATE may not justify the policy.
Reverse causation + simultaneity — Y influences T as well as T influences Y. Lagged designs, IV, dynamic models.
Mistaking auxiliary balance for unconfoundedness — covariate balance on X does not imply balance on unmeasured U.

18. Cross-references

[[Math/probability-fundamentals]] — joint distributions, conditional independence, expectations underpin every causal estimand.
[[Math/hypothesis-testing-mle]] — frequentist test machinery for ATE confidence intervals + sequential testing.
[[Math/bayesian-inference]] — Bayesian causal inference, MCMC posterior over potential outcomes, Bayesian DAGs.
[[Math/_index]] — full Math reference index.
[[Compute/transformer-architecture]] — RLHF reward modeling has causal interpretation (reward = counterfactual outcome under intervention on action).
[[Engineering/six-sigma]] — Design of Experiments (DOE) is industrial causal inference; factorial + fractional factorial designs.

19. Citations

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
Pearl, J. + Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Hernán, M.A. + Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall / CRC. Free online.
Imbens, G.W. + Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.
Rubin, D.B. (1974). “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology 66(5): 688-701.
Neyman, J. (1923). “On the Application of Probability Theory to Agricultural Experiments.” Translated 1990 Statistical Science.
Holland, P.W. (1986). “Statistics and Causal Inference.” JASA 81(396): 945-960.
Rosenbaum, P.R. + Rubin, D.B. (1983). “The central role of the propensity score in observational studies for causal effects.” Biometrika 70(1): 41-55.
Robins, J.M. + Hernán, M.A. + Brumback, B. (2000). “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11(5).
Robins, J.M. + Rotnitzky, A. (1995). “Semiparametric Efficiency in Multivariate Regression Models with Missing Data.” JASA 90(429).
Pearl, J. (1995). “Causal diagrams for empirical research.” Biometrika 82(4): 669-688.
Tian, J. + Pearl, J. (2003). “A general identification condition for causal effects.” AAAI.
Shpitser, I. + Pearl, J. (2006). “Identification of conditional interventional distributions.” UAI.
Wager, S. + Athey, S. (2018). “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” JASA 113(523): 1228-1242.
Athey, S. + Tibshirani, J. + Wager, S. (2019). “Generalized Random Forests.” Annals of Statistics 47(2).
Chernozhukov, V. et al. (2018). “Double/debiased machine learning for treatment and structural parameters.” Econometrics Journal 21(1).
Künzel, S.R. + Sekhon, J.S. + Bickel, P.J. + Yu, B. (2019). “Metalearners for estimating heterogeneous treatment effects using machine learning.” PNAS 116(10).
Nie, X. + Wager, S. (2021). “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” Biometrika 108(2).
Shalit, U. + Johansson, F.D. + Sontag, D. (2017). “Estimating Individual Treatment Effect: generalization bounds and algorithms.” ICML.
VanderWeele, T.J. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press.
VanderWeele, T.J. + Ding, P. (2017). “Sensitivity Analysis in Observational Research: Introducing the E-Value.” Annals of Internal Medicine 167(4).
Card, D. + Krueger, A.B. (1994). “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” AER 84(4).
Callaway, B. + Sant’Anna, P.H.C. (2021). “Difference-in-Differences with multiple time periods.” J. Econometrics 225(2).
Goodman-Bacon, A. (2021). “Difference-in-Differences with variation in treatment timing.” J. Econometrics 225(2).
Sun, L. + Abraham, S. (2021). “Estimating dynamic treatment effects in event studies with heterogeneous treatment effects.” J. Econometrics 225(2).
de Chaisemartin, C. + D’Haultfœuille, X. (2020). “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” AER 110(9).
Abadie, A. + Diamond, A. + Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” JASA 105(490).
Abadie, A. + Gardeazabal, J. (2003). “The Economic Costs of Conflict: A Case Study of the Basque Country.” AER 93(1).
Thistlethwaite, D.L. + Campbell, D.T. (1960). “Regression- discontinuity analysis: an alternative to the ex post facto experiment.” J. Educ. Psychology 51(6).
Lee, D.S. (2008). “Randomized experiments from non-random selection in U.S. House elections.” J. Econometrics 142(2).
Cattaneo, M.D. + Idrobo, N. + Titiunik, R. (2020). A Practical Introduction to Regression Discontinuity Designs. Cambridge.
Calonico, S. + Cattaneo, M.D. + Titiunik, R. (2014). “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.” Econometrica 82(6).
Imbens, G.W. + Angrist, J.D. (1994). “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62(2).
Angrist, J.D. (1990). “Lifetime Earnings and the Vietnam Era Draft Lottery.” AER 80(3).
Stock, J.H. + Yogo, M. (2005). “Testing for weak instruments in linear IV regression.” In Identification and Inference for Econometric Models.
Robins, J.M. + Greenland, S. (1992). “Identifiability and Exchangeability for Direct and Indirect Effects.” Epidemiology 3(2).
Pearl, J. (2001). “Direct and Indirect Effects.” UAI.
Iacus, S.M. + King, G. + Porro, G. (2012). “Causal Inference Without Balance Checking: Coarsened Exact Matching.” Political Analysis 20(1).
Abadie, A. + Imbens, G.W. (2006). “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74(1).
van der Laan, M.J. + Rubin, D. (2006). “Targeted Maximum Likelihood Learning.” IJB 2(1).
Brodersen, K.H. et al. (2015). “Inferring causal impact using Bayesian structural time-series models.” Annals of Applied Statistics 9(1).
Bareinboim, E. + Pearl, J. (2014). “Causal inference and the data-fusion problem.” PNAS 113(27).
Banerjee, A. + Duflo, E. (2015). Poor Economics. PublicAffairs; 2019 Nobel for experimental approach.

Compendium

Explorer

Causal Inference — Math Reference

Causal Inference — Math Reference

1. At a glance

2. Potential outcomes framework

3. Identification assumptions

4. Randomized Controlled Trial (RCT)

5. Observational study tools

5.1 Matching

5.2 Propensity score (Rosenbaum + Rubin 1983)

5.3 Inverse propensity weighting (IPW)

5.4 Doubly robust estimators

6. Difference-in-Differences (DiD)

6.1 Two-way fixed effects (TWFE) under heterogeneous timing

6.2 Synthetic Control

7. Regression Discontinuity Design (RDD)

8. Instrumental Variables (IV)

8.1 Wald estimator

8.2 Two-stage least squares (2SLS)

8.3 LATE (Local Average Treatment Effect)

8.4 Weak instruments + diagnostics

8.5 Famous instruments

9. Pearl’s structural framework

9.1 DAGs (Directed Acyclic Graphs)

9.2 d-separation

9.3 do-operator + interventional distribution

9.4 Do-calculus (Pearl 1995, 2009)

9.5 Backdoor criterion

9.6 Frontdoor criterion

10. Common DAG-mistake patterns

11. Mediation analysis

12. Sensitivity analysis

13. Heterogeneous treatment effects (HTE) + ML

13.1 Meta-learners

13.2 Causal Forests

13.3 Double / debiased machine learning

13.4 Deep representation learning for CATE

13.5 Software ecosystem

14. Software inventory

Python

R

Stata

Julia

15. Applications

15.1 A/B testing + online experimentation

15.2 Marketing attribution + incrementality

15.3 Pricing experiments

15.4 Medical trials

15.5 Policy evaluation

15.6 Education

15.7 Climate attribution

15.8 Cybersecurity + fraud

16. Recent topics (2024–26)

17. Pitfalls

18. Cross-references

19. Citations

Graph View

Table of Contents