Causal Inference — Math Reference
1. At a glance
“Correlation is not causation” — but causal inference is the formal apparatus
that tells us when, and under what assumptions, an associational quantity
can be read as causal. The central question is counterfactual: what would
happen to outcome Y if we intervened to set treatment T? This is distinct
from prediction. A predictive model asks P(Y | T = t, X = x) (what we
observe when T happens to take value t); a causal model asks
P(Y | do(T = t), X = x) (what we would see if we forced T to t,
overriding its natural mechanism).
Two main frameworks dominate:
- Potential outcomes (Neyman 1923; Rubin 1974) — every unit has a
vector of counterfactual outcomes
Y_i(t)for each treatment level. The fundamental problem is that only one is ever observed. - Graphical / structural causal models (Pearl 1995, 2009) — DAGs encode causal structure; do-calculus formally manipulates interventional distributions; backdoor + frontdoor criteria identify estimable effects.
Modern unification is achieved by Imbens + Rubin 2015, Pearl 2009, and Hernán + Robins “What If” 2020 — the three texts that together define the contemporary canon. The frameworks are equivalent in expressive power for the questions both can pose; choice is largely a matter of taste and problem geometry.
Practical use cases:
- Policy evaluation (job training, minimum wage, welfare programs)
- A/B testing + online experimentation
- Medical efficacy (RCTs + observational drug-effect studies)
- Ad attribution + marketing incrementality
- Pricing + recommendation lift
- Climate attribution (was this hurricane “caused by” warming?)
- Algorithmic fairness + counterfactual fairness
2. Potential outcomes framework
For each unit i = 1, ..., n define:
- Treatment assignment
T_i ∈ {0, 1}(binary case; extends to multi-valued + continuous). - Potential outcomes
Y_i(0)(under control) andY_i(1)(under treatment). These are both defined for every unit, but only one is realized. - Individual causal effect
τ_i = Y_i(1) − Y_i(0).
Fundamental Problem of Causal Inference (Holland 1986): we observe
Y_i = Y_i(T_i), never both Y_i(0) and Y_i(1) for the same unit.
Individual effects are therefore never directly estimable; we estimate
population averages across units.
Key estimands:
- ATE (Average Treatment Effect)
τ = E[Y(1) − Y(0)]— population mean effect. - ATT (Average Treatment effect on the Treated)
τ_ATT = E[Y(1) − Y(0) | T = 1]— effect on those who actually received treatment. Often the policy-relevant quantity (e.g. “did the training program work for those who attended?”). - ATU (Average Treatment effect on the Untreated)
E[Y(1) − Y(0) | T = 0]. - CATE (Conditional ATE)
τ(x) = E[Y(1) − Y(0) | X = x]— covariate- conditional, basis for personalization (see §14). - ATO (Overlap-weighted ATE) — re-weights to where propensity is near 0.5; Li + Morgan + Zaslavsky 2018.
The naive estimator E[Y | T = 1] − E[Y | T = 0] decomposes as
ATE_naive = ATE + selection_bias + treatment_heterogeneity_bias
where selection bias is E[Y(0) | T=1] − E[Y(0) | T=0] (baseline
differences between groups).
3. Identification assumptions
To estimate E[Y(t)] from observed (T, Y, X) data, three core
assumptions are required (Rubin’s “stable, ignorable, overlapping”):
- SUTVA (Stable Unit Treatment Value Assumption) — Rubin 1980.
- No interference: unit
i’s outcome depends only on uniti’s treatment, not on others’. Violated by network effects, herd immunity, market equilibrium spillovers. - Consistent treatment: a single, well-defined
t=1(no “hidden versions” of treatment).
- No interference: unit
- Ignorability / Unconfoundedness / Strong Ignorability
(Y(0), Y(1)) ⫫ T | X— given covariates X, treatment is as good as random. The unobserved counterfactual distributions are the same in the treated and control groups conditional on X. Untestable from data alone; rests on substantive domain knowledge. - Positivity / Overlap / Common Support
0 < P(T = 1 | X = x) < 1for all x in the support. If no treated units exist at some X = x, there is no information aboutE[Y(1) | X = x]. - Consistency
Y_i = T_i · Y_i(1) + (1 − T_i) · Y_i(0)— the observed outcome equals the potential outcome under the realized treatment (often folded into SUTVA).
Under (ignorability + positivity + consistency + SUTVA), the g-formula identifies:
E[Y(t)] = E_X[E[Y | T = t, X]]
— a standardization across the covariate distribution.
4. Randomized Controlled Trial (RCT)
The gold standard. By design T ⫫ (Y(0), Y(1), X), so unconfoundedness
holds unconditionally and ATE = E[Y | T=1] − E[Y | T=0] is identified
by a simple difference in means. Standard error from a two-sample
t-test or Neyman variance bound.
Variants:
- Completely randomized — each unit independently assigned with
probability
p. - Stratified randomization — randomize within strata of pre-treatment covariates; reduces variance, ensures balance on key features.
- Block / cluster randomized — randomize whole groups (schools, villages) — needed when interference within cluster is unavoidable.
- Factorial designs — multiple treatments simultaneously, estimate
main effects + interactions; common in industrial DOE (see
[[Engineering/six-sigma]]). - Adaptive + multi-arm bandit RCTs — Thompson sampling, response- adaptive randomization; ethical efficiency at cost of statistical complexity (Berry + Berry).
- Crossover designs — each unit receives both treatments at different times; requires no carryover.
Limitations: ethical constraints (you cannot randomize cancer onset); expense; ecological validity (does the lab effect generalize?); slow; hard for rare outcomes; political resistance; non-compliance (treated units that didn’t take the pill) introduces ITT vs LATE distinction (see §9).
5. Observational study tools
When RCTs are infeasible, observational designs attempt to recover unconfoundedness by conditioning on enough X to make treatment quasi-random.
5.1 Matching
Find for each treated unit a control unit “similar” on X:
- Exact matching — exact match on all covariates; impractical with more than a handful of discrete X.
- Nearest-neighbor matching — minimize a distance metric (Mahalanobis, Euclidean) on standardized X; can be 1:1, k:1, with or without replacement.
- Coarsened exact matching (CEM) (Iacus + King + Porro 2012) — discretize covariates into bins, exact-match within bins.
- Optimal full matching (Hansen 2004) — assign each treated unit to a (potentially unbalanced) cluster of controls minimizing global distance.
- Caliper matching — reject matches beyond a propensity-score distance threshold (reduces bias, increases variance from dropped units).
After matching, estimate ATT by mean difference in matched pairs. Standard errors via Abadie + Imbens 2006 robust variance.
5.2 Propensity score (Rosenbaum + Rubin 1983)
Define π(x) = P(T = 1 | X = x). Key theorem: if unconfoundedness
holds given X, it also holds given the scalar π(X). This dimensionality
reduction is the engine of all propensity-based methods.
Estimation: logistic regression, gradient boosting (XGBoost), calibrated random forests, neural-net classifiers.
Uses:
- Propensity score matching (PSM) — match on π(X) instead of X.
- Propensity stratification — bin into 5–10 strata of π(X), compute within-stratum effect, weight by stratum size.
- Propensity weighting (IPW) — see 5.3.
- Doubly robust — combine outcome + propensity model (5.4).
Diagnostics: check covariate balance (standardized mean difference < 0.1 post-matching), overlap (histogram of π(X) by treatment group), sensitivity (Rosenbaum bounds).
5.3 Inverse propensity weighting (IPW)
Horvitz + Thompson 1952 estimator adapted to causal inference:
τ_IPW = (1/n) Σ [ T_i · Y_i / π(X_i) − (1 − T_i) · Y_i / (1 − π(X_i)) ]
Intuition: re-weight observed data so that, conditional on X, treated and control groups have the same covariate distribution. Variance explodes when π(X) is near 0 or 1 — stabilized IPW (Robins + Hernán
- Brumback 2000) normalizes by the marginal probability, truncated IPW clips extreme weights at some percentile (e.g. 1 + 99).
5.4 Doubly robust estimators
Robins + Rotnitzky 1995. Combine an outcome regression μ_t(X) = E[Y | T=t, X] with the propensity score π(X):
AIPW (Augmented IPW):
τ_AIPW = (1/n) Σ [ μ_1(X_i) − μ_0(X_i)
+ T_i (Y_i − μ_1(X_i)) / π(X_i)
− (1 − T_i)(Y_i − μ_0(X_i)) / (1 − π(X_i)) ]
The “double” robustness: τ_AIPW is consistent if either μ_t or π is correctly specified; both wrong → bias. Variance is minimized when both are correct (semiparametric efficiency bound; Hahn 1998).
TMLE (Targeted Maximum Likelihood Estimation) — Mark van der Laan + Rubin 2006; iteratively updates μ to be “targeted” toward the estimand; combines well with SuperLearner ensembles.
6. Difference-in-Differences (DiD)
Card + Krueger 1994 (NJ vs PA fast-food employment after NJ minimum wage hike) is the canonical study. Two-period two-group setup:
- Pre-period
t = 0and post-periodt = 1. - Groups
g ∈ {0, 1}(control, treated). Only group 1 receives treatment, and only in period 1.
ATT_DiD = (E[Y_{t=1} | g=1] − E[Y_{t=0} | g=1])
− (E[Y_{t=1} | g=0] − E[Y_{t=0} | g=0])
Identifying assumption: parallel trends — in absence of treatment, mean outcome would have evolved identically in both groups. Untestable; visually inspected with pre-period trends.
Equivalent regression form:
Y_it = α_i + λ_t + β · (g_i · post_t) + ε_it
— unit and time fixed effects, coefficient on the interaction is ATT.
6.1 Two-way fixed effects (TWFE) under heterogeneous timing
When treatment is staggered (different units treated at different times), TWFE is not a clean DiD. Goodman-Bacon 2021 showed TWFE is a weighted average of all possible 2x2 DiDs, including “forbidden” comparisons where already-treated units serve as controls — yielding negative weights and biased estimates under heterogeneous effects.
Modern fixes:
- Callaway + Sant’Anna 2021 — group-time ATTs aggregated with positive weights; csdid in R, csdid in Stata.
- Sun + Abraham 2021 — interaction-weighted estimator using not-yet-treated as controls.
- de Chaisemartin + D’Haultfœuille 2020 — multiple-period DiD with heterogeneous adoption.
- Borusyak + Jaravel + Spiess 2024 — imputation estimator.
6.2 Synthetic Control
Abadie + Gardeazabal 2003 (Basque terrorism) + Abadie + Diamond +
Hainmueller 2010 (California Prop 99 anti-tobacco). When N_treated = 1
and standard DiD has no comparable control, construct counterfactual as
weighted combination of donor units:
Y_t(0)_treated ≈ Σ_j w_j · Y_t_j
with w_j ≥ 0, Σ w_j = 1, weights chosen to match pre-treatment
trajectory + covariates of the treated unit.
Famous applications: California Prop 99 (smoking declined ~25% relative to synthetic California), German reunification (West-only synthetic counterfactual for unified Germany).
Extensions: augmented synthetic control (Ben-Michael + Feller + Rothstein 2021) corrects for imperfect pre-period fit; generalized synthetic control (Xu 2017) for multiple treated units; matrix completion (Athey + Bayati + Doudchenko + Imbens + Khosravi 2021).
7. Regression Discontinuity Design (RDD)
Thistlethwaite + Campbell 1960 (scholarship effect via test-score
threshold). Treatment is assigned by whether a running variable X
crosses a cutoff c:
- Sharp RDD:
T = 𝟙(X ≥ c)(deterministic). - Fuzzy RDD:
P(T = 1 | X)jumps atcbut is not 0/1.
Identification: units just above + below the cutoff are quasi-randomly similar; in the limit, this approximates a local RCT.
Estimator: local linear regression on a bandwidth around c:
Y_i = α + τ · 𝟙(X_i ≥ c) + β_1 (X_i − c) + β_2 (X_i − c) · 𝟙(X_i ≥ c) + ε_i
— τ is the local ATE at the cutoff.
Bandwidth selection: Imbens + Kalyanaraman 2012; Calonico +
Cattaneo + Titiunik 2014 robust bias-corrected inference + the
rdrobust package. Modern guide: Cattaneo + Idrobo + Titiunik 2020
(2-volume Cambridge monograph).
Applications:
- Scholarship eligibility cutoffs (test-score thresholds).
- Election RDD — Lee 2008, incumbency advantage from close-vote wins.
- Medicare eligibility at age 65 (Card + Dobkin + Maestas 2008).
- Class-size effects from Maimonides’ rule (Angrist + Lavy 1999).
Validity checks: no manipulation of X around c (McCrary 2008 density
test); covariate balance at cutoff; placebo cutoffs.
8. Instrumental Variables (IV)
Used when treatment is endogenous (correlated with unobserved
confounders). Find an instrument Z satisfying:
- Relevance:
Cov(Z, T) ≠ 0— Z affects T. - Exclusion: Z affects Y only through T — no direct path.
- Independence: Z is independent of unobserved confounders.
8.1 Wald estimator
Binary Z, binary T:
τ_Wald = (E[Y | Z=1] − E[Y | Z=0]) / (E[T | Z=1] − E[T | Z=0])
= (intent-to-treat effect on Y) / (first-stage effect on T)
8.2 Two-stage least squares (2SLS)
Continuous case. Stage 1: regress T on Z (and controls X) to get
T̂. Stage 2: regress Y on T̂ (and X). Coefficient on T̂ is IV
estimate.
8.3 LATE (Local Average Treatment Effect)
Imbens + Angrist 1994 LATE theorem: under monotonicity (no “defiers”), IV identifies the average treatment effect on compliers — units who would take T = 1 if Z = 1 and T = 0 if Z = 0. Not ATE unless effects are homogeneous.
Compliance types:
- Always-takers: T = 1 regardless of Z.
- Never-takers: T = 0 regardless of Z.
- Compliers: T = Z.
- Defiers: T = 1 − Z (assumed away).
8.4 Weak instruments + diagnostics
If Cov(Z, T) is small, 2SLS is biased toward OLS and confidence
intervals are unreliable. Stock + Yogo 2005 propose F-statistic on
first stage > 10 rule of thumb; Lee + McCrary + Moreira + Porter
2022 tighten to F > 104.7 for reliable 5% inference.
8.5 Famous instruments
- Vietnam draft lottery (Angrist 1990) — lottery number as instrument for military service effect on earnings.
- Quarter of birth (Angrist + Krueger 1991) — instruments for schooling via compulsory-attendance laws; later criticized as weak.
- Mendelian randomization — genetic variants as instruments in epidemiology (Smith + Ebrahim 2003); SNPs are randomized at conception.
- Distance to college (Card 1993) — proximity instruments for educational attainment.
- Judge fixed effects (Kling 2006) — random assignment of judges with different sentencing tendencies.
9. Pearl’s structural framework
9.1 DAGs (Directed Acyclic Graphs)
Nodes are variables, directed edges represent direct causal effects. The graph implies a factorization:
P(V_1, ..., V_n) = Π_i P(V_i | parents(V_i))
A Structural Causal Model (SCM) specifies for each variable a
function V_i = f_i(parents(V_i), U_i) with exogenous noise U_i.
9.2 d-separation
A graphical criterion for conditional independence. A path between X and Y is blocked by a set Z if it contains:
- A chain
A → B → Cwith B ∈ Z. - A fork
A ← B → Cwith B ∈ Z. - A collider
A → B ← Cwith B ∉ Z and no descendant of B in Z.
If all paths between X and Y are blocked by Z, then X ⫫ Y | Z in any
distribution compatible with the DAG.
9.3 do-operator + interventional distribution
do(T = t) represents an intervention: replace the structural
equation for T with T := t, severing all incoming edges. Then
P(Y | do(T = t)) is the post-intervention distribution.
Generally P(Y | T = t) ≠ P(Y | do(T = t)) — they coincide only under
specific structural conditions (e.g. no backdoor paths).
9.4 Do-calculus (Pearl 1995, 2009)
Three rules that transform P(Y | do(T), Z, W) expressions:
- Rule 1 (Insertion / deletion of observations): ignore observations not affecting the outcome.
- Rule 2 (Action / observation exchange): replace
do(X)with observing X when no backdoor remains. - Rule 3 (Insertion / deletion of actions): drop
do(X)when X doesn’t affect Y in mutilated graph.
Tian + Pearl 2003 + Shpitser + Pearl 2006 give the ID algorithm: a
complete procedure deciding whether P(Y | do(T)) is identifiable from
the observational distribution + DAG, and if so produces the
estimating formula.
9.5 Backdoor criterion
A set Z satisfies the backdoor criterion for (T, Y) if:
- No node in Z is a descendant of T.
- Z blocks every path between T and Y that has an arrow into T (backdoor path).
Then P(Y | do(T)) = Σ_z P(Y | T, Z=z) P(Z=z) — the standardization
formula.
9.6 Frontdoor criterion
When backdoor is infeasible (key confounder unobserved), seek a set M on the directed path T → M → Y such that:
- M intercepts all directed paths from T to Y.
- No backdoor path from T to M.
- All backdoor paths from M to Y are blocked by T.
Then P(Y | do(T)) = Σ_m P(M=m | T) Σ_t' P(Y | M=m, T=t') P(T=t').
Pearl’s smoking → tar → cancer example illustrates frontdoor when genetic confounding makes backdoor adjustment impossible.
10. Common DAG-mistake patterns
- Adjusting for a collider opens a spurious path. Classic example: Berkson’s paradox — selecting on hospital admission induces a correlation between two independent diseases.
- Adjusting for a mediator blocks the very effect you want to estimate. If T → M → Y is the causal path, conditioning on M removes the indirect effect.
- M-bias — conditioning on a variable that is a collider with respect to two upstream causes can open new bias paths even though it is associated with both T and Y.
- Bad controls — Cinelli + Forney + Pearl 2024: include only pre-treatment, non-collider, non-mediator covariates that block backdoors.
Domain knowledge is irreducible: data alone cannot distinguish a
confounder from a mediator from a collider. Tools like dagitty.net
and the R package dagitty (Textor + van der Zander 2016) help draw
DAGs and read off adjustment sets.
11. Mediation analysis
How much of the effect of T on Y goes through a mediator M?
Baron + Kenny 1986 classic three-regression decomposition:
- T → Y (total)
- T → M (a-path)
- T → Y controlling M (direct)
- Product
a · b= indirect.
Limitations: assumes no interaction, no confounding of M-Y, linear.
Counterfactual mediation — Robins + Greenland 1992 + Pearl 2001:
- Controlled direct effect (CDE)
Y(t=1, m) − Y(t=0, m)— effect setting M to a fixed value. - Natural direct effect (NDE)
Y(1, M(0)) − Y(0, M(0))— effect of T fixing M at its natural value under control. - Natural indirect effect (NIE)
Y(1, M(1)) − Y(1, M(0))— change in Y from changing M to its T=1 distribution, holding T fixed at 1. - Total effect
= NDE + NIE.
Identification requires “no unmeasured M-Y confounder affected by T” — Pearl’s cross-world independence (often controversial).
Reference: VanderWeele 2015 “Explanation in Causal Inference”.
12. Sensitivity analysis
If unconfoundedness might fail, how much unmeasured confounding would overturn the conclusion?
- Rosenbaum bounds (Rosenbaum 2002) — for matched pairs, bound how much the odds of treatment could differ between matched units due to unobserved U before the result loses significance.
- E-value (VanderWeele + Ding 2017) — minimum strength of unmeasured confounder (associated with both T and Y) required to explain away the observed RR. Publishable as a single number; widely used in epi.
- Cinelli + Hazlett 2020 omitted-variable bias bounds for OLS.
- Tipping-point analysis — vary assumptions to find the value at which conclusion flips.
13. Heterogeneous treatment effects (HTE) + ML
Beyond a single ATE, we often want CATE τ(x) = E[Y(1) − Y(0) | X = x] —
who benefits, who doesn’t.
13.1 Meta-learners
Pre-existing ML regressors wrapped into causal estimators:
- S-learner (Single) — fit
μ(t, x) = E[Y | T=t, X=x]with T as a feature;τ̂(x) = μ(1, x) − μ(0, x). Biased toward zero when T has low importance. - T-learner (Two) — separate models per arm; high variance in small-treatment-arm regime.
- X-learner (Künzel + Sekhon + Bickel + Yu 2019) — use T-learner predictions to impute counterfactuals, then regress imputed individual effects on X, weight by propensity. Strong with imbalanced treatment.
- R-learner (Nie + Wager 2021) — Robinson residual-on-residual regression after fitting nuisance models.
- DR-learner (Kennedy 2023) — regress AIPW pseudo-outcomes on X.
13.2 Causal Forests
Wager + Athey 2018 — random forest variant for HTE. Each tree splits to maximize heterogeneity of treatment effect rather than outcome variance; honest splitting uses one subsample for splits and another for leaf estimates → valid asymptotic inference.
Generalized Random Forests (GRF) (Athey + Tibshirani + Wager 2019) extend to instrumental forests, quantile forests, local moment- condition forests.
13.3 Double / debiased machine learning
Chernozhukov + Chetverikov + Demirer + Duflo + Hansen + Newey + Robins 2018 — Neyman-orthogonal moment conditions plus cross-fitting allow ML nuisance estimators (boosting, deep nets) to plug into a final causal estimator with √n inference. The DoubleML package (R + Python) implements it.
13.4 Deep representation learning for CATE
- TARNet + CFRNet (Shalit + Johansson + Sontag 2017) — shared representation with two outcome heads + IPM balancing penalty.
- Dragonnet (Shi + Blei + Veitch 2019) — three-headed net (μ_0, μ_1, π) + targeted regularization.
- CEVAE (Louizos + Shalit + Mooij + Sontag + Zemel + Welling 2017) — latent-variable model for hidden confounders.
- Causal Transformer / CT (Melnychuk + Frauen + Feuerriegel 2022) — attention over treatment + covariate + outcome sequences for time-varying treatments.
- TransTEE + GCN-based HTE (2023–24) — graph + transformer hybrids.
13.5 Software ecosystem
- EconML (Microsoft) — DML, DR-learner, causal forests, deep IV, meta-learners; production-grade Python.
- CausalML (Uber) — meta-learners, uplift trees, A/B segmentation.
- DoWhy (Microsoft, Pearl-based) — graphical identification + multi-method estimation + refutation suite.
- PyMC + pymc-experimental — Bayesian causal modeling.
- EconML’s
causalforestdml— production causal forest with DML.
14. Software inventory
Python
- DoWhy (Microsoft) — graphical identification + estimation + refutation tests; supports backdoor, IV, frontdoor, mediation.
- CausalML (Uber) — uplift trees, meta-learners, sensitivity.
- EconML (Microsoft) — DML, causal forests, deep IV, dynamic treatments.
- DoubleML — Chernozhukov et al. orthogonal estimators.
- pymc + pymc-experimental — Bayesian causal models, ABCs.
- statsmodels — basic OLS / IV / 2SLS.
- linearmodels — robust IV, panel data, GMM.
- rdrobust (Python port) — Calonico-Cattaneo-Titiunik RDD.
- CausalNex (QuantumBlack) — structure learning + Bayesian networks.
- CausalImpact — Bayesian structural time series (Brodersen 2015; Google).
R
- MatchIt — matching (Ho + Imai + King + Stuart 2011).
- twang — generalized boosted modeling for propensity.
- WeightIt — IPW + entropy + CBPS.
- did (Callaway + Sant’Anna), DRDID, bacondecomp, fixest (high-dim FE).
- rdrobust, rddtools — RDD.
- AER, ivreg, ivmodel — IV / 2SLS.
- dagitty, ggdag — DAG construction + adjustment-set identification.
- mediation (Tingley + Yamamoto + Hirose + Keele + Imai 2014) — causal mediation.
- EpiABC, tmle — TMLE.
- grf — generalized random forests.
Stata
- teffects — propensity, IPW, AIPW, matching.
- ivregress, ivreg2 — IV.
- xtdidregress, didregress, csdid — DiD variants.
- rdrobust, rdmulti, rddensity — RDD.
Julia
- CausalInference.jl — DAG identification + PC algorithm.
- StochasticAD.jl — automatic differentiation for stochastic computation graphs.
- TuringGLM.jl, Turing.jl — Bayesian causal models.
15. Applications
15.1 A/B testing + online experimentation
Industry uses RCTs at scale — Microsoft, Google, Netflix, Booking each run thousands of experiments per year. Platforms: Optimizely, GrowthBook, Statsig, Eppo, VWO. Statistical techniques specific to online settings:
- Sequential testing — peek without inflating Type I (mSPRT (Johari + Pekelis + Walsh 2017), always-valid p-values).
- CUPED (Deng + Xu + Kohavi + Walker 2013) — variance reduction by regressing on pre-experiment covariate.
- Switchback experiments — for marketplaces with interference.
- Synthetic control + ITS for marketing campaigns where unit-level randomization is infeasible.
15.2 Marketing attribution + incrementality
Last-click attribution overstates already-converted users. Modern:
- Geo experiments (Chen + Au + Au + Cohen 2018, Google) — randomize cities to measure ad lift.
- Ghost ads + holdout — randomized exposure suppression.
- Media mix modeling (MMM) with Bayesian priors — PyMC-Marketing, Robyn (Meta), LightweightMMM (Google).
15.3 Pricing experiments
Customer-level price experiments are often ethically + legally constrained. Used: A/B-tested email coupons (legal), price-elasticity discontinuities (RDD around list-price breaks), conjoint analysis (stated preference).
15.4 Medical trials
RCT remains the regulatory standard (FDA, EMA). But observational real-world-evidence is rising — platforms include TriNetX, Aetion, Flatiron. Common methods: PSM, IPW, TMLE on EHR data. COVID-19 vaccine effectiveness studies (Polack + Thomas 2020 BNT162b2 RCT; Dagan 2021 observational Pfizer in Israel).
15.5 Policy evaluation
J-PAL (Abdul Latif Jameel Poverty Action Lab) RCT-based development economics. Banerjee + Duflo + Kremer 2019 Nobel for experimental approach. Examples: deworming (Kremer + Miguel 2004 — contested by 2015 reanalysis), microcredit (Banerjee + Duflo 2015), conditional cash transfers (PROGRESA Mexico — Schultz 2004).
15.6 Education
- Hanushek + Rivkin 2010 value-added teacher effects (causal interpretation contested).
- Chetty + Friedman + Rockoff 2014 long-run teacher quality effects.
- STAR experiment (Tennessee class-size RCT, Krueger 1999).
- Project Follow Through, Head Start RDD around eligibility.
15.7 Climate attribution
CMIP6-based detection + attribution: did a specific extreme event become more likely under anthropogenic forcing? Methods: probabilistic event attribution (Stott + Stone + Allen 2004 European 2003 heatwave); pseudo-global warming experiments; ML-based attribution (Diffenbaugh + Burke 2019).
15.8 Cybersecurity + fraud
Counterfactual reasoning: what would the loss have been absent this detection rule? Used in lift modeling for fraud detection, evaluation of phishing-training programs, intrusion-detection ROI.
16. Recent topics (2024–26)
- Targeted Learning + TMLE — van der Laan school continues; the
tlverseR ecosystem matures. - Causal Transformer + LLM-assisted causal discovery — Kıcıman + Ness + Sharma + Tan 2024 “Causal Reasoning and LLMs” — GPT-style models scoring causal-discovery benchmarks; combined with PC + GES algorithms.
- Causal abstraction + intervention abstraction (Geiger + Lu + Icard + Potts 2021–25) — when can a “high-level” causal model be realized by a neural network?
- Soft interventions + dynamic treatment regimes (DTR) — Murphy 2003, Chakraborty + Moodie 2013; reinforcement-learning style adaptive treatments (Q-learning, A-learning).
- Recurrent + temporal causal — Robins g-methods: G-computation, marginal structural models with IPW, G-formula for time-varying confounders that are also intermediates.
- Dynamic causal graphs (DCG) + graphical learning with neural nets — NOTEARS (Zheng + Aragam + Ravikumar + Xing 2018) makes DAG structure learning differentiable; subsequent DAGMA, GOLEM.
- Causal reinforcement learning — Bareinboim + Pearl 2014 “data fusion” theory; off-policy evaluation with do-calculus identification.
- Fairness as counterfactual — Kusner + Loftus + Russell + Silva 2017 counterfactual fairness; path-specific fairness via mediation.
- Causal representation learning (Schölkopf + Locatello + Bauer + Ke + Kalchbrenner + Goyal + Bengio 2021 “Toward Causal Representation Learning”) — disentangle factors of variation as causal variables.
17. Pitfalls
- Treating a prediction model as causal —
P(Y | T)≠P(Y | do(T))in general. Reading XGBoost feature importance as “what to intervene on” is a categorical error. - SUTVA violation under network interference — vaccination, fashion, peer effects. Need network-aware estimators (Aronow + Samii 2017; Hudgens + Halloran 2008).
- Forgetting overlap — fitting outcome models in regions with no treated (or no control) data extrapolates the model, not the treatment.
- Selection bias / missing-not-at-random — survivors, responders, click-through.
- Internal vs external validity — a clean RCT in one population may not transport. Transportability + generalizability literature (Pearl + Bareinboim 2014, Westreich + Edwards + Lesko + Cole + Stuart 2017).
- p-hacking + multiple testing / “garden of forking paths” (Gelman
- Loken 2014) — pre-registration helps; Bonferroni / Benjamini- Hochberg adjust.
- Statistical vs practical significance — a tiny but statistically- significant ATE may not justify the policy.
- Reverse causation + simultaneity — Y influences T as well as T influences Y. Lagged designs, IV, dynamic models.
- Mistaking auxiliary balance for unconfoundedness — covariate balance on X does not imply balance on unmeasured U.
18. Cross-references
[[Math/probability-fundamentals]]— joint distributions, conditional independence, expectations underpin every causal estimand.[[Math/hypothesis-testing-mle]]— frequentist test machinery for ATE confidence intervals + sequential testing.[[Math/bayesian-inference]]— Bayesian causal inference, MCMC posterior over potential outcomes, Bayesian DAGs.[[Math/_index]]— full Math reference index.[[Compute/transformer-architecture]]— RLHF reward modeling has causal interpretation (reward = counterfactual outcome under intervention on action).[[Engineering/six-sigma]]— Design of Experiments (DOE) is industrial causal inference; factorial + fractional factorial designs.
19. Citations
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
- Pearl, J. + Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
- Hernán, M.A. + Robins, J.M. (2020). Causal Inference: What If. Chapman & Hall / CRC. Free online.
- Imbens, G.W. + Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.
- Rubin, D.B. (1974). “Estimating causal effects of treatments in randomized and nonrandomized studies.” Journal of Educational Psychology 66(5): 688-701.
- Neyman, J. (1923). “On the Application of Probability Theory to Agricultural Experiments.” Translated 1990 Statistical Science.
- Holland, P.W. (1986). “Statistics and Causal Inference.” JASA 81(396): 945-960.
- Rosenbaum, P.R. + Rubin, D.B. (1983). “The central role of the propensity score in observational studies for causal effects.” Biometrika 70(1): 41-55.
- Robins, J.M. + Hernán, M.A. + Brumback, B. (2000). “Marginal Structural Models and Causal Inference in Epidemiology.” Epidemiology 11(5).
- Robins, J.M. + Rotnitzky, A. (1995). “Semiparametric Efficiency in Multivariate Regression Models with Missing Data.” JASA 90(429).
- Pearl, J. (1995). “Causal diagrams for empirical research.” Biometrika 82(4): 669-688.
- Tian, J. + Pearl, J. (2003). “A general identification condition for causal effects.” AAAI.
- Shpitser, I. + Pearl, J. (2006). “Identification of conditional interventional distributions.” UAI.
- Wager, S. + Athey, S. (2018). “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests.” JASA 113(523): 1228-1242.
- Athey, S. + Tibshirani, J. + Wager, S. (2019). “Generalized Random Forests.” Annals of Statistics 47(2).
- Chernozhukov, V. et al. (2018). “Double/debiased machine learning for treatment and structural parameters.” Econometrics Journal 21(1).
- Künzel, S.R. + Sekhon, J.S. + Bickel, P.J. + Yu, B. (2019). “Metalearners for estimating heterogeneous treatment effects using machine learning.” PNAS 116(10).
- Nie, X. + Wager, S. (2021). “Quasi-Oracle Estimation of Heterogeneous Treatment Effects.” Biometrika 108(2).
- Shalit, U. + Johansson, F.D. + Sontag, D. (2017). “Estimating Individual Treatment Effect: generalization bounds and algorithms.” ICML.
- VanderWeele, T.J. (2015). Explanation in Causal Inference: Methods for Mediation and Interaction. Oxford University Press.
- VanderWeele, T.J. + Ding, P. (2017). “Sensitivity Analysis in Observational Research: Introducing the E-Value.” Annals of Internal Medicine 167(4).
- Card, D. + Krueger, A.B. (1994). “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” AER 84(4).
- Callaway, B. + Sant’Anna, P.H.C. (2021). “Difference-in-Differences with multiple time periods.” J. Econometrics 225(2).
- Goodman-Bacon, A. (2021). “Difference-in-Differences with variation in treatment timing.” J. Econometrics 225(2).
- Sun, L. + Abraham, S. (2021). “Estimating dynamic treatment effects in event studies with heterogeneous treatment effects.” J. Econometrics 225(2).
- de Chaisemartin, C. + D’Haultfœuille, X. (2020). “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” AER 110(9).
- Abadie, A. + Diamond, A. + Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” JASA 105(490).
- Abadie, A. + Gardeazabal, J. (2003). “The Economic Costs of Conflict: A Case Study of the Basque Country.” AER 93(1).
- Thistlethwaite, D.L. + Campbell, D.T. (1960). “Regression- discontinuity analysis: an alternative to the ex post facto experiment.” J. Educ. Psychology 51(6).
- Lee, D.S. (2008). “Randomized experiments from non-random selection in U.S. House elections.” J. Econometrics 142(2).
- Cattaneo, M.D. + Idrobo, N. + Titiunik, R. (2020). A Practical Introduction to Regression Discontinuity Designs. Cambridge.
- Calonico, S. + Cattaneo, M.D. + Titiunik, R. (2014). “Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs.” Econometrica 82(6).
- Imbens, G.W. + Angrist, J.D. (1994). “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62(2).
- Angrist, J.D. (1990). “Lifetime Earnings and the Vietnam Era Draft Lottery.” AER 80(3).
- Stock, J.H. + Yogo, M. (2005). “Testing for weak instruments in linear IV regression.” In Identification and Inference for Econometric Models.
- Robins, J.M. + Greenland, S. (1992). “Identifiability and Exchangeability for Direct and Indirect Effects.” Epidemiology 3(2).
- Pearl, J. (2001). “Direct and Indirect Effects.” UAI.
- Iacus, S.M. + King, G. + Porro, G. (2012). “Causal Inference Without Balance Checking: Coarsened Exact Matching.” Political Analysis 20(1).
- Abadie, A. + Imbens, G.W. (2006). “Large Sample Properties of Matching Estimators for Average Treatment Effects.” Econometrica 74(1).
- van der Laan, M.J. + Rubin, D. (2006). “Targeted Maximum Likelihood Learning.” IJB 2(1).
- Brodersen, K.H. et al. (2015). “Inferring causal impact using Bayesian structural time-series models.” Annals of Applied Statistics 9(1).
- Bareinboim, E. + Pearl, J. (2014). “Causal inference and the data-fusion problem.” PNAS 113(27).
- Banerjee, A. + Duflo, E. (2015). Poor Economics. PublicAffairs; 2019 Nobel for experimental approach.