Hypothesis Testing & Maximum Likelihood — Math Reference
1. At a glance
Frequentist statistical inference is the dominant paradigm for drawing conclusions from data in classical science, A/B testing, clinical trials, quality control, and most empirical reporting. It treats parameters as fixed-but-unknown quantities and data as random, in contrast to the Bayesian view where parameters carry probability distributions (see [[Math/bayesian-inference]]).
Three pillars define the frequentist toolkit:
- Hypothesis tests — decision procedures against a null hypothesis with controlled Type-I error rate.
- Confidence intervals — random intervals constructed to cover the true parameter with prescribed long-run frequency.
- Point estimation — single-value estimators (MLE, method of moments, M-estimators) with asymptotic optimality theory.
Despite ongoing critique — Ioannidis (2005), the ASA statement on p-values (2016), and the replication crisis in psychology and biomedicine — p-values and confidence intervals remain the lingua franca of empirical science. The pragmatic agent must know how to compute, interpret, and critique them while also knowing when to reach for Bayesian, likelihood-based, or non-parametric alternatives.
This note complements [[Math/probability-fundamentals]] (which builds the probabilistic substrate) and [[Math/bayesian-inference]] (the posterior-based counterpart).
2. Hypothesis testing fundamentals
A statistical test is a decision rule between two hypotheses:
- Null hypothesis H₀ — the default / no-effect / status-quo claim.
- Alternative hypothesis H₁ (or H_a) — what you want to demonstrate.
The asymmetry is essential: H₀ is never accepted — it is either rejected or you fail to reject it. The test controls only the error of wrongly rejecting H₀ (the Type-I error).
2.1 Test statistics and critical regions
A test statistic T(X) is a function of the data whose distribution under H₀ is known (at least asymptotically). The critical region C is the subset of outcomes for which we reject H₀:
2.2 Errors
| H₀ true | H₁ true | |
|---|---|---|
| Fail to reject | Correct (1−α) | Type-II error (β) |
| Reject | Type-I error (α) | Correct (1−β = power) |
- Significance level α = P(Type-I) — typically 0.05, 0.01, or 0.001. Set in advance.
- Power = 1 − β = P(reject H₀ | H₁ true). Depends on effect size, n, σ, α.
2.3 The p-value
The p-value is:
(or the two-tailed analog). It is the probability of observing data at least as extreme as the actual data under the null. We reject H₀ iff p < α.
Critical misinterpretation: p is NOT P(H₀ | data). That quantity exists only under a Bayesian framing with a prior. The p-value is a property of the test, not of the hypothesis.
2.4 Effect size, sample size, power
Power increases with:
- Larger true effect size (e.g., |µ − µ₀| in a z-test).
- Larger n (variance of x̄ shrinks as σ²/n).
- Larger α (more lenient rejection).
- Smaller σ (less noise).
Power analysis (Cohen 1988) — given a target effect size and desired power (commonly 0.8), compute required n before running the study. Tools: G*Power, statsmodels.stats.power, R pwr package.
3. Z-test — known σ, normal data
When X₁,…,Xₙ ~ N(µ, σ²) with σ known, test H₀: µ = µ₀:
Under H₀, Z ~ N(0,1). Reject at α = 0.05 two-sided iff |Z| > 1.96. One-sided variant: reject iff Z > 1.645 (upper) or Z < −1.645 (lower).
Two-sample version with known σ₁, σ₂:
Rarely seen in practice because σ is rarely known — but central to large-n approximations (e.g., proportions via CLT).
4. t-test — unknown σ, normal data
Gosset (1908, under pseudonym “Student”) solved the unknown-σ case. With sample standard deviation s = √(Σ(xᵢ − x̄)²/(n−1)):
The Student-t distribution has heavier tails than N(0,1); as n → ∞, t_{n−1} → N(0,1). Critical value at α=0.05, df=20, two-sided: ≈ 2.086 (vs 1.96 for z).
4.1 Two-sample t-test (equal variances)
Pooled variance s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2):
4.2 Welch’s t-test (unequal variances)
When σ₁ ≠ σ₂ (the common case), use Welch (1947):
with Welch-Satterthwaite degrees of freedom (non-integer). Default in R’s t.test().
4.3 Paired t-test
For matched samples (before/after, twins), compute differences dᵢ = xᵢ − yᵢ and run a one-sample t-test on d. Dramatically more powerful than independent-samples t when pairing removes variance.
5. Chi-square tests
The chi-square distribution χ²_k with k degrees of freedom is the distribution of ΣZᵢ² for Zᵢ ~ N(0,1) iid. Three classical uses:
5.1 Goodness-of-fit
Test whether observed counts O₁,…,O_k match expected counts E₁,…,E_k under H₀:
(Subtract 1 more df per fitted parameter.) Pearson (1900). Used for dice fairness, Mendel genetics, distributional fit.
5.2 Test of independence (contingency table)
Two categorical variables, r × c table. Eᵢⱼ = (row_i total × col_j total)/n:
Fisher’s exact test preferred when expected counts < 5.
5.3 Variance test
For normal data: (n−1)s²/σ₀² ~ χ²_{n−1}. Sensitive to non-normality.
6. ANOVA — analysis of variance
Compare means of k > 2 groups via the F-statistic (Fisher 1925). Decompose total sum of squares:
Variants:
- One-way ANOVA — single factor with k levels.
- Two-way ANOVA — two factors + interaction.
- Repeated-measures ANOVA — within-subjects factor; correlated errors handled by sphericity assumption (Greenhouse-Geisser correction if violated).
- MANOVA — multiple dependent variables; Wilks’ Λ, Pillai’s trace.
- ANCOVA — covariate adjustment.
ANOVA omnibus reject does not say which groups differ — follow with post-hoc tests (Tukey HSD, Scheffé, Dunnett vs control).
7. Non-parametric tests
When normality fails or data is ordinal, use rank-based tests. Less power if normality holds; robust when it doesn’t.
| Parametric test | Non-parametric analog | Notes |
|---|---|---|
| One-sample t | Wilcoxon signed-rank | Tests median = µ₀ |
| Paired t | Wilcoxon signed-rank on diffs | Wilcoxon (1945) |
| Independent t | Mann-Whitney U / Wilcoxon rank-sum | Mann-Whitney (1947) |
| One-way ANOVA | Kruskal-Wallis | Kruskal-Wallis (1952) |
| Repeated-measures ANOVA | Friedman | Friedman (1937) |
| Pearson r | Spearman ρ, Kendall τ | Rank-based correlation |
| GOF / 2-sample dist | Kolmogorov-Smirnov | Compare empirical CDFs |
| 2-sample dist | Anderson-Darling, Cramér-von Mises | More tail-sensitive than KS |
KS statistic: D = sup_x |F̂(x) − F₀(x)|; distribution-free under H₀.
8. Permutation tests and bootstrap
Permutation (randomization) tests — Fisher’s original (1935). Under H₀ of exchangeability, the labels are arbitrary; recompute the test statistic across all (or randomly sampled) relabelings to get an empirical null. Exact, assumption-free, finite-sample.
Bootstrap — Efron (1979). Resample n observations with replacement from the empirical distribution B times (e.g., B=10000), compute the statistic on each bootstrap sample, derive:
- Bootstrap standard error = SD of bootstrap statistics.
- Percentile CI = [θ̂_{(α/2)}, θ̂_{(1−α/2)}].
- BCa (bias-corrected accelerated) — better coverage; Efron (1987).
Modern, computational, assumption-light. Standard in econometrics, machine learning evaluation, and any setting where the sampling distribution is intractable.
9. Multiple comparison correction
Running m tests at α each gives a family-wise Type-I rate of up to mα — making any “significant” result among many tests likely a false positive. Corrections:
9.1 Bonferroni
Use α/m per test. Controls family-wise error rate (FWER): P(any false reject) ≤ α. Simple, conservative, loses power as m grows.
9.2 Holm-Bonferroni (1979)
Stepwise: sort p-values p₍₁₎ ≤ … ≤ p₍ₘ₎. Reject p₍ᵢ₎ if p₍ᵢ₎ ≤ α/(m − i + 1) for the smallest such i. Always at least as powerful as Bonferroni, controls FWER.
9.3 Benjamini-Hochberg FDR (1995)
Controls the False Discovery Rate — expected proportion of false rejections among rejections — at level q. Sort p-values; find largest i with p₍ᵢ₎ ≤ (i/m)·q; reject p₍₁₎,…,p₍ᵢ₎. Standard in genomics (m can be 20,000+ genes), neuroimaging, large-scale A/B testing.
9.4 Storey q-value (2002)
Adaptive FDR that estimates the proportion π₀ of true nulls from the p-value distribution; the q-value of a test is the minimum FDR at which it is significant. More powerful than BH when π₀ < 1.
9.5 FWER vs FDR — when to use which
- FWER — small m, any false positive is costly (regulatory, drug approval).
- FDR — large m, screening / discovery, willing to tolerate some false positives in exchange for power (omics, exploratory A/B).
10. Confidence intervals
A (1−α)·100% confidence interval for θ is a random interval [L(X), U(X)] such that:
over repeated sampling. For a normal mean, σ unknown:
For proportions (Wald, large n): p̂ ± z_{α/2} √(p̂(1−p̂)/n). Wilson and Clopper-Pearson are preferred for small n or extreme p̂.
10.1 Interpretation — critical
A 95% CI does NOT mean “there is a 95% probability the parameter lies in this interval.” Frequentist θ is fixed; the interval is random. The correct statement: if you repeated the experiment many times, 95% of the resulting CIs would contain the true parameter.
For a probability-of-θ statement, you need a Bayesian credible interval computed from a posterior (see [[Math/bayesian-inference]]).
10.2 Duality with hypothesis tests
The (1−α) CI consists of all θ₀ values that would not be rejected by a two-sided test at level α. If µ₀ ∉ CI, then the test rejects H₀: µ = µ₀ at level α.
11. Maximum Likelihood Estimation (MLE)
Fisher (1922) — given iid data X₁,…,Xₙ with density p(x; θ):
The maximum likelihood estimator:
Typically solve the score equation ∇ℓ(θ) = 0, checking the Hessian is negative-definite. Numerical optimization (Newton-Raphson, BFGS) when no closed form exists.
12. MLE properties
Under standard regularity conditions (smooth density, identifiability, support not depending on θ):
12.1 Consistency
The MLE converges in probability to the true parameter.
12.2 Asymptotic normality
where the Fisher information is:
For vector θ, I(θ) is the Fisher information matrix; asymptotic covariance = I(θ)⁻¹/n.
12.3 Efficiency — Cramér-Rao lower bound
The Cramér-Rao lower bound (CRLB) states that for any unbiased estimator θ̃ of θ:
The MLE asymptotically achieves this bound — it is asymptotically efficient, no asymptotically unbiased estimator has lower variance.
12.4 Invariance
If g is a one-to-one transformation, then the MLE of g(θ) is g(θ̂_MLE). No re-derivation needed when re-parameterizing.
12.5 Caveats
- Finite-sample MLE can be biased (the variance MLE divides by n, not n−1).
- MLE can fail under non-regular models (e.g., uniform on [0, θ] — boundary parameter, MLE = max Xᵢ has non-standard asymptotics).
- MLE can have unbounded likelihood (Gaussian mixture with σₖ → 0 around a single point) — needs regularization or restricted optimization.
13. MLE examples
13.1 Normal — N(µ, σ²)
The MLE of σ² is biased (divides by n); the unbiased estimator divides by n−1. Bias → 0 as n → ∞ (consistent).
13.2 Bernoulli — Ber(p)
Sample proportion of 1s.
13.3 Exponential — Exp(λ)
Reciprocal of sample mean.
13.4 Poisson — Pois(λ)
13.5 Uniform — U(0, θ)
Non-regular: not asymptotically normal, biased downward, bias correctable to (n+1)/n · max xᵢ.
14. EM algorithm
Dempster, Laird, Rubin (1977) — when the likelihood depends on latent / unobserved variables Z and direct ML is intractable. The EM algorithm iterates:
E-step — compute the expected complete-data log-likelihood under the current parameter estimate θ_t:
M-step — maximize:
Key property — monotone convergence: the observed-data log-likelihood ℓ(θ_t; X) is non-decreasing across iterations. Converges to a local maximum (or saddle) — sensitive to initialization, often run with random restarts.
14.1 Examples
- Gaussian mixture models (GMM) — Z = component assignment; E-step gives soft responsibilities; M-step updates means, covariances, weights. Foundation of clustering with
sklearn.mixture.GaussianMixture. - Hidden Markov Models — Baum-Welch algorithm (1970) is EM where E-step uses the forward-backward algorithm.
- Missing-data MLE — Z = missing values; alternative to multiple imputation.
- Factor analysis — Z = latent factors.
- Mixture of experts — Z = expert assignment.
- Probabilistic PCA — closed-form EM (Tipping-Bishop 1999).
14.2 Connections
EM is a special case of MM (majorize-minimize); generalizes to variational EM (replace expectation with variational approximation) — used heavily in variational Bayes and modern deep generative models.
15. Method of moments
Pearson (1894) — equate the first k sample moments to theoretical moments and solve for the k parameters:
Example: Gamma(α, β) — E[X] = α/β, Var(X) = α/β². Equate x̄ = α/β and s² = α/β² → β̂ = x̄/s², α̂ = x̄²/s².
Older, often simpler than MLE. Generally less efficient asymptotically (higher variance) but useful as MLE starting points or when likelihood is intractable. Generalized method of moments (GMM) — Hansen (1982) — major in econometrics.
16. Likelihood ratio test (LRT)
For nested models with restricted parameter space Θ₀ ⊂ Θ:
Under H₀ and regularity conditions (Wilks 1938):
where r = dim(Θ) − dim(Θ₀). Foundation of:
- Nested regression model comparison (e.g., does adding predictor improve fit?).
- ANOVA F-tests (asymptotically equivalent under normality).
- Deviance tests in GLM.
LRT is uniformly most powerful for simple-vs-simple hypotheses (Neyman-Pearson lemma, 1933).
17. Wald and score tests
Three asymptotically equivalent tests of H₀: θ = θ₀:
- Likelihood ratio: −2 log [L(θ₀)/L(θ̂)] ~ χ²
- Wald: (θ̂ − θ₀)² · I(θ̂) ~ χ² — fit unrestricted model, check distance.
- Score (Lagrange multiplier): [∇ℓ(θ₀)]² / I(θ₀) ~ χ² — fit restricted model only.
Choice depends on convenience: Wald needs unrestricted fit, score needs only restricted fit, LRT needs both. Asymptotically equivalent but can disagree in finite samples, especially with non-quadratic log-likelihood.
18. Generalized Linear Models (GLM)
Nelder-Wedderburn (1972) — unifies linear regression, logistic regression, Poisson regression, and more under a single MLE framework:
with response distribution in the exponential family and link function g.
| Model | Distribution | Link | Use |
|---|---|---|---|
| Linear regression (OLS) | Normal | Identity | Continuous Y |
| Logistic regression | Bernoulli | logit log(p/(1−p)) | Binary Y |
| Probit regression | Bernoulli | Φ⁻¹ | Binary Y, normal latent |
| Poisson regression | Poisson | log | Count Y |
| Negative binomial | NegBin | log | Overdispersed counts |
| Gamma regression | Gamma | inverse / log | Positive continuous |
Fitted by iteratively reweighted least squares (IRLS) = Newton-Raphson on the log-likelihood. Deviance D = −2(ℓ_fitted − ℓ_saturated); analog of RSS for model comparison.
Software: R glm(), statsmodels.GLM, scikit-learn LogisticRegression (regularized by default).
19. Survival analysis
Time-to-event data with censoring — some subjects haven’t experienced the event by study end. Standard tools:
19.1 Kaplan-Meier estimator (1958)
Non-parametric estimator of the survival function S(t) = P(T > t):
where dᵢ = deaths at tᵢ, nᵢ = at-risk at tᵢ. Stepwise function; standard plot in clinical trials.
19.2 Log-rank test
Compare survival curves between groups; non-parametric, sensitive to proportional hazards alternatives.
19.3 Cox proportional hazards (Cox 1972)
Semi-parametric — leaves baseline hazard λ₀(t) unspecified:
Fitted by partial likelihood. Hazard ratios exp(βⱼ) interpret as relative risk per unit increase in covariate. Assumes proportional hazards across time (testable via Schoenfeld residuals).
Software: R survival (Therneau), Python lifelines.
20. A/B testing
Modern industrial application of frequentist inference. Compare conversion or metric between control A and treatment B.
20.1 Two-proportion test
For binary metrics (clicks, signups):
Equivalent to χ² test on the 2×2 table.
20.2 Sequential testing
Naive “peek and decide” inflates Type-I rate dramatically. Solutions:
- Sequential probability ratio test (SPRT) — Wald (1945).
- mSPRT (mixture SPRT) — Robbins (1970); Johari-Pekelis-Walsh at Optimizely.
- Group-sequential tests — Pocock, O’Brien-Fleming spending functions; standard in clinical trials.
- Always-valid p-values + e-values — Vovk-Wang (2021) revival.
20.3 Bandit alternatives
Multi-armed bandits (Thompson sampling, UCB) — Bayesian; allocate traffic adaptively to better-performing arms while learning. Trade off exploration / exploitation. Higher cumulative reward but harder to obtain clean post-hoc effect estimates.
Software: Optimizely, GrowthBook, statsmodels, scipy.stats, Eppo, Statsig.
21. Reproducibility crisis and the p-value debate
A wave of evidence in the 2010s revealed widespread non-replication in psychology (Open Science Collaboration 2015), medicine (Ioannidis 2005 “Why Most Published Research Findings Are False”), and biology. Core diagnoses:
- p-hacking — try many tests, report the significant ones.
- Garden of forking paths — analysis choices contingent on the data (Gelman-Loken).
- Publication bias — null results unpublished; meta-analyses biased.
- Underpowered studies — n too small; significant findings overestimate effect size (Type-M error, Gelman).
- Misinterpreted p-values — treated as posterior probabilities of H₀.
21.1 ASA statement (2016)
The American Statistical Association published six principles, including: p-values do not measure the probability of the hypothesis; p < 0.05 does not establish importance; reporting only p ignores effect size and uncertainty.
21.2 Reforms
- Pre-registration — lock the analysis plan before seeing the data.
- Registered reports — peer review the design, not the result.
- Effect-size + CI reporting alongside p.
- Multiverse analysis — report results across reasonable analysis choices.
- Bayes factors and likelihood-based inference — see
[[Math/bayesian-inference]]. - Open data + open code — independent reproducibility.
22. Software
22.1 R
stats(built-in):t.test,chisq.test,wilcox.test,kruskal.test,aov,lm,glm,cor.test,ks.test.lme4— mixed-effects models (Bates et al.).survival— Cox PH, Kaplan-Meier (Therneau).pwr— power analysis.multcomp— multiple comparisons (Tukey, Dunnett).boot— bootstrap.
22.2 Python
scipy.stats— distributions, tests (t, F, χ², Mann-Whitney, KS, etc.).statsmodels— GLM, mixed models, ANOVA, time series, survival.pingouin— friendly stats wrapper.lifelines— survival analysis.scikit-posthocs— post-hoc multiple comparisons.scikit-learn—LogisticRegression,GaussianMixture(EM under the hood).
22.3 Bayesian alternatives
- PyMC (Python) — NUTS / HMC sampler.
- Stan (R / Python / Julia) — Hamiltonian Monte Carlo.
- NumPyro — JAX-backed Bayesian inference.
See [[Math/bayesian-inference]] for the Bayesian counterpart and conversion between p-values and Bayes factors.
23. Common pitfalls
- p-hacking — running many tests until one gives p < 0.05; pre-register or correct.
- Statistical vs practical significance — large n can make trivial effects significant; report effect size + CI, not just p.
- No multiple-comparisons correction — m tests at α each inflates FWER to ~mα.
- Assumption violations — normality (use bootstrap / non-parametric), equal variance (use Welch’s t), independence (use mixed models or GEE).
- Underpowered studies — n too small; observed significant effects are inflated (Type-M / Type-S error per Gelman).
- Optional stopping — peeking at p during data collection inflates Type-I; use sequential designs.
- Confusing CI with credible interval — the 95% CI is not P(θ ∈ interval) = 0.95. Frequentist coverage statement only.
- Treating “fail to reject” as “accept H₀” — absence of evidence is not evidence of absence; report power and CIs.
- Cherry-picking the test — Welch vs Student, parametric vs non-parametric, decided after seeing the data.
- Ignoring dependence structure — clustered, repeated-measures, or time-series data violate iid; use hierarchical / GEE / time-series methods.
- Reporting p without direction — “p = 0.04” without saying which group is higher.
- MLE of variance — using σ̂² = (1/n)Σ(xᵢ − x̄)² when the unbiased estimator is needed.
24. Cross-references
[[Math/probability-fundamentals]]— distributions, CLT, LLN, joint/conditional probability that underpin all tests.[[Math/bayesian-inference]]— posterior-based counterpart; Bayes factors, credible intervals, MCMC.[[Math/_index]]— math reference index.[[Engineering/reliability-engineering]]— Weibull fitting, accelerated life testing, MTBF inference.[[Engineering/six-sigma]]— DMAIC Analyze phase uses ANOVA, hypothesis tests, capability indices.
25. Citations and further reading
- Wasserman L. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004 — chapters 6–13 cover hypothesis testing, CIs, MLE, GLM in compact form.
- Casella G, Berger RL. Statistical Inference, 2nd ed. Duxbury, 2001 — graduate-level reference, exhaustive on MLE properties + decision theory.
- Lehmann EL, Romano JP. Testing Statistical Hypotheses, 4th ed. Springer, 2022 — canonical hypothesis-testing monograph.
- Fisher RA. “On the Mathematical Foundations of Theoretical Statistics.” Phil Trans R Soc A, 1922 — introduces MLE, consistency, efficiency, information.
- Neyman J, Pearson ES. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Phil Trans R Soc A, 1933 — Neyman-Pearson lemma.
- Student. “The Probable Error of a Mean.” Biometrika, 1908 — Gosset’s t-distribution.
- Welch BL. “The Generalization of Student’s Problem When Several Different Population Variances Are Involved.” Biometrika, 1947.
- Wilks SS. “The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses.” Ann Math Stat, 1938.
- Dempster AP, Laird NM, Rubin DB. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” J R Stat Soc B, 1977.
- Cox DR. “Regression Models and Life-Tables.” J R Stat Soc B, 1972 — proportional hazards.
- Nelder JA, Wedderburn RWM. “Generalized Linear Models.” J R Stat Soc A, 1972.
- Kaplan EL, Meier P. “Nonparametric Estimation from Incomplete Observations.” JASA, 1958.
- Benjamini Y, Hochberg Y. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J R Stat Soc B, 1995.
- Holm S. “A Simple Sequentially Rejective Multiple Test Procedure.” Scand J Stat, 1979.
- Storey JD. “A Direct Approach to False Discovery Rates.” J R Stat Soc B, 2002.
- Efron B. “Bootstrap Methods: Another Look at the Jackknife.” Ann Stat, 1979.
- Hansen LP. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica, 1982.
- Ioannidis JPA. “Why Most Published Research Findings Are False.” PLOS Medicine, 2005.
- Wasserstein RL, Lazar NA. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician, 2016.
- Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Erlbaum, 1988.
- Open Science Collaboration. “Estimating the Reproducibility of Psychological Science.” Science, 2015.
- Gelman A, Loken E. “The Statistical Crisis in Science.” American Scientist, 2014 — the garden of forking paths.