Hypothesis Testing & Maximum Likelihood — Math Reference

1. At a glance

Frequentist statistical inference is the dominant paradigm for drawing conclusions from data in classical science, A/B testing, clinical trials, quality control, and most empirical reporting. It treats parameters as fixed-but-unknown quantities and data as random, in contrast to the Bayesian view where parameters carry probability distributions (see [[Math/bayesian-inference]]).

Three pillars define the frequentist toolkit:

Hypothesis tests — decision procedures against a null hypothesis with controlled Type-I error rate.
Confidence intervals — random intervals constructed to cover the true parameter with prescribed long-run frequency.
Point estimation — single-value estimators (MLE, method of moments, M-estimators) with asymptotic optimality theory.

Despite ongoing critique — Ioannidis (2005), the ASA statement on p-values (2016), and the replication crisis in psychology and biomedicine — p-values and confidence intervals remain the lingua franca of empirical science. The pragmatic agent must know how to compute, interpret, and critique them while also knowing when to reach for Bayesian, likelihood-based, or non-parametric alternatives.

This note complements [[Math/probability-fundamentals]] (which builds the probabilistic substrate) and [[Math/bayesian-inference]] (the posterior-based counterpart).

2. Hypothesis testing fundamentals

A statistical test is a decision rule between two hypotheses:

Null hypothesis H₀ — the default / no-effect / status-quo claim.
Alternative hypothesis H₁ (or H_a) — what you want to demonstrate.

The asymmetry is essential: H₀ is never accepted — it is either rejected or you fail to reject it. The test controls only the error of wrongly rejecting H₀ (the Type-I error).

2.1 Test statistics and critical regions

A test statistic T(X) is a function of the data whose distribution under H₀ is known (at least asymptotically). The critical region C is the subset of outcomes for which we reject H₀:

$Reject H_{0} ⟺ T (X) \in C$

2.2 Errors

	H₀ true	H₁ true
Fail to reject	Correct (1−α)	Type-II error (β)
Reject	Type-I error (α)	Correct (1−β = power)

Significance level α = P(Type-I) — typically 0.05, 0.01, or 0.001. Set in advance.
Power = 1 − β = P(reject H₀ | H₁ true). Depends on effect size, n, σ, α.

2.3 The p-value

The p-value is:

$p = P (T (X) \geq T_{obs} ∣ H_{0})$

(or the two-tailed analog). It is the probability of observing data at least as extreme as the actual data under the null. We reject H₀ iff p < α.

Critical misinterpretation: p is NOT P(H₀ | data). That quantity exists only under a Bayesian framing with a prior. The p-value is a property of the test, not of the hypothesis.

2.4 Effect size, sample size, power

Power increases with:

Larger true effect size (e.g., |μ − μ₀| in a z-test).
Larger n (variance of x̄ shrinks as σ²/n).
Larger α (more lenient rejection).
Smaller σ (less noise).

Power analysis (Cohen 1988) — given a target effect size and desired power (commonly 0.8), compute required n before running the study. Tools: G*Power, statsmodels.stats.power, R pwr package.

3. Z-test — known σ, normal data

When X₁,…,Xₙ ~ N(μ, σ²) with σ known, test H₀: μ = μ₀:

$Z = \frac{x ˉ - μ _{0}}{σ / n}$

Under H₀, Z ~ N(0,1). Reject at α = 0.05 two-sided iff |Z| > 1.96. One-sided variant: reject iff Z > 1.645 (upper) or Z < −1.645 (lower).

Two-sample version with known σ₁, σ₂:

$Z = \frac{x ˉ _{1} - x ˉ _{2}}{σ _{1}^{2} / n _{1} + σ _{2}^{2} / n _{2}}$

Rarely seen in practice because σ is rarely known — but central to large-n approximations (e.g., proportions via CLT).

4. t-test — unknown σ, normal data

Gosset (1908, under pseudonym “Student”) solved the unknown-σ case. With sample standard deviation s = √(Σ(xᵢ − x̄)²/(n−1)):

$t = \frac{x ˉ - μ _{0}}{s / n} \sim t_{n - 1}$

The Student-t distribution has heavier tails than N(0,1); as n → ∞, t_{n−1} → N(0,1). Critical value at α=0.05, df=20, two-sided: ≈ 2.086 (vs 1.96 for z).

4.1 Two-sample t-test (equal variances)

Pooled variance s_p² = ((n₁−1)s₁² + (n₂−1)s₂²)/(n₁+n₂−2):

$t = \frac{x ˉ _{1} - x ˉ _{2}}{s _{p} 1/ n _{1} + 1/ n _{2}} \sim t_{n_{1} + n_{2} - 2}$

4.2 Welch’s t-test (unequal variances)

When σ₁ ≠ σ₂ (the common case), use Welch (1947):

$t = \frac{x ˉ _{1} - x ˉ _{2}}{s _{1}^{2} / n _{1} + s _{2}^{2} / n _{2}}$

with Welch-Satterthwaite degrees of freedom (non-integer). Default in R’s t.test().

4.3 Paired t-test

For matched samples (before/after, twins), compute differences dᵢ = xᵢ − yᵢ and run a one-sample t-test on d. Dramatically more powerful than independent-samples t when pairing removes variance.

5. Chi-square tests

The chi-square distribution χ²_k with k degrees of freedom is the distribution of ΣZᵢ² for Zᵢ ~ N(0,1) iid. Three classical uses:

5.1 Goodness-of-fit

Test whether observed counts O₁,…,O_k match expected counts E₁,…,E_k under H₀:

$χ^{2} = \sum_{i = 1}^{k} \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}} \sim χ_{k - 1}^{2}$

(Subtract 1 more df per fitted parameter.) Pearson (1900). Used for dice fairness, Mendel genetics, distributional fit.

5.2 Test of independence (contingency table)

Two categorical variables, r × c table. Eᵢⱼ = (row_i total × col_j total)/n:

$χ^{2} = \sum_{i, j} \frac{( O _{ij} - E _{ij} ) ^{2}}{E _{ij}} \sim χ_{(r - 1) (c - 1)}^{2}$

Fisher’s exact test preferred when expected counts < 5.

5.3 Variance test

For normal data: (n−1)s²/σ₀² ~ χ²_{n−1}. Sensitive to non-normality.

6. ANOVA — analysis of variance

Compare means of k > 2 groups via the F-statistic (Fisher 1925). Decompose total sum of squares:

$S S_{total} = S S_{between} + S S_{within}$

$F = \frac{S S _{between} / ( k - 1 )}{S S _{within} / ( n - k )} \sim F_{k - 1, n - k}$

Variants:

One-way ANOVA — single factor with k levels.
Two-way ANOVA — two factors + interaction.
Repeated-measures ANOVA — within-subjects factor; correlated errors handled by sphericity assumption (Greenhouse-Geisser correction if violated).
MANOVA — multiple dependent variables; Wilks’ Λ, Pillai’s trace.
ANCOVA — covariate adjustment.

ANOVA omnibus reject does not say which groups differ — follow with post-hoc tests (Tukey HSD, Scheffé, Dunnett vs control).

7. Non-parametric tests

When normality fails or data is ordinal, use rank-based tests. Less power if normality holds; robust when it doesn’t.

Parametric test	Non-parametric analog	Notes
One-sample t	Wilcoxon signed-rank	Tests median = μ₀
Paired t	Wilcoxon signed-rank on diffs	Wilcoxon (1945)
Independent t	Mann-Whitney U / Wilcoxon rank-sum	Mann-Whitney (1947)
One-way ANOVA	Kruskal-Wallis	Kruskal-Wallis (1952)
Repeated-measures ANOVA	Friedman	Friedman (1937)
Pearson r	Spearman ρ, Kendall τ	Rank-based correlation
GOF / 2-sample dist	Kolmogorov-Smirnov	Compare empirical CDFs
2-sample dist	Anderson-Darling, Cramér-von Mises	More tail-sensitive than KS

KS statistic: D = sup_x |F̂(x) − F₀(x)|; distribution-free under H₀.

8. Permutation tests and bootstrap

Permutation (randomization) tests — Fisher’s original (1935). Under H₀ of exchangeability, the labels are arbitrary; recompute the test statistic across all (or randomly sampled) relabelings to get an empirical null. Exact, assumption-free, finite-sample.

Bootstrap — Efron (1979). Resample n observations with replacement from the empirical distribution B times (e.g., B=10000), compute the statistic on each bootstrap sample, derive:

Bootstrap standard error = SD of bootstrap statistics.
Percentile CI = [θ̂_{(α/2)}, θ̂_{(1−α/2)}].
BCa (bias-corrected accelerated) — better coverage; Efron (1987).

Modern, computational, assumption-light. Standard in econometrics, machine learning evaluation, and any setting where the sampling distribution is intractable.

9. Multiple comparison correction

Running m tests at α each gives a family-wise Type-I rate of up to mα — making any “significant” result among many tests likely a false positive. Corrections:

9.1 Bonferroni

Use α/m per test. Controls family-wise error rate (FWER): P(any false reject) ≤ α. Simple, conservative, loses power as m grows.

9.2 Holm-Bonferroni (1979)

Stepwise: sort p-values p₍₁₎ ≤ … ≤ p₍ₘ₎. Reject p₍ᵢ₎ if p₍ᵢ₎ ≤ α/(m − i + 1) for the smallest such i. Always at least as powerful as Bonferroni, controls FWER.

9.3 Benjamini-Hochberg FDR (1995)

Controls the False Discovery Rate — expected proportion of false rejections among rejections — at level q. Sort p-values; find largest i with p₍ᵢ₎ ≤ (i/m)·q; reject p₍₁₎,…,p₍ᵢ₎. Standard in genomics (m can be 20,000+ genes), neuroimaging, large-scale A/B testing.

9.4 Storey q-value (2002)

Adaptive FDR that estimates the proportion π₀ of true nulls from the p-value distribution; the q-value of a test is the minimum FDR at which it is significant. More powerful than BH when π₀ < 1.

9.5 FWER vs FDR — when to use which

FWER — small m, any false positive is costly (regulatory, drug approval).
FDR — large m, screening / discovery, willing to tolerate some false positives in exchange for power (omics, exploratory A/B).

10. Confidence intervals

A (1−α)·100% confidence interval for θ is a random interval [L(X), U(X)] such that:

$P (L (X) \leq θ \leq U (X)) = 1 - α$

over repeated sampling. For a normal mean, σ unknown:

$C I = \overset{x}{ˉ} \pm t_{n - 1, α /2} \cdot \frac{s}{n}$

For proportions (Wald, large n): p̂ ± z_{α/2} √(p̂(1−p̂)/n). Wilson and Clopper-Pearson are preferred for small n or extreme p̂.

10.1 Interpretation — critical

A 95% CI does NOT mean “there is a 95% probability the parameter lies in this interval.” Frequentist θ is fixed; the interval is random. The correct statement: if you repeated the experiment many times, 95% of the resulting CIs would contain the true parameter.

For a probability-of-θ statement, you need a Bayesian credible interval computed from a posterior (see [[Math/bayesian-inference]]).

10.2 Duality with hypothesis tests

The (1−α) CI consists of all θ₀ values that would not be rejected by a two-sided test at level α. If μ₀ ∉ CI, then the test rejects H₀: μ = μ₀ at level α.

11. Maximum Likelihood Estimation (MLE)

Fisher (1922) — given iid data X₁,…,Xₙ with density p(x; θ):

$L (θ; X) = \prod_{i = 1}^{n} p (x_{i}; θ) ℓ (θ) = lo g L (θ; X) = \sum_{i = 1}^{n} lo g p (x_{i}; θ)$

The maximum likelihood estimator:

$\hat{θ}_{M L E} = ar g max_{θ} L (θ; X) = ar g max_{θ} ℓ (θ)$

Typically solve the score equation ∇ℓ(θ) = 0, checking the Hessian is negative-definite. Numerical optimization (Newton-Raphson, BFGS) when no closed form exists.

12. MLE properties

Under standard regularity conditions (smooth density, identifiability, support not depending on θ):

12.1 Consistency

$\hat{θ}_{M L E} P θ_{true} as n \to \infty$

The MLE converges in probability to the true parameter.

12.2 Asymptotic normality

$n (\hat{θ}_{M L E} - θ_{true}) d N (0, I (θ)^{- 1})$

where the Fisher information is:

$I (θ) = - E [\frac{\partial ^{2} l o g p ( X ; θ )}{\partial θ ^{2}}] = E [(\frac{\partial l o g p ( X ; θ )}{\partial θ})^{2}]$

For vector θ, I(θ) is the Fisher information matrix; asymptotic covariance = I(θ)⁻¹/n.

12.3 Efficiency — Cramér-Rao lower bound

The Cramér-Rao lower bound (CRLB) states that for any unbiased estimator θ̃ of θ:

$Var (\tilde{θ}) \geq \frac{1}{n I ( θ )}$

The MLE asymptotically achieves this bound — it is asymptotically efficient, no asymptotically unbiased estimator has lower variance.

12.4 Invariance

If g is a one-to-one transformation, then the MLE of g(θ) is g(θ̂_MLE). No re-derivation needed when re-parameterizing.

12.5 Caveats

Finite-sample MLE can be biased (the variance MLE divides by n, not n−1).
MLE can fail under non-regular models (e.g., uniform on [0, θ] — boundary parameter, MLE = max Xᵢ has non-standard asymptotics).
MLE can have unbounded likelihood (Gaussian mixture with σₖ → 0 around a single point) — needs regularization or restricted optimization.

13. MLE examples

13.1 Normal — N(μ, σ²)

$\overset{μ}{^}_{M L E} = \overset{x}{ˉ} \overset{σ}{^}_{M L E}^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \overset{x}{ˉ})^{2}$

The MLE of σ² is biased (divides by n); the unbiased estimator divides by n−1. Bias → 0 as n → ∞ (consistent).

13.2 Bernoulli — Ber(p)

$\overset{p}{^}_{M L E} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} = \overset{x}{ˉ}$

Sample proportion of 1s.

13.3 Exponential — Exp(λ)

$\hat{λ}_{M L E} = \frac{1}{x ˉ}$

Reciprocal of sample mean.

13.4 Poisson — Pois(λ)

$\hat{λ}_{M L E} = \overset{x}{ˉ}$

13.5 Uniform — U(0, θ)

$\hat{θ}_{M L E} = max_{i} x_{i}$

Non-regular: not asymptotically normal, biased downward, bias correctable to (n+1)/n · max xᵢ.

14. EM algorithm

Dempster, Laird, Rubin (1977) — when the likelihood depends on latent / unobserved variables Z and direct ML is intractable. The EM algorithm iterates:

E-step — compute the expected complete-data log-likelihood under the current parameter estimate θ_t:

$Q (θ ∣ θ_{t}) = E_{Z ∣ X, θ_{t}} [lo g p (X, Z; θ)]$

M-step — maximize:

$θ_{t + 1} = ar g max_{θ} Q (θ ∣ θ_{t})$

Key property — monotone convergence: the observed-data log-likelihood ℓ(θ_t; X) is non-decreasing across iterations. Converges to a local maximum (or saddle) — sensitive to initialization, often run with random restarts.

14.1 Examples

Gaussian mixture models (GMM) — Z = component assignment; E-step gives soft responsibilities; M-step updates means, covariances, weights. Foundation of clustering with sklearn.mixture.GaussianMixture.
Hidden Markov Models — Baum-Welch algorithm (1970) is EM where E-step uses the forward-backward algorithm.
Missing-data MLE — Z = missing values; alternative to multiple imputation.
Factor analysis — Z = latent factors.
Mixture of experts — Z = expert assignment.
Probabilistic PCA — closed-form EM (Tipping-Bishop 1999).

14.2 Connections

EM is a special case of MM (majorize-minimize); generalizes to variational EM (replace expectation with variational approximation) — used heavily in variational Bayes and modern deep generative models.

15. Method of moments

Pearson (1894) — equate the first k sample moments to theoretical moments and solve for the k parameters:

$\overset{μ}{^}_{j} = \frac{1}{n} \sum_{i} x_{i}^{j} =! E [X^{j}; θ] j = 1, \dots, k$

Example: Gamma(α, β) — E[X] = α/β, Var(X) = α/β². Equate x̄ = α/β and s² = α/β² → β̂ = x̄/s², α̂ = x̄²/s².

Older, often simpler than MLE. Generally less efficient asymptotically (higher variance) but useful as MLE starting points or when likelihood is intractable. Generalized method of moments (GMM) — Hansen (1982) — major in econometrics.

16. Likelihood ratio test (LRT)

For nested models with restricted parameter space Θ₀ ⊂ Θ:

$Λ = \frac{s u p _{θ \in Θ_{0}} L ( θ ; X )}{s u p _{θ \in Θ} L ( θ ; X )}$

Under H₀ and regularity conditions (Wilks 1938):

$- 2 lo g Λ d χ_{r}^{2}$

where r = dim(Θ) − dim(Θ₀). Foundation of:

Nested regression model comparison (e.g., does adding predictor improve fit?).
ANOVA F-tests (asymptotically equivalent under normality).
Deviance tests in GLM.

LRT is uniformly most powerful for simple-vs-simple hypotheses (Neyman-Pearson lemma, 1933).

17. Wald and score tests

Three asymptotically equivalent tests of H₀: θ = θ₀:

Likelihood ratio: −2 log [L(θ₀)/L(θ̂)] ~ χ²
Wald: (θ̂ − θ₀)² · I(θ̂) ~ χ² — fit unrestricted model, check distance.
Score (Lagrange multiplier): [∇ℓ(θ₀)]² / I(θ₀) ~ χ² — fit restricted model only.

Choice depends on convenience: Wald needs unrestricted fit, score needs only restricted fit, LRT needs both. Asymptotically equivalent but can disagree in finite samples, especially with non-quadratic log-likelihood.

18. Generalized Linear Models (GLM)

Nelder-Wedderburn (1972) — unifies linear regression, logistic regression, Poisson regression, and more under a single MLE framework:

$g (E [Y ∣ X]) = Xβ$

with response distribution in the exponential family and link function g.

Model	Distribution	Link	Use
Linear regression (OLS)	Normal	Identity	Continuous Y
Logistic regression	Bernoulli	logit log(p/(1−p))	Binary Y
Probit regression	Bernoulli	Φ⁻¹	Binary Y, normal latent
Poisson regression	Poisson	log	Count Y
Negative binomial	NegBin	log	Overdispersed counts
Gamma regression	Gamma	inverse / log	Positive continuous

Fitted by iteratively reweighted least squares (IRLS) = Newton-Raphson on the log-likelihood. Deviance D = −2(ℓ_fitted − ℓ_saturated); analog of RSS for model comparison.

Software: R glm(), statsmodels.GLM, scikit-learn LogisticRegression (regularized by default).

19. Survival analysis

Time-to-event data with censoring — some subjects haven’t experienced the event by study end. Standard tools:

19.1 Kaplan-Meier estimator (1958)

Non-parametric estimator of the survival function S(t) = P(T > t):

$\hat{S} (t) = \prod_{t_{i} \leq t} (1 - \frac{d _{i}}{n _{i}})$

where dᵢ = deaths at tᵢ, nᵢ = at-risk at tᵢ. Stepwise function; standard plot in clinical trials.

19.2 Log-rank test

Compare survival curves between groups; non-parametric, sensitive to proportional hazards alternatives.

19.3 Cox proportional hazards (Cox 1972)

Semi-parametric — leaves baseline hazard λ₀(t) unspecified:

$λ (t ∣ X) = λ_{0} (t) exp (Xβ)$

Fitted by partial likelihood. Hazard ratios exp(βⱼ) interpret as relative risk per unit increase in covariate. Assumes proportional hazards across time (testable via Schoenfeld residuals).

Software: R survival (Therneau), Python lifelines.

20. A/B testing

Modern industrial application of frequentist inference. Compare conversion or metric between control A and treatment B.

20.1 Two-proportion test

For binary metrics (clicks, signups):

$Z = \frac{p ^ _{B} - p ^ _{A}}{p ^ ( 1 - p ^ ) ( 1/ n _{A} + 1/ n _{B} )} \overset{p}{^} = pooled proportion$

Equivalent to χ² test on the 2×2 table.

20.2 Sequential testing

Naive “peek and decide” inflates Type-I rate dramatically. Solutions:

Sequential probability ratio test (SPRT) — Wald (1945).
mSPRT (mixture SPRT) — Robbins (1970); Johari-Pekelis-Walsh at Optimizely.
Group-sequential tests — Pocock, O’Brien-Fleming spending functions; standard in clinical trials.
Always-valid p-values + e-values — Vovk-Wang (2021) revival.

20.3 Bandit alternatives

Multi-armed bandits (Thompson sampling, UCB) — Bayesian; allocate traffic adaptively to better-performing arms while learning. Trade off exploration / exploitation. Higher cumulative reward but harder to obtain clean post-hoc effect estimates.

Software: Optimizely, GrowthBook, statsmodels, scipy.stats, Eppo, Statsig.

21. Reproducibility crisis and the p-value debate

A wave of evidence in the 2010s revealed widespread non-replication in psychology (Open Science Collaboration 2015), medicine (Ioannidis 2005 “Why Most Published Research Findings Are False”), and biology. Core diagnoses:

p-hacking — try many tests, report the significant ones.
Garden of forking paths — analysis choices contingent on the data (Gelman-Loken).
Publication bias — null results unpublished; meta-analyses biased.
Underpowered studies — n too small; significant findings overestimate effect size (Type-M error, Gelman).
Misinterpreted p-values — treated as posterior probabilities of H₀.

21.1 ASA statement (2016)

The American Statistical Association published six principles, including: p-values do not measure the probability of the hypothesis; p < 0.05 does not establish importance; reporting only p ignores effect size and uncertainty.

21.2 Reforms

Pre-registration — lock the analysis plan before seeing the data.
Registered reports — peer review the design, not the result.
Effect-size + CI reporting alongside p.
Multiverse analysis — report results across reasonable analysis choices.
Bayes factors and likelihood-based inference — see [[Math/bayesian-inference]].
Open data + open code — independent reproducibility.

22. Software

22.1 R

stats (built-in): t.test, chisq.test, wilcox.test, kruskal.test, aov, lm, glm, cor.test, ks.test.
lme4 — mixed-effects models (Bates et al.).
survival — Cox PH, Kaplan-Meier (Therneau).
pwr — power analysis.
multcomp — multiple comparisons (Tukey, Dunnett).
boot — bootstrap.

22.2 Python

scipy.stats — distributions, tests (t, F, χ², Mann-Whitney, KS, etc.).
statsmodels — GLM, mixed models, ANOVA, time series, survival.
pingouin — friendly stats wrapper.
lifelines — survival analysis.
scikit-posthocs — post-hoc multiple comparisons.
scikit-learn — LogisticRegression, GaussianMixture (EM under the hood).

22.3 Bayesian alternatives

PyMC (Python) — NUTS / HMC sampler.
Stan (R / Python / Julia) — Hamiltonian Monte Carlo.
NumPyro — JAX-backed Bayesian inference.

See [[Math/bayesian-inference]] for the Bayesian counterpart and conversion between p-values and Bayes factors.

23. Common pitfalls

p-hacking — running many tests until one gives p < 0.05; pre-register or correct.
Statistical vs practical significance — large n can make trivial effects significant; report effect size + CI, not just p.
No multiple-comparisons correction — m tests at α each inflates FWER to ~mα.
Assumption violations — normality (use bootstrap / non-parametric), equal variance (use Welch’s t), independence (use mixed models or GEE).
Underpowered studies — n too small; observed significant effects are inflated (Type-M / Type-S error per Gelman).
Optional stopping — peeking at p during data collection inflates Type-I; use sequential designs.
Confusing CI with credible interval — the 95% CI is not P(θ ∈ interval) = 0.95. Frequentist coverage statement only.
Treating “fail to reject” as “accept H₀” — absence of evidence is not evidence of absence; report power and CIs.
Cherry-picking the test — Welch vs Student, parametric vs non-parametric, decided after seeing the data.
Ignoring dependence structure — clustered, repeated-measures, or time-series data violate iid; use hierarchical / GEE / time-series methods.
Reporting p without direction — “p = 0.04” without saying which group is higher.
MLE of variance — using σ̂² = (1/n)Σ(xᵢ − x̄)² when the unbiased estimator is needed.

24. Cross-references

[[Math/probability-fundamentals]] — distributions, CLT, LLN, joint/conditional probability that underpin all tests.
[[Math/bayesian-inference]] — posterior-based counterpart; Bayes factors, credible intervals, MCMC.
[[Math/_index]] — math reference index.
[[Engineering/reliability-engineering]] — Weibull fitting, accelerated life testing, MTBF inference.
[[Engineering/six-sigma]] — DMAIC Analyze phase uses ANOVA, hypothesis tests, capability indices.

25. Citations and further reading

Wasserman L. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004 — chapters 6–13 cover hypothesis testing, CIs, MLE, GLM in compact form.
Casella G, Berger RL. Statistical Inference, 2nd ed. Duxbury, 2001 — graduate-level reference, exhaustive on MLE properties + decision theory.
Lehmann EL, Romano JP. Testing Statistical Hypotheses, 4th ed. Springer, 2022 — canonical hypothesis-testing monograph.
Fisher RA. “On the Mathematical Foundations of Theoretical Statistics.” Phil Trans R Soc A, 1922 — introduces MLE, consistency, efficiency, information.
Neyman J, Pearson ES. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Phil Trans R Soc A, 1933 — Neyman-Pearson lemma.
Student. “The Probable Error of a Mean.” Biometrika, 1908 — Gosset’s t-distribution.
Welch BL. “The Generalization of Student’s Problem When Several Different Population Variances Are Involved.” Biometrika, 1947.
Wilks SS. “The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses.” Ann Math Stat, 1938.
Dempster AP, Laird NM, Rubin DB. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” J R Stat Soc B, 1977.
Cox DR. “Regression Models and Life-Tables.” J R Stat Soc B, 1972 — proportional hazards.
Nelder JA, Wedderburn RWM. “Generalized Linear Models.” J R Stat Soc A, 1972.
Kaplan EL, Meier P. “Nonparametric Estimation from Incomplete Observations.” JASA, 1958.
Benjamini Y, Hochberg Y. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J R Stat Soc B, 1995.
Holm S. “A Simple Sequentially Rejective Multiple Test Procedure.” Scand J Stat, 1979.
Storey JD. “A Direct Approach to False Discovery Rates.” J R Stat Soc B, 2002.
Efron B. “Bootstrap Methods: Another Look at the Jackknife.” Ann Stat, 1979.
Hansen LP. “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica, 1982.
Ioannidis JPA. “Why Most Published Research Findings Are False.” PLOS Medicine, 2005.
Wasserstein RL, Lazar NA. “The ASA Statement on p-Values: Context, Process, and Purpose.” The American Statistician, 2016.
Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Erlbaum, 1988.
Open Science Collaboration. “Estimating the Reproducibility of Psychological Science.” Science, 2015.
Gelman A, Loken E. “The Statistical Crisis in Science.” American Scientist, 2014 — the garden of forking paths.

Compendium

Explorer

Hypothesis Testing & Maximum Likelihood — Math Reference

Hypothesis Testing & Maximum Likelihood — Math Reference

1. At a glance

2. Hypothesis testing fundamentals

2.1 Test statistics and critical regions

2.2 Errors

2.3 The p-value

2.4 Effect size, sample size, power

3. Z-test — known σ, normal data

4. t-test — unknown σ, normal data

4.1 Two-sample t-test (equal variances)

4.2 Welch’s t-test (unequal variances)

4.3 Paired t-test

5. Chi-square tests

5.1 Goodness-of-fit

5.2 Test of independence (contingency table)

5.3 Variance test

6. ANOVA — analysis of variance

7. Non-parametric tests

8. Permutation tests and bootstrap

9. Multiple comparison correction

9.1 Bonferroni

9.2 Holm-Bonferroni (1979)

9.3 Benjamini-Hochberg FDR (1995)

9.4 Storey q-value (2002)

9.5 FWER vs FDR — when to use which

10. Confidence intervals

10.1 Interpretation — critical

10.2 Duality with hypothesis tests

11. Maximum Likelihood Estimation (MLE)

12. MLE properties

12.1 Consistency

12.2 Asymptotic normality

12.3 Efficiency — Cramér-Rao lower bound

12.4 Invariance

12.5 Caveats

13. MLE examples

13.1 Normal — N(μ, σ²)

13.2 Bernoulli — Ber(p)

13.3 Exponential — Exp(λ)

13.4 Poisson — Pois(λ)

13.5 Uniform — U(0, θ)

14. EM algorithm

14.1 Examples

14.2 Connections

15. Method of moments

16. Likelihood ratio test (LRT)

17. Wald and score tests

18. Generalized Linear Models (GLM)

19. Survival analysis

19.1 Kaplan-Meier estimator (1958)

19.2 Log-rank test

19.3 Cox proportional hazards (Cox 1972)

20. A/B testing

20.1 Two-proportion test

20.2 Sequential testing

20.3 Bandit alternatives

21. Reproducibility crisis and the p-value debate

21.1 ASA statement (2016)

21.2 Reforms

22. Software

22.1 R

22.2 Python

22.3 Bayesian alternatives

23. Common pitfalls

24. Cross-references

25. Citations and further reading

Graph View

Table of Contents