Information Theory

1. At a glance

Claude E. Shannon founded the field with the 1948 Bell System Technical Journal paper “A Mathematical Theory of Communication”. The framework quantifies information, randomness, and the fundamental limits of compression and communication over noisy channels. Three central quantities anchor the theory:

Entropy H(X) — the average uncertainty (or surprise) in a random variable.
Mutual information I(X;Y) — the information one variable carries about another.
Channel capacity C — the maximum reliable bit-rate over a noisy channel.

Applications:

Data compression — zip / gzip / deflate, ZSTD, Brotli, LZ4, JPEG / JPEG XL, H.264 / H.265 / H.266 / AV1 video, MP3 / AAC / Opus audio. Theoretical lossless limit = source entropy.
Error-correcting codes — Hamming, Reed-Solomon (CDs, QR, RAID-6), convolutional + Viterbi, turbo, LDPC (WiFi 6, 5G data), Polar (5G control).
Statistical inference — MLE as cross-entropy minimization; minimum description length; Bayesian Occam’s razor.
Machine learning — cross-entropy classification loss, KL in VAE / RL / distillation, mutual-information objectives in contrastive learning (InfoNCE, CLIP), Information Bottleneck for representation learning.
Communication system design — channel capacity sets the bit-rate ceiling (Shannon-Hartley); modern codes operate within fractions of a dB of it.

The unifying claim: information is a measurable resource, and Shannon’s theorems give tight bounds on how much of it can be stored, sent, or recovered.

2. Entropy (Shannon entropy)

For a discrete random variable X with pmf p(x), the Shannon entropy is

H(X) = − Σ_x p(x) · log p(x).

Units depend on the logarithm base:

log₂ → bits (binary digits, “shannons”).
log_e → nats (natural units).
log₁₀ → hartleys (or “bans”).

Conversion: 1 nat ≈ 1.4427 bits.

Properties:

H(X) ≥ 0, with equality iff X is deterministic.
For X taking values in a finite alphabet of size |X|, H(X) ≤ log|X|, with equality iff p is uniform — maximum entropy is achieved by the uniform distribution.
Continuous and concave in p.
Expansion property: adding a zero-probability outcome does not change H.
Grouping axiom: H of a composite outcome decomposes into H of the macro-choice plus a weighted average of H of the sub-choices.

Examples:

Fair coin: H = 1 bit.
Biased coin with p = 0.1, 0.9: H ≈ 0.469 bits.
Fair 6-sided die: H = log₂ 6 ≈ 2.585 bits.
English text (per character, after redundancy): ≈ 1.0 — 1.5 bits, Shannon’s 1951 estimate; the ASCII representation uses 7 bits, so the compression ratio achievable is ~5–7×.

The binary entropy function H_b(p) = −p·log p − (1−p)·log(1−p) appears everywhere; max 1 bit at p = 0.5; symmetric about 0.5.

3. Joint, conditional, mutual information

Joint entropy:

H(X,Y) = − Σ_x Σ_y p(x,y) · log p(x,y).

Conditional entropy — uncertainty in Y given knowledge of X:

H(Y|X) = − Σ_x Σ_y p(x,y) · log p(y|x) = Σ_x p(x) · H(Y|X = x) = H(X,Y) − H(X).

Chain rule:

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y).

Generalizes: H(X₁,…,X_n) = Σᵢ H(Xᵢ | X₁,…,X_{i−1}).

Mutual information — the reduction in uncertainty about Y given X (and vice versa, symmetric):

I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X) = H(X) + H(Y) − H(X,Y) = Σ_x Σ_y p(x,y) · log [p(x,y) / (p(x) · p(y))] = D( p(x,y) ‖ p(x) · p(y) ).

The last line — MI as the KL divergence between the joint and the product of marginals — is the most useful identity.

Properties:

I(X;Y) ≥ 0, with equality iff X and Y are independent.
I(X;X) = H(X).
Symmetric: I(X;Y) = I(Y;X).
Data-processing inequality: if X → Y → Z is a Markov chain, then I(X;Z) ≤ I(X;Y). You cannot increase information about X by post-processing Y.

Conditional mutual information:

I(X;Y|Z) = H(X|Z) − H(X|Y,Z).

Chain rule for MI:

I(X₁,…,X_n; Y) = Σᵢ I(Xᵢ ; Y | X₁,…,X_{i−1}).

4. KL divergence (relative entropy)

For distributions p and q over the same support,

D(p ‖ q) = Σ_x p(x) · log [p(x) / q(x)].

(Convention: 0·log(0/q) = 0; p·log(p/0) = ∞ when p > 0.)

Properties:

Gibbs’ inequality: D(p ‖ q) ≥ 0, with equality iff p = q (almost everywhere).
Asymmetric: D(p ‖ q) ≠ D(q ‖ p) in general — not a metric. No triangle inequality.
Convex in the pair (p, q) jointly.
Pinsker’s inequality: total-variation distance ≤ √(½ D(p ‖ q)) — links KL to TV.

Forward vs. reverse KL:

D(p ‖ q) — “forward” (or “M-projection”). Penalizes q for missing mass where p has it (“mode-covering”). Used in supervised cross-entropy.
D(q ‖ p) — “reverse” (or “I-projection”). Penalizes q for putting mass where p has none (“mode-seeking”). Used in variational inference (the standard VAE / ELBO objective).

Uses:

Supervised classification loss = cross-entropy = H(p) + D(p ‖ q); since H(p) is constant in q, minimizing cross-entropy is equivalent to minimizing forward KL.
Variational inference: minimize D(q ‖ p) where p is the (intractable) posterior and q is a tractable family — yields ELBO maximization.
RLHF / preference modeling: KL penalty against a reference policy prevents drift.
Knowledge distillation (Hinton 2015): student matches teacher softened logits via cross-entropy / KL.

5. Cross-entropy

H(p, q) = − Σ_x p(x) · log q(x) = H(p) + D(p ‖ q).

Since H(p) is independent of the model q, minimizing cross-entropy over q is equivalent to minimizing KL — and in the iid case it is equivalent to maximum-likelihood estimation:

argmin_q H(p_data, q) = argmax_q E_{x ~ p_data}[ log q(x) ] = MLE.

Why this matters in ML: nearly every supervised classifier, every softmax head, every LLM token predictor, is trained against a cross-entropy loss. The “log-loss” name in classical statistics is the same object. The per-token cross-entropy of a language model, in bits, is its perplexity PPL = 2^H — directly comparable across models.

6. Differential entropy

For a continuous random variable X with density f(x),

h(X) = − ∫ f(x) · log f(x) dx.

Caveats:

Not always non-negative — h(X) can be negative for tightly concentrated densities (e.g. a Gaussian with σ² < 1/(2πe)).
Not invariant under coordinate change: h(aX) = h(X) + log|a|, so units matter.
The right object for a “true” generalization of H is relative entropy D(f ‖ g) — that quantity is non-negative and reparameterization-invariant.

Useful closed forms (in nats):

Uniform on [a, b]: h = ln(b − a).
Univariate Gaussian N(μ, σ²): h = ½ ln(2πe σ²).
Multivariate Gaussian N(μ, Σ) in ℝᵈ: h = ½ ln((2πe)ᵈ |Σ|).
Exponential rate λ: h = 1 − ln λ.

7. Maximum entropy distributions

Subject to constraints, the distribution that maximizes H (or h) is the least committed — it assumes no structure beyond the constraints. This is the MaxEnt principle (Jaynes 1957). Standard solutions:

Domain	Constraints	MaxEnt distribution
Finite alphabet	none (just Σ p = 1)	Uniform
Continuous on [a, b]	none	Uniform on [a, b]
Continuous on (−∞, ∞)	mean μ, variance σ²	Gaussian N(μ, σ²)
Continuous on [0, ∞)	mean μ	Exponential rate 1/μ
Discrete on ℕ	mean μ	Geometric
Continuous on ℝᵈ	mean μ, covariance Σ	Multivariate Gaussian
Continuous on ℝᵈ, fixed Σᵢ E[fᵢ(x)]	exponential family with sufficient stats fᵢ

This last row is why exponential-family distributions are the natural model class — they are the MaxEnt distributions for a fixed set of moment constraints, and they include Gaussian, Bernoulli, Poisson, gamma, Dirichlet, categorical, etc. GLMs are precisely the linear-in-natural-parameter members of this family (Berger 1996).

8. Source coding theorem (Shannon 1948)

For an iid source X over a finite alphabet, the expected codeword length L of any uniquely decodable code satisfies

L ≥ H(X).

Conversely, codes exist with L < H(X) + 1. So entropy is the precise lower bound for lossless compression. For block coding (n-symbol blocks), the per-symbol overhead vanishes:

H(X) ≤ L_n / n < H(X) + 1/n.

Kraft’s inequality characterizes prefix codes: a code with lengths {ℓ₁, …, ℓ_n} over an alphabet of size D exists iff Σᵢ D^(−ℓᵢ) ≤ 1.

For non-iid (stationary ergodic) sources, the bound generalizes to the entropy rate H̄(X) = lim_{n→∞} (1/n) H(X₁,…,X_n).

9. Practical compression

Lossless

Huffman coding (Huffman 1952) — optimal prefix code given a known symbol distribution. Greedy bottom-up tree construction. Per-symbol overhead ≤ 1 bit above entropy; worse with skewed distributions. Used in JPEG (default), MP3, deflate / gzip, PNG.
Arithmetic coding (Rissanen 1976, Witten/Neal/Cleary 1987) — encodes the entire message as a single fractional number in [0,1) refined by successive sub-intervals. Asymptotically matches entropy with no per-symbol slack. Used in JPEG (alternate / JBIG / JPEG 2000), H.264 (CABAC), H.265, AV1. Patent encumbrance kept it out of early standards.
ANS / range coding (Duda 2014, “Asymmetric Numeral Systems”) — combines the compression efficiency of arithmetic coding with the speed of Huffman. Used in ZSTD (Facebook 2016), Brotli, JPEG XL, LZFSE.
LZ77 / LZ78 / LZW (Lempel-Ziv 1977 / 1978; Welch 1984) — dictionary-based; replace repeated substrings with (offset, length, next-symbol) tuples. Foundation of:
- gzip / deflate (RFC 1951) = LZ77 + Huffman; ZIP, PNG, HTTP gzip.
- LZ4 (Collet 2011) — ultra-fast LZ77 variant; ~500 MB/s decode, used in real-time pipelines, ClickHouse, Kafka.
- ZSTD (Collet/Facebook 2016) — LZ77 + ANS/FSE + dictionary; 3–5× faster than gzip at the same ratio; default in Linux kernel modules, Btrfs, ZFS.
- Brotli (Google 2013) — LZ77 + Huffman + 2nd-order context modeling + a 120 KB static dictionary; ~20 % smaller than gzip for web text; used in HTTP Content-Encoding: br, WOFF2 fonts.
Burrows-Wheeler Transform (Burrows & Wheeler 1994) — block-sorting reversible permutation that clusters similar contexts; followed by move-to-front + Huffman. Used in bzip2. Slower than LZ but better ratio on text.
PPM (Cleary & Witten 1984, “Prediction by Partial Matching”) — context-mixing predictor + arithmetic coding; best-in-class ratios but slow. PAQ family (Mahoney 2002+) pushed it further with neural-network mixers.

Lossy

Lossy codecs exploit perceptual / statistical redundancy beyond the entropy floor:

JPEG (1992) — block-DCT + quantization + Huffman / arithmetic.
JPEG 2000 (2000) — wavelet (CDF 9/7) + EBCOT.
JPEG XL (2022) — modular + variable-block DCT + ANS; supports lossless and lossy; meant as a JPEG replacement.
WebP (2010) — VP8 intra coding + Huffman.
MP3 (ISO/IEC 11172-3, 1993) — perceptual model (masking), MDCT, Huffman.
AAC (1997, MPEG-2) — MDCT, perceptual model, Huffman; replaces MP3.
Opus (RFC 6716, 2012) — SILK + CELT hybrid; codecs of choice for VoIP / WebRTC.
H.264 / AVC (2003), H.265 / HEVC (2013), H.266 / VVC (2020) — block-based motion-compensated DCT + CABAC arithmetic coding; ~50 % bit-rate reduction per generation at constant quality.
AV1 (AOMedia 2018) — open-license successor to HEVC; multi-symbol arithmetic coder, advanced intra prediction; royalty-free.

The theoretical lossy floor is given by rate-distortion theory (Section 13).

10. Channel capacity

A discrete memoryless channel is described by transition probabilities p(y|x). The capacity is

C = max_{p(x)} I(X; Y) bits per channel use.

C is the supremum of rates R = (log M)/n at which codes of size M and block-length n exist with vanishing error probability.

Important channels:

Binary Symmetric Channel (BSC) with crossover (flip) probability p:
- C = 1 − H_b(p) bits per use; achieved by uniform input.
- At p = 0: C = 1 bit (perfect channel); at p = 0.5: C = 0 (completely random output).
Binary Erasure Channel (BEC) with erasure probability ε:
- C = 1 − ε; achieved by uniform input.
AWGN (Additive White Gaussian Noise) channel with bandwidth B Hz, signal power S, noise spectral density N₀:
- Shannon-Hartley theorem: C = B · log₂(1 + S/N) bits/s, where SNR = S/N = S/(N₀ B).
- This is the foundational limit of analog and digital communications. Doubling bandwidth doubles C linearly; doubling SNR adds 1 bit/s/Hz only at high SNR.
MIMO (multiple-input multiple-output) channels — capacity scales linearly with min(N_t, N_r) for the rich-scattering ideal case (Telatar 1995, Foschini 1996); the basis of WiFi, LTE, 5G NR throughput.
Multi-user channels — capacity regions rather than scalar capacities: MAC (multiple-access), broadcast, interference channels are still partially open problems.

11. Channel coding theorem (Shannon 1948)

Statement: For any rate R < C and any ε > 0, there exist (n, M = ⌈2^{nR}⌉) codes with average error probability < ε for sufficiently large n. Conversely, no such codes exist with R > C.

This is the deepest result of the field: reliable communication is possible at any rate below capacity, despite arbitrary channel noise. Random coding suffices for the existence proof — but explicit, computable codes that approach C were the work of the next 60 years.

The converse: for R > C, error probability is bounded away from 0 — a “strong converse” gives exponential decay.

12. Error-correcting codes

Hamming codes (Hamming 1950) — (7,4) single-error-correcting; generalizes to (2ᵐ − 1, 2ᵐ − 1 − m). Linear block code with minimum distance 3.
Reed-Solomon (Reed & Solomon 1960) — non-binary block code over GF(2ᵐ); corrects up to ⌊(n−k)/2⌋ symbol errors. Used in CDs, DVDs, Blu-ray, QR codes, RAID-6, deep-space communications, DSL.
BCH codes (Bose-Chaudhuri-Hocquenghem 1959) — superclass including Reed-Solomon; cyclic codes correcting multiple errors.
Convolutional codes + Viterbi decoder (Viterbi 1967) — state-machine encoding, maximum-likelihood sequence decoding via dynamic programming over the trellis. Used in GSM, satellite links, deep-space.
Turbo codes (Berrou, Glavieux & Thitimajshima 1993) — parallel concatenation of convolutional codes + iterative (BCJR) decoding; the first to operate within ~1 dB of Shannon capacity. Adopted in 3G/4G mobile.
LDPC codes (Gallager 1962, rediscovered by MacKay 1996) — sparse parity-check matrices; iterative belief-propagation decoding. Near-capacity performance. Used in WiFi 802.11n/ac/ax, 5G NR data channels, DVB-S2 satellite TV, 10GBASE-T Ethernet, NAND flash ECC.
Polar codes (Arıkan 2009) — provably capacity-achieving as block length → ∞ for any symmetric BMS channel; uses channel polarization. 5G NR control channel standard.
Fountain / rateless codes (LT codes Luby 2002; Raptor codes Shokrollahi 2006) — encoder produces an effectively unlimited stream; decoder recovers from any (1+ε)·k symbols. Used in DVB-H, 3GPP MBMS multicast.

13. Rate-distortion theory

For a source X and a fidelity criterion d(x, x̂) (e.g. squared error), the rate-distortion function is

R(D) = min_{p(x̂ | x) : E[d(X, X̂)] ≤ D} I(X; X̂).

R(D) is the minimum bit-rate required to represent X with average distortion at most D. The dual D(R) is the achievable distortion at rate R.

Examples:

Gaussian source N(0, σ²), squared-error distortion: R(D) = ½ log(σ²/D) for D ≤ σ²; 0 otherwise.
Binary source p, Hamming distortion: R(D) = H_b(p) − H_b(D) for D ≤ p; 0 otherwise.

R(D) is the theoretical foundation of all lossy codecs: JPEG, MP3, H.265, AV1. Practical codecs operate well above R(D) — there is still headroom, which is why each codec generation chips away at the gap.

14. Information-theoretic machine learning

Cross-entropy loss = MLE for softmax classifiers and LLM token prediction. Per-token NLL in nats is the model’s loss; per-token bits = NLL / ln 2 = log₂ PPL.
KL divergence appears in:
- VAE ELBO (Kingma & Welling 2013): L = E_q[log p(x|z)] − D(q(z|x) ‖ p(z)) — reconstruction minus KL to prior. The KL is a regularizer that keeps the latent posterior near the prior.
- RL policy gradient with KL penalty: TRPO (Schulman 2015), PPO (Schulman 2017), RLHF — bound on how far the new policy can drift.
- Knowledge distillation (Hinton et al. 2015): student matches teacher’s softened-logit distribution via KL.
- Label smoothing (Szegedy 2016) — replace one-hot targets with (1−ε)·δ + ε·uniform; equivalent to adding KL to uniform.
Mutual information maximization:
- InfoNCE (van den Oord et al. 2018) — lower bound on I(X;Y) used in contrastive representation learning (SimCLR, MoCo, CLIP). Maximize log [f(x,y_+) / Σ f(x,y)].
- CLIP (Radford et al. 2021) — symmetric InfoNCE between image and text encoders.
- DeepInfoMax (Hjelm et al. 2018) — maximize MI between input and latent.
Information Bottleneck (Tishby, Pereira & Bialek 1999; Tishby & Zaslavsky 2015): find a representation Z that compresses X while preserving information about Y — minimize I(X; Z) − β · I(Z; Y). Lagrangian principle for representation learning; proposed as a theoretical lens on deep-net training dynamics.
Variational free energy F = E_q[log q − log p] = D(q ‖ p) − log Z. ELBO = −F. Foundation of variational Bayes, VAEs, normalizing flows, EM.
MDL — Minimum Description Length (Rissanen 1978) — model selection via the shortest total description of (model + data given model). A code-length formalization of Occam’s razor; equivalent to Bayesian marginal likelihood at large n. Penalizes complexity automatically.
Maximum-entropy RL (Ziebart 2010, Soft Actor-Critic / SAC: Haarnoja et al. 2018) — augment reward with α · H(π(·|s)); produces stochastic, exploratory policies; the policy gradient picks up KL-style terms.
Bits-back coding (Wallace 1990; Frey & Hinton 1996) — connects latent-variable models to lossless compression: VAEs can asymptotically achieve the cross-entropy bound; basis of “compression as a benchmark for AGI” arguments (Hutter Prize).

15. Algorithmic / Kolmogorov complexity

Kolmogorov complexity K(x) (Kolmogorov 1965; independently Solomonoff 1964 and Chaitin 1966) is the length of the shortest program on a universal Turing machine that outputs x.

Uncomputable — there is no algorithm that computes K(x) in general (related to the halting problem).
Invariant up to a constant across universal machines.
Conditional complexity K(x | y) — shortest program producing x given y as input.
Algorithmic mutual information I_K(x; y) = K(x) − K(x | y).

K(x) is the true information content of an individual object; Shannon entropy is its probabilistic counterpart. The MDL principle and Solomonoff induction are practical approximations.

Bayesian Occam’s razor: model classes with shorter descriptions have higher prior probability; corresponds to the BIC and similar penalties.

16. Fisher information and the Cramér-Rao bound

For a parametric family p(x; θ), the Fisher information is

I(θ) = E_θ [ (∂ log p(X; θ) / ∂θ)² ] = − E_θ [ ∂² log p(X; θ) / ∂θ² ].

Cramér-Rao Lower Bound (CRLB): for any unbiased estimator θ̂ of θ,

Var(θ̂) ≥ 1 / I(θ).

For multivariate θ, the bound becomes Cov(θ̂) ⪰ I(θ)⁻¹ (Loewner order on PSD matrices).

MLE is asymptotically efficient: under regularity, √n (θ̂_MLE − θ) → N(0, I(θ)⁻¹), so MLE achieves the CRLB asymptotically.

Information geometry treats I(θ) as a Riemannian metric on the manifold of distributions — leading to natural gradient methods (Amari 1998) which use the Fisher metric to define geometry-aware updates; basis of TRPO and K-FAC optimizers.

Connection to KL: D(p_θ ‖ p_{θ+dθ}) ≈ ½ dθᵀ I(θ) dθ — Fisher info is the local quadratic approximation of KL.

17. Entropy estimation from samples

Estimating H, MI, or D from a finite sample is non-trivial, especially in high dimensions.

Plug-in (naive) estimator H_n = − Σ p̂(x) log p̂(x) — biased downward by ~ (|X| − 1) / (2n) for discrete X. Underestimates entropy.
Miller-Madow (Miller 1955) — bias-corrected plug-in: H_MM = H_n + (m̂ − 1) / (2n), where m̂ is the number of observed bins.
NSB estimator (Nemenman, Shafee & Bialek 2002) — Bayesian estimator with Dirichlet-mixture prior; near-optimal at small n.
Continuous entropy — KDE-based or nearest-neighbor.
Kozachenko-Leonenko (1987) — nearest-neighbor differential entropy estimator: h̄ ≈ d·E[log r_k(X)] + log(c_d) + ψ(n) − ψ(k), where r_k is the k-NN distance, c_d is the unit-ball volume in d-dim, ψ is digamma.
Kraskov-Stögbauer-Grassberger / KSG (Kraskov, Stögbauer & Grassberger 2004) — k-NN-based MI estimator using max-norm; popular for continuous MI in physics and neuroscience.
MINE — Mutual Information Neural Estimation (Belghazi et al. 2018) — neural-network lower bound on MI via Donsker-Varadhan representation: I(X;Y) ≥ sup_T E_p[T] − log E_q[e^T].
InfoNCE (van den Oord 2018) — sample-efficient lower bound: I(X;Y) ≥ log K − L_NCE.

Caveat: MI estimation in high dimensions is fundamentally hard; estimators with neural networks can be biased and high-variance. McAllester & Stratos (2018) showed that any distribution-free lower bound on MI requires exponentially many samples for high MI values.

18. Quantum information theory (brief)

Von Neumann entropy: S(ρ) = − Tr(ρ log ρ) for density matrix ρ; classical Shannon entropy for diagonal ρ.
Quantum mutual information I(A:B) = S(A) + S(B) − S(AB); can exceed the classical bound 2·min(S(A), S(B)).
Holevo bound — accessible classical information from a quantum source is at most χ.
No-cloning theorem (Wootters & Zurek 1982) — no unitary can clone an arbitrary unknown quantum state; consequence of linearity.
Quantum channel capacity — multiple distinct capacities (classical C, quantum Q, entanglement-assisted C_E); regularization required.
Quantum key distribution (BB84: Bennett & Brassard 1984; E91: Ekert 1991) — protocols whose security rests on quantum mechanics rather than computational hardness; deployed in commercial systems (ID Quantique, Toshiba).

19. Applications

Compression — every lossless and lossy codec is grounded in entropy / rate-distortion. ZIP, ZSTD, JPEG, H.265, MP3, Opus. Hutter Prize uses text compression as an AGI benchmark.
Communication — modulation choice + coding rate + power are jointly optimized against Shannon-Hartley. 5G NR uses LDPC (data) + Polar (control), MIMO + OFDM, adaptive coding & modulation that tracks SNR.
Cryptography — perfect secrecy (Shannon 1949) requires the key entropy ≥ message entropy → one-time pad is the only perfectly-secret cipher. Modern ciphers settle for computational secrecy. Entropy is also the foundation of cryptographic RNGs (NIST SP 800-90) and password-strength metrics.
ML loss functions — cross-entropy classification, KL regularization in VAE / RL, MI bounds in contrastive learning, distillation.
Variational inference + VAE — minimize D(q ‖ p) → ELBO maximization; foundation of Bayesian deep learning.
Reinforcement learning — entropy regularization (SAC, A3C) for exploration; KL-constrained policy updates (TRPO, PPO); reward = log-likelihood improvement (intrinsic motivation, Pathak 2017).
Decision trees — split criterion = information gain = H(parent) − weighted-sum H(children); ID3 (Quinlan 1986), C4.5, CART (Gini is an alternative).
Feature selection — mutual information between feature and target as a model-agnostic relevance score; mRMR (Peng 2005) selects features maximizing relevance while minimizing redundancy.
Neuroscience — efficient-coding hypothesis (Barlow 1961, Berkes et al. 2011) posits that cortex maximizes MI between stimulus and neural response under metabolic constraints; predicts sparse coding, decorrelation, divisive normalization.
Genomics — DNA sequence compression (CRAM, MFCompress), motif discovery via relative entropy of position-weight matrices, alignment scoring via log-odds (information-content).
Networks & PageRank — entropy rate of random walks; spectral / informational centrality measures.

20. Pitfalls

Units confusion: nats vs bits vs hartleys. ML loss functions usually report nats (natural log); compression literature uses bits. PPL = exp(NLL) only if NLL is in nats.
Sign errors in cross-entropy: H(p, q) = − Σ p log q (with the minus sign). A frequent off-by-sign bug.
KL is not a metric — no triangle inequality, asymmetric. Don’t use it as a distance for clustering without care; symmetrized variants (Jensen-Shannon) give a true (square-root) metric.
Differential entropy can be negative — e.g. N(0, σ²) with σ² < 1/(2πe) has h < 0. Don’t interpret h(X) < 0 as “less than no uncertainty”; the right invariant is relative entropy.
MI estimation from few samples — high bias and high variance; in high dimensions, naive plug-in is unusable. Use KSG, MINE, InfoNCE, and be aware of their failure modes.
Confusing entropy with diversity — H is one diversity measure; Simpson, Gini, and Renyi-α entropies give others. Don’t equate without justification.
Asymptotic source/channel theorems require long blocks — in finite-block regimes (Polyanskiy, Poor & Verdú 2010 normal approximation), capacity is replaced by C − √(V/n)·Q⁻¹(ε) — non-trivially below C at short n.
Compression vs. encryption — encrypt before compressing is a mistake (encrypted data is incompressible); always compress, then encrypt.
Conditioning on data does not always reduce entropy for differential entropy in pathological cases — beware when generalizing intuition.

21. Cross-references

probability-fundamentals — pmf/pdf, expectation, independence; prerequisites.
_index — math hub.
hypothesis-testing-mle — MLE = cross-entropy minimization; CRLB.
bayesian-inference — KL minimization → variational Bayes; MDL ↔ marginal likelihood.
gradient-descent-variants — natural gradient uses Fisher metric.
cryptography-fundamentals — entropy in KDFs, RNGs, perfect secrecy.
transformer-architecture — cross-entropy LM loss, perplexity, KL in RLHF.
op-amp-variants — Shannon-Hartley B·log(1 + SNR).
rf-design — channel capacity in modulation / coding choice.

22. Citations

Founding paper:

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3): 379-423; 27(4): 623-656. — The single paper that founded the field; entropy, source coding, channel coding, capacity all introduced in one work.
Shannon, C. E. (1949). Communication Theory of Secrecy Systems. BSTJ 28(4): 656-715. — Perfect-secrecy / one-time-pad result.

Textbooks:

Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. — The standard graduate text.
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge UP. — Free PDF; integrates information theory with Bayesian inference and ML. The pedagogical classic.
Csiszár, I. & Körner, J. (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge UP. — Rigorous treatment of source/channel theorems.
Polyanskiy, Y. & Wu, Y. (2024). Information Theory: From Coding to Learning. Cambridge UP. — Modern, ML-aware.

Compression:

Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE 40(9): 1098-1101.
Lempel, A. & Ziv, J. (1977). A Universal Algorithm for Sequential Data Compression. IEEE Trans. Inf. Theory 23(3): 337-343.
Ziv, J. & Lempel, A. (1978). Compression of Individual Sequences via Variable-Rate Coding. IEEE Trans. Inf. Theory 24(5): 530-536.
Welch, T. A. (1984). A Technique for High-Performance Data Compression. IEEE Computer 17(6): 8-19.
Burrows, M. & Wheeler, D. J. (1994). A Block-Sorting Lossless Data Compression Algorithm. DEC SRC Tech Report 124.
Rissanen, J. (1976). Generalized Kraft Inequality and Arithmetic Coding. IBM J. Res. Dev. 20(3): 198-203.
Duda, J. (2014). Asymmetric Numeral Systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding. arXiv:1311.2540.

Channel coding:

Hamming, R. W. (1950). Error Detecting and Error Correcting Codes. BSTJ 29(2): 147-160.
Reed, I. S. & Solomon, G. (1960). Polynomial Codes over Certain Finite Fields. J. SIAM 8(2): 300-304.
Viterbi, A. J. (1967). Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Trans. Inf. Theory 13(2): 260-269.
Gallager, R. G. (1962). Low-Density Parity-Check Codes. IRE Trans. Inf. Theory 8(1): 21-28.
Berrou, C., Glavieux, A. & Thitimajshima, P. (1993). Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes. Proc. IEEE ICC.
Arıkan, E. (2009). Channel Polarization: A Method for Constructing Capacity-Achieving Codes for Symmetric Binary-Input Memoryless Channels. IEEE Trans. Inf. Theory 55(7): 3051-3073.

ML / inference:

Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Phys. Rev. 106(4): 620-630. — MaxEnt principle.
Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica 14(5): 465-471. — MDL.
Tishby, N., Pereira, F. C. & Bialek, W. (1999). The Information Bottleneck Method. Proc. Allerton.
Tishby, N. & Zaslavsky, N. (2015). Deep Learning and the Information Bottleneck Principle. IEEE ITW.
Kingma, D. P. & Welling, M. (2013). Auto-Encoding Variational Bayes. arXiv:1312.6114.
Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531.
van den Oord, A., Li, Y. & Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748. — InfoNCE.
Belghazi, M. I. et al. (2018). Mutual Information Neural Estimation. ICML. — MINE.
Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML. — CLIP.

Estimation:

Kozachenko, L. F. & Leonenko, N. N. (1987). Sample Estimate of the Entropy of a Random Vector. Problems Inf. Trans. 23: 95-101.
Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Phys. Rev. E 69: 066138.

Algorithmic complexity:

Solomonoff, R. J. (1964). A Formal Theory of Inductive Inference. Inf. Control 7(1): 1-22; 7(2): 224-254.
Kolmogorov, A. N. (1965). Three Approaches to the Quantitative Definition of Information. Problems Inf. Trans. 1(1): 1-7.
Chaitin, G. J. (1966). On the Length of Programs for Computing Finite Binary Sequences. J. ACM 13(4): 547-569.

Quantum:

Bennett, C. H. & Brassard, G. (1984). Quantum Cryptography: Public Key Distribution and Coin Tossing. Proc. IEEE Conf. Computers, Systems and Signal Processing, 175-179.
Nielsen, M. A. & Chuang, I. L. (2010). Quantum Computation and Quantum Information, 10th anniversary ed. Cambridge UP.

Compendium

Explorer

Information Theory — Math Reference

Information Theory

1. At a glance

2. Entropy (Shannon entropy)

3. Joint, conditional, mutual information

4. KL divergence (relative entropy)

5. Cross-entropy

6. Differential entropy

7. Maximum entropy distributions

8. Source coding theorem (Shannon 1948)

9. Practical compression

Lossless

Lossy

10. Channel capacity

11. Channel coding theorem (Shannon 1948)

12. Error-correcting codes

13. Rate-distortion theory

14. Information-theoretic machine learning

15. Algorithmic / Kolmogorov complexity

16. Fisher information and the Cramér-Rao bound

17. Entropy estimation from samples

18. Quantum information theory (brief)

19. Applications

20. Pitfalls

21. Cross-references

22. Citations

Graph View

Table of Contents