SVD, PCA & Spectral Methods — Math Reference

1. At a glance

The Singular Value Decomposition (SVD) is the most useful single tool in numerical linear algebra. It generalizes the eigen-decomposition to arbitrary rectangular matrices and is numerically stable for every matrix that fits in memory. A startling fraction of modern data analysis reduces, eventually, to an SVD or one of its near-relatives:

  • Principal Component Analysis (PCA) — directions of maximum variance.
  • Low-rank approximation — image/audio compression, denoising.
  • Pseudoinverse + least-squares solving when the system is rank-deficient or non-square.
  • Latent Semantic Analysis (LSA) and the early word embeddings.
  • Recommendation systems and the Netflix-prize matrix-completion machinery.
  • Procrustes alignment of two point clouds (shapes, embeddings, MRI scans).
  • Spectral clustering and PageRank — eigenvectors of graph operators.
  • Schmidt decomposition of bipartite quantum states.

Closely related supporting machinery covered below: kernel PCA, ICA, NMF, robust PCA, randomized SVD, t-SNE, and UMAP. The recurring theme: extract a small number of orthogonal (or near-orthogonal) directions that explain most of the structure, throw the rest away, and operate in the small space.

Cross-link: linear-algebra-essentials for the eigenvalue/orthogonal-matrix prerequisites and numerical-linear-algebra for the stability story behind the algorithms cited here.

2. SVD definition

For any real matrix A ∈ R^{m×n} of rank r, there exist orthogonal matrices U ∈ R^{m×m}, V ∈ R^{n×n} and a diagonal-ish matrix Σ ∈ R^{m×n} such that:

A = U · Σ · Vᵀ

where Σ has σ_1 ≥ σ_2 ≥ … ≥ σ_r > 0 on the diagonal and zeros elsewhere. The σ_i are the singular values; the columns of U are left singular vectors and the columns of V are right singular vectors.

Properties worth memorising:

  • σ_i ≥ 0 always, and they are real even when A is complex (in the complex case U, V are unitary rather than orthogonal).
  • σ_i² = λ_i(AᵀA) = λ_i(AAᵀ) — singular values are square roots of the nonzero eigenvalues of the Gram matrices.
  • Aᵀ · u_i = σ_i · v_i and A · v_i = σ_i · u_i — the SVD pairs the left and right singular vectors through the corresponding σ_i.
  • Rank of A equals the number of nonzero singular values.
  • Frobenius norm ‖A‖_F = √(Σ σ_i²); spectral norm ‖A‖_2 = σ_1.
  • Determinant of a square A: |det(A)| = Π σ_i.
  • Condition number κ_2(A) = σ_1 / σ_r (ratio of largest to smallest nonzero singular value).

Existence is guaranteed by the Spectral Theorem applied to the symmetric positive-semidefinite matrix AᵀA: orthonormal eigenvectors give V, square roots of eigenvalues give σ_i, and u_i = (1/σ_i) · A · v_i for the nonzero σ_i (the rest of U is any completion to an orthonormal basis).

The decomposition is essentially unique: singular values are uniquely determined (with multiplicities), the singular vectors are determined up to a sign for each simple singular value and up to an orthogonal mixing within each repeated-singular-value subspace. In practice software always returns some valid choice; downstream code that compares SVDs across runs must align signs explicitly.

The “economy SVD” or “thin SVD” drops the zero columns of Σ and the corresponding columns of U and V:

A = U_r · Σ_r · V_rᵀ,    U_r ∈ R^{m×r},  Σ_r ∈ R^{r×r},  V_r ∈ R^{n×r}

For m ≫ n (tall matrices like data tables) the economy form requires O(m·n) storage rather than O(m²). This is the form that should be requested when U is never multiplied by anything in R^m.

A pleasant identity that follows: A = Σ_{i=1..r} σ_i · u_i · v_iᵀ — the SVD is literally a sum of r rank-one outer products, ordered by descending importance. This dyadic form is the basis of every “truncated SVD” downstream method.

3. Geometric interpretation

A maps the unit sphere in R^n to an ellipsoid in R^m. The semi-axes of the ellipsoid have lengths σ_1, σ_2, …, σ_r and point in the directions U · e_1, U · e_2, …, U · e_r. Equivalently:

  • V · e_i (the i-th right singular vector) is a preimage direction in input space.
  • σ_i · U · e_i is its image — the corresponding ellipsoid semi-axis in output space.

Decomposition by stages: Vᵀ rotates the input frame to align with the principal input directions; Σ stretches along each axis by σ_i; U rotates the result to its final orientation in output space. Every linear map is therefore “rotate, stretch axis-by-axis, rotate” — the polar decomposition makes this even more explicit: A = U · Σ · Vᵀ = (U · Vᵀ) · (V · Σ · Vᵀ) = Q · P where Q is orthogonal and P is symmetric PSD.

Worked example — 2 × 2:

A = [ 3  0 ]
    [ 4  5 ]

AᵀA = [ 25  20 ]
      [ 20  25 ]

eigenvalues of AᵀA: 45, 5
σ_1 = √45 ≈ 6.708,  σ_2 = √5 ≈ 2.236

The right singular vectors come from the eigenvectors of AᵀA:

v_1 = (1/√2) · ( 1, 1 )ᵀ
v_2 = (1/√2) · ( 1,-1 )ᵀ

and the left singular vectors from u_i = (1/σ_i) · A · v_i. This tiny calculation captures the entire mechanism: form the Gram matrix, diagonalise it, take square roots, propagate. The professional routines never do this naively (squaring loses half the precision in the small singular values), but the conceptual chain is exactly that.

4. Computing the SVD

The classical algorithm is Golub-Kahan-Reinsch (Golub + Kahan 1965; Golub + Reinsch 1970): first reduce A to upper bidiagonal form using a sequence of Householder reflections from the left and right; then run an implicitly shifted QR iteration on the bidiagonal matrix to drive its off-diagonal entries to zero.

Modern LAPACK implementations:

  • dgesvd — original Golub-Reinsch driver.
  • dgesdd — divide-and-conquer variant; substantially faster for large matrices, slightly more memory.
  • dgejsv and dgesvj — Jacobi-based; the most accurate but the slowest. Use when σ_r / σ_1 is tiny and you really need every digit.

For symmetric eigenvalue subproblems (which appear inside the bidiagonal SVD), MRRR (Dhillon + Parlett 2004) computes eigenvalues and eigenvectors in O(n²) time with guaranteed orthogonality.

Asymptotic cost: O(min(m·n², m²·n)) for a full SVD of m × n. Constant factors for dgesdd are roughly 4·m·n·min(m,n) + 8·min(m,n)³ / 3 flops. Memory is O(m² + n²) if both U and V are formed; “economy SVD” (matlab svd(A,'econ'), numpy svd(A, full_matrices=False)) skips the trailing all-zero columns and costs O(m·n·min(m,n)).

Iterative alternatives when only a few singular triplets are needed:

  • Lanczos bidiagonalization (Lanczos 1950; Golub + Kahan 1965) — Krylov method, suitable for sparse A.
  • PROPACK (Larsen 1998) — high-quality reorthogonalised Lanczos SVD.
  • IRLBA (Baglama + Reichel 2005) — implicitly restarted Lanczos, the default in many R/Python packages for large sparse SVD.
  • LOBPCG (Knyazev 2001) — block preconditioned, well suited to top-k eigenpairs of AᵀA.

5. Truncated SVD

Keep only the top-k singular values and the corresponding singular vectors:

A_k = Σ_{i=1..k} σ_i · u_i · v_iᵀ

This is the rank-k truncated SVD. It costs O(m·n·k) to evaluate after the decomposition is in hand.

Eckart-Young-Mirsky theorem (Eckart + Young 1936; Mirsky 1960):

‖A − A_k‖_F = min { ‖A − B‖_F : rank(B) ≤ k } = √(Σ_{i>k} σ_i²)
‖A − A_k‖_2 = min { ‖A − B‖_2 : rank(B) ≤ k } = σ_{k+1}

i.e., A_k is the best rank-k approximation of A in both the Frobenius and spectral norms simultaneously. The result extends to any unitarily-invariant norm (Mirsky’s generalization). This is why “drop the smallest singular values” is the canonical low-rank approximation strategy.

Typical applications: image compression (an m × n image stored as (m + n + 1)·k floats rather than m · n), noise reduction (small σ_i carry mostly noise), and the low-rank component of robust PCA.

6. Randomized SVD

Halko, Martinsson, and Tropp’s 2011 SIAM Review paper formalised what had been folklore for a few years: when only the top-k singular triplets are required, generate a random matrix Ω ∈ R^{n×(k+p)} (Gaussian entries, p is a small oversampling parameter, e.g., p = 5), form Y = A · Ω, orthonormalise Q = qr(Y), then compute the small SVD of B = Qᵀ · A. The top-k left singular vectors of A are Q · U_B, the singular values and right singular vectors come straight from the small decomposition.

Cost: O(m·n·k) rather than O(min(m·n², m²·n)) — for typical “tall-and-skinny” matrices this is one to two orders of magnitude faster than the deterministic SVD.

Error bound (Halko-Martinsson-Tropp 2011, Theorem 10.7):

E ‖A − Q · Qᵀ · A‖_F ≤ (1 + k/(p − 1))^{1/2} · (Σ_{i>k} σ_i²)^{1/2}

i.e., with p = 5 oversampling the expected error is within a small constant of the Eckart-Young optimum. Power iteration (Y = (A · Aᵀ)^q · A · Ω for q = 1 or q = 2) further damps the tail and is the standard recipe in scikit-learn’s TruncatedSVD(algorithm='randomized') and Facebook/FBPCA. The probabilistic failure probability decays super-exponentially in p.

Default these days for “big data PCA”: one pass to form Y, one pass to form B, then a tiny deterministic SVD. Trivially streams. Trivially parallelises.

Practical recipe (numpy):

def randomized_svd(A, k, p=10, q=2):
    m, n = A.shape
    Omega = np.random.randn(n, k + p)
    Y = A @ Omega
    for _ in range(q):                # subspace iteration to damp tail
        Y = A @ (A.T @ Y)
    Q, _ = np.linalg.qr(Y)
    B = Q.T @ A
    U_tilde, s, Vt = np.linalg.svd(B, full_matrices=False)
    return Q @ U_tilde[:, :k], s[:k], Vt[:k]

Two passes over A plus a small dense SVD of an (k+p) × n matrix. For m = n = 10⁵ and k = 50, this is roughly four orders of magnitude faster than np.linalg.svd(A). Memory is O(m·k) rather than O(m·n).

7. Pseudoinverse

The Moore-Penrose pseudoinverse (Moore 1920; Penrose 1955) of A = U · Σ · Vᵀ is:

A⁺ = V · Σ⁺ · Uᵀ

where Σ⁺ is formed from Σ by taking the reciprocal of each nonzero σ_i (and transposing the shape). A⁺ is the unique matrix satisfying Penrose’s four conditions: A · A⁺ · A = A, A⁺ · A · A⁺ = A⁺, (A · A⁺)ᵀ = A · A⁺, (A⁺ · A)ᵀ = A⁺ · A.

Solves the least-squares problem min_x ‖A · x − b‖²:

x̂ = A⁺ · b

When the problem is underdetermined (infinite solutions), is the unique minimum-norm solution. When overdetermined, is the residual-minimising solution. When A has full column rank, A⁺ = (Aᵀ · A)⁻¹ · Aᵀ (the normal-equations form) but the SVD form is far better numerically because forming Aᵀ · A squares the condition number.

Truncated pseudoinverse: invert only the singular values above some tolerance τ (typical: τ = max(m, n) · σ_1 · ε_machine). This is Tikhonov-style regularisation in disguise and is what numpy.linalg.pinv does by default.

8. PCA via SVD

Given a data matrix X ∈ R^{n×d} (n samples, d features):

  1. Centre: μ = (1/n) · Σ x_i, X_c = X − 1·μᵀ.
  2. Compute the SVD: X_c = U · Σ · Vᵀ.
  3. Principal components (loadings) = columns of V. Each v_j is a unit-norm direction in feature space.
  4. PC scores (projection coordinates) = U · Σ (alternatively X_c · V).
  5. Variance explained by component j is σ_j² / (n − 1); the fraction explained is σ_j² / Σ_i σ_i².

Equivalence with the eigendecomposition of the sample covariance: C = (1/(n−1)) · X_cᵀ · X_c = (1/(n−1)) · V · Σ² · Vᵀ, so the eigenvalues of C are λ_j = σ_j² / (n − 1) and the eigenvectors of C are the columns of V. Computing PCA via SVD of X_c is numerically preferable to forming C explicitly — same condition-number argument as the pseudoinverse.

Reconstruction in the reduced space:

X̂ = U_k · Σ_k · V_kᵀ + 1·μᵀ        (rank-k reconstruction)

Using only the first k PCs gives the best rank-k reconstruction of the centred data by Eckart-Young. PCA is therefore literally “truncated SVD of centred data”.

9. Choosing k for PCA

There is no single correct answer; the standard heuristics in order of frequency:

  • Cumulative variance threshold — pick the smallest k with Σ_{i≤k} σ_i² / Σ_i σ_i² ≥ τ for some τ (commonly 0.90 or 0.95). Simple, easy to defend.
  • Scree plot — plot σ_i² against i and look for the “elbow” where the curve flattens. Subjective but informative.
  • Kaiser criterion (Kaiser 1960) — keep components with eigenvalue greater than 1 (the average variance of standardised features). Works for correlation-PCA only and tends to over-retain.
  • Parallel analysis (Horn 1965) — compare σ_i² against the distribution under random data; keep components above the random baseline. Statistically principled and widely used in psychometrics.
  • Cross-validation — leave one feature out, reconstruct it from PCs of the rest, choose k minimising reconstruction error. Computationally expensive but objective.
  • Minka’s Bayesian model-selection criterion (Minka 2000) — closed-form posterior probability over k under a probabilistic PCA model; default in scikit-learn when n_components='mle'.

If PCA is a preprocessing step for a downstream model, cross-validate k on downstream performance directly.

Probabilistic PCA (Tipping + Bishop 1999) reframes PCA as the maximum-likelihood estimate of a latent-variable Gaussian model x = W · z + μ + ε with z ~ N(0, I_k) and ε ~ N(0, σ²·I_d). The MLE solution for W is exactly the top k PCs scaled by √(λ_i − σ²). Benefits: principled handling of missing data (EM algorithm), automatic noise-variance estimate, and the Bayesian extension that motivates Minka’s MLE for k. Used as the default reasoning framework when “PCA with uncertainty” is needed.

10. Standardization vs centering

Two flavours of PCA in widespread use:

  • Covariance PCA — centre only. Variance of each feature is its own scale; high-variance features dominate. Appropriate when features share units (pixel intensities, log returns).
  • Correlation PCA — centre and divide each feature by its sample standard deviation. Equivalent to PCA on the correlation matrix. Appropriate when features have different units or wildly different scales (height in metres vs income in dollars).

Forgetting to standardise when features have different scales is one of the most common mistakes — a single feature with units and variance 10⁶ swamps every other feature and the “first PC” is just that feature. As a default for heterogeneous tabular data, use correlation PCA. For images or other homogeneous data, centring alone is typical.

11. Kernel PCA

PCA discovers linear structure. Kernel PCA (Schölkopf, Smola, Müller 1998) lifts the data into a (possibly infinite-dimensional) feature space via a positive-definite kernel K(x, y) = ⟨φ(x), φ(y)⟩ and performs PCA in that space without ever forming φ.

Algorithm:

  1. Compute the kernel Gram matrix K_{ij} = K(x_i, x_j).
  2. Centre it: K_c = K − 1_n · K − K · 1_n + 1_n · K · 1_n where 1_n = (1/n) · 1_{n×n}.
  3. Eigendecompose K_c = U · Λ · Uᵀ.
  4. The kernel-PC scores of the training data along component j are √λ_j · u_j.
  5. For a new point x*, the projection onto component j is (1/√λ_j) · Σ_i u_{ji} · K(x_i, x*) after centring.

Common kernels: RBF (exp(−γ ‖x − y‖²)), polynomial ((⟨x,y⟩ + c)^d), cosine. Kernel PCA captures nonlinear manifold structure (concentric circles, swiss roll) at the cost of O(n²) memory for the Gram matrix — fine up to a few tens of thousands of points.

12. ICA — Independent Component Analysis

PCA decorrelates components but does not make them independent — decorrelation only matches second-order statistics. ICA (Comon 1994; Hyvärinen + Oja 2000) seeks a linear demixing S = W · X such that the components of S are as statistically independent as possible. Concretely it maximises non-Gaussianity (by central limit theorem, mixtures of independent sources are more Gaussian than the sources themselves).

FastICA (Hyvärinen 1999) — fixed-point iteration that maximises an approximation to negentropy. Whiten first (so the search is over orthogonal W), then iterate per component:

w ← E[X · g(wᵀ · X)] − E[g'(wᵀ · X)] · w
w ← w / ‖w‖

with g(u) = tanh(u) or g(u) = u · exp(−u²/2) as standard non-quadratic contrast functions.

Applications: blind source separation of mixed audio (“cocktail-party problem”), EEG and fMRI denoising (separating brain signal from eye-blink, heartbeat, and electrode artefacts), and financial time-series factor extraction.

Identifiability caveats: sources are recovered up to permutation, sign, and scale; if more than one source is Gaussian, ICA cannot separate them (Gaussianity is the unique distribution where independence and decorrelation coincide).

Other approaches: InfoMax ICA (Bell + Sejnowski 1995) maximises mutual information between source and observed; JADE (Cardoso + Souloumiac 1993) jointly diagonalises fourth-order cumulant matrices; SOBI (Belouchrani et al. 1997) jointly diagonalises a set of time-lagged covariance matrices and exploits temporal structure rather than non-Gaussianity. For neuroscience, SOBI and InfoMax are popular; for general blind source separation, FastICA dominates.

13. NMF — Non-negative Matrix Factorization

For data with non-negative entries (counts, pixel intensities, audio magnitudes), NMF (Paatero + Tapper 1994; Lee + Seung 1999, Nature) factorises

A ≈ W · H,   W ∈ R_{≥0}^{m×k},  H ∈ R_{≥0}^{k×n}

The non-negativity constraint forces a parts-based representation: each column of A is reconstructed as a strictly additive combination of k non-negative “parts” (columns of W). Faces decompose into nose, eyes, mouth; documents into topics; audio spectrograms into note templates.

Lee + Seung’s multiplicative update rule (squared-error objective):

H ← H * (Wᵀ · A) / (Wᵀ · W · H)
W ← W * (A · Hᵀ) / (W · H · Hᵀ)

where * and / are element-wise. Provably non-increasing on the Frobenius loss; converges to a stationary point (not necessarily a global minimum — the problem is non-convex).

Variants: KL-divergence loss for count data, sparse NMF (add ‖H‖_1 penalty), orthogonal NMF (clustering interpretation), and projective NMF. Foundational for topic modelling pre-LDA and remains a standard baseline for audio source separation and hyperspectral unmixing.

Initialisation matters: pure random initialisation gives noisy local optima. NNDSVD (Boutsidis + Gallopoulos 2008) initialises W and H from a truncated SVD of A projected onto the non-negative orthant — both faster and more reproducible. Modern libraries (scikit-learn, librosa) default to NNDSVD.

14. Low-rank matrix completion

Given A with most entries missing, can we recover the missing entries? Candès + Recht 2009 (Foundations of Computational Mathematics) proved that if the underlying matrix is rank-r and the observation pattern is sufficiently incoherent, the minimum-nuclear-norm completion is exact with high probability from O(n · r · log² n) observations:

minimise   ‖X‖_*       (nuclear norm = Σ σ_i(X))
subject to X_ij = A_ij  for (i,j) ∈ Ω

The nuclear norm is the convex envelope of rank on the unit spectral ball; minimising it is the tightest tractable surrogate for min rank(X). Solved in practice by Singular Value Thresholding (Cai, Candès, Shen 2010), accelerated proximal-gradient, or alternating least-squares on X = U · Vᵀ with explicit factors (the latter scaled the Netflix Prize era).

Closely related: Robust PCA (next section) and matrix-factorisation recommender systems where Ω is the user-item observation graph and r is a small latent-factor dimension.

15. Robust PCA

Standard PCA breaks under sparse gross errors — a single anomalous pixel can rotate the entire principal subspace. Robust PCA (Candès, Li, Ma, Wright 2011) decomposes

A = L + S

with L low-rank and S sparse, by solving the convex relaxation

minimise   ‖L‖_* + λ · ‖S‖_1
subject to L + S = A

with λ = 1 / √max(m, n) as the universal choice in their paper. Solved by ADMM (Boyd 2011) — alternate proximal updates for L (singular-value soft-thresholding) and S (entry-wise soft-thresholding), then update the dual variable. Cost dominated by one SVD per iteration; randomised SVD makes it practical for video.

Applications: surveillance-video background subtraction (background is low-rank across frames, moving objects are sparse), face recognition under shadow/occlusion, anomaly detection in network traffic matrices.

16. Spectral clustering

k-means assumes convex, roughly equal-sized clusters. For data with non-convex clusters (concentric rings, moons, manifolds) it fails badly. Spectral clustering sidesteps the geometry by working in the eigenspace of the data’s similarity graph.

Algorithm (Shi + Malik 2000; Ng, Jordan, Weiss 2002):

  1. Build a similarity graph with weight matrix W (k-nearest-neighbour, ε-ball, or fully connected with W_{ij} = exp(−‖x_i − x_j‖² / 2σ²)).
  2. Form the degree matrix D (diagonal, D_{ii} = Σ_j W_{ij}).
  3. Form the graph Laplacian. Three common choices:
    • Unnormalised: L = D − W.
    • Symmetric normalised: L_sym = I − D^{−1/2} · W · D^{−1/2} (Ng et al.).
    • Random-walk: L_rw = I − D^{−1} · W (Shi + Malik).
  4. Compute the k smallest eigenvectors (excluding the trivial constant one for the unnormalised case) and stack them as columns of U ∈ R^{n×k}.
  5. Optionally normalise each row of U to unit length (Ng et al.).
  6. Run k-means on the rows of U.

Intuition (Cheeger inequality, Fiedler 1973): the second-smallest eigenvector of L provides the best continuous relaxation of the minimum normalised-cut partition of the graph. Higher eigenvectors continue the partition.

Costs the SVD of an n × n Laplacian (or rather the k smallest eigenpairs — Lanczos is the workhorse). Excellent for non-convex clusters; sensitive to the choice of similarity scale σ.

Choosing k in spectral clustering: look for a large gap between consecutive eigenvalues — λ_k ≪ λ_{k+1} strongly suggests k natural clusters (the eigengap heuristic). For very large graphs, the Nyström extension (Williams + Seeger 2001; Fowlkes et al. 2004) samples s ≪ n columns of W, computes their small Laplacian eigenvectors, and extrapolates to the full graph in O(n · s²) rather than O(n³). This is what makes spectral clustering tractable for image segmentation and single-cell genomics datasets in the millions.

17. PageRank

The original 1998 PageRank (Page + Brin, Stanford technical report) defines the importance of a web page as the stationary distribution of a random surfer:

M = α · A_norm + (1 − α) · (1/n) · 1·1ᵀ
π = M · π,   π ≥ 0,   1ᵀ · π = 1

where A_norm is the column-stochastic version of the link matrix (each column sums to 1) and α is the damping factor (typically 0.85). The teleport term (1 − α) · (1/n) · 1·1ᵀ ensures irreducibility and aperiodicity — M then has a unique principal eigenvector by Perron-Frobenius.

π is computed by power iteration: π ← M · π, normalise, repeat. Each iteration is one sparse matrix-vector product against the web link matrix — billions of edges, but extremely sparse, so the iteration is cheap. Converges geometrically at rate α (so 50-100 iterations for tolerance 10⁻⁸).

The same eigenvector view underlies eigenvector centrality in network science (Bonacich 1972), HITS (Kleinberg 1999, two coupled eigenvector problems for hubs and authorities), and modern variants like Personalised PageRank used in recommendation and graph-neural-network influence propagation.

Topic-sensitive PageRank (Haveliwala 2002) replaces the uniform teleport vector with a topic-specific distribution, yielding personalised rankings; Random Walk with Restart generalises further by allowing arbitrary restart distributions. These power “people you may know” and “similar items” features in modern recommender systems and are the backbone of graph-based semi-supervised learning (Zhou et al. 2003).

18. Procrustes analysis

Given two corresponding point clouds X, Y ∈ R^{n×d} (rows are points, columns are coordinates), the orthogonal Procrustes problem is:

minimise   ‖Y − X · R‖_F
subject to RᵀR = I

Solution (Schönemann 1966):

SVD(Yᵀ · X) = U · Σ · Vᵀ
R = V · Uᵀ

(Some references swap U ↔ V depending on whether the rotation is applied on the left or right and whether the data are row- or column-stacked; check carefully.) If a reflection is undesirable, force det(R) = +1 by flipping the sign of the last column of V when needed.

Generalised Procrustes extends to similarity transforms (rotation + scale + translation) and to averaging more than two shapes. Standard in: morphometrics, alignment of MRI scans, multi-view 3D reconstruction, and — recently — aligning two word-embedding spaces or two model-version embedding spaces so that “old man” and “new man” land near each other. Embedding alignment across model versions is the only practical way to compare features across a model retrain without re-embedding everything.

19. Latent Semantic Analysis (LSA)

LSA (Deerwester et al. 1990, JASIS) builds the term-document matrix A ∈ R^{m×n} (m terms, n documents, entries are TF-IDF weights), takes a truncated SVD A_k = U_k · Σ_k · V_kᵀ, and interprets:

  • Rows of U_k · Σ_k as k-dimensional embeddings of terms.
  • Rows of V_k · Σ_k as k-dimensional embeddings of documents.
  • Cosine similarity in this k-dim space as semantic similarity.

Typical k is 100-300. Captures synonymy (“car” and “automobile” co-occur with similar contexts → similar vectors) and polysemy partially (a term gets a single vector, so meanings are blended). Directly precedes the probabilistic-topic-model family (pLSI 1999, LDA 2003) and the predictive word-embedding family (word2vec 2013, GloVe 2014). Modern dense retrieval (rag-embeddings-vector-search) is the same idea with a learned encoder and inner-product index in place of SVD and cosine similarity over a fixed term-document matrix.

20. Linear Discriminant Analysis (LDA, Fisher 1936)

A supervised cousin of PCA. Given class labels y ∈ {1, …, C}, define:

  • Between-class scatter: S_B = Σ_c n_c · (μ_c − μ) · (μ_c − μ)ᵀ.
  • Within-class scatter: S_W = Σ_c Σ_{i ∈ c} (x_i − μ_c) · (x_i − μ_c)ᵀ.

Fisher’s LDA finds the projection that maximises the Rayleigh quotient

J(w) = (wᵀ · S_B · w) / (wᵀ · S_W · w)

solved by the generalised eigenproblem S_B · w = λ · S_W · w. Top C − 1 eigenvectors form the discriminant subspace (rank of S_B is at most C − 1).

LDA assumes equal class covariances; QDA (quadratic discriminant analysis) drops the assumption at the cost of C separate covariance estimates. Acronym collision warning: in NLP “LDA” almost always means Latent Dirichlet Allocation (Blei, Ng, Jordan 2003), an unrelated topic model.

21. t-SNE and UMAP

Non-linear dimensionality reduction for visualisation:

  • t-SNE (van der Maaten + Hinton 2008, JMLR) — convert pairwise distances in high-dim into conditional probabilities under a Gaussian; convert pairwise distances in 2D into a Student-t distribution with one degree of freedom; minimise KL-divergence between the two via gradient descent. Excellent at preserving local neighbourhood structure. Famous for the “tSNE swirl” — gorgeous but global distances are unreliable. O(n²) naive, O(n log n) with Barnes-Hut tree approximation.
  • UMAP (McInnes, Healy, Melville 2018) — based on Riemannian manifold theory and algebraic topology. Builds a fuzzy simplicial complex from nearest-neighbour graphs in both spaces and minimises cross-entropy between them. Preserves more global structure than t-SNE, scales to millions of points, much faster. Default for single-cell genomics, NLP-embedding visualisation, and exploratory dataset triage.

Both should be used for visualisation only. The output coordinates are not a faithful linear embedding — distances are not meaningful, density is distorted, and feeding 2D coordinates into a downstream model destroys most of the signal. If you need a low-dim feature representation for downstream models, use PCA or autoencoder embeddings, not t-SNE/UMAP coordinates.

Hyperparameter pitfalls: t-SNE perplexity (5-50 typical; results change qualitatively across that range) and UMAP n_neighbors + min_dist (run multiple configurations). The 2019 Distill article “How to Use t-SNE Effectively” (Wattenberg, Viégas, Johnson) shows beautifully how varied the plots look on the same data with different perplexities.

TriMap (Amid + Warmuth 2019) and PaCMAP (Wang et al. 2021) are more recent embeddings designed to preserve global structure better than t-SNE while staying fast. PHATE (Moon et al. 2019) uses heat-kernel diffusion distances and is increasingly popular in single-cell genomics where lineage trajectories matter and Euclidean embedding distortions are unacceptable. None of these dethrone UMAP for general use, but each has its niche.

A practical heuristic: always pair the non-linear embedding with the first two PCs of the same data side-by-side. If the structure shows up in both, trust it. If it only shows up in t-SNE/UMAP, suspect an artefact of the projection.

22. Applications

Where the methods of this note show up:

  • Recommendation systems — matrix factorisation A ≈ U · V with stochastic-gradient ALS or implicit-feedback objectives; the post-2009 Netflix-prize architecture and still the baseline behind many production recommenders.
  • Image compression — store an image as the top-k SVD; useful when transmitting low-bandwidth thumbnails or as a denoising preprocessing step.
  • Latent semantic analysis + word embeddings — Section 19 above; the lineage from LSA to GloVe is a continuous one-parameter family of low-rank factorisations of co-occurrence statistics.
  • Noise reduction — drop small σ_i to remove high-frequency or low-energy noise components (denoising medical images, time-series, audio).
  • Visualisation — PCA for a quick linear sanity check; t-SNE or UMAP for a non-linear scatter plot for exploration.
  • Embedding alignment — Procrustes (Section 18) to bring two embedding spaces into a common coordinate frame.
  • Graph clustering — spectral clustering (Section 16); community detection in social and biological networks.
  • Anomaly detection — fit a low-rank model, flag observations with large residuals; robust PCA formalises this.
  • SLAM linearisation — robotics SLAM solvers (GTSAM, g2o) repeatedly factor sparse Jacobians; SVD of small dense blocks appears inside marginalisation and in computing analytic uncertainty ellipsoids. See computer-vision-robotics.
  • Quantum mechanics — Schmidt decomposition — any pure bipartite state |ψ⟩ ∈ H_A ⊗ H_B admits a Schmidt decomposition |ψ⟩ = Σ √p_i · |a_i⟩ ⊗ |b_i⟩ which is exactly the SVD of the coefficient matrix in the product basis. The Schmidt rank measures entanglement.
  • LoRA fine-tuning of large language models (Hu et al. 2021) — freeze the pretrained weights W_0 and learn a low-rank update ΔW = B · A with A ∈ R^{r×d} and B ∈ R^{d×r}. Each weight matrix is updated by an r-rank SVD-style perturbation, reducing trainable parameters by 10⁴×. The implicit assumption — that task adaptation lives in a low-rank subspace of weight space — is empirically validated and theoretically supported by intrinsic-dimensionality measurements (Aghajanyan et al. 2020).
  • Tensor decompositions (CP / Tucker / Tensor-Train) — multi-dimensional generalisations of the SVD. Tucker = SVD per mode + small core tensor; Tensor-Train (Oseledets 2011) gives near-linear storage for high-dimensional structured tensors; foundational in quantum many-body physics (DMRG, MPS) and emerging in deep-learning weight compression.

23. Pitfalls

  • Mean-centring forgotten — without centring, the first PC captures the mean direction rather than the largest-variance direction. Always centre before SVD-PCA.
  • Different feature scales — see Section 10. Standardise unless features share units.
  • Sign ambiguity of singular vectors(u_i, v_i) and (−u_i, −v_i) give the same outer product. Random initialisation in randomised SVD or different LAPACK versions can return either sign. Do not compare signed entries across PCA runs without an alignment step (Procrustes, or simply fix the sign of the largest-magnitude entry).
  • Full SVD when truncated suffices — computing dgesdd on a 10⁵ × 10⁵ matrix to keep k = 50 components is wasteful by three orders of magnitude. Use randomised SVD or Lanczos.
  • PCA loading magnitudes as feature importance — only valid when features are on the same scale. After standardisation, loading magnitudes do indicate relative importance; without it they reflect feature variance.
  • Reporting cumulative variance without n — for small n the singular values are noisy and the cumulative-variance curve underestimates k. Use cross-validation or parallel analysis for small datasets.
  • Spectral clustering with the wrong Laplacian — random-walk L_rw and symmetric L_sym are usually preferable to unnormalised L when cluster sizes differ; unnormalised L minimises RatioCut and tends to produce equal-sized clusters.
  • Treating t-SNE/UMAP coordinates as features — see Section 21. Visualisation only.
  • Numerical PCA on Xᵀ · X — squaring the condition number is unnecessary; SVD of X directly.
  • NMF non-uniqueness — multiple (W, H) pairs reach similar loss; results depend on initialisation. Use multiple random restarts.
  • Kernel PCA forgetting to centre the Gram matrixK_c is what you eigendecompose, not raw K.
  • Treating PCA as variable selection — PCA gives orthogonal combinations of original features, not a subset of them. If interpretability requires retaining original feature names, use sparse PCA (Zou, Hastie, Tibshirani 2006) or feature-importance methods instead.
  • Spectral-clustering bandwidth choice — the Gaussian similarity scale σ makes or breaks the result. Use self-tuning spectral clustering (Zelnik-Manor + Perona 2004) which sets per-point bandwidths from the local k-th nearest-neighbour distance.
  • Procrustes without removing scale or translation — make sure both clouds are mean-centred before the SVD; if scales differ, normalise (or use the similarity-Procrustes variant).
  • Running NMF on real-valued data — NMF requires A ≥ 0. For mixed-sign data, shift or use a signed-friendly factorisation (semi-NMF, Ding et al. 2010).

24. Cross-references

25. Citations

  • Eckart, C. + Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika 1, 211-218.
  • Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. Quarterly Journal of Mathematics 11, 50-59.
  • Moore, E. H. (1920). On the reciprocal of the general algebraic matrix. Bull. AMS 26.
  • Penrose, R. (1955). A generalized inverse for matrices. Proc. Cambridge Philos. Soc. 51, 406-413.
  • Golub, G. H. + Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal. 2, 205-224.
  • Golub, G. H. + Reinsch, C. (1970). Singular value decomposition and least squares solutions. Numer. Math. 14, 403-420.
  • Halko, N., Martinsson, P.-G., Tropp, J. A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review 53, 217-288.
  • Candès, E. J. + Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathematics 9, 717-772.
  • Candès, E. J., Li, X., Ma, Y., Wright, J. (2011). Robust principal component analysis? Journal of the ACM 58, 11.
  • Cai, J.-F., Candès, E. J., Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20, 1956-1982.
  • Lee, D. D. + Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401, 788-791.
  • Paatero, P. + Tapper, U. (1994). Positive matrix factorization. Environmetrics 5, 111-126.
  • Page, L., Brin, S., Motwani, R., Winograd, T. (1998). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab Technical Report.
  • Shi, J. + Malik, J. (2000). Normalized cuts and image segmentation. IEEE TPAMI 22, 888-905.
  • Ng, A. Y., Jordan, M. I., Weiss, Y. (2002). On spectral clustering: analysis and an algorithm. NeurIPS 14.
  • Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal 23, 298-305.
  • Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks 10, 626-634.
  • Hyvärinen, A. + Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Networks 13, 411-430.
  • Comon, P. (1994). Independent component analysis, a new concept? Signal Processing 36, 287-314.
  • McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
  • van der Maaten, L. + Hinton, G. (2008). Visualizing data using t-SNE. JMLR 9, 2579-2605.
  • Schölkopf, B., Smola, A., Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 1299-1319.
  • Schönemann, P. H. (1966). A generalized solution of the orthogonal Procrustes problem. Psychometrika 31, 1-10.
  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R. (1990). Indexing by latent semantic analysis. JASIS 41, 391-407.
  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179-188.
  • Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement 20, 141-151.
  • Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 30, 179-185.
  • Minka, T. P. (2000). Automatic choice of dimensionality for PCA. NeurIPS 13.
  • Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3, 1-122.
  • Dhillon, I. S. + Parlett, B. N. (2004). Multiple representations to compute orthogonal eigenvectors of symmetric tridiagonal matrices. Linear Algebra Appl. 387, 1-28.
  • Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. J. Res. Natl. Bur. Stand. 45, 255-282.
  • Baglama, J. + Reichel, L. (2005). Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27, 19-42.
  • Knyazev, A. V. (2001). Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23, 517-541.
  • Larsen, R. M. (1998). Lanczos bidiagonalization with partial reorthogonalization. DAIMI PB-357, University of Aarhus.
  • Bonacich, P. (1972). Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology 2, 113-120.
  • Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. J. ACM 46, 604-632.
  • Blei, D. M., Ng, A. Y., Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR 3, 993-1022.
  • Wattenberg, M., Viégas, F., Johnson, I. (2016). How to use t-SNE effectively. Distill.
  • Tipping, M. E. + Bishop, C. M. (1999). Probabilistic principal component analysis. J. Royal Statistical Society B 61, 611-622.
  • Bell, A. J. + Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation 7, 1129-1159.
  • Cardoso, J.-F. + Souloumiac, A. (1993). Blind beamforming for non-Gaussian signals. IEE Proceedings F 140, 362-370.
  • Boutsidis, C. + Gallopoulos, E. (2008). SVD based initialization: a head start for non-negative matrix factorization. Pattern Recognition 41, 1350-1362.
  • Williams, C. K. I. + Seeger, M. (2001). Using the Nyström method to speed up kernel machines. NeurIPS 13.
  • Fowlkes, C., Belongie, S., Chung, F., Malik, J. (2004). Spectral grouping using the Nyström method. IEEE TPAMI 26, 214-225.
  • Haveliwala, T. H. (2002). Topic-sensitive PageRank. WWW 2002.
  • Hu, E. J. et al. (2021). LoRA: Low-rank adaptation of large language models. arXiv:2106.09685.
  • Aghajanyan, A., Zettlemoyer, L., Gupta, S. (2020). Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv:2012.13255.
  • Oseledets, I. V. (2011). Tensor-Train decomposition. SIAM J. Sci. Comput. 33, 2295-2317.
  • Zou, H., Hastie, T., Tibshirani, R. (2006). Sparse principal component analysis. J. Computational and Graphical Statistics 15, 265-286.
  • Zelnik-Manor, L. + Perona, P. (2004). Self-tuning spectral clustering. NeurIPS 17.
  • Moon, K. R. et al. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology 37, 1482-1492.
  • Amid, E. + Warmuth, M. K. (2019). TriMap: large-scale dimensionality reduction using triplets. arXiv:1910.00204.
  • Wang, Y., Huang, H., Rudin, C., Shaposhnik, Y. (2021). Understanding how dimension reduction tools work. JMLR 22, 1-73.