Linear Algebra Essentials — Math Reference
1. At a glance
Linear algebra is the operational language of nearly every quantitative engineering and ML discipline. It underwrites:
- Mechanics + structures — stiffness matrices, modal analysis, FEM/FEA assemblies are sparse linear systems Ax = b with millions of degrees of freedom.
- Control + signal processing — state-space models x(k+1) = A·x(k) + B·u(k) live and die by eigenvalues (stability), observability matrices, and Kalman gain inversions.
- Computer graphics + robotics — every rigid transform is a 4×4 homogeneous matrix; quaternion-to-rotation conversions, Jacobians, inverse kinematics, camera projections are all matrix algebra.
- Machine learning — every dense layer is y = W·x + b; every attention block is QKᵀ; every PCA / autoencoder / collaborative filter is a low-rank factorization; gradient updates are tensor contractions.
- Statistics — covariance matrices, least squares, generalized linear models, multivariate Gaussians, Mahalanobis distance.
- Quantum + chemistry — operators are matrices, observables are eigenvalues, wavefunctions evolve via unitary matrices.
- Graphs + networks — adjacency/Laplacian matrices, PageRank, spectral clustering.
Two complementary views of a matrix
Master both, switch fluidly:
-
Matrix as a list of column vectors. A = [a₁ | a₂ | … | aₙ] where each aᵢ ∈ ℝᵐ is a column. The product A·x = Σᵢ xᵢ·aᵢ is a linear combination of columns weighted by the entries of x. This view dominates when reasoning about the column space (which vectors can be reached) and when interpreting solutions of Ax = b (“which combination of the columns equals b?”).
-
Matrix as a linear transformation. A maps a vector x ∈ ℝⁿ to A·x ∈ ℝᵐ, preserving addition and scalar multiplication: A(αx + βy) = α·A·x + β·A·y. This view dominates when reasoning about geometry — rotations, reflections, projections, change of basis — and when composing transformations (matrix product = function composition).
A third view, matrix as a table of bilinear coefficients (xᵀ·A·y), shows up in quadratic forms, energy methods, and probabilistic models (precision matrices), but the column-vector and transformation views cover ~95% of working intuition.
Core mental models
- Column space C(A) — the set of all vectors A·x can produce; lives in ℝᵐ; dimension equals the rank.
- Row space C(Aᵀ) — span of the rows; lives in ℝⁿ; same dimension as the column space (a deep theorem: row rank = column rank).
- Null space N(A) — all x with A·x = 0; lives in ℝⁿ; dimension n − rank. These are the “invisible” inputs.
- Left null space N(Aᵀ) — all y with yᵀ·A = 0; lives in ℝᵐ; dimension m − rank. These are the “unreachable” output directions.
- Rank — the universal “effective dimension” of A; equals number of linearly-independent columns = rows = number of nonzero singular values.
If you internalize these five concepts, almost every algorithm below becomes “which subspace are we manipulating, and how?“
2. Vectors
A vector v ∈ ℝⁿ has both an algebraic representation (an ordered tuple of n real numbers) and a geometric one (an arrow from the origin, or a point in n-dimensional space). The two views must agree: the algebra is the bookkeeping; the geometry is what we’re computing about.
Dot product (inner product)
For a, b ∈ ℝⁿ:
a · b = Σᵢ aᵢ·bᵢ = ‖a‖·‖b‖·cos θ
- Algebraic form: scalar from coordinate-wise multiplication.
- Geometric form: product of magnitudes times cosine of the angle between them.
- a · b = 0 iff a ⊥ b (orthogonal).
- a · a = ‖a‖² (norm-squared).
Worked example. a = (3, 4, 0), b = (1, 0, 2). Then a · b = 3·1 + 4·0 + 0·2 = 3. ‖a‖ = 5, ‖b‖ = √5 ≈ 2.236. cos θ = 3 / (5·√5) = 3/(5√5) ≈ 0.268, so θ ≈ 74.5°.
Cross product (3D only)
For a, b ∈ ℝ³:
a × b = (a₂b₃ − a₃b₂, a₃b₁ − a₁b₃, a₁b₂ − a₂b₁)
‖a × b‖ = ‖a‖·‖b‖·sin θ
- The result is perpendicular to both a and b (right-hand rule).
- Magnitude equals the area of the parallelogram spanned by a and b.
- a × b = −b × a (anti-commutative).
- Generalization to higher dimensions is the wedge product (a ∧ b) from exterior algebra; in 3D it happens to be expressible as a vector via the Hodge star.
Worked example. a = (1, 0, 0), b = (0, 1, 0). a × b = (0·0 − 0·1, 0·0 − 1·0, 1·1 − 0·0) = (0, 0, 1). Standard right-hand frame: x × y = z.
Norms
Norms measure “size.” Different norms emphasize different aspects:
- L⁰ (count of nonzeros) — not technically a norm (fails homogeneity); used in sparse-recovery problems.
- L¹ (taxicab / Manhattan) ‖v‖₁ = Σᵢ |vᵢ|. Encourages sparsity in regularization (Lasso).
- L² (Euclidean) ‖v‖₂ = √(Σᵢ vᵢ²). The default geometric length.
- Lᵖ ‖v‖ₚ = (Σᵢ |vᵢ|ᵖ)^(1/p) for p ≥ 1.
- L∞ (max / Chebyshev) ‖v‖∞ = maxᵢ |vᵢ|.
- Frobenius norm (for matrices) ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²).
All Lᵖ-norms on ℝⁿ are equivalent (each bounded by constants times the others), but constants depend on dimension; this matters for high-dimensional analysis.
Unit vectors
A unit vector û = v / ‖v‖₂ has ‖û‖₂ = 1. Used as direction without magnitude. Computing a unit vector requires ‖v‖ ≠ 0; near-zero v causes catastrophic cancellation.
Orthogonality + orthonormality
- Orthogonal set {v₁, …, vₖ}: vᵢ · vⱼ = 0 for i ≠ j.
- Orthonormal set: orthogonal plus ‖vᵢ‖ = 1.
- An orthonormal basis Q = [q₁ | … | qₙ] of ℝⁿ satisfies QᵀQ = I, hence Q⁻¹ = Qᵀ. This is the killer property of orthogonal matrices: they preserve dot products, lengths, and angles (isometries).
Projection onto a subspace
To project v onto the line spanned by unit vector û: proj_û v = (v · û)·û. For a general subspace S with orthonormal basis Q (columns of an m × k matrix):
proj_S v = Q·(Qᵀ·v)
For a non-orthonormal basis with columns of A (m × k, k ≤ m, full column rank):
proj_C(A) v = A·(AᵀA)⁻¹·Aᵀ·v
The matrix P = A·(AᵀA)⁻¹·Aᵀ is the projection matrix onto C(A); it satisfies P² = P (idempotent) and Pᵀ = P (symmetric). The residual v − P·v lies in N(Aᵀ).
3. Matrices
Operations
- Addition / scalar multiplication. Entry-wise; A + B and α·A are defined when shapes match. Costs O(m·n).
- Matrix-vector product. y = A·x, yᵢ = Σⱼ Aᵢⱼ·xⱼ. Either “linear combination of A’s columns weighted by x” or “each yᵢ is a row of A dotted with x.” O(m·n).
- Matrix-matrix product. (A·B)ᵢⱼ = Σₖ Aᵢₖ·Bₖⱼ. For A (m × n) and B (n × p), result is (m × p). Inner dimensions must match. O(m·n·p) naive; Strassen-class algorithms achieve O(n^2.373…) asymptotically but are rarely competitive for sizes below ~1000.
AB ≠ BA in general. Multiplication is not commutative. Counterexample: rotations in 3D about different axes don’t commute (try R_x(90°) then R_y(90°) vs the reverse).
Multiplication is:
- Associative: (AB)C = A(BC).
- Distributive over addition.
- Has identity: I·A = A·I = A.
Special matrix types
| Type | Definition | Why it matters |
|---|---|---|
| Identity I | I·v = v; diagonal of 1s, zeros elsewhere. | Multiplicative identity; A·A⁻¹ = I. |
| Diagonal D | Nonzero only on the diagonal. | Stretches axes; cheap to invert (1/dᵢ). |
| Upper triangular U | Zeros below diagonal. | Back substitution solves U·x = b in O(n²). |
| Lower triangular L | Zeros above diagonal. | Forward substitution. |
| Symmetric | A = Aᵀ. | Real eigenvalues, orthogonal eigenvectors (spectral theorem). |
| Skew-symmetric | A = −Aᵀ (diagonal is zero). | Eigenvalues purely imaginary; cross-product matrix. |
| Orthogonal Q | QᵀQ = QQᵀ = I (real). | Isometries; det = ±1; condition number 1. |
| Unitary U | UU = I (complex; is conjugate transpose). | Complex analog of orthogonal. |
| Positive-definite | Symmetric and xᵀAx > 0 ∀ x ≠ 0. | All eigenvalues > 0; Cholesky applies; covariance matrices. |
| Positive-semidefinite | xᵀAx ≥ 0. | Eigenvalues ≥ 0; allows rank deficiency. |
| Permutation P | One 1 per row and column. | P·x permutes entries; det = ±1. |
| Stochastic | Rows (or columns) sum to 1, entries ≥ 0. | Markov chains; row-stochastic right-multiplies probability vectors. |
| Hessenberg | Upper triangular plus first sub-diagonal. | Intermediate form for QR iteration. |
| Tridiagonal | Nonzeros only on diagonal + adjacent sub/super-diagonals. | O(n) solvers (Thomas algorithm). |
Transpose
(Aᵀ)ᵢⱼ = Aⱼᵢ. Properties:
- (A + B)ᵀ = Aᵀ + Bᵀ.
- (αA)ᵀ = α·Aᵀ.
- (A·B)ᵀ = Bᵀ·Aᵀ (order flips).
- (Aᵀ)ᵀ = A.
- (A⁻¹)ᵀ = (Aᵀ)⁻¹.
Inverse
A⁻¹ exists iff A is square and det(A) ≠ 0, equivalently A has full rank, equivalently all singular values > 0, equivalently the only solution of Ax = 0 is x = 0.
- (A·B)⁻¹ = B⁻¹·A⁻¹.
- Computing A⁻¹ explicitly is almost always wrong in numerical work. To solve A·x = b, factor A (LU, Cholesky, QR) and back-solve. Explicit inversion costs O(n³) and amplifies error.
Trace
tr(A) = Σᵢ Aᵢᵢ
- Linear: tr(αA + βB) = α·tr(A) + β·tr(B).
- Cyclic: tr(ABC) = tr(BCA) = tr(CAB).
- Equals the sum of eigenvalues (counted with algebraic multiplicity).
- tr(AᵀA) = ‖A‖_F² (Frobenius norm squared).
Determinant
A scalar measure of how a linear transformation scales (signed) volume. For a 2×2 matrix [[a, b], [c, d]], det = a·d − b·c.
- det(A·B) = det(A)·det(B).
- det(Aᵀ) = det(A).
- det(A⁻¹) = 1 / det(A).
- det(αA) = αⁿ·det(A) for n × n.
- det(triangular) = product of diagonal entries.
- det(orthogonal Q) = ±1.
- Equals the product of eigenvalues (with multiplicity).
- |det(A)| = product of singular values (regardless of sign).
Geometrically, |det(A)| is the volume of the parallelepiped formed by the columns of A. Sign indicates orientation (positive = preserves handedness, negative = mirror).
Rank
Number of linearly-independent columns = number of linearly-independent rows = number of nonzero singular values = dim(C(A)).
Worked example. A = [[1, 2, 3], [2, 4, 6], [1, 1, 1]]. Row 2 = 2 · Row 1, so only two rows are independent. Rank = 2. Null space contains all multiples of (1, −2, 1) — check: A·(1, −2, 1)ᵀ = (1·1 + 2·(−2) + 3·1, …)ᵀ = (0, 0, 0)ᵀ. ✓
4. Linear transformations
A function T: ℝⁿ → ℝᵐ is linear iff T(αx + βy) = α·T(x) + β·T(y) for all scalars α, β and vectors x, y. Every linear transformation between finite-dimensional spaces is exactly a matrix multiplication (relative to a chosen basis).
Common transformations (2D / 3D)
| Transformation | Matrix form |
|---|---|
| Identity | I |
| Scaling by factor k along each axis | k·I |
| Non-uniform scaling | diag(s₁, s₂, …) |
| Rotation by θ in 2D | [[cos θ, −sin θ], [sin θ, cos θ]] |
| Reflection across the x-axis | [[1, 0], [0, −1]] |
| Reflection across line y = x | [[0, 1], [1, 0]] |
| Shear (horizontal) | [[1, k], [0, 1]] |
| Projection onto x-axis | [[1, 0], [0, 0]] |
| Rotation about z-axis by θ in 3D | block diag(R_z(θ), 1) |
For rigid-body motion in 3D (rotation + translation), we use 4×4 homogeneous matrices:
T = [ R t ]
[ 0 1 ]
where R ∈ SO(3) and t ∈ ℝ³. Composition is matrix multiplication; this is the backbone of computer graphics, robotics, and AR.
Change of basis
If B = [b₁ | … | bₙ] is a new basis (columns of B form a basis of ℝⁿ), then a vector v with coordinates [v]_B in the new basis has standard coordinates B·[v]_B. The same linear map A expressed in the new basis is:
A' = B⁻¹ · A · B
This is a similarity transformation. A and A’ have the same eigenvalues, trace, determinant, rank, and characteristic polynomial — they’re “the same transformation in different coordinates.” Diagonalization (Section 7) is the special case where the new basis consists of eigenvectors and A’ becomes diagonal.
5. Four fundamental subspaces
Gilbert Strang’s organizing framework for any m × n matrix A of rank r:
| Subspace | Lives in | Dimension | Description |
|---|---|---|---|
| Column space C(A) | ℝᵐ | r | All possible A·x; the range. |
| Row space C(Aᵀ) | ℝⁿ | r | Span of rows; perpendicular to N(A). |
| Null space N(A) | ℝⁿ | n − r | All x with A·x = 0; perpendicular to C(Aᵀ). |
| Left null space N(Aᵀ) | ℝᵐ | m − r | All y with yᵀA = 0; perpendicular to C(A). |
Rank-nullity theorem
rank(A) + nullity(A) = n (number of columns)
In words: every dimension of the input space ℝⁿ is either reached as a row-space direction (counted in rank) or annihilated as a null-space direction.
Orthogonal decompositions
ℝⁿ = C(Aᵀ) ⊕ N(A), with C(Aᵀ) ⊥ N(A). ℝᵐ = C(A) ⊕ N(Aᵀ), with C(A) ⊥ N(Aᵀ).
This pairs nicely with SVD: the right singular vectors split ℝⁿ into row-space + null-space; left singular vectors split ℝᵐ into column-space + left null-space.
6. Solving linear systems Ax = b
The Swiss Army knife of numerical analysis. Pick the right tool:
Direct methods
Gaussian elimination + LU decomposition. Factor A = P·L·U where P is a permutation, L is unit lower triangular, U is upper triangular. Then solve L·y = P·b by forward substitution, then U·x = y by back substitution. Cost: O(n³) to factor, O(n²) per right-hand-side. Default for dense square systems.
Cholesky decomposition (SPD only). If A is symmetric positive-definite, A = L·Lᵀ with L lower triangular and positive diagonal. About 2× faster than LU because we exploit symmetry, and numerically stable without pivoting. Cost: O(n³/3). Used heavily in statistics (Gaussian likelihood, covariance manipulation), optimization (Newton steps for convex problems), and Kalman filters.
QR decomposition. A = Q·R where Q has orthonormal columns and R is upper triangular. For overdetermined systems (m > n), Ax = b has no exact solution; the least squares problem minimizes ‖A·x − b‖₂. Solution: x = R⁻¹·Qᵀ·b. Cost: O(m·n²). Compared to the normal equations AᵀA·x = Aᵀ·b solved by Cholesky, QR is more numerically stable because the condition number of A appears once, not squared.
SVD. A = U·Σ·Vᵀ; pseudoinverse A⁺ = V·Σ⁺·Uᵀ where Σ⁺ inverts only the nonzero singular values. Solves every linear system: square, over-, or underdetermined; full rank or rank-deficient. For underdetermined systems, returns the minimum-norm solution. Cost: O(min(m, n)²·max(m, n)). Slowest dense factorization but most robust.
Iterative methods (for large sparse A)
Conjugate Gradient (CG). For SPD A. Iterates over conjugate (A-orthogonal) search directions; converges in at most n steps in exact arithmetic, but in practice converges to acceptable accuracy in O(√κ) iterations where κ is the condition number. Cost per iteration: one matrix-vector product + O(n) work. Used in FEM, image reconstruction, large-scale ML.
GMRES (Generalized Minimum Residual). For general (non-symmetric) A. Builds a Krylov subspace and minimizes the residual at each step. Memory grows with iteration count; usually restarted.
BiCGSTAB, MINRES, IDR(s). Other Krylov methods for specific structures.
Preconditioning transforms Ax = b into M⁻¹A·x = M⁻¹b where M ≈ A is cheap to invert. Standard preconditioners: incomplete LU (ILU), incomplete Cholesky (IC), algebraic multigrid (AMG), domain decomposition.
Condition number
κ(A) = σ_max(A) / σ_min(A) (2-norm condition number)
If κ(A) ≈ 10ᵏ, expect to lose roughly k digits of accuracy in floating-point solves. Double precision provides ~16 decimal digits, so κ above 10¹⁶ means no digits of accuracy remain. Practical thresholds:
- κ < 10⁴: well-conditioned, any method works.
- 10⁴ < κ < 10⁸: moderate; use stable factorizations (QR, SVD).
- κ > 10⁸: ill-conditioned; consider SVD with truncation, regularization (Tikhonov), or reformulation.
7. Eigenvalues + eigenvectors
A scalar λ is an eigenvalue of square A if there exists nonzero v with:
A·v = λ·v
The vector v is an eigenvector for λ. Geometrically, the linear transformation A merely stretches v by factor λ — it doesn’t rotate it.
Characteristic polynomial
Rearrange: (A − λI)·v = 0. Nontrivial v exists iff A − λI is singular, iff:
det(A − λI) = 0
This is a polynomial of degree n in λ with n complex roots (counted with multiplicity), the eigenvalues.
Worked example. A = [[2, 1], [1, 2]].
det(A − λI) = (2 − λ)² − 1 = λ² − 4λ + 3 = (λ − 1)(λ − 3) = 0.
Eigenvalues: λ₁ = 1, λ₂ = 3. Eigenvector for λ = 3: solve (A − 3I)v = 0 ⇒ [[−1, 1], [1, −1]]·v = 0 ⇒ v = (1, 1) (up to scaling). For λ = 1: v = (1, −1). The two eigenvectors are orthogonal because A is symmetric.
Multiplicities
- Algebraic multiplicity of λ: the multiplicity of λ as a root of det(A − λI).
- Geometric multiplicity of λ: dim N(A − λI), the dimension of its eigenspace.
- Geometric ≤ algebraic, always.
- When they’re equal for every eigenvalue, A is diagonalizable.
Diagonalization
If A has n linearly-independent eigenvectors, stack them as columns of P and the eigenvalues on the diagonal of D:
A = P · D · P⁻¹
A·P = P·D (each column of P is an eigenvector)
This makes powers trivial: Aᵏ = P·Dᵏ·P⁻¹, where Dᵏ just raises each diagonal entry to the k-th power. Used to solve linear recurrences (Fibonacci closed form), linear ODEs ẋ = A·x with solution x(t) = e^(A·t)·x(0) = P·diag(e^(λᵢ·t))·P⁻¹·x(0), and Markov chain limiting distributions.
Spectral theorem (symmetric matrices)
If A is real and symmetric (A = Aᵀ), then:
- All eigenvalues are real.
- Eigenvectors for distinct eigenvalues are orthogonal.
- A can always be diagonalized by an orthogonal matrix Q: A = Q·Λ·Qᵀ, with Q⁻¹ = Qᵀ.
This is the spectral decomposition and one of the most-used results in applied math.
Complex generalization. If A is Hermitian (A = A*), eigenvalues are real and A = U·Λ·U* with U unitary.
Applications
- PCA. Eigendecompose the sample covariance matrix C = (1/(n−1))·Xᵀ·X (after centering X). Eigenvectors = principal axes; eigenvalues = variance along each axis.
- Vibration modes. For mass M and stiffness K (both SPD), the generalized eigenvalue problem K·v = ω²·M·v gives natural frequencies ωᵢ and mode shapes vᵢ. Foundational for structural and mechanical engineering.
- PageRank. Stationary distribution of a Markov chain is the eigenvector for λ = 1 of the transition matrix Pᵀ. Power iteration solves this on web-scale graphs.
- Quantum mechanics. Observable operators are Hermitian; their eigenvalues are the measurable values; eigenvectors are the post-measurement states.
- Markov chains. The mixing time depends on the second-largest eigenvalue of the transition matrix (spectral gap).
- Graph Laplacian / spectral clustering. Eigenvectors of the Laplacian reveal community structure.
Numerical algorithms
- Power iteration. v ← A·v / ‖A·v‖; converges to the eigenvector for the largest |λ|. Rate depends on |λ₂/λ₁|.
- Inverse iteration. Power iteration on (A − μI)⁻¹; converges to the eigenvalue nearest the shift μ.
- QR algorithm. Iteratively factor A = Q·R then form A’ = R·Q; eigenvalues appear on the diagonal in the limit. The workhorse for dense eigenproblems.
- Lanczos (symmetric) / Arnoldi (general). Build a Krylov subspace and solve a small projected eigenproblem; the method of choice for large sparse matrices when only a few extreme eigenvalues are needed.
- Jacobi-Davidson, LOBPCG. Advanced solvers for specific cases.
8. Singular Value Decomposition (SVD)
For any m × n matrix A (real or complex; full rank or not; square or not):
A = U · Σ · Vᵀ
- U is m × m with orthonormal columns (left singular vectors).
- Σ is m × n with non-negative diagonal entries σ₁ ≥ σ₂ ≥ … ≥ σ_min(m,n) ≥ 0 (singular values), zeros off the diagonal.
- V is n × n with orthonormal columns (right singular vectors). Vᵀ in the factorization, V in the formula for the pseudoinverse.
The “economy” or “thin” SVD truncates U and V to min(m, n) columns and Σ to that size.
Geometric interpretation
Every linear map decomposes as rotation → axis-aligned stretch → rotation:
- Vᵀ rotates the input frame so its axes line up with the right singular vectors.
- Σ stretches each axis by σᵢ.
- U rotates the result into the output frame.
The unit sphere in ℝⁿ is mapped to an ellipsoid in ℝᵐ; the σᵢ are the semi-axis lengths; the columns of U are their directions.
Worked example
A = [[3, 1], [1, 3]]. Symmetric, so SVD coincides with eigendecomposition (up to signs). Eigenvalues: 4 and 2 (both positive, so singular values too). Eigenvectors: (1,1)/√2 and (1,−1)/√2.
A = (1/√2)·[[1, 1], [1, −1]] · diag(4, 2) · (1/√2)·[[1, 1], [1, −1]]ᵀ
Connections to other concepts
- σᵢ² are the nonzero eigenvalues of both AᵀA (n × n) and AAᵀ (m × m).
- Columns of V are eigenvectors of AᵀA.
- Columns of U are eigenvectors of AAᵀ.
- Rank(A) = number of nonzero σᵢ.
- ‖A‖₂ (spectral norm) = σ₁.
- ‖A‖_F (Frobenius) = √(Σ σᵢ²).
- ‖A‖_∗ (nuclear) = Σ σᵢ.
- |det(A)| = ∏ σᵢ (for square A).
- κ₂(A) = σ₁ / σ_r (smallest nonzero).
Eckart-Young-Mirsky theorem
The best rank-k approximation of A in both Frobenius and spectral norm is:
Aₖ = Σᵢ₌₁ᵏ σᵢ · uᵢ · vᵢᵀ
obtained by zeroing out all but the top k singular values. The approximation error:
‖A − Aₖ‖₂ = σₖ₊₁
‖A − Aₖ‖_F = √(σₖ₊₁² + σₖ₊₂² + … )
This single result powers:
- Image compression. Store only the top k singular vectors and values.
- Recommender systems. Low-rank factorization of user-item rating matrices (Netflix Prize).
- Latent Semantic Analysis (LSA). Term-document matrix → low-rank approximation reveals topics.
- Denoising. Truncate small σᵢ (assumed noise) and reconstruct.
- Model order reduction. Reduce dynamical-system dimensionality.
Pseudoinverse (Moore-Penrose)
A⁺ = V · Σ⁺ · Uᵀ
where Σ⁺ replaces each nonzero σᵢ with 1/σᵢ (and keeps zeros as zeros). Properties:
- A·A⁺·A = A.
- A⁺·A·A⁺ = A⁺.
- (A·A⁺)ᵀ = A·A⁺.
- (A⁺·A)ᵀ = A⁺·A.
- For full-rank tall A: A⁺ = (AᵀA)⁻¹·Aᵀ (left inverse).
- For full-rank wide A: A⁺ = Aᵀ·(AAᵀ)⁻¹ (right inverse).
The solution x = A⁺·b is the least-squares minimum-norm solution: among all x minimizing ‖A·x − b‖, it has the smallest ‖x‖.
9. PCA (Principal Component Analysis)
A workhorse algorithm for dimensionality reduction, visualization, decorrelation, and denoising.
Algorithm
Given data matrix X (n × d) of n samples, each with d features:
- Center. X̃ = X − 1·μᵀ, where μ is the column-wise mean. (Sometimes also scale to unit variance — “standardization” — when feature units differ.)
- Covariance. C = (1/(n − 1))·X̃ᵀ·X̃ (d × d, symmetric, positive-semidefinite).
- Eigendecompose. C = V·Λ·Vᵀ, eigenvalues sorted λ₁ ≥ λ₂ ≥ … ≥ λ_d ≥ 0.
- Principal components. Columns of V are the principal directions; each λᵢ = variance along principal direction i.
- Project. Y = X̃·V[:, :k] gives the k-dimensional reduced representation.
Equivalently and more numerically stable: compute the SVD X̃ = U·Σ·Vᵀ directly. Then C = V·(Σᵀ Σ/(n−1))·Vᵀ, so V is the matrix of principal directions and λᵢ = σᵢ²/(n−1).
Explained variance
explained_variance_ratio_i = σᵢ² / Σⱼ σⱼ²
Choose k such that the cumulative ratio exceeds a threshold (e.g., 0.95).
Worked example
Three 2D points: (0, 0), (1, 2), (2, 4). Already exactly on the line y = 2x.
- Mean: (1, 2). Centered: (−1, −2), (0, 0), (1, 2).
- Covariance: (1/2)·[[1+0+1, 2+0+2], [2+0+2, 4+0+4]] = [[1, 2], [2, 4]].
- Eigenvalues: λ² − 5λ + 0 = 0 ⇒ λ = 5, 0. Eigenvector for 5: (1, 2)/√5.
- One PC captures 100% of variance; the data is intrinsically 1D.
Caveats
- Linear method. Captures only linear structure. Nonlinear analogs: kernel PCA, Isomap, t-SNE, UMAP, autoencoders.
- Scale-sensitive. Features with larger magnitudes dominate. Standardize when units differ.
- Maximum-variance ≠ most-informative. For classification, LDA may serve better.
- Centering matters. Failing to center gives “PCA of raw data” which mixes mean and shape.
10. Matrix factorizations summary
| Factorization | Form | Requires | Cost (dense) | Primary use |
|---|---|---|---|---|
| LU | A = P·L·U | Square A | (2/3)·n³ | General Ax = b |
| Cholesky | A = L·Lᵀ | SPD | (1/3)·n³ | SPD Ax = b; sampling Gaussian |
| QR | A = Q·R | None | 2·m·n² (m ≥ n) | Least squares; Gram-Schmidt; basis |
| SVD | A = U·Σ·Vᵀ | None | ~14·m·n² (m ≥ n) | Robust solve; PCA; low-rank |
| Eigendecomp | A = P·D·P⁻¹ | Diagonalizable square | ~10·n³ | Powers, exponential, modes |
| Schur | A = Q·T·Qᵀ | Square (always exists) | ~25·n³ | Numerically stable eigen-info |
| Polar | A = U·P | Square | via SVD | Best rotation; animation |
| Jordan | A = P·J·P⁻¹ | Square (always exists) | unstable numerically | Theoretical structure |
| Bidiagonal / Hessenberg | Reductions | Square | intermediate step | Inside QR / SVD algorithms |
Schur decomposition
Every square matrix A (over ℂ) is unitarily similar to an upper-triangular matrix: A = Q·T·Q*, with Q unitary and T upper triangular with eigenvalues on the diagonal. Always exists; numerically stable. For real A with possibly complex eigenvalues, the real Schur form uses 1×1 and 2×2 diagonal blocks.
Polar decomposition
A = U·P where U is orthogonal (rotation/reflection) and P is symmetric positive-semidefinite (stretch). Computed via SVD: if A = UΣVᵀ then U_polar = U·Vᵀ and P = V·Σ·Vᵀ. Used in:
- Continuum mechanics — deformation gradient F = R·U (rotation · stretch).
- Computer animation — smoothest interpolation of rotations.
- Procrustes problem — finding the best rotation aligning two point clouds.
11. Tensors + multi-linear algebra
A tensor of rank k is a multi-dimensional array with k indices. A scalar is rank 0, a vector rank 1, a matrix rank 2, etc. (Note: “rank” here means “order” or “number of indices” — not the same as matrix rank.) In physics and differential geometry, “tensor” also implies coordinate-transformation rules; in ML/data science it’s usually just “multi-dim array.”
Einstein summation convention
Repeated indices are implicitly summed:
yᵢ = Aᵢⱼ · xⱼ means yᵢ = Σⱼ Aᵢⱼ·xⱼ
Cᵢⱼ = Aᵢₖ · Bₖⱼ means matrix product
Concise and a perfect match for einsum in NumPy / PyTorch / JAX.
Tensor product (outer product)
For vectors a ∈ ℝᵐ and b ∈ ℝⁿ, the outer product a ⊗ b is the m × n matrix with entries (a ⊗ b)ᵢⱼ = aᵢ·bⱼ. Rank 1. Generalizes to higher-order tensors.
Tensor contraction
Summing over a pair of indices, reducing the rank by 2. Matrix product is a contraction. Trace is a contraction.
Kronecker product
For A (m × n) and B (p × q):
A ⊗ B = block matrix with (i, j) block aᵢⱼ·B, total size mp × nq.
Properties:
- (A ⊗ B)·(C ⊗ D) = (A·C) ⊗ (B·D).
- (A ⊗ B)⁻¹ = A⁻¹ ⊗ B⁻¹.
- vec(A·X·B) = (Bᵀ ⊗ A)·vec(X), where vec stacks columns.
Used in: signal processing (separable filters), Lyapunov equations, quantum state spaces (composite systems).
Higher-order decompositions
- CP (CANDECOMP / PARAFAC). T = Σᵣ aᵣ ⊗ bᵣ ⊗ cᵣ; sum of rank-1 tensors. Generalization of “rank” to tensors (NP-hard in general).
- Tucker decomposition. T = G ×₁ U₁ ×₂ U₂ ×₃ U₃ where G is a “core” tensor and Uₖ are factor matrices; higher-order SVD (HOSVD).
- Tensor train (TT). Sequence of small 3D cores; the workhorse for very high-dimensional tensors in scientific computing.
Matrix derivatives (common identities)
For x ∈ ℝⁿ and A, B compatible:
| Expression | Derivative w.r.t. x |
|---|---|
| aᵀ·x | a |
| xᵀ·a | a |
| xᵀ·A·x | (A + Aᵀ)·x (= 2A·x if A symmetric) |
| ‖x‖² = xᵀ·x | 2·x |
| log det(X) (w.r.t. X) | (X⁻¹)ᵀ |
| tr(A·X) (w.r.t. X) | Aᵀ |
| tr(Xᵀ·A·X) (w.r.t. X) | (A + Aᵀ)·X |
These are the bread and butter of analytic gradients in optimization and ML; modern autograd computes them by chain rule but knowing the identities catches autograd bugs and informs custom backward passes.
12. Norms + inequalities
Vector norms
- L⁰ — count of nonzero entries (not a true norm).
- L¹ — sum of absolute values.
- L² — Euclidean.
- Lᵖ — (Σ |vᵢ|ᵖ)^(1/p).
- L∞ — max absolute value.
Matrix norms
- Induced (operator) p-norm. ‖A‖ₚ = sup_{x ≠ 0} ‖A·x‖ₚ / ‖x‖ₚ.
- ‖A‖₁ = max column sum.
- ‖A‖∞ = max row sum.
- ‖A‖₂ = σ_max(A) (spectral norm).
- Frobenius. ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²). Not an induced norm but submultiplicative.
- Nuclear (trace) norm. ‖A‖_∗ = Σ σᵢ. Convex surrogate for rank; used in low-rank recovery (matrix completion).
- Max norm. maxᵢⱼ |Aᵢⱼ|. Useful but not submultiplicative.
All induced norms and the Frobenius norm satisfy:
- ‖A·B‖ ≤ ‖A‖·‖B‖ (submultiplicativity).
- ‖A·x‖ ≤ ‖A‖·‖x‖.
Cauchy-Schwarz inequality
|⟨a, b⟩| ≤ ‖a‖·‖b‖
with equality iff a and b are linearly dependent. Universal across inner-product spaces.
Triangle inequality
‖a + b‖ ≤ ‖a‖ + ‖b‖
Together with positive-definiteness and absolute homogeneity, this defines a norm.
Hölder inequality
For 1/p + 1/q = 1:
|⟨a, b⟩| ≤ ‖a‖ₚ·‖b‖_q
Generalizes Cauchy-Schwarz (p = q = 2). Used in functional analysis and probability.
Matrix Hölder
‖A·B‖_F ≤ ‖A‖₂·‖B‖_F; many similar variants relating Schatten norms (Schatten-p = (Σ σᵢᵖ)^(1/p), recovering nuclear for p = 1, Frobenius for p = 2, spectral for p = ∞).
13. Numerical considerations
Floating-point precision
| Format | Bits | ~Decimal digits | Notes |
|---|---|---|---|
| float64 (double) | 64 | 15.95 | Scientific computing default |
| float32 (single) | 32 | 7.22 | GPU compute, ML training |
| float16 (half) | 16 | 3.31 | ML inference; narrow exponent range |
| bf16 (bfloat16) | 16 | 2.36 | Same exponent range as float32; ML training on TPUs/GPUs |
| fp8 (E4M3, E5M2) | 8 | 1–2 | Cutting-edge ML training; needs scaling |
Choosing precision is a trade-off between speed, memory, and numerical stability. In ML, mixed-precision training keeps weights and accumulators in higher precision while doing matmuls in lower precision.
Catastrophic cancellation
Subtracting two nearly-equal floating-point numbers loses precision proportional to their relative closeness. E.g., (1 + 1e-16) − 1 in float64 ≈ 0, not 1e-16. Mitigate by reformulating: use compensated summation (Kahan), use stable identities (e.g., (1 − cos x) ≈ x²/2 for small x), or use libraries that handle this internally.
Numerical rank vs theoretical rank
In exact arithmetic, rank is a well-defined integer. In floating-point, “rank” depends on a tolerance: count σᵢ above ε·σ_max·max(m, n). NumPy’s matrix_rank uses this by default. Small singular values can represent either rank deficiency masked by noise or just noise; truncation must be motivated.
Pivoting in LU
Without pivoting, LU can fail (zero on diagonal) or amplify error. Partial pivoting (row exchanges to put the largest available entry on the diagonal) is standard and stable. Full pivoting (row + column exchanges) is more stable but rarely needed.
Iterative refinement
After computing x̂ = solve(A, b), residual r = b − A·x̂ may reveal lost precision. Solve A·δ = r and update x̂ ← x̂ + δ. Cheap correction, often regains a few digits.
BLAS levels
- Level 1. Vector-vector ops (axpy, dot, scal). O(n) work on O(n) data; memory-bound.
- Level 2. Matrix-vector (gemv, trsv). O(n²) work on O(n²) data; still memory-bound.
- Level 3. Matrix-matrix (gemm, syrk, trsm). O(n³) work on O(n²) data; compute-bound, achieves close to peak FLOPS.
Algorithm design that maximizes Level-3 usage (blocked / panel-factored versions of LU, QR, Cholesky) is the reason LAPACK is fast.
LAPACK and friends
- BLAS — low-level kernels. Reference impl is slow; production uses OpenBLAS, Intel MKL, Apple Accelerate, AOCL, cuBLAS (NVIDIA), rocBLAS (AMD), oneMKL (Intel oneAPI), Arm Performance Libraries, Eigen.
- LAPACK — higher-level: linear systems, eigenproblems, SVD, least squares. Calls BLAS.
- ScaLAPACK / ELPA — distributed-memory versions.
- SuiteSparse / MUMPS / PARDISO — sparse direct solvers.
- Trilinos / PETSc / Hypre — sparse iterative + multigrid + preconditioners.
NumPy and SciPy bind to whichever BLAS/LAPACK they’re built against; on most distros it’s OpenBLAS or MKL. PyTorch and JAX layer their own dispatch on top of BLAS + cuBLAS + custom CUDA/Triton kernels. Knowing the stack matters: a misconfigured single-threaded BLAS in a multi-threaded Python program can silently throttle performance by 10×.
14. Applications matrix — what technique solves what
| Problem | Technique |
|---|---|
| Solve square Ax = b, general A | LU with partial pivoting |
| Solve Ax = b, A symmetric positive-definite | Cholesky |
| Solve Ax = b, A symmetric indefinite | LDLᵀ (Bunch-Kaufman) |
| Overdetermined least squares (m > n, full rank) | QR (Householder) |
| Underdetermined min-norm (m < n) | SVD / Moore-Penrose pseudoinverse |
| Rank-deficient / ill-conditioned | Truncated SVD; Tikhonov regularization |
| PCA / latent factors | SVD of centered data; eigendecomp of covariance |
| Image / data compression | Truncated SVD per Eckart-Young |
| Matrix completion (sparse observations) | Nuclear-norm minimization (convex); ALS |
| PageRank / Markov stationary | Power iteration on transition matrix |
| Vibration modes / FEM eigenvalues | Symmetric generalized eigenproblem; Lanczos for large sparse |
| Linear ODE ẋ = A·x | Matrix exponential e^(At); via eigendecomp or Padé |
| Lyapunov stability | Solve A·X + X·Aᵀ + Q = 0 (Bartels-Stewart, Schur-based) |
| Kalman filter prediction / update | Matrix multiplications + Cholesky / Joseph form for stability |
| Robot Jacobian inverse | Damped least squares (J⁺ = Jᵀ(JJᵀ + λI)⁻¹) for singularity robustness |
| Sparse graph computations | Compressed-storage sparse formats + iterative solvers (CG, GMRES) |
| Recommender system | Low-rank matrix factorization via SGD / ALS / SVD |
| Word embeddings (latent semantic) | Truncated SVD of co-occurrence or PPMI matrix |
| Spectral clustering | Eigenvectors of graph Laplacian |
| Total least squares (errors-in-variables) | SVD-based estimator |
15. Cross-references
- _index — math reference hub.
- svd-pca-spectral — deep-dive on SVD, PCA, spectral methods (TBD).
- eigenvalue-problems — symmetric, generalized, large-sparse eigenproblems (TBD).
- fem-fea — large sparse linear systems in continuum mechanics.
- system-identification — least-squares + subspace ID methods.
- kinematics-dh — homogeneous transforms, DH parameters, Jacobians.
- bayesian-estimation — Kalman filters, covariance updates.
- transformer-architecture — attention as scaled dot-product, matrix factorizations in Q/K/V.
16. Citations
- Gilbert Strang, Linear Algebra and Its Applications, 5th ed., Wellesley-Cambridge Press, 2016. The classic engineering-flavored exposition.
- Gilbert Strang, Introduction to Linear Algebra, 5th ed., Wellesley-Cambridge Press, 2016. Companion to the MIT 18.06 OCW lectures.
- Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM, 1997. The reference for stable, modern numerical algorithms.
- Gene H. Golub and Charles F. Van Loan, Matrix Computations, 4th ed., Johns Hopkins University Press, 2013. The comprehensive practitioner’s bible.
- Carl Eckart and Gale Young, “The approximation of one matrix by another of lower rank,” Psychometrika 1(3): 211–218, 1936. The original Eckart-Young theorem.
- Roger A. Horn and Charles R. Johnson, Matrix Analysis, 2nd ed., Cambridge University Press, 2013. Definitive on matrix-theoretic results.
- James W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Strong coverage of numerical stability + iterative methods.
- L. Vandenberghe and S. Boyd, Introduction to Applied Linear Algebra, Cambridge University Press, 2018. Free PDF; great for the ML / engineering crossover audience.