Linear Algebra Essentials — Math Reference

1. At a glance

Linear algebra is the operational language of nearly every quantitative engineering and ML discipline. It underwrites:

Mechanics + structures — stiffness matrices, modal analysis, FEM/FEA assemblies are sparse linear systems Ax = b with millions of degrees of freedom.
Control + signal processing — state-space models x(k+1) = A·x(k) + B·u(k) live and die by eigenvalues (stability), observability matrices, and Kalman gain inversions.
Computer graphics + robotics — every rigid transform is a 4×4 homogeneous matrix; quaternion-to-rotation conversions, Jacobians, inverse kinematics, camera projections are all matrix algebra.
Machine learning — every dense layer is y = W·x + b; every attention block is QKᵀ; every PCA / autoencoder / collaborative filter is a low-rank factorization; gradient updates are tensor contractions.
Statistics — covariance matrices, least squares, generalized linear models, multivariate Gaussians, Mahalanobis distance.
Quantum + chemistry — operators are matrices, observables are eigenvalues, wavefunctions evolve via unitary matrices.
Graphs + networks — adjacency/Laplacian matrices, PageRank, spectral clustering.

Two complementary views of a matrix

Master both, switch fluidly:

Matrix as a list of column vectors. A = [a₁ | a₂ | … | aₙ] where each aᵢ ∈ ℝᵐ is a column. The product A·x = Σᵢ xᵢ·aᵢ is a linear combination of columns weighted by the entries of x. This view dominates when reasoning about the column space (which vectors can be reached) and when interpreting solutions of Ax = b (“which combination of the columns equals b?”).
Matrix as a linear transformation. A maps a vector x ∈ ℝⁿ to A·x ∈ ℝᵐ, preserving addition and scalar multiplication: A(αx + βy) = α·A·x + β·A·y. This view dominates when reasoning about geometry — rotations, reflections, projections, change of basis — and when composing transformations (matrix product = function composition).

A third view, matrix as a table of bilinear coefficients (xᵀ·A·y), shows up in quadratic forms, energy methods, and probabilistic models (precision matrices), but the column-vector and transformation views cover ~95% of working intuition.

Core mental models

Column space C(A) — the set of all vectors A·x can produce; lives in ℝᵐ; dimension equals the rank.
Row space C(Aᵀ) — span of the rows; lives in ℝⁿ; same dimension as the column space (a deep theorem: row rank = column rank).
Null space N(A) — all x with A·x = 0; lives in ℝⁿ; dimension n − rank. These are the “invisible” inputs.
Left null space N(Aᵀ) — all y with yᵀ·A = 0; lives in ℝᵐ; dimension m − rank. These are the “unreachable” output directions.
Rank — the universal “effective dimension” of A; equals number of linearly-independent columns = rows = number of nonzero singular values.

If you internalize these five concepts, almost every algorithm below becomes “which subspace are we manipulating, and how?“

2. Vectors

A vector v ∈ ℝⁿ has both an algebraic representation (an ordered tuple of n real numbers) and a geometric one (an arrow from the origin, or a point in n-dimensional space). The two views must agree: the algebra is the bookkeeping; the geometry is what we’re computing about.

Dot product (inner product)

For a, b ∈ ℝⁿ:

a · b = Σᵢ aᵢ·bᵢ  =  ‖a‖·‖b‖·cos θ

Algebraic form: scalar from coordinate-wise multiplication.
Geometric form: product of magnitudes times cosine of the angle between them.
a · b = 0 iff a ⊥ b (orthogonal).
a · a = ‖a‖² (norm-squared).

Worked example. a = (3, 4, 0), b = (1, 0, 2). Then a · b = 3·1 + 4·0 + 0·2 = 3. ‖a‖ = 5, ‖b‖ = √5 ≈ 2.236. cos θ = 3 / (5·√5) = 3/(5√5) ≈ 0.268, so θ ≈ 74.5°.

Cross product (3D only)

For a, b ∈ ℝ³:

a × b = (a₂b₃ − a₃b₂, a₃b₁ − a₁b₃, a₁b₂ − a₂b₁)
‖a × b‖ = ‖a‖·‖b‖·sin θ

The result is perpendicular to both a and b (right-hand rule).
Magnitude equals the area of the parallelogram spanned by a and b.
a × b = −b × a (anti-commutative).
Generalization to higher dimensions is the wedge product (a ∧ b) from exterior algebra; in 3D it happens to be expressible as a vector via the Hodge star.

Worked example. a = (1, 0, 0), b = (0, 1, 0). a × b = (0·0 − 0·1, 0·0 − 1·0, 1·1 − 0·0) = (0, 0, 1). Standard right-hand frame: x × y = z.

Norms

Norms measure “size.” Different norms emphasize different aspects:

L⁰ (count of nonzeros) — not technically a norm (fails homogeneity); used in sparse-recovery problems.
L¹ (taxicab / Manhattan) ‖v‖₁ = Σᵢ |vᵢ|. Encourages sparsity in regularization (Lasso).
L² (Euclidean) ‖v‖₂ = √(Σᵢ vᵢ²). The default geometric length.
Lᵖ ‖v‖ₚ = (Σᵢ |vᵢ|ᵖ)^(1/p) for p ≥ 1.
L∞ (max / Chebyshev) ‖v‖∞ = maxᵢ |vᵢ|.
Frobenius norm (for matrices) ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²).

All Lᵖ-norms on ℝⁿ are equivalent (each bounded by constants times the others), but constants depend on dimension; this matters for high-dimensional analysis.

Unit vectors

A unit vector û = v / ‖v‖₂ has ‖û‖₂ = 1. Used as direction without magnitude. Computing a unit vector requires ‖v‖ ≠ 0; near-zero v causes catastrophic cancellation.

Orthogonality + orthonormality

Orthogonal set {v₁, …, vₖ}: vᵢ · vⱼ = 0 for i ≠ j.
Orthonormal set: orthogonal plus ‖vᵢ‖ = 1.
An orthonormal basis Q = [q₁ | … | qₙ] of ℝⁿ satisfies QᵀQ = I, hence Q⁻¹ = Qᵀ. This is the killer property of orthogonal matrices: they preserve dot products, lengths, and angles (isometries).

Projection onto a subspace

To project v onto the line spanned by unit vector û: proj_û v = (v · û)·û. For a general subspace S with orthonormal basis Q (columns of an m × k matrix):

proj_S v = Q·(Qᵀ·v)

For a non-orthonormal basis with columns of A (m × k, k ≤ m, full column rank):

proj_C(A) v = A·(AᵀA)⁻¹·Aᵀ·v

The matrix P = A·(AᵀA)⁻¹·Aᵀ is the projection matrix onto C(A); it satisfies P² = P (idempotent) and Pᵀ = P (symmetric). The residual v − P·v lies in N(Aᵀ).

3. Matrices

Operations

Addition / scalar multiplication. Entry-wise; A + B and α·A are defined when shapes match. Costs O(m·n).
Matrix-vector product. y = A·x, yᵢ = Σⱼ Aᵢⱼ·xⱼ. Either “linear combination of A’s columns weighted by x” or “each yᵢ is a row of A dotted with x.” O(m·n).
Matrix-matrix product. (A·B)ᵢⱼ = Σₖ Aᵢₖ·Bₖⱼ. For A (m × n) and B (n × p), result is (m × p). Inner dimensions must match. O(m·n·p) naive; Strassen-class algorithms achieve O(n^2.373…) asymptotically but are rarely competitive for sizes below ~1000.

AB ≠ BA in general. Multiplication is not commutative. Counterexample: rotations in 3D about different axes don’t commute (try R_x(90°) then R_y(90°) vs the reverse).

Multiplication is:

Associative: (AB)C = A(BC).
Distributive over addition.
Has identity: I·A = A·I = A.

Special matrix types

Type	Definition	Why it matters
Identity I	I·v = v; diagonal of 1s, zeros elsewhere.	Multiplicative identity; A·A⁻¹ = I.
Diagonal D	Nonzero only on the diagonal.	Stretches axes; cheap to invert (1/dᵢ).
Upper triangular U	Zeros below diagonal.	Back substitution solves U·x = b in O(n²).
Lower triangular L	Zeros above diagonal.	Forward substitution.
Symmetric	A = Aᵀ.	Real eigenvalues, orthogonal eigenvectors (spectral theorem).
Skew-symmetric	A = −Aᵀ (diagonal is zero).	Eigenvalues purely imaginary; cross-product matrix.
Orthogonal Q	QᵀQ = QQᵀ = I (real).	Isometries; det = ±1; condition number 1.
Unitary U	UU = I (complex; is conjugate transpose).	Complex analog of orthogonal.
Positive-definite	Symmetric and xᵀAx > 0 ∀ x ≠ 0.	All eigenvalues > 0; Cholesky applies; covariance matrices.
Positive-semidefinite	xᵀAx ≥ 0.	Eigenvalues ≥ 0; allows rank deficiency.
Permutation P	One 1 per row and column.	P·x permutes entries; det = ±1.
Stochastic	Rows (or columns) sum to 1, entries ≥ 0.	Markov chains; row-stochastic right-multiplies probability vectors.
Hessenberg	Upper triangular plus first sub-diagonal.	Intermediate form for QR iteration.
Tridiagonal	Nonzeros only on diagonal + adjacent sub/super-diagonals.	O(n) solvers (Thomas algorithm).

Transpose

(Aᵀ)ᵢⱼ = Aⱼᵢ. Properties:

(A + B)ᵀ = Aᵀ + Bᵀ.
(αA)ᵀ = α·Aᵀ.
(A·B)ᵀ = Bᵀ·Aᵀ (order flips).
(Aᵀ)ᵀ = A.
(A⁻¹)ᵀ = (Aᵀ)⁻¹.

Inverse

A⁻¹ exists iff A is square and det(A) ≠ 0, equivalently A has full rank, equivalently all singular values > 0, equivalently the only solution of Ax = 0 is x = 0.

(A·B)⁻¹ = B⁻¹·A⁻¹.
Computing A⁻¹ explicitly is almost always wrong in numerical work. To solve A·x = b, factor A (LU, Cholesky, QR) and back-solve. Explicit inversion costs O(n³) and amplifies error.

Trace

tr(A) = Σᵢ Aᵢᵢ

Linear: tr(αA + βB) = α·tr(A) + β·tr(B).
Cyclic: tr(ABC) = tr(BCA) = tr(CAB).
Equals the sum of eigenvalues (counted with algebraic multiplicity).
tr(AᵀA) = ‖A‖_F² (Frobenius norm squared).

Determinant

A scalar measure of how a linear transformation scales (signed) volume. For a 2×2 matrix [[a, b], [c, d]], det = a·d − b·c.

det(A·B) = det(A)·det(B).
det(Aᵀ) = det(A).
det(A⁻¹) = 1 / det(A).
det(αA) = αⁿ·det(A) for n × n.
det(triangular) = product of diagonal entries.
det(orthogonal Q) = ±1.
Equals the product of eigenvalues (with multiplicity).
|det(A)| = product of singular values (regardless of sign).

Geometrically, |det(A)| is the volume of the parallelepiped formed by the columns of A. Sign indicates orientation (positive = preserves handedness, negative = mirror).

Rank

Number of linearly-independent columns = number of linearly-independent rows = number of nonzero singular values = dim(C(A)).

Worked example. A = [[1, 2, 3], [2, 4, 6], [1, 1, 1]]. Row 2 = 2 · Row 1, so only two rows are independent. Rank = 2. Null space contains all multiples of (1, −2, 1) — check: A·(1, −2, 1)ᵀ = (1·1 + 2·(−2) + 3·1, …)ᵀ = (0, 0, 0)ᵀ. ✓

4. Linear transformations

A function T: ℝⁿ → ℝᵐ is linear iff T(αx + βy) = α·T(x) + β·T(y) for all scalars α, β and vectors x, y. Every linear transformation between finite-dimensional spaces is exactly a matrix multiplication (relative to a chosen basis).

Common transformations (2D / 3D)

Transformation	Matrix form
Identity	I
Scaling by factor k along each axis	k·I
Non-uniform scaling	diag(s₁, s₂, …)
Rotation by θ in 2D	[[cos θ, −sin θ], [sin θ, cos θ]]
Reflection across the x-axis	[[1, 0], [0, −1]]
Reflection across line y = x	[[0, 1], [1, 0]]
Shear (horizontal)	[[1, k], [0, 1]]
Projection onto x-axis	[[1, 0], [0, 0]]
Rotation about z-axis by θ in 3D	block diag(R_z(θ), 1)

For rigid-body motion in 3D (rotation + translation), we use 4×4 homogeneous matrices:

T = [ R  t ]
    [ 0  1 ]

where R ∈ SO(3) and t ∈ ℝ³. Composition is matrix multiplication; this is the backbone of computer graphics, robotics, and AR.

Change of basis

If B = [b₁ | … | bₙ] is a new basis (columns of B form a basis of ℝⁿ), then a vector v with coordinates [v]_B in the new basis has standard coordinates B·[v]_B. The same linear map A expressed in the new basis is:

A' = B⁻¹ · A · B

This is a similarity transformation. A and A’ have the same eigenvalues, trace, determinant, rank, and characteristic polynomial — they’re “the same transformation in different coordinates.” Diagonalization (Section 7) is the special case where the new basis consists of eigenvectors and A’ becomes diagonal.

5. Four fundamental subspaces

Gilbert Strang’s organizing framework for any m × n matrix A of rank r:

Subspace	Lives in	Dimension	Description
Column space C(A)	ℝᵐ	r	All possible A·x; the range.
Row space C(Aᵀ)	ℝⁿ	r	Span of rows; perpendicular to N(A).
Null space N(A)	ℝⁿ	n − r	All x with A·x = 0; perpendicular to C(Aᵀ).
Left null space N(Aᵀ)	ℝᵐ	m − r	All y with yᵀA = 0; perpendicular to C(A).

Rank-nullity theorem

rank(A) + nullity(A) = n   (number of columns)

In words: every dimension of the input space ℝⁿ is either reached as a row-space direction (counted in rank) or annihilated as a null-space direction.

Orthogonal decompositions

ℝⁿ = C(Aᵀ) ⊕ N(A), with C(Aᵀ) ⊥ N(A). ℝᵐ = C(A) ⊕ N(Aᵀ), with C(A) ⊥ N(Aᵀ).

This pairs nicely with SVD: the right singular vectors split ℝⁿ into row-space + null-space; left singular vectors split ℝᵐ into column-space + left null-space.

6. Solving linear systems Ax = b

The Swiss Army knife of numerical analysis. Pick the right tool:

Direct methods

Gaussian elimination + LU decomposition. Factor A = P·L·U where P is a permutation, L is unit lower triangular, U is upper triangular. Then solve L·y = P·b by forward substitution, then U·x = y by back substitution. Cost: O(n³) to factor, O(n²) per right-hand-side. Default for dense square systems.

Cholesky decomposition (SPD only). If A is symmetric positive-definite, A = L·Lᵀ with L lower triangular and positive diagonal. About 2× faster than LU because we exploit symmetry, and numerically stable without pivoting. Cost: O(n³/3). Used heavily in statistics (Gaussian likelihood, covariance manipulation), optimization (Newton steps for convex problems), and Kalman filters.

QR decomposition. A = Q·R where Q has orthonormal columns and R is upper triangular. For overdetermined systems (m > n), Ax = b has no exact solution; the least squares problem minimizes ‖A·x − b‖₂. Solution: x = R⁻¹·Qᵀ·b. Cost: O(m·n²). Compared to the normal equations AᵀA·x = Aᵀ·b solved by Cholesky, QR is more numerically stable because the condition number of A appears once, not squared.

SVD. A = U·Σ·Vᵀ; pseudoinverse A⁺ = V·Σ⁺·Uᵀ where Σ⁺ inverts only the nonzero singular values. Solves every linear system: square, over-, or underdetermined; full rank or rank-deficient. For underdetermined systems, returns the minimum-norm solution. Cost: O(min(m, n)²·max(m, n)). Slowest dense factorization but most robust.

Iterative methods (for large sparse A)

Conjugate Gradient (CG). For SPD A. Iterates over conjugate (A-orthogonal) search directions; converges in at most n steps in exact arithmetic, but in practice converges to acceptable accuracy in O(√κ) iterations where κ is the condition number. Cost per iteration: one matrix-vector product + O(n) work. Used in FEM, image reconstruction, large-scale ML.

GMRES (Generalized Minimum Residual). For general (non-symmetric) A. Builds a Krylov subspace and minimizes the residual at each step. Memory grows with iteration count; usually restarted.

BiCGSTAB, MINRES, IDR(s). Other Krylov methods for specific structures.

Preconditioning transforms Ax = b into M⁻¹A·x = M⁻¹b where M ≈ A is cheap to invert. Standard preconditioners: incomplete LU (ILU), incomplete Cholesky (IC), algebraic multigrid (AMG), domain decomposition.

Condition number

κ(A) = σ_max(A) / σ_min(A)   (2-norm condition number)

If κ(A) ≈ 10ᵏ, expect to lose roughly k digits of accuracy in floating-point solves. Double precision provides ~16 decimal digits, so κ above 10¹⁶ means no digits of accuracy remain. Practical thresholds:

κ < 10⁴: well-conditioned, any method works.
10⁴ < κ < 10⁸: moderate; use stable factorizations (QR, SVD).
κ > 10⁸: ill-conditioned; consider SVD with truncation, regularization (Tikhonov), or reformulation.

7. Eigenvalues + eigenvectors

A scalar λ is an eigenvalue of square A if there exists nonzero v with:

A·v = λ·v

The vector v is an eigenvector for λ. Geometrically, the linear transformation A merely stretches v by factor λ — it doesn’t rotate it.

Characteristic polynomial

Rearrange: (A − λI)·v = 0. Nontrivial v exists iff A − λI is singular, iff:

det(A − λI) = 0

This is a polynomial of degree n in λ with n complex roots (counted with multiplicity), the eigenvalues.

Worked example. A = [[2, 1], [1, 2]].

det(A − λI) = (2 − λ)² − 1 = λ² − 4λ + 3 = (λ − 1)(λ − 3) = 0.

Eigenvalues: λ₁ = 1, λ₂ = 3. Eigenvector for λ = 3: solve (A − 3I)v = 0 ⇒ [[−1, 1], [1, −1]]·v = 0 ⇒ v = (1, 1) (up to scaling). For λ = 1: v = (1, −1). The two eigenvectors are orthogonal because A is symmetric.

Multiplicities

Algebraic multiplicity of λ: the multiplicity of λ as a root of det(A − λI).
Geometric multiplicity of λ: dim N(A − λI), the dimension of its eigenspace.
Geometric ≤ algebraic, always.
When they’re equal for every eigenvalue, A is diagonalizable.

Diagonalization

If A has n linearly-independent eigenvectors, stack them as columns of P and the eigenvalues on the diagonal of D:

A = P · D · P⁻¹
A·P = P·D     (each column of P is an eigenvector)

This makes powers trivial: Aᵏ = P·Dᵏ·P⁻¹, where Dᵏ just raises each diagonal entry to the k-th power. Used to solve linear recurrences (Fibonacci closed form), linear ODEs ẋ = A·x with solution x(t) = e^(A·t)·x(0) = P·diag(e^(λᵢ·t))·P⁻¹·x(0), and Markov chain limiting distributions.

Spectral theorem (symmetric matrices)

If A is real and symmetric (A = Aᵀ), then:

All eigenvalues are real.
Eigenvectors for distinct eigenvalues are orthogonal.
A can always be diagonalized by an orthogonal matrix Q: A = Q·Λ·Qᵀ, with Q⁻¹ = Qᵀ.

This is the spectral decomposition and one of the most-used results in applied math.

Complex generalization. If A is Hermitian (A = A*), eigenvalues are real and A = U·Λ·U* with U unitary.

Applications

PCA. Eigendecompose the sample covariance matrix C = (1/(n−1))·Xᵀ·X (after centering X). Eigenvectors = principal axes; eigenvalues = variance along each axis.
Vibration modes. For mass M and stiffness K (both SPD), the generalized eigenvalue problem K·v = ω²·M·v gives natural frequencies ωᵢ and mode shapes vᵢ. Foundational for structural and mechanical engineering.
PageRank. Stationary distribution of a Markov chain is the eigenvector for λ = 1 of the transition matrix Pᵀ. Power iteration solves this on web-scale graphs.
Quantum mechanics. Observable operators are Hermitian; their eigenvalues are the measurable values; eigenvectors are the post-measurement states.
Markov chains. The mixing time depends on the second-largest eigenvalue of the transition matrix (spectral gap).
Graph Laplacian / spectral clustering. Eigenvectors of the Laplacian reveal community structure.

Numerical algorithms

Power iteration. v ← A·v / ‖A·v‖; converges to the eigenvector for the largest |λ|. Rate depends on |λ₂/λ₁|.
Inverse iteration. Power iteration on (A − μI)⁻¹; converges to the eigenvalue nearest the shift μ.
QR algorithm. Iteratively factor A = Q·R then form A’ = R·Q; eigenvalues appear on the diagonal in the limit. The workhorse for dense eigenproblems.
Lanczos (symmetric) / Arnoldi (general). Build a Krylov subspace and solve a small projected eigenproblem; the method of choice for large sparse matrices when only a few extreme eigenvalues are needed.
Jacobi-Davidson, LOBPCG. Advanced solvers for specific cases.

8. Singular Value Decomposition (SVD)

For any m × n matrix A (real or complex; full rank or not; square or not):

A = U · Σ · Vᵀ

U is m × m with orthonormal columns (left singular vectors).
Σ is m × n with non-negative diagonal entries σ₁ ≥ σ₂ ≥ … ≥ σ_min(m,n) ≥ 0 (singular values), zeros off the diagonal.
V is n × n with orthonormal columns (right singular vectors). Vᵀ in the factorization, V in the formula for the pseudoinverse.

The “economy” or “thin” SVD truncates U and V to min(m, n) columns and Σ to that size.

Geometric interpretation

Every linear map decomposes as rotation → axis-aligned stretch → rotation:

Vᵀ rotates the input frame so its axes line up with the right singular vectors.
Σ stretches each axis by σᵢ.
U rotates the result into the output frame.

The unit sphere in ℝⁿ is mapped to an ellipsoid in ℝᵐ; the σᵢ are the semi-axis lengths; the columns of U are their directions.

Worked example

A = [[3, 1], [1, 3]]. Symmetric, so SVD coincides with eigendecomposition (up to signs). Eigenvalues: 4 and 2 (both positive, so singular values too). Eigenvectors: (1,1)/√2 and (1,−1)/√2.

A = (1/√2)·[[1, 1], [1, −1]]  ·  diag(4, 2)  ·  (1/√2)·[[1, 1], [1, −1]]ᵀ

Connections to other concepts

σᵢ² are the nonzero eigenvalues of both AᵀA (n × n) and AAᵀ (m × m).
Columns of V are eigenvectors of AᵀA.
Columns of U are eigenvectors of AAᵀ.
Rank(A) = number of nonzero σᵢ.
‖A‖₂ (spectral norm) = σ₁.
‖A‖_F (Frobenius) = √(Σ σᵢ²).
‖A‖_∗ (nuclear) = Σ σᵢ.
|det(A)| = ∏ σᵢ (for square A).
κ₂(A) = σ₁ / σ_r (smallest nonzero).

Eckart-Young-Mirsky theorem

The best rank-k approximation of A in both Frobenius and spectral norm is:

Aₖ = Σᵢ₌₁ᵏ σᵢ · uᵢ · vᵢᵀ

obtained by zeroing out all but the top k singular values. The approximation error:

‖A − Aₖ‖₂ = σₖ₊₁
‖A − Aₖ‖_F = √(σₖ₊₁² + σₖ₊₂² + … )

This single result powers:

Image compression. Store only the top k singular vectors and values.
Recommender systems. Low-rank factorization of user-item rating matrices (Netflix Prize).
Latent Semantic Analysis (LSA). Term-document matrix → low-rank approximation reveals topics.
Denoising. Truncate small σᵢ (assumed noise) and reconstruct.
Model order reduction. Reduce dynamical-system dimensionality.

Pseudoinverse (Moore-Penrose)

A⁺ = V · Σ⁺ · Uᵀ

where Σ⁺ replaces each nonzero σᵢ with 1/σᵢ (and keeps zeros as zeros). Properties:

A·A⁺·A = A.
A⁺·A·A⁺ = A⁺.
(A·A⁺)ᵀ = A·A⁺.
(A⁺·A)ᵀ = A⁺·A.
For full-rank tall A: A⁺ = (AᵀA)⁻¹·Aᵀ (left inverse).
For full-rank wide A: A⁺ = Aᵀ·(AAᵀ)⁻¹ (right inverse).

The solution x = A⁺·b is the least-squares minimum-norm solution: among all x minimizing ‖A·x − b‖, it has the smallest ‖x‖.

9. PCA (Principal Component Analysis)

A workhorse algorithm for dimensionality reduction, visualization, decorrelation, and denoising.

Algorithm

Given data matrix X (n × d) of n samples, each with d features:

Center. X̃ = X − 1·μᵀ, where μ is the column-wise mean. (Sometimes also scale to unit variance — “standardization” — when feature units differ.)
Covariance. C = (1/(n − 1))·X̃ᵀ·X̃ (d × d, symmetric, positive-semidefinite).
Eigendecompose. C = V·Λ·Vᵀ, eigenvalues sorted λ₁ ≥ λ₂ ≥ … ≥ λ_d ≥ 0.
Principal components. Columns of V are the principal directions; each λᵢ = variance along principal direction i.
Project. Y = X̃·V[:, :k] gives the k-dimensional reduced representation.

Equivalently and more numerically stable: compute the SVD X̃ = U·Σ·Vᵀ directly. Then C = V·(Σᵀ Σ/(n−1))·Vᵀ, so V is the matrix of principal directions and λᵢ = σᵢ²/(n−1).

Explained variance

explained_variance_ratio_i = σᵢ² / Σⱼ σⱼ²

Choose k such that the cumulative ratio exceeds a threshold (e.g., 0.95).

Worked example

Three 2D points: (0, 0), (1, 2), (2, 4). Already exactly on the line y = 2x.

Mean: (1, 2). Centered: (−1, −2), (0, 0), (1, 2).
Covariance: (1/2)·[[1+0+1, 2+0+2], [2+0+2, 4+0+4]] = [[1, 2], [2, 4]].
Eigenvalues: λ² − 5λ + 0 = 0 ⇒ λ = 5, 0. Eigenvector for 5: (1, 2)/√5.
One PC captures 100% of variance; the data is intrinsically 1D.

Caveats

Linear method. Captures only linear structure. Nonlinear analogs: kernel PCA, Isomap, t-SNE, UMAP, autoencoders.
Scale-sensitive. Features with larger magnitudes dominate. Standardize when units differ.
Maximum-variance ≠ most-informative. For classification, LDA may serve better.
Centering matters. Failing to center gives “PCA of raw data” which mixes mean and shape.

10. Matrix factorizations summary

Factorization	Form	Requires	Cost (dense)	Primary use
LU	A = P·L·U	Square A	(2/3)·n³	General Ax = b
Cholesky	A = L·Lᵀ	SPD	(1/3)·n³	SPD Ax = b; sampling Gaussian
QR	A = Q·R	None	2·m·n² (m ≥ n)	Least squares; Gram-Schmidt; basis
SVD	A = U·Σ·Vᵀ	None	~14·m·n² (m ≥ n)	Robust solve; PCA; low-rank
Eigendecomp	A = P·D·P⁻¹	Diagonalizable square	~10·n³	Powers, exponential, modes
Schur	A = Q·T·Qᵀ	Square (always exists)	~25·n³	Numerically stable eigen-info
Polar	A = U·P	Square	via SVD	Best rotation; animation
Jordan	A = P·J·P⁻¹	Square (always exists)	unstable numerically	Theoretical structure
Bidiagonal / Hessenberg	Reductions	Square	intermediate step	Inside QR / SVD algorithms

Schur decomposition

Every square matrix A (over ℂ) is unitarily similar to an upper-triangular matrix: A = Q·T·Q*, with Q unitary and T upper triangular with eigenvalues on the diagonal. Always exists; numerically stable. For real A with possibly complex eigenvalues, the real Schur form uses 1×1 and 2×2 diagonal blocks.

Polar decomposition

A = U·P where U is orthogonal (rotation/reflection) and P is symmetric positive-semidefinite (stretch). Computed via SVD: if A = UΣVᵀ then U_polar = U·Vᵀ and P = V·Σ·Vᵀ. Used in:

Continuum mechanics — deformation gradient F = R·U (rotation · stretch).
Computer animation — smoothest interpolation of rotations.
Procrustes problem — finding the best rotation aligning two point clouds.

11. Tensors + multi-linear algebra

A tensor of rank k is a multi-dimensional array with k indices. A scalar is rank 0, a vector rank 1, a matrix rank 2, etc. (Note: “rank” here means “order” or “number of indices” — not the same as matrix rank.) In physics and differential geometry, “tensor” also implies coordinate-transformation rules; in ML/data science it’s usually just “multi-dim array.”

Einstein summation convention

Repeated indices are implicitly summed:

yᵢ = Aᵢⱼ · xⱼ    means    yᵢ = Σⱼ Aᵢⱼ·xⱼ
Cᵢⱼ = Aᵢₖ · Bₖⱼ  means    matrix product

Concise and a perfect match for einsum in NumPy / PyTorch / JAX.

Tensor product (outer product)

For vectors a ∈ ℝᵐ and b ∈ ℝⁿ, the outer product a ⊗ b is the m × n matrix with entries (a ⊗ b)ᵢⱼ = aᵢ·bⱼ. Rank 1. Generalizes to higher-order tensors.

Tensor contraction

Summing over a pair of indices, reducing the rank by 2. Matrix product is a contraction. Trace is a contraction.

Kronecker product

For A (m × n) and B (p × q):

A ⊗ B = block matrix with (i, j) block aᵢⱼ·B, total size mp × nq.

Properties:

(A ⊗ B)·(C ⊗ D) = (A·C) ⊗ (B·D).
(A ⊗ B)⁻¹ = A⁻¹ ⊗ B⁻¹.
vec(A·X·B) = (Bᵀ ⊗ A)·vec(X), where vec stacks columns.

Used in: signal processing (separable filters), Lyapunov equations, quantum state spaces (composite systems).

Higher-order decompositions

CP (CANDECOMP / PARAFAC). T = Σᵣ aᵣ ⊗ bᵣ ⊗ cᵣ; sum of rank-1 tensors. Generalization of “rank” to tensors (NP-hard in general).
Tucker decomposition. T = G ×₁ U₁ ×₂ U₂ ×₃ U₃ where G is a “core” tensor and Uₖ are factor matrices; higher-order SVD (HOSVD).
Tensor train (TT). Sequence of small 3D cores; the workhorse for very high-dimensional tensors in scientific computing.

Matrix derivatives (common identities)

For x ∈ ℝⁿ and A, B compatible:

Expression	Derivative w.r.t. x
aᵀ·x	a
xᵀ·a	a
xᵀ·A·x	(A + Aᵀ)·x (= 2A·x if A symmetric)
‖x‖² = xᵀ·x	2·x
log det(X) (w.r.t. X)	(X⁻¹)ᵀ
tr(A·X) (w.r.t. X)	Aᵀ
tr(Xᵀ·A·X) (w.r.t. X)	(A + Aᵀ)·X

These are the bread and butter of analytic gradients in optimization and ML; modern autograd computes them by chain rule but knowing the identities catches autograd bugs and informs custom backward passes.

12. Norms + inequalities

Vector norms

L⁰ — count of nonzero entries (not a true norm).
L¹ — sum of absolute values.
L² — Euclidean.
Lᵖ — (Σ |vᵢ|ᵖ)^(1/p).
L∞ — max absolute value.

Matrix norms

Induced (operator) p-norm. ‖A‖ₚ = sup_{x ≠ 0} ‖A·x‖ₚ / ‖x‖ₚ.
- ‖A‖₁ = max column sum.
- ‖A‖∞ = max row sum.
- ‖A‖₂ = σ_max(A) (spectral norm).
Frobenius. ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²). Not an induced norm but submultiplicative.
Nuclear (trace) norm. ‖A‖_∗ = Σ σᵢ. Convex surrogate for rank; used in low-rank recovery (matrix completion).
Max norm. maxᵢⱼ |Aᵢⱼ|. Useful but not submultiplicative.

All induced norms and the Frobenius norm satisfy:

‖A·B‖ ≤ ‖A‖·‖B‖ (submultiplicativity).
‖A·x‖ ≤ ‖A‖·‖x‖.

Cauchy-Schwarz inequality

|⟨a, b⟩| ≤ ‖a‖·‖b‖

with equality iff a and b are linearly dependent. Universal across inner-product spaces.

Triangle inequality

‖a + b‖ ≤ ‖a‖ + ‖b‖

Together with positive-definiteness and absolute homogeneity, this defines a norm.

Hölder inequality

For 1/p + 1/q = 1:

|⟨a, b⟩| ≤ ‖a‖ₚ·‖b‖_q

Generalizes Cauchy-Schwarz (p = q = 2). Used in functional analysis and probability.

Matrix Hölder

‖A·B‖_F ≤ ‖A‖₂·‖B‖_F; many similar variants relating Schatten norms (Schatten-p = (Σ σᵢᵖ)^(1/p), recovering nuclear for p = 1, Frobenius for p = 2, spectral for p = ∞).

13. Numerical considerations

Floating-point precision

Format	Bits	~Decimal digits	Notes
float64 (double)	64	15.95	Scientific computing default
float32 (single)	32	7.22	GPU compute, ML training
float16 (half)	16	3.31	ML inference; narrow exponent range
bf16 (bfloat16)	16	2.36	Same exponent range as float32; ML training on TPUs/GPUs
fp8 (E4M3, E5M2)	8	1–2	Cutting-edge ML training; needs scaling

Choosing precision is a trade-off between speed, memory, and numerical stability. In ML, mixed-precision training keeps weights and accumulators in higher precision while doing matmuls in lower precision.

Catastrophic cancellation

Subtracting two nearly-equal floating-point numbers loses precision proportional to their relative closeness. E.g., (1 + 1e-16) − 1 in float64 ≈ 0, not 1e-16. Mitigate by reformulating: use compensated summation (Kahan), use stable identities (e.g., (1 − cos x) ≈ x²/2 for small x), or use libraries that handle this internally.

Numerical rank vs theoretical rank

In exact arithmetic, rank is a well-defined integer. In floating-point, “rank” depends on a tolerance: count σᵢ above ε·σ_max·max(m, n). NumPy’s matrix_rank uses this by default. Small singular values can represent either rank deficiency masked by noise or just noise; truncation must be motivated.

Pivoting in LU

Without pivoting, LU can fail (zero on diagonal) or amplify error. Partial pivoting (row exchanges to put the largest available entry on the diagonal) is standard and stable. Full pivoting (row + column exchanges) is more stable but rarely needed.

After computing x̂ = solve(A, b), residual r = b − A·x̂ may reveal lost precision. Solve A·δ = r and update x̂ ← x̂ + δ. Cheap correction, often regains a few digits.

BLAS levels

Level 1. Vector-vector ops (axpy, dot, scal). O(n) work on O(n) data; memory-bound.
Level 2. Matrix-vector (gemv, trsv). O(n²) work on O(n²) data; still memory-bound.
Level 3. Matrix-matrix (gemm, syrk, trsm). O(n³) work on O(n²) data; compute-bound, achieves close to peak FLOPS.

Algorithm design that maximizes Level-3 usage (blocked / panel-factored versions of LU, QR, Cholesky) is the reason LAPACK is fast.

LAPACK and friends

BLAS — low-level kernels. Reference impl is slow; production uses OpenBLAS, Intel MKL, Apple Accelerate, AOCL, cuBLAS (NVIDIA), rocBLAS (AMD), oneMKL (Intel oneAPI), Arm Performance Libraries, Eigen.
LAPACK — higher-level: linear systems, eigenproblems, SVD, least squares. Calls BLAS.
ScaLAPACK / ELPA — distributed-memory versions.
SuiteSparse / MUMPS / PARDISO — sparse direct solvers.
Trilinos / PETSc / Hypre — sparse iterative + multigrid + preconditioners.

NumPy and SciPy bind to whichever BLAS/LAPACK they’re built against; on most distros it’s OpenBLAS or MKL. PyTorch and JAX layer their own dispatch on top of BLAS + cuBLAS + custom CUDA/Triton kernels. Knowing the stack matters: a misconfigured single-threaded BLAS in a multi-threaded Python program can silently throttle performance by 10×.

14. Applications matrix — what technique solves what

Problem	Technique
Solve square Ax = b, general A	LU with partial pivoting
Solve Ax = b, A symmetric positive-definite	Cholesky
Solve Ax = b, A symmetric indefinite	LDLᵀ (Bunch-Kaufman)
Overdetermined least squares (m > n, full rank)	QR (Householder)
Underdetermined min-norm (m < n)	SVD / Moore-Penrose pseudoinverse
Rank-deficient / ill-conditioned	Truncated SVD; Tikhonov regularization
PCA / latent factors	SVD of centered data; eigendecomp of covariance
Image / data compression	Truncated SVD per Eckart-Young
Matrix completion (sparse observations)	Nuclear-norm minimization (convex); ALS
PageRank / Markov stationary	Power iteration on transition matrix
Vibration modes / FEM eigenvalues	Symmetric generalized eigenproblem; Lanczos for large sparse
Linear ODE ẋ = A·x	Matrix exponential e^(At); via eigendecomp or Padé
Lyapunov stability	Solve A·X + X·Aᵀ + Q = 0 (Bartels-Stewart, Schur-based)
Kalman filter prediction / update	Matrix multiplications + Cholesky / Joseph form for stability
Robot Jacobian inverse	Damped least squares (J⁺ = Jᵀ(JJᵀ + λI)⁻¹) for singularity robustness
Sparse graph computations	Compressed-storage sparse formats + iterative solvers (CG, GMRES)
Recommender system	Low-rank matrix factorization via SGD / ALS / SVD
Word embeddings (latent semantic)	Truncated SVD of co-occurrence or PPMI matrix
Spectral clustering	Eigenvectors of graph Laplacian
Total least squares (errors-in-variables)	SVD-based estimator

15. Cross-references

_index — math reference hub.
svd-pca-spectral — deep-dive on SVD, PCA, spectral methods (TBD).
eigenvalue-problems — symmetric, generalized, large-sparse eigenproblems (TBD).
fem-fea — large sparse linear systems in continuum mechanics.
system-identification — least-squares + subspace ID methods.
kinematics-dh — homogeneous transforms, DH parameters, Jacobians.
bayesian-estimation — Kalman filters, covariance updates.
transformer-architecture — attention as scaled dot-product, matrix factorizations in Q/K/V.

16. Citations

Gilbert Strang, Linear Algebra and Its Applications, 5th ed., Wellesley-Cambridge Press, 2016. The classic engineering-flavored exposition.
Gilbert Strang, Introduction to Linear Algebra, 5th ed., Wellesley-Cambridge Press, 2016. Companion to the MIT 18.06 OCW lectures.
Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM, 1997. The reference for stable, modern numerical algorithms.
Gene H. Golub and Charles F. Van Loan, Matrix Computations, 4th ed., Johns Hopkins University Press, 2013. The comprehensive practitioner’s bible.
Carl Eckart and Gale Young, “The approximation of one matrix by another of lower rank,” Psychometrika 1(3): 211–218, 1936. The original Eckart-Young theorem.
Roger A. Horn and Charles R. Johnson, Matrix Analysis, 2nd ed., Cambridge University Press, 2013. Definitive on matrix-theoretic results.
James W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Strong coverage of numerical stability + iterative methods.
L. Vandenberghe and S. Boyd, Introduction to Applied Linear Algebra, Cambridge University Press, 2018. Free PDF; great for the ML / engineering crossover audience.

Compendium

Explorer

Linear Algebra Essentials — Math Reference

Linear Algebra Essentials — Math Reference

1. At a glance

Two complementary views of a matrix

Core mental models

2. Vectors

Dot product (inner product)

Cross product (3D only)

Norms

Unit vectors

Orthogonality + orthonormality

Projection onto a subspace

3. Matrices

Operations

Special matrix types

Transpose

Inverse

Trace

Determinant

Rank

4. Linear transformations

Common transformations (2D / 3D)

Change of basis

5. Four fundamental subspaces

Rank-nullity theorem

Orthogonal decompositions

6. Solving linear systems Ax = b

Direct methods

Iterative methods (for large sparse A)

Condition number

7. Eigenvalues + eigenvectors

Characteristic polynomial

Multiplicities

Diagonalization

Spectral theorem (symmetric matrices)

Applications

Numerical algorithms

8. Singular Value Decomposition (SVD)

Geometric interpretation

Worked example

Connections to other concepts

Eckart-Young-Mirsky theorem

Pseudoinverse (Moore-Penrose)

9. PCA (Principal Component Analysis)

Algorithm

Explained variance

Worked example

Caveats

10. Matrix factorizations summary

Schur decomposition

Polar decomposition

11. Tensors + multi-linear algebra

Einstein summation convention

Tensor product (outer product)

Tensor contraction

Kronecker product

Higher-order decompositions

Matrix derivatives (common identities)

12. Norms + inequalities

Vector norms

Matrix norms

Cauchy-Schwarz inequality

Triangle inequality

Hölder inequality

Matrix Hölder

13. Numerical considerations

Floating-point precision

Catastrophic cancellation

Numerical rank vs theoretical rank

Pivoting in LU

Iterative refinement

BLAS levels

LAPACK and friends

14. Applications matrix — what technique solves what

15. Cross-references

16. Citations

Graph View

Table of Contents