Linear Algebra Essentials — Math Reference

1. At a glance

Linear algebra is the operational language of nearly every quantitative engineering and ML discipline. It underwrites:

  • Mechanics + structures — stiffness matrices, modal analysis, FEM/FEA assemblies are sparse linear systems Ax = b with millions of degrees of freedom.
  • Control + signal processing — state-space models x(k+1) = A·x(k) + B·u(k) live and die by eigenvalues (stability), observability matrices, and Kalman gain inversions.
  • Computer graphics + robotics — every rigid transform is a 4×4 homogeneous matrix; quaternion-to-rotation conversions, Jacobians, inverse kinematics, camera projections are all matrix algebra.
  • Machine learning — every dense layer is y = W·x + b; every attention block is QKᵀ; every PCA / autoencoder / collaborative filter is a low-rank factorization; gradient updates are tensor contractions.
  • Statistics — covariance matrices, least squares, generalized linear models, multivariate Gaussians, Mahalanobis distance.
  • Quantum + chemistry — operators are matrices, observables are eigenvalues, wavefunctions evolve via unitary matrices.
  • Graphs + networks — adjacency/Laplacian matrices, PageRank, spectral clustering.

Two complementary views of a matrix

Master both, switch fluidly:

  1. Matrix as a list of column vectors. A = [a₁ | a₂ | … | aₙ] where each aᵢ ∈ ℝᵐ is a column. The product A·x = Σᵢ xᵢ·aᵢ is a linear combination of columns weighted by the entries of x. This view dominates when reasoning about the column space (which vectors can be reached) and when interpreting solutions of Ax = b (“which combination of the columns equals b?”).

  2. Matrix as a linear transformation. A maps a vector x ∈ ℝⁿ to A·x ∈ ℝᵐ, preserving addition and scalar multiplication: A(αx + βy) = α·A·x + β·A·y. This view dominates when reasoning about geometry — rotations, reflections, projections, change of basis — and when composing transformations (matrix product = function composition).

A third view, matrix as a table of bilinear coefficients (xᵀ·A·y), shows up in quadratic forms, energy methods, and probabilistic models (precision matrices), but the column-vector and transformation views cover ~95% of working intuition.

Core mental models

  • Column space C(A) — the set of all vectors A·x can produce; lives in ℝᵐ; dimension equals the rank.
  • Row space C(Aᵀ) — span of the rows; lives in ℝⁿ; same dimension as the column space (a deep theorem: row rank = column rank).
  • Null space N(A) — all x with A·x = 0; lives in ℝⁿ; dimension n − rank. These are the “invisible” inputs.
  • Left null space N(Aᵀ) — all y with yᵀ·A = 0; lives in ℝᵐ; dimension m − rank. These are the “unreachable” output directions.
  • Rank — the universal “effective dimension” of A; equals number of linearly-independent columns = rows = number of nonzero singular values.

If you internalize these five concepts, almost every algorithm below becomes “which subspace are we manipulating, and how?“

2. Vectors

A vector v ∈ ℝⁿ has both an algebraic representation (an ordered tuple of n real numbers) and a geometric one (an arrow from the origin, or a point in n-dimensional space). The two views must agree: the algebra is the bookkeeping; the geometry is what we’re computing about.

Dot product (inner product)

For a, b ∈ ℝⁿ:

a · b = Σᵢ aᵢ·bᵢ  =  ‖a‖·‖b‖·cos θ
  • Algebraic form: scalar from coordinate-wise multiplication.
  • Geometric form: product of magnitudes times cosine of the angle between them.
  • a · b = 0 iff a ⊥ b (orthogonal).
  • a · a = ‖a‖² (norm-squared).

Worked example. a = (3, 4, 0), b = (1, 0, 2). Then a · b = 3·1 + 4·0 + 0·2 = 3. ‖a‖ = 5, ‖b‖ = √5 ≈ 2.236. cos θ = 3 / (5·√5) = 3/(5√5) ≈ 0.268, so θ ≈ 74.5°.

Cross product (3D only)

For a, b ∈ ℝ³:

a × b = (a₂b₃ − a₃b₂, a₃b₁ − a₁b₃, a₁b₂ − a₂b₁)
‖a × b‖ = ‖a‖·‖b‖·sin θ
  • The result is perpendicular to both a and b (right-hand rule).
  • Magnitude equals the area of the parallelogram spanned by a and b.
  • a × b = −b × a (anti-commutative).
  • Generalization to higher dimensions is the wedge product (a ∧ b) from exterior algebra; in 3D it happens to be expressible as a vector via the Hodge star.

Worked example. a = (1, 0, 0), b = (0, 1, 0). a × b = (0·0 − 0·1, 0·0 − 1·0, 1·1 − 0·0) = (0, 0, 1). Standard right-hand frame: x × y = z.

Norms

Norms measure “size.” Different norms emphasize different aspects:

  • L⁰ (count of nonzeros) — not technically a norm (fails homogeneity); used in sparse-recovery problems.
  • L¹ (taxicab / Manhattan) ‖v‖₁ = Σᵢ |vᵢ|. Encourages sparsity in regularization (Lasso).
  • L² (Euclidean) ‖v‖₂ = √(Σᵢ vᵢ²). The default geometric length.
  • Lᵖ ‖v‖ₚ = (Σᵢ |vᵢ|ᵖ)^(1/p) for p ≥ 1.
  • L∞ (max / Chebyshev) ‖v‖∞ = maxᵢ |vᵢ|.
  • Frobenius norm (for matrices) ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²).

All Lᵖ-norms on ℝⁿ are equivalent (each bounded by constants times the others), but constants depend on dimension; this matters for high-dimensional analysis.

Unit vectors

A unit vector û = v / ‖v‖₂ has ‖û‖₂ = 1. Used as direction without magnitude. Computing a unit vector requires ‖v‖ ≠ 0; near-zero v causes catastrophic cancellation.

Orthogonality + orthonormality

  • Orthogonal set {v₁, …, vₖ}: vᵢ · vⱼ = 0 for i ≠ j.
  • Orthonormal set: orthogonal plus ‖vᵢ‖ = 1.
  • An orthonormal basis Q = [q₁ | … | qₙ] of ℝⁿ satisfies QᵀQ = I, hence Q⁻¹ = Qᵀ. This is the killer property of orthogonal matrices: they preserve dot products, lengths, and angles (isometries).

Projection onto a subspace

To project v onto the line spanned by unit vector û: proj_û v = (v · û)·û. For a general subspace S with orthonormal basis Q (columns of an m × k matrix):

proj_S v = Q·(Qᵀ·v)

For a non-orthonormal basis with columns of A (m × k, k ≤ m, full column rank):

proj_C(A) v = A·(AᵀA)⁻¹·Aᵀ·v

The matrix P = A·(AᵀA)⁻¹·Aᵀ is the projection matrix onto C(A); it satisfies P² = P (idempotent) and Pᵀ = P (symmetric). The residual v − P·v lies in N(Aᵀ).

3. Matrices

Operations

  • Addition / scalar multiplication. Entry-wise; A + B and α·A are defined when shapes match. Costs O(m·n).
  • Matrix-vector product. y = A·x, yᵢ = Σⱼ Aᵢⱼ·xⱼ. Either “linear combination of A’s columns weighted by x” or “each yᵢ is a row of A dotted with x.” O(m·n).
  • Matrix-matrix product. (A·B)ᵢⱼ = Σₖ Aᵢₖ·Bₖⱼ. For A (m × n) and B (n × p), result is (m × p). Inner dimensions must match. O(m·n·p) naive; Strassen-class algorithms achieve O(n^2.373…) asymptotically but are rarely competitive for sizes below ~1000.

AB ≠ BA in general. Multiplication is not commutative. Counterexample: rotations in 3D about different axes don’t commute (try R_x(90°) then R_y(90°) vs the reverse).

Multiplication is:

  • Associative: (AB)C = A(BC).
  • Distributive over addition.
  • Has identity: I·A = A·I = A.

Special matrix types

TypeDefinitionWhy it matters
Identity II·v = v; diagonal of 1s, zeros elsewhere.Multiplicative identity; A·A⁻¹ = I.
Diagonal DNonzero only on the diagonal.Stretches axes; cheap to invert (1/dᵢ).
Upper triangular UZeros below diagonal.Back substitution solves U·x = b in O(n²).
Lower triangular LZeros above diagonal.Forward substitution.
SymmetricA = Aᵀ.Real eigenvalues, orthogonal eigenvectors (spectral theorem).
Skew-symmetricA = −Aᵀ (diagonal is zero).Eigenvalues purely imaginary; cross-product matrix.
Orthogonal QQᵀQ = QQᵀ = I (real).Isometries; det = ±1; condition number 1.
Unitary UUU = I (complex; is conjugate transpose).Complex analog of orthogonal.
Positive-definiteSymmetric and xᵀAx > 0 ∀ x ≠ 0.All eigenvalues > 0; Cholesky applies; covariance matrices.
Positive-semidefinitexᵀAx ≥ 0.Eigenvalues ≥ 0; allows rank deficiency.
Permutation POne 1 per row and column.P·x permutes entries; det = ±1.
StochasticRows (or columns) sum to 1, entries ≥ 0.Markov chains; row-stochastic right-multiplies probability vectors.
HessenbergUpper triangular plus first sub-diagonal.Intermediate form for QR iteration.
TridiagonalNonzeros only on diagonal + adjacent sub/super-diagonals.O(n) solvers (Thomas algorithm).

Transpose

(Aᵀ)ᵢⱼ = Aⱼᵢ. Properties:

  • (A + B)ᵀ = Aᵀ + Bᵀ.
  • (αA)ᵀ = α·Aᵀ.
  • (A·B)ᵀ = Bᵀ·Aᵀ (order flips).
  • (Aᵀ)ᵀ = A.
  • (A⁻¹)ᵀ = (Aᵀ)⁻¹.

Inverse

A⁻¹ exists iff A is square and det(A) ≠ 0, equivalently A has full rank, equivalently all singular values > 0, equivalently the only solution of Ax = 0 is x = 0.

  • (A·B)⁻¹ = B⁻¹·A⁻¹.
  • Computing A⁻¹ explicitly is almost always wrong in numerical work. To solve A·x = b, factor A (LU, Cholesky, QR) and back-solve. Explicit inversion costs O(n³) and amplifies error.

Trace

tr(A) = Σᵢ Aᵢᵢ
  • Linear: tr(αA + βB) = α·tr(A) + β·tr(B).
  • Cyclic: tr(ABC) = tr(BCA) = tr(CAB).
  • Equals the sum of eigenvalues (counted with algebraic multiplicity).
  • tr(AᵀA) = ‖A‖_F² (Frobenius norm squared).

Determinant

A scalar measure of how a linear transformation scales (signed) volume. For a 2×2 matrix [[a, b], [c, d]], det = a·d − b·c.

  • det(A·B) = det(A)·det(B).
  • det(Aᵀ) = det(A).
  • det(A⁻¹) = 1 / det(A).
  • det(αA) = αⁿ·det(A) for n × n.
  • det(triangular) = product of diagonal entries.
  • det(orthogonal Q) = ±1.
  • Equals the product of eigenvalues (with multiplicity).
  • |det(A)| = product of singular values (regardless of sign).

Geometrically, |det(A)| is the volume of the parallelepiped formed by the columns of A. Sign indicates orientation (positive = preserves handedness, negative = mirror).

Rank

Number of linearly-independent columns = number of linearly-independent rows = number of nonzero singular values = dim(C(A)).

Worked example. A = [[1, 2, 3], [2, 4, 6], [1, 1, 1]]. Row 2 = 2 · Row 1, so only two rows are independent. Rank = 2. Null space contains all multiples of (1, −2, 1) — check: A·(1, −2, 1)ᵀ = (1·1 + 2·(−2) + 3·1, …)ᵀ = (0, 0, 0)ᵀ. ✓

4. Linear transformations

A function T: ℝⁿ → ℝᵐ is linear iff T(αx + βy) = α·T(x) + β·T(y) for all scalars α, β and vectors x, y. Every linear transformation between finite-dimensional spaces is exactly a matrix multiplication (relative to a chosen basis).

Common transformations (2D / 3D)

TransformationMatrix form
IdentityI
Scaling by factor k along each axisk·I
Non-uniform scalingdiag(s₁, s₂, …)
Rotation by θ in 2D[[cos θ, −sin θ], [sin θ, cos θ]]
Reflection across the x-axis[[1, 0], [0, −1]]
Reflection across line y = x[[0, 1], [1, 0]]
Shear (horizontal)[[1, k], [0, 1]]
Projection onto x-axis[[1, 0], [0, 0]]
Rotation about z-axis by θ in 3Dblock diag(R_z(θ), 1)

For rigid-body motion in 3D (rotation + translation), we use 4×4 homogeneous matrices:

T = [ R  t ]
    [ 0  1 ]

where R ∈ SO(3) and t ∈ ℝ³. Composition is matrix multiplication; this is the backbone of computer graphics, robotics, and AR.

Change of basis

If B = [b₁ | … | bₙ] is a new basis (columns of B form a basis of ℝⁿ), then a vector v with coordinates [v]_B in the new basis has standard coordinates B·[v]_B. The same linear map A expressed in the new basis is:

A' = B⁻¹ · A · B

This is a similarity transformation. A and A’ have the same eigenvalues, trace, determinant, rank, and characteristic polynomial — they’re “the same transformation in different coordinates.” Diagonalization (Section 7) is the special case where the new basis consists of eigenvectors and A’ becomes diagonal.

5. Four fundamental subspaces

Gilbert Strang’s organizing framework for any m × n matrix A of rank r:

SubspaceLives inDimensionDescription
Column space C(A)ℝᵐrAll possible A·x; the range.
Row space C(Aᵀ)ℝⁿrSpan of rows; perpendicular to N(A).
Null space N(A)ℝⁿn − rAll x with A·x = 0; perpendicular to C(Aᵀ).
Left null space N(Aᵀ)ℝᵐm − rAll y with yᵀA = 0; perpendicular to C(A).

Rank-nullity theorem

rank(A) + nullity(A) = n   (number of columns)

In words: every dimension of the input space ℝⁿ is either reached as a row-space direction (counted in rank) or annihilated as a null-space direction.

Orthogonal decompositions

ℝⁿ = C(Aᵀ) ⊕ N(A), with C(Aᵀ) ⊥ N(A). ℝᵐ = C(A) ⊕ N(Aᵀ), with C(A) ⊥ N(Aᵀ).

This pairs nicely with SVD: the right singular vectors split ℝⁿ into row-space + null-space; left singular vectors split ℝᵐ into column-space + left null-space.

6. Solving linear systems Ax = b

The Swiss Army knife of numerical analysis. Pick the right tool:

Direct methods

Gaussian elimination + LU decomposition. Factor A = P·L·U where P is a permutation, L is unit lower triangular, U is upper triangular. Then solve L·y = P·b by forward substitution, then U·x = y by back substitution. Cost: O(n³) to factor, O(n²) per right-hand-side. Default for dense square systems.

Cholesky decomposition (SPD only). If A is symmetric positive-definite, A = L·Lᵀ with L lower triangular and positive diagonal. About 2× faster than LU because we exploit symmetry, and numerically stable without pivoting. Cost: O(n³/3). Used heavily in statistics (Gaussian likelihood, covariance manipulation), optimization (Newton steps for convex problems), and Kalman filters.

QR decomposition. A = Q·R where Q has orthonormal columns and R is upper triangular. For overdetermined systems (m > n), Ax = b has no exact solution; the least squares problem minimizes ‖A·x − b‖₂. Solution: x = R⁻¹·Qᵀ·b. Cost: O(m·n²). Compared to the normal equations AᵀA·x = Aᵀ·b solved by Cholesky, QR is more numerically stable because the condition number of A appears once, not squared.

SVD. A = U·Σ·Vᵀ; pseudoinverse A⁺ = V·Σ⁺·Uᵀ where Σ⁺ inverts only the nonzero singular values. Solves every linear system: square, over-, or underdetermined; full rank or rank-deficient. For underdetermined systems, returns the minimum-norm solution. Cost: O(min(m, n)²·max(m, n)). Slowest dense factorization but most robust.

Iterative methods (for large sparse A)

Conjugate Gradient (CG). For SPD A. Iterates over conjugate (A-orthogonal) search directions; converges in at most n steps in exact arithmetic, but in practice converges to acceptable accuracy in O(√κ) iterations where κ is the condition number. Cost per iteration: one matrix-vector product + O(n) work. Used in FEM, image reconstruction, large-scale ML.

GMRES (Generalized Minimum Residual). For general (non-symmetric) A. Builds a Krylov subspace and minimizes the residual at each step. Memory grows with iteration count; usually restarted.

BiCGSTAB, MINRES, IDR(s). Other Krylov methods for specific structures.

Preconditioning transforms Ax = b into M⁻¹A·x = M⁻¹b where M ≈ A is cheap to invert. Standard preconditioners: incomplete LU (ILU), incomplete Cholesky (IC), algebraic multigrid (AMG), domain decomposition.

Condition number

κ(A) = σ_max(A) / σ_min(A)   (2-norm condition number)

If κ(A) ≈ 10ᵏ, expect to lose roughly k digits of accuracy in floating-point solves. Double precision provides ~16 decimal digits, so κ above 10¹⁶ means no digits of accuracy remain. Practical thresholds:

  • κ < 10⁴: well-conditioned, any method works.
  • 10⁴ < κ < 10⁸: moderate; use stable factorizations (QR, SVD).
  • κ > 10⁸: ill-conditioned; consider SVD with truncation, regularization (Tikhonov), or reformulation.

7. Eigenvalues + eigenvectors

A scalar λ is an eigenvalue of square A if there exists nonzero v with:

A·v = λ·v

The vector v is an eigenvector for λ. Geometrically, the linear transformation A merely stretches v by factor λ — it doesn’t rotate it.

Characteristic polynomial

Rearrange: (A − λI)·v = 0. Nontrivial v exists iff A − λI is singular, iff:

det(A − λI) = 0

This is a polynomial of degree n in λ with n complex roots (counted with multiplicity), the eigenvalues.

Worked example. A = [[2, 1], [1, 2]].

det(A − λI) = (2 − λ)² − 1 = λ² − 4λ + 3 = (λ − 1)(λ − 3) = 0.

Eigenvalues: λ₁ = 1, λ₂ = 3. Eigenvector for λ = 3: solve (A − 3I)v = 0 ⇒ [[−1, 1], [1, −1]]·v = 0 ⇒ v = (1, 1) (up to scaling). For λ = 1: v = (1, −1). The two eigenvectors are orthogonal because A is symmetric.

Multiplicities

  • Algebraic multiplicity of λ: the multiplicity of λ as a root of det(A − λI).
  • Geometric multiplicity of λ: dim N(A − λI), the dimension of its eigenspace.
  • Geometric ≤ algebraic, always.
  • When they’re equal for every eigenvalue, A is diagonalizable.

Diagonalization

If A has n linearly-independent eigenvectors, stack them as columns of P and the eigenvalues on the diagonal of D:

A = P · D · P⁻¹
A·P = P·D     (each column of P is an eigenvector)

This makes powers trivial: Aᵏ = P·Dᵏ·P⁻¹, where Dᵏ just raises each diagonal entry to the k-th power. Used to solve linear recurrences (Fibonacci closed form), linear ODEs ẋ = A·x with solution x(t) = e^(A·t)·x(0) = P·diag(e^(λᵢ·t))·P⁻¹·x(0), and Markov chain limiting distributions.

Spectral theorem (symmetric matrices)

If A is real and symmetric (A = Aᵀ), then:

  • All eigenvalues are real.
  • Eigenvectors for distinct eigenvalues are orthogonal.
  • A can always be diagonalized by an orthogonal matrix Q: A = Q·Λ·Qᵀ, with Q⁻¹ = Qᵀ.

This is the spectral decomposition and one of the most-used results in applied math.

Complex generalization. If A is Hermitian (A = A*), eigenvalues are real and A = U·Λ·U* with U unitary.

Applications

  • PCA. Eigendecompose the sample covariance matrix C = (1/(n−1))·Xᵀ·X (after centering X). Eigenvectors = principal axes; eigenvalues = variance along each axis.
  • Vibration modes. For mass M and stiffness K (both SPD), the generalized eigenvalue problem K·v = ω²·M·v gives natural frequencies ωᵢ and mode shapes vᵢ. Foundational for structural and mechanical engineering.
  • PageRank. Stationary distribution of a Markov chain is the eigenvector for λ = 1 of the transition matrix Pᵀ. Power iteration solves this on web-scale graphs.
  • Quantum mechanics. Observable operators are Hermitian; their eigenvalues are the measurable values; eigenvectors are the post-measurement states.
  • Markov chains. The mixing time depends on the second-largest eigenvalue of the transition matrix (spectral gap).
  • Graph Laplacian / spectral clustering. Eigenvectors of the Laplacian reveal community structure.

Numerical algorithms

  • Power iteration. v ← A·v / ‖A·v‖; converges to the eigenvector for the largest |λ|. Rate depends on |λ₂/λ₁|.
  • Inverse iteration. Power iteration on (A − μI)⁻¹; converges to the eigenvalue nearest the shift μ.
  • QR algorithm. Iteratively factor A = Q·R then form A’ = R·Q; eigenvalues appear on the diagonal in the limit. The workhorse for dense eigenproblems.
  • Lanczos (symmetric) / Arnoldi (general). Build a Krylov subspace and solve a small projected eigenproblem; the method of choice for large sparse matrices when only a few extreme eigenvalues are needed.
  • Jacobi-Davidson, LOBPCG. Advanced solvers for specific cases.

8. Singular Value Decomposition (SVD)

For any m × n matrix A (real or complex; full rank or not; square or not):

A = U · Σ · Vᵀ
  • U is m × m with orthonormal columns (left singular vectors).
  • Σ is m × n with non-negative diagonal entries σ₁ ≥ σ₂ ≥ … ≥ σ_min(m,n) ≥ 0 (singular values), zeros off the diagonal.
  • V is n × n with orthonormal columns (right singular vectors). Vᵀ in the factorization, V in the formula for the pseudoinverse.

The “economy” or “thin” SVD truncates U and V to min(m, n) columns and Σ to that size.

Geometric interpretation

Every linear map decomposes as rotation → axis-aligned stretch → rotation:

  1. Vᵀ rotates the input frame so its axes line up with the right singular vectors.
  2. Σ stretches each axis by σᵢ.
  3. U rotates the result into the output frame.

The unit sphere in ℝⁿ is mapped to an ellipsoid in ℝᵐ; the σᵢ are the semi-axis lengths; the columns of U are their directions.

Worked example

A = [[3, 1], [1, 3]]. Symmetric, so SVD coincides with eigendecomposition (up to signs). Eigenvalues: 4 and 2 (both positive, so singular values too). Eigenvectors: (1,1)/√2 and (1,−1)/√2.

A = (1/√2)·[[1, 1], [1, −1]]  ·  diag(4, 2)  ·  (1/√2)·[[1, 1], [1, −1]]ᵀ

Connections to other concepts

  • σᵢ² are the nonzero eigenvalues of both AᵀA (n × n) and AAᵀ (m × m).
  • Columns of V are eigenvectors of AᵀA.
  • Columns of U are eigenvectors of AAᵀ.
  • Rank(A) = number of nonzero σᵢ.
  • ‖A‖₂ (spectral norm) = σ₁.
  • ‖A‖_F (Frobenius) = √(Σ σᵢ²).
  • ‖A‖_∗ (nuclear) = Σ σᵢ.
  • |det(A)| = ∏ σᵢ (for square A).
  • κ₂(A) = σ₁ / σ_r (smallest nonzero).

Eckart-Young-Mirsky theorem

The best rank-k approximation of A in both Frobenius and spectral norm is:

Aₖ = Σᵢ₌₁ᵏ σᵢ · uᵢ · vᵢᵀ

obtained by zeroing out all but the top k singular values. The approximation error:

‖A − Aₖ‖₂ = σₖ₊₁
‖A − Aₖ‖_F = √(σₖ₊₁² + σₖ₊₂² + … )

This single result powers:

  • Image compression. Store only the top k singular vectors and values.
  • Recommender systems. Low-rank factorization of user-item rating matrices (Netflix Prize).
  • Latent Semantic Analysis (LSA). Term-document matrix → low-rank approximation reveals topics.
  • Denoising. Truncate small σᵢ (assumed noise) and reconstruct.
  • Model order reduction. Reduce dynamical-system dimensionality.

Pseudoinverse (Moore-Penrose)

A⁺ = V · Σ⁺ · Uᵀ

where Σ⁺ replaces each nonzero σᵢ with 1/σᵢ (and keeps zeros as zeros). Properties:

  • A·A⁺·A = A.
  • A⁺·A·A⁺ = A⁺.
  • (A·A⁺)ᵀ = A·A⁺.
  • (A⁺·A)ᵀ = A⁺·A.
  • For full-rank tall A: A⁺ = (AᵀA)⁻¹·Aᵀ (left inverse).
  • For full-rank wide A: A⁺ = Aᵀ·(AAᵀ)⁻¹ (right inverse).

The solution x = A⁺·b is the least-squares minimum-norm solution: among all x minimizing ‖A·x − b‖, it has the smallest ‖x‖.

9. PCA (Principal Component Analysis)

A workhorse algorithm for dimensionality reduction, visualization, decorrelation, and denoising.

Algorithm

Given data matrix X (n × d) of n samples, each with d features:

  1. Center. X̃ = X − 1·μᵀ, where μ is the column-wise mean. (Sometimes also scale to unit variance — “standardization” — when feature units differ.)
  2. Covariance. C = (1/(n − 1))·X̃ᵀ·X̃ (d × d, symmetric, positive-semidefinite).
  3. Eigendecompose. C = V·Λ·Vᵀ, eigenvalues sorted λ₁ ≥ λ₂ ≥ … ≥ λ_d ≥ 0.
  4. Principal components. Columns of V are the principal directions; each λᵢ = variance along principal direction i.
  5. Project. Y = X̃·V[:, :k] gives the k-dimensional reduced representation.

Equivalently and more numerically stable: compute the SVD X̃ = U·Σ·Vᵀ directly. Then C = V·(Σᵀ Σ/(n−1))·Vᵀ, so V is the matrix of principal directions and λᵢ = σᵢ²/(n−1).

Explained variance

explained_variance_ratio_i = σᵢ² / Σⱼ σⱼ²

Choose k such that the cumulative ratio exceeds a threshold (e.g., 0.95).

Worked example

Three 2D points: (0, 0), (1, 2), (2, 4). Already exactly on the line y = 2x.

  • Mean: (1, 2). Centered: (−1, −2), (0, 0), (1, 2).
  • Covariance: (1/2)·[[1+0+1, 2+0+2], [2+0+2, 4+0+4]] = [[1, 2], [2, 4]].
  • Eigenvalues: λ² − 5λ + 0 = 0 ⇒ λ = 5, 0. Eigenvector for 5: (1, 2)/√5.
  • One PC captures 100% of variance; the data is intrinsically 1D.

Caveats

  • Linear method. Captures only linear structure. Nonlinear analogs: kernel PCA, Isomap, t-SNE, UMAP, autoencoders.
  • Scale-sensitive. Features with larger magnitudes dominate. Standardize when units differ.
  • Maximum-variance ≠ most-informative. For classification, LDA may serve better.
  • Centering matters. Failing to center gives “PCA of raw data” which mixes mean and shape.

10. Matrix factorizations summary

FactorizationFormRequiresCost (dense)Primary use
LUA = P·L·USquare A(2/3)·n³General Ax = b
CholeskyA = L·LᵀSPD(1/3)·n³SPD Ax = b; sampling Gaussian
QRA = Q·RNone2·m·n² (m ≥ n)Least squares; Gram-Schmidt; basis
SVDA = U·Σ·VᵀNone~14·m·n² (m ≥ n)Robust solve; PCA; low-rank
EigendecompA = P·D·P⁻¹Diagonalizable square~10·n³Powers, exponential, modes
SchurA = Q·T·QᵀSquare (always exists)~25·n³Numerically stable eigen-info
PolarA = U·PSquarevia SVDBest rotation; animation
JordanA = P·J·P⁻¹Square (always exists)unstable numericallyTheoretical structure
Bidiagonal / HessenbergReductionsSquareintermediate stepInside QR / SVD algorithms

Schur decomposition

Every square matrix A (over ℂ) is unitarily similar to an upper-triangular matrix: A = Q·T·Q*, with Q unitary and T upper triangular with eigenvalues on the diagonal. Always exists; numerically stable. For real A with possibly complex eigenvalues, the real Schur form uses 1×1 and 2×2 diagonal blocks.

Polar decomposition

A = U·P where U is orthogonal (rotation/reflection) and P is symmetric positive-semidefinite (stretch). Computed via SVD: if A = UΣVᵀ then U_polar = U·Vᵀ and P = V·Σ·Vᵀ. Used in:

  • Continuum mechanics — deformation gradient F = R·U (rotation · stretch).
  • Computer animation — smoothest interpolation of rotations.
  • Procrustes problem — finding the best rotation aligning two point clouds.

11. Tensors + multi-linear algebra

A tensor of rank k is a multi-dimensional array with k indices. A scalar is rank 0, a vector rank 1, a matrix rank 2, etc. (Note: “rank” here means “order” or “number of indices” — not the same as matrix rank.) In physics and differential geometry, “tensor” also implies coordinate-transformation rules; in ML/data science it’s usually just “multi-dim array.”

Einstein summation convention

Repeated indices are implicitly summed:

yᵢ = Aᵢⱼ · xⱼ    means    yᵢ = Σⱼ Aᵢⱼ·xⱼ
Cᵢⱼ = Aᵢₖ · Bₖⱼ  means    matrix product

Concise and a perfect match for einsum in NumPy / PyTorch / JAX.

Tensor product (outer product)

For vectors a ∈ ℝᵐ and b ∈ ℝⁿ, the outer product a ⊗ b is the m × n matrix with entries (a ⊗ b)ᵢⱼ = aᵢ·bⱼ. Rank 1. Generalizes to higher-order tensors.

Tensor contraction

Summing over a pair of indices, reducing the rank by 2. Matrix product is a contraction. Trace is a contraction.

Kronecker product

For A (m × n) and B (p × q):

A ⊗ B = block matrix with (i, j) block aᵢⱼ·B, total size mp × nq.

Properties:

  • (A ⊗ B)·(C ⊗ D) = (A·C) ⊗ (B·D).
  • (A ⊗ B)⁻¹ = A⁻¹ ⊗ B⁻¹.
  • vec(A·X·B) = (Bᵀ ⊗ A)·vec(X), where vec stacks columns.

Used in: signal processing (separable filters), Lyapunov equations, quantum state spaces (composite systems).

Higher-order decompositions

  • CP (CANDECOMP / PARAFAC). T = Σᵣ aᵣ ⊗ bᵣ ⊗ cᵣ; sum of rank-1 tensors. Generalization of “rank” to tensors (NP-hard in general).
  • Tucker decomposition. T = G ×₁ U₁ ×₂ U₂ ×₃ U₃ where G is a “core” tensor and Uₖ are factor matrices; higher-order SVD (HOSVD).
  • Tensor train (TT). Sequence of small 3D cores; the workhorse for very high-dimensional tensors in scientific computing.

Matrix derivatives (common identities)

For x ∈ ℝⁿ and A, B compatible:

ExpressionDerivative w.r.t. x
aᵀ·xa
xᵀ·aa
xᵀ·A·x(A + Aᵀ)·x (= 2A·x if A symmetric)
‖x‖² = xᵀ·x2·x
log det(X) (w.r.t. X)(X⁻¹)ᵀ
tr(A·X) (w.r.t. X)Aᵀ
tr(Xᵀ·A·X) (w.r.t. X)(A + Aᵀ)·X

These are the bread and butter of analytic gradients in optimization and ML; modern autograd computes them by chain rule but knowing the identities catches autograd bugs and informs custom backward passes.

12. Norms + inequalities

Vector norms

  • L⁰ — count of nonzero entries (not a true norm).
  • L¹ — sum of absolute values.
  • L² — Euclidean.
  • Lᵖ — (Σ |vᵢ|ᵖ)^(1/p).
  • L∞ — max absolute value.

Matrix norms

  • Induced (operator) p-norm. ‖A‖ₚ = sup_{x ≠ 0} ‖A·x‖ₚ / ‖x‖ₚ.
    • ‖A‖₁ = max column sum.
    • ‖A‖∞ = max row sum.
    • ‖A‖₂ = σ_max(A) (spectral norm).
  • Frobenius. ‖A‖_F = √(Σᵢⱼ Aᵢⱼ²) = √tr(AᵀA) = √(Σ σᵢ²). Not an induced norm but submultiplicative.
  • Nuclear (trace) norm. ‖A‖_∗ = Σ σᵢ. Convex surrogate for rank; used in low-rank recovery (matrix completion).
  • Max norm. maxᵢⱼ |Aᵢⱼ|. Useful but not submultiplicative.

All induced norms and the Frobenius norm satisfy:

  • ‖A·B‖ ≤ ‖A‖·‖B‖ (submultiplicativity).
  • ‖A·x‖ ≤ ‖A‖·‖x‖.

Cauchy-Schwarz inequality

|⟨a, b⟩| ≤ ‖a‖·‖b‖

with equality iff a and b are linearly dependent. Universal across inner-product spaces.

Triangle inequality

‖a + b‖ ≤ ‖a‖ + ‖b‖

Together with positive-definiteness and absolute homogeneity, this defines a norm.

Hölder inequality

For 1/p + 1/q = 1:

|⟨a, b⟩| ≤ ‖a‖ₚ·‖b‖_q

Generalizes Cauchy-Schwarz (p = q = 2). Used in functional analysis and probability.

Matrix Hölder

‖A·B‖_F ≤ ‖A‖₂·‖B‖_F; many similar variants relating Schatten norms (Schatten-p = (Σ σᵢᵖ)^(1/p), recovering nuclear for p = 1, Frobenius for p = 2, spectral for p = ∞).

13. Numerical considerations

Floating-point precision

FormatBits~Decimal digitsNotes
float64 (double)6415.95Scientific computing default
float32 (single)327.22GPU compute, ML training
float16 (half)163.31ML inference; narrow exponent range
bf16 (bfloat16)162.36Same exponent range as float32; ML training on TPUs/GPUs
fp8 (E4M3, E5M2)81–2Cutting-edge ML training; needs scaling

Choosing precision is a trade-off between speed, memory, and numerical stability. In ML, mixed-precision training keeps weights and accumulators in higher precision while doing matmuls in lower precision.

Catastrophic cancellation

Subtracting two nearly-equal floating-point numbers loses precision proportional to their relative closeness. E.g., (1 + 1e-16) − 1 in float64 ≈ 0, not 1e-16. Mitigate by reformulating: use compensated summation (Kahan), use stable identities (e.g., (1 − cos x) ≈ x²/2 for small x), or use libraries that handle this internally.

Numerical rank vs theoretical rank

In exact arithmetic, rank is a well-defined integer. In floating-point, “rank” depends on a tolerance: count σᵢ above ε·σ_max·max(m, n). NumPy’s matrix_rank uses this by default. Small singular values can represent either rank deficiency masked by noise or just noise; truncation must be motivated.

Pivoting in LU

Without pivoting, LU can fail (zero on diagonal) or amplify error. Partial pivoting (row exchanges to put the largest available entry on the diagonal) is standard and stable. Full pivoting (row + column exchanges) is more stable but rarely needed.

Iterative refinement

After computing x̂ = solve(A, b), residual r = b − A·x̂ may reveal lost precision. Solve A·δ = r and update x̂ ← x̂ + δ. Cheap correction, often regains a few digits.

BLAS levels

  • Level 1. Vector-vector ops (axpy, dot, scal). O(n) work on O(n) data; memory-bound.
  • Level 2. Matrix-vector (gemv, trsv). O(n²) work on O(n²) data; still memory-bound.
  • Level 3. Matrix-matrix (gemm, syrk, trsm). O(n³) work on O(n²) data; compute-bound, achieves close to peak FLOPS.

Algorithm design that maximizes Level-3 usage (blocked / panel-factored versions of LU, QR, Cholesky) is the reason LAPACK is fast.

LAPACK and friends

  • BLAS — low-level kernels. Reference impl is slow; production uses OpenBLAS, Intel MKL, Apple Accelerate, AOCL, cuBLAS (NVIDIA), rocBLAS (AMD), oneMKL (Intel oneAPI), Arm Performance Libraries, Eigen.
  • LAPACK — higher-level: linear systems, eigenproblems, SVD, least squares. Calls BLAS.
  • ScaLAPACK / ELPA — distributed-memory versions.
  • SuiteSparse / MUMPS / PARDISO — sparse direct solvers.
  • Trilinos / PETSc / Hypre — sparse iterative + multigrid + preconditioners.

NumPy and SciPy bind to whichever BLAS/LAPACK they’re built against; on most distros it’s OpenBLAS or MKL. PyTorch and JAX layer their own dispatch on top of BLAS + cuBLAS + custom CUDA/Triton kernels. Knowing the stack matters: a misconfigured single-threaded BLAS in a multi-threaded Python program can silently throttle performance by 10×.

14. Applications matrix — what technique solves what

ProblemTechnique
Solve square Ax = b, general ALU with partial pivoting
Solve Ax = b, A symmetric positive-definiteCholesky
Solve Ax = b, A symmetric indefiniteLDLᵀ (Bunch-Kaufman)
Overdetermined least squares (m > n, full rank)QR (Householder)
Underdetermined min-norm (m < n)SVD / Moore-Penrose pseudoinverse
Rank-deficient / ill-conditionedTruncated SVD; Tikhonov regularization
PCA / latent factorsSVD of centered data; eigendecomp of covariance
Image / data compressionTruncated SVD per Eckart-Young
Matrix completion (sparse observations)Nuclear-norm minimization (convex); ALS
PageRank / Markov stationaryPower iteration on transition matrix
Vibration modes / FEM eigenvaluesSymmetric generalized eigenproblem; Lanczos for large sparse
Linear ODE ẋ = A·xMatrix exponential e^(At); via eigendecomp or Padé
Lyapunov stabilitySolve A·X + X·Aᵀ + Q = 0 (Bartels-Stewart, Schur-based)
Kalman filter prediction / updateMatrix multiplications + Cholesky / Joseph form for stability
Robot Jacobian inverseDamped least squares (J⁺ = Jᵀ(JJᵀ + λI)⁻¹) for singularity robustness
Sparse graph computationsCompressed-storage sparse formats + iterative solvers (CG, GMRES)
Recommender systemLow-rank matrix factorization via SGD / ALS / SVD
Word embeddings (latent semantic)Truncated SVD of co-occurrence or PPMI matrix
Spectral clusteringEigenvectors of graph Laplacian
Total least squares (errors-in-variables)SVD-based estimator

15. Cross-references

16. Citations

  • Gilbert Strang, Linear Algebra and Its Applications, 5th ed., Wellesley-Cambridge Press, 2016. The classic engineering-flavored exposition.
  • Gilbert Strang, Introduction to Linear Algebra, 5th ed., Wellesley-Cambridge Press, 2016. Companion to the MIT 18.06 OCW lectures.
  • Lloyd N. Trefethen and David Bau III, Numerical Linear Algebra, SIAM, 1997. The reference for stable, modern numerical algorithms.
  • Gene H. Golub and Charles F. Van Loan, Matrix Computations, 4th ed., Johns Hopkins University Press, 2013. The comprehensive practitioner’s bible.
  • Carl Eckart and Gale Young, “The approximation of one matrix by another of lower rank,” Psychometrika 1(3): 211–218, 1936. The original Eckart-Young theorem.
  • Roger A. Horn and Charles R. Johnson, Matrix Analysis, 2nd ed., Cambridge University Press, 2013. Definitive on matrix-theoretic results.
  • James W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Strong coverage of numerical stability + iterative methods.
  • L. Vandenberghe and S. Boyd, Introduction to Applied Linear Algebra, Cambridge University Press, 2018. Free PDF; great for the ML / engineering crossover audience.