Numerical Linear Algebra — Math Reference

1. At a glance

Numerical linear algebra is the discipline of turning the abstract math of linear-algebra-essentials — matrices, vectors, eigenvalues, decompositions — into computations that run on real hardware, in finite precision, fast enough to solve problems with n in the millions or billions.

Two operational regimes dominate, and choosing between them is the first design decision:

  • Dense regime. Small-to-medium n (roughly up to 10⁴–10⁵ on a single node). Every entry of A is stored; algorithms cost O(n²) memory and O(n³) flops. The tools are BLAS/LAPACK with cache-blocked, SIMD-vectorized kernels. Examples: covariance matrices in statistics, small-scale eigenproblems, control system matrices, dense Hessians in optimization.
  • Sparse regime. Large n (10⁶–10⁹+) where A has O(n) or O(n·log n) nonzeros. Storing only nonzeros and exploiting structure is mandatory. Algorithms are dominated by sparse matrix-vector products (SpMV) inside iterative Krylov methods (CG, GMRES, BiCGSTAB) wrapped with preconditioners. Examples: discretized PDEs (FEM/FEA, CFD), graph Laplacians, recommender systems, network analysis.

A third hybrid regime — low-rank — sits between the two: A is dense in principle but compressible to rank k ≪ n, so randomized SVD or hierarchical matrices (H-matrices, HSS, HODLR) reduce both storage and flops to O(n·k) or O(n·log n).

Hardware awareness is non-negotiable. A modern CPU has L1/L2/L3 caches with bandwidths differing by an order of magnitude; an algorithm that reuses data inside L1 runs 10–100× faster than one that streams from DRAM. SIMD (AVX-512, NEON, SVE) gives 8–16× speedup on packed FP64 ops if the kernel is vectorizable. GPUs (NVIDIA H100, AMD MI300, Apple M-series) push another 10–100× via massive parallelism and Tensor Cores, but only for dense GEMM-heavy workloads — sparse SpMV is bandwidth-limited and gains far less. Distributed memory (MPI + ScaLAPACK or SLATE) scales out to thousands of nodes for truly large problems.

The literature is anchored by Trefethen and Bau (1997) for pedagogy, Golub and Van Loan (4th ed 2013) for the reference treatise, Davis (2006) for sparse direct methods, and Saad (2nd ed 2003) for iterative methods.

2. Floating-point and conditioning

All numerical linear algebra runs in finite precision, so the IEEE 754 standard (originally 1985, revised 2008 and 2019) is the foundation.

Formats in use:

  • binary64 (FP64, “double”). 1 sign + 11 exponent + 52 mantissa bits. ~15.95 decimal digits. Range ~10⁻³⁰⁸ to 10³⁰⁸. The default for scientific computing. Machine epsilon ε_64 ≈ 2.220446e-16.
  • binary32 (FP32, “single”). 1 + 8 + 23 bits. ~7.22 decimal digits. ε_32 ≈ 1.192e-7. Common in graphics, signal processing, ML training (legacy).
  • binary16 (FP16, “half”). 1 + 5 + 10 bits. ~3.3 decimal digits. Limited range ±65504. Used in ML inference and mixed-precision training.
  • bfloat16 (BF16). 1 + 8 + 7 bits. Same exponent range as FP32 (so no overflow when downcasting weights), but only ~2.3 decimal digits of mantissa. The de facto ML format on TPUs, Hopper, MI300.
  • FP8. Two variants in the Open Compute Project / NVIDIA Hopper spec: E4M3 (1+4+3, narrower range, more precision) for forward activations; E5M2 (1+5+2, wider range like FP16) for gradients. Used for ML inference and Hopper FP8 training.
  • TF32. NVIDIA’s 19-bit “tensorfloat” (1+8+10). Same exponent as FP32, mantissa as FP16. Hardware-only format inside Ampere+ Tensor Cores; never stored.

Catastrophic cancellation. Subtracting two near-equal values loses significant digits proportionally. If a = 1.000000123 and b = 1.000000122 are both correct to FP64, their difference a − b ≈ 10⁻⁹ has only ~7 significant digits left — the leading 9 digits cancelled. This is why naive variance computations using E[X²] − (E[X])² are unstable when the mean is large relative to the spread; use Welford’s online algorithm instead. The same hazard appears in Gram-Schmidt, normal equations, and any algorithm that subtracts close quantities.

Condition number. For a linear system A·x = b, the relative error in x is bounded by

‖Δx‖/‖x‖ ≤ κ(A) · (‖ΔA‖/‖A‖ + ‖Δb‖/‖b‖)

where κ(A) = ‖A‖ · ‖A⁻¹‖ = σ_max(A) / σ_min(A) (using the 2-norm). A condition number of 10ᵏ means you can lose up to k digits of accuracy. So:

  • κ ≈ 1: well-conditioned; expect full FP64 precision.
  • κ ≈ 10⁸: half precision lost; FP64 gives ~8 correct digits.
  • κ ≈ 10¹⁶: at the limit of FP64; results are essentially noise.

A matrix that’s “ill-conditioned” needs either higher precision, a stable algorithm (QR or SVD rather than normal equations), or regularization. Estimating κ cheaply: LAPACK provides dgecon (1-norm condition estimator) using Higham’s algorithm — O(n²) rather than computing the full inverse.

Backward stability. A backward-stable algorithm computes the exact answer for a slightly perturbed input: the computed solution x̃ satisfies (A + ΔA)·x̃ = b with ‖ΔA‖ ≤ O(ε)·‖A‖. Householder QR, LU with partial pivoting, and Cholesky on SPD matrices are all backward stable. Classical Gram-Schmidt is not; Modified Gram-Schmidt is partially.

Forward vs backward error. Forward error ‖x̃ − x‖ is what the user wants to bound; backward error ‖ΔA‖ is what the algorithm controls. The two are linked by κ(A): forward error ≤ κ(A) × backward error. So even a backward-stable algorithm gives meaningless answers on a badly-conditioned matrix — the user must look at κ(A) separately. Iterative refinement (compute residual in higher precision, solve correction in lower precision) recovers forward accuracy when the backward-stable factorization is in working precision but conditioning is borderline.

Numerical example. Solve the 5×5 Hilbert matrix H (H_{ij} = 1/(i+j−1)) for x = (1,1,1,1,1)ᵀ. κ(H) ≈ 4.77e5 in FP64. Even Cholesky (which is backward stable on this SPD matrix) computes x̃ with relative error of order κ·ε ≈ 10⁻¹⁰ — eight digits lost out of sixteen. By n = 12 the Hilbert matrix has κ > 10¹⁶ and FP64 Cholesky returns garbage; you must regularize or move to extended precision.

3. LU decomposition

LU factorizes a square matrix A as the product of a lower-triangular L (unit diagonal) and an upper-triangular U:

A = L · U

Solving A·x = b then becomes two triangular solves: L·y = b (forward substitution) and U·x = y (back substitution), each O(n²) once the factors exist.

Partial pivoting. Plain LU fails or is unstable when a diagonal pivot is zero or small. Partial pivoting permutes rows to put the largest-magnitude candidate on the diagonal at each step:

P · A = L · U

where P is a permutation matrix. With partial pivoting, LU is backward stable in practice (worst-case growth factor 2ⁿ⁻¹ is exponentially rare; empirical growth is O(√n)).

Cost. Factorization: (2/3)n³ flops. Solve: 2n² flops per right-hand side. Memory: O(n²), with L and U overwriting A in place.

LAPACK routines.

  • dgetrf — compute P·A = L·U (factorization with partial pivoting).
  • dgetrs — solve A·x = b using the factors.
  • dgesv — combined factor + solve (the high-level driver).

The d prefix is double precision; s, c, z are single, complex single, complex double.

Block LU. To exploit cache, the matrix is partitioned into blocks (typically 64×64 or 128×128), and the factorization uses BLAS-3 (matrix-matrix) operations on each block. This is what makes modern dgetrf reach 70–90% of peak FLOPS on a CPU.

Recursive LU. Toledo (1997) showed that recursing on halves gives even better cache behavior — the algorithm divides the matrix into a left and right block column, recursively factors the left, applies the update to the right, recurses on the right. This is the strategy in PLASMA and SLATE for tiled multicore LU.

Growth factor. Define ρ_n = max_k ‖U^{(k)}‖_∞ / ‖A‖_∞ where U^{(k)} is the partially-reduced matrix at step k. Backward error bound is O(n·ρ_n·ε). With partial pivoting ρ_n ≤ 2^{n−1} in worst case (Wilkinson 1961), but in practice ρ_n grows like O(n^{2/3}) on random matrices and is rarely a concern. Pathological constructions exist (Wilkinson matrix) where it saturates the bound. With complete pivoting (also permute columns), ρ_n ≤ n^{1/2}·(2·3^{1/2}·4^{1/3}·…·n^{1/(n−1)})^{1/2} — polynomial — but the O(n³) pivot search is too expensive in practice. Rook pivoting (Foster 1997) and tournament pivoting (Demmel-Grigori-Hoemmen-Langou 2008, for communication-avoiding LU) are middle grounds used in distributed and GPU codes.

Communication-avoiding LU (CALU). On distributed memory and GPUs, the bottleneck is not flops but data movement. CALU replaces partial pivoting with tournament pivoting: each panel is factored on each process, the best pivots are merged in a reduction tree. Cost in messages drops from O(n·log P) to O(√P·log P) at the cost of weaker stability (still backward stable but with a larger growth-factor bound). Implemented in CALU within ScaLAPACK and SLATE.

4. QR decomposition

QR factorizes A (m × n, m ≥ n) as

A = Q · R

where Q is m × m orthogonal (Qᵀ·Q = I) and R is m × n upper triangular. For “thin” QR, Q is m × n and R is n × n. QR is the workhorse for least-squares: solving min_x ‖A·x − b‖₂ becomes R·x = Qᵀ·b once factored.

Algorithms (in order of stability):

  • Classical Gram-Schmidt (CGS). Orthogonalize each column against the previous Q-columns using inner products. Theoretically correct but loses orthogonality severely in finite precision when columns are nearly dependent. Avoid.
  • Modified Gram-Schmidt (MGS). Same idea but subtracts projections one at a time, updating the column in place. Much better than CGS but still loses some orthogonality at κ ≈ 1/ε. Used in iterative Krylov methods (Arnoldi) where partial loss is acceptable.
  • Householder reflections. Build Q as a product of n − 1 reflectors H_k = I − 2·v_k·v_kᵀ / (v_kᵀ·v_k). Each reflector zeros out a subdiagonal column below the pivot. Backward stable; this is what LAPACK uses (dgeqrf). Cost: (2/3)·m·n² flops for thin QR (m = n: (4/3)n³, twice LU).
  • Givens rotations. Plane rotations zeroing one element at a time. More flops than Householder for dense matrices but lets you zero a single nonzero element without touching the rest — ideal for sparse QR, updating QR after adding a row or column, and orthogonal updates inside the QR eigenvalue algorithm.

LAPACK routines.

  • dgeqrf — compute QR; returns R in the upper triangle of A, Householder vectors in the lower part.
  • dorgqr — explicitly form Q from the compact reflectors.
  • dormqr — apply Q (or Qᵀ) to another matrix without forming it explicitly (faster).
  • dgels — high-level least-squares driver.

Least-squares via QR. For overdetermined A·x ≈ b with m > n:

  1. Factor A = Q·R (thin QR).
  2. Compute c = Qᵀ·b.
  3. Solve R·x = c by back-substitution.

This is numerically stable: the residual norm is preserved, and the conditioning is κ(A) (not κ(A)² as in normal equations).

Rank-deficient or nearly rank-deficient. When A has near-zero singular values, plain QR can return a triangular R with small diagonal entries that pollute the solution. Two remedies: QR with column pivoting (LAPACK dgeqp3) permutes columns so |R_{ii}| decreases, exposing the numerical rank — truncate at the first R_{ii} below a tolerance. Complete orthogonal decomposition (dgelsy) then applies a second orthogonal transformation to the trailing block. For true rank determination prefer SVD (dgelsd), which is slower but gives the minimum-norm least-squares solution and a robust singular-value gap.

Updating QR. Adding a row to A (Givens-based) costs O(m·n); adding a column costs O(m·n); deleting either costs O(m·n) with hyperbolic rotations. Critical in online regression, Kalman square-root filters, and recursive least squares. LAPACK does not ship these directly; libraries like qrupdate and SuiteSparse spqr do.

5. Cholesky decomposition

For a symmetric positive-definite (SPD) matrix A, the Cholesky factorization is

A = L · Lᵀ

where L is lower-triangular with positive diagonal. Half the flops of LU ((1/3)n³ vs (2/3)n³), no pivoting needed (numerically stable because the diagonal entries of L are bounded by √(a_ii)), half the storage.

If A is not SPD, the algorithm fails when it tries to take √ of a non-positive quantity — which is itself the cheapest SPD test.

LAPACK routines.

  • dpotrf — Cholesky factor (L in lower triangle, or U = Lᵀ in upper if requested).
  • dpotrs — solve A·x = b using the factor.
  • dposv — combined driver.
  • dpocon — condition number estimate.

LDLᵀ variant. For symmetric indefinite matrices, A = L · D · Lᵀ where D is block-diagonal (1×1 and 2×2 blocks). Bunch-Kaufman pivoting keeps it stable. LAPACK: dsytrf / dsytrs. Rook pivoting variant dsytrf_rk gives a smaller pivot growth bound at slightly higher cost; useful in KKT systems where conditioning is precarious.

Block Cholesky for SPD systems with right-hand sides. When solving A·X = B for an n × k matrix B (k right-hand sides), don’t loop column-by-column — call dpotrs once on the whole B; it uses BLAS-3 dtrsm internally for both the L and Lᵀ solves. This is dramatically faster than k separate solves on a threaded BLAS.

Use cases. Normal equations in regression (when stable to do so), Kalman filter covariance updates, sampling from multivariate Gaussians (z = L·u with u ~ N(0, I)), interior-point optimization (the KKT system is symmetric indefinite or SPD after Schur complement), preconditioners (incomplete Cholesky, see §8).

Rank-one update / downdate. A + α·v·vᵀ has Cholesky factor obtainable from L in O(n²) flops via a sequence of Givens or hyperbolic rotations (LAPACK auxiliary dlartg, or dchud/dchdd in LINPACK lineage). Indispensable in recursive least squares, particle filters, sequential Bayesian updating. Downdate (α < 0) is delicate — the result may not be SPD; check the diagonal of the updated factor at each step.

6. SVD

The singular value decomposition factors any m × n matrix as

A = U · Σ · Vᵀ

where U (m × m) and V (n × n) are orthogonal and Σ (m × n) is diagonal with σ_1 ≥ σ_2 ≥ … ≥ σ_min(m,n) ≥ 0 (the singular values).

Algorithm. The standard “Golub-Kahan-Reinsch” (1965, 1970) approach has two phases:

  1. Bidiagonalization. Apply Householder reflectors from both sides to reduce A to upper bidiagonal form B. O(m·n²) flops.
  2. Bidiagonal SVD. Compute the SVD of B using either implicit QR iteration with shifts (Demmel-Kahan 1990 zero-shift QR for high relative accuracy) or divide-and-conquer (Gu-Eisenstat 1995). O(n²) to O(n³) depending on whether singular vectors are wanted.

LAPACK routines.

  • dgesvd — classical QR-based SVD, robust but slower for full vectors.
  • dgesdd — divide-and-conquer, typically 2–10× faster when all singular vectors are wanted; default for large problems.

Singular values vs eigenvalues. For a general A, eigenvalues of AᵀA are σ_i² and eigenvalues of A·Aᵀ are also σ_i² (with the remaining eigenvalues zero on the larger side). Computing SVD via AᵀA-eigen is numerically poor (κ squares); use the direct Golub-Kahan bidiagonalization. For symmetric A, |λ_i(A)| = σ_i(A), and you can recover sign from the eigenvectors.

Truncated and randomized SVD. For low-rank approximation A ≈ U_k · Σ_k · V_kᵀ (best rank-k approximation by Eckart-Young), randomized methods (Halko, Martinsson, Tropp, SIREV 2011) dominate:

  1. Draw a random Gaussian Ω of size n × (k + p) where p is a small oversampling (typically 5–10).
  2. Form Y = A · Ω (a sketch). One matvec pass over A.
  3. Orthogonalize: Y = Q · R (thin QR).
  4. Form B = Qᵀ · A and compute the SVD of B (small: (k + p) × n).
  5. The approximate SVD is A ≈ Q · U_B · Σ_B · V_Bᵀ.

Cost: O(m·n·k) for dense, O(nnz(A)·k) for sparse. Power iteration (replace Y by (A·Aᵀ)^q · A · Ω) sharpens decay if singular values fall slowly. Standard tool in scipy.sparse.linalg.svds, sklearn.decomposition.TruncatedSVD, and the basis for many tensor methods.

Applications. PCA (the SVD of the centered data matrix gives principal components directly), pseudoinverse A⁺ = V · Σ⁺ · Uᵀ (the minimum-norm least-squares solver), total least squares, image compression, latent semantic indexing, recommender systems (matrix factorization), spectral clustering.

Polar decomposition. A = U_p · H where U_p is orthogonal (or unitary) and H is SPD. Derived trivially from SVD: U_p = U · Vᵀ and H = V · Σ · Vᵀ. Used in continuum mechanics (deformation gradient → rotation × stretch), procrustes alignment, and orthogonalization of nearly-orthogonal matrices. Higham’s iterative Newton-Schulz scheme computes U_p in a handful of matrix multiplies without an SVD — useful inside ML training loops (e.g. orthogonal weight constraints).

Numerical rank. Define rank_ε(A) = #{i : σ_i > ε·σ_1}. Picking ε is the central question: machine-epsilon level (ε ≈ ε_mach) is rarely useful — measurement noise sets the floor. In data analysis, look for a singular-value gap, plot σ_i on a log scale, or use cross-validation (e.g. Gavish-Donoho 2014 optimal hard threshold 4/√3·√n·σ_noise for square noisy matrices).

7. Eigensolvers

The eigenvalue problem A·v = λ·v is fundamentally harder than linear solving — it’s nonlinear in λ and only iterative algorithms converge.

Symmetric / Hermitian case. Real spectrum, orthogonal eigenvectors. Standard algorithm:

  1. Tridiagonalize via Householder reflectors: A → T (symmetric tridiagonal). O((4/3)n³) flops.
  2. Solve the tridiagonal eigenproblem via QR iteration with shifts, divide-and-conquer, or MRRR (multiple relatively robust representations, Dhillon-Parlett 2004 — O(n²) for full spectrum with vectors).

LAPACK routines:

  • dsyev — basic QR-iteration driver.
  • dsyevd — divide-and-conquer (D&C), faster when many or all eigenvectors are wanted.
  • dsyevr — MRRR-based; best when only a subset of eigenpairs is needed (specify index range or value range).
  • dsygv — generalized symmetric A·v = λ·B·v with B SPD.

Nonsymmetric case. Eigenvalues are complex in general; eigenvectors are not orthogonal; the matrix may not even be diagonalizable.

  1. Reduce to upper Hessenberg form H via Householder reflectors.
  2. Apply the implicit QR algorithm with multiple shifts (Francis 1961, plus modern variants) to converge H to (real or complex) Schur form T: A = Q·T·Qᵀ.

LAPACK routines:

  • dgees — Schur form (eigenvalues from the diagonal blocks of T).
  • dgeev — explicit eigenvalues and eigenvectors.
  • dggev — generalized A·v = λ·B·v.

Small / partial eigenproblems for large sparse A. Direct dense factorization is infeasible. Use Krylov subspace methods:

  • Power iteration. v_{k+1} = A·v_k / ‖A·v_k‖. Converges to the dominant eigenvector geometrically with rate |λ_2 / λ_1|. Cheap but slow.
  • Inverse iteration. Apply (A − μ·I)⁻¹ instead — converges to the eigenvalue closest to the shift μ. Each step is a linear solve.
  • Rayleigh quotient iteration. Adaptive shift μ_k = v_kᵀ·A·v_k / v_kᵀ·v_k — cubic convergence for symmetric A.
  • Lanczos (Lanczos 1950). Build a Krylov subspace K_m(A, v_1) and project A onto it. For symmetric A, the projected matrix is tridiagonal — eigenpairs of the small tridiagonal approximate the extremal eigenpairs of A. Suffers from loss of orthogonality in finite precision; needs selective or full reorthogonalization.
  • Arnoldi (Arnoldi 1951). The nonsymmetric analog. Projection is upper Hessenberg.
  • Implicitly Restarted Arnoldi / Lanczos (Sorensen 1992). The basis of ARPACK — restart the Krylov subspace with a polynomial filter to keep memory bounded.

Libraries.

  • ARPACK / ARPACK-NG — the standard sparse eigensolver since 1998; wrapped by scipy.sparse.linalg.eigs / eigsh. Computes a few eigenpairs at one end or near a shift.
  • PRIMME (Stathopoulos, McCombs) — newer, optimized for many eigenpairs, GD+K and JD+K methods.
  • SLEPc (Hernandez et al.) — Scalable Library for Eigenvalue Problem Computations, on top of PETSc; MPI-parallel.
  • Anasazi (part of Trilinos) — block Krylov, block Davidson, LOBPCG.
  • FEAST (Polizzi 2009) — contour integral approach; computes all eigenpairs in a given interval; in Intel MKL.

Jacobi-Davidson and LOBPCG. For interior eigenpairs (those not at the spectrum’s edge), Lanczos/Arnoldi converge slowly. Jacobi-Davidson (Sleijpen-van der Vorst 1996) corrects the current eigenvector by solving an inner correction equation, often with a few GMRES steps and a preconditioner. LOBPCG (Knyazev 2001, Locally Optimal Block Preconditioned CG) — block method for symmetric eigenproblems; combines preconditioning, blocking, and three-term recurrences. Standard in PETSc’s BLOPEX, Trilinos Anasazi, and SciPy scipy.sparse.linalg.lobpcg. Excellent for the lowest k eigenpairs of a discrete Laplacian.

Generalized eigenproblem A·v = λ·B·v. Common in FEM (B = mass matrix), structural vibration. When B is SPD, transform to standard form L⁻¹·A·L⁻ᵀ (with B = L·Lᵀ); when B is singular or indefinite, use the QZ algorithm (LAPACK dggev) which produces a generalized Schur decomposition.

8. Iterative solvers

For very large sparse linear systems A·x = b, direct factorization is intractable (fill-in fills the sparse matrix as L and U accumulate nonzeros). Iterative methods compute successive approximations x_0, x_1, x_2, … converging to x without ever factoring A — they need only A as a “black box” matvec.

The dominant family is Krylov subspace methods: x_k is chosen from x_0 + K_k(A, r_0) where K_k = span{r_0, A·r_0, …, Aᵏ⁻¹·r_0} and r_0 = b − A·x_0.

Conjugate Gradient (CG). Hestenes and Stiefel (1952). Only for SPD A. Minimizes the A-norm of the error over the Krylov subspace; equivalently, minimizes (1/2)·xᵀ·A·x − bᵀ·x. Each iteration: one matvec, two inner products, three AXPYs (vector updates). Memory: 4 vectors. Convergence:

‖x_k − x*‖_A / ‖x_0 − x*‖_A ≤ 2·((√κ − 1) / (√κ + 1))ᵏ

So convergence depends on √κ — half the precision exponent of κ. For κ = 10⁴, you get ~1 digit per 50 iterations without preconditioning. With a good preconditioner, dramatically faster.

The CG iteration in algorithmic form (preconditioned, PCG):

r_0 = b − A·x_0;  z_0 = M⁻¹·r_0;  p_0 = z_0
for k = 0, 1, …:
    α_k = (r_k · z_k) / (p_k · A·p_k)
    x_{k+1} = x_k + α_k·p_k
    r_{k+1} = r_k − α_k·A·p_k
    if ‖r_{k+1}‖ < tol: stop
    z_{k+1} = M⁻¹·r_{k+1}
    β_k = (r_{k+1} · z_{k+1}) / (r_k · z_k)
    p_{k+1} = z_{k+1} + β_k·p_k

One matvec with A, one preconditioner apply, two dot products per iteration. The dot products are the global-communication bottleneck on distributed systems; pipelined CG (Ghysels-Vanroose 2014) overlaps communication with computation by reordering the recurrences.

MINRES. Paige and Saunders (1975). Symmetric indefinite (e.g. KKT systems). Minimizes the 2-norm of the residual over the Krylov subspace. Memory: 5 vectors.

SYMMLQ. Same setting as MINRES, minimizes a different error norm; preferred in some KKT contexts.

GMRES. Saad and Schultz (1986). General nonsymmetric. Builds an orthonormal Krylov basis via Arnoldi and minimizes ‖r_k‖_2. Memory grows linearly with k (stores the full basis), so restarted GMRES(m) truncates every m iterations — common m = 20 to 100. Drawback: no monotonic convergence guarantee in the restarted form; can stagnate.

GMRES per iteration costs one matvec, one preconditioner apply, k+1 dot products (orthogonalization against all previous basis vectors), and k+1 AXPYs. The cost-per-iteration grows linearly with iteration count; an upper bound on m balances memory and cost-per-iteration against stagnation risk. A reasonable rule: m around 30–50 for moderately preconditioned PDE problems, m around 100+ for poorly preconditioned ones.

BiCG, BiCGSTAB. Two-sided Krylov for nonsymmetric A. BiCG (Fletcher 1976) needs Aᵀ matvecs. BiCGSTAB (van der Vorst 1992) avoids them and has smoother convergence than BiCG. Each iteration: two matvecs, four inner products. Memory: ~7 vectors.

CGS (Sonneveld 1989). Squared CG — avoids Aᵀ but convergence can be erratic. BiCGSTAB largely supersedes it.

IDR(s) (Sonneveld, van Gijzen 2008). Induced Dimension Reduction — competitive with BiCGSTAB for difficult nonsymmetric problems; finite termination in n + n/s matvecs.

QMR (Freund, Nachtigal 1991). Quasi-Minimal Residual for nonsymmetric — handles BiCG breakdowns.

Multigrid. Not a Krylov method but the optimal solver for elliptic PDEs. Idea: smooth out high-frequency error on the fine grid (a few Jacobi or Gauss-Seidel sweeps), restrict the residual to a coarser grid, solve there recursively, prolong the correction back, smooth again. With V-cycle, W-cycle, or full multigrid (FMG), the cost is O(n) and convergence is independent of n. Geometric multigrid uses the actual grid; algebraic multigrid (AMG) builds the coarse hierarchy from the matrix graph itself — works on unstructured problems.

A single V-cycle on an N-unknown 2D Poisson problem with 3 pre- and post-smoothing sweeps costs roughly 30·N flops and reduces the error by a factor of 0.1; ten cycles suffice for FP64 accuracy. The total work is ~300·N flops — independent of N, which is the punchline of multigrid theory (Brandt 1977, Hackbusch 1985). No competing iterative scheme attains this asymptotic bound for general elliptic problems.

Preconditioners. Replace A·x = b with M⁻¹·A·x = M⁻¹·b (or right/split variants) where M approximates A and M⁻¹·v is cheap. The condition number of M⁻¹·A drives convergence.

  • Jacobi — M = diag(A). Trivial; only helps a little.
  • Gauss-Seidel / SSOR — M from triangular parts. Symmetric SOR is symmetric for use with CG.
  • Incomplete LU (ILU) — compute approximate L̃, Ũ keeping only some fill (ILU(0): no fill outside the sparsity of A; ILU(k): allow fill up to level k; ILUT: drop-tolerance variant). Standard for nonsymmetric.
  • Incomplete Cholesky (IC) — same idea for SPD.
  • Algebraic Multigrid (AMG). Build a multigrid hierarchy purely from A. The two big production AMG libraries:
    • Hypre / BoomerAMG (Lawrence Livermore, Falgout-Yang 2002). Classical Ruge-Stüben AMG, scalable to millions of cores.
    • Trilinos / ML (Sandia, now MueLu). Smoothed aggregation AMG.
    • AMGX (NVIDIA). GPU-accelerated AMG.
  • Domain decomposition. Additive Schwarz, restricted additive Schwarz, BDDC, FETI — partition the domain, solve subproblems, glue together. Naturally parallel.
  • Approximate inverse (SPAI, FSAI). Compute M⁻¹ ≈ A⁻¹ directly with a sparsity pattern. Fully parallel in application.

Convergence example. Consider the 2D Poisson equation on an n × n grid with the standard 5-point stencil. The condition number κ(A) ≈ (n/π)² grows quadratically. Unpreconditioned CG needs O(n) iterations for fixed accuracy. With ILU(0), iterations drop to O(n^{1/2}). With AMG as a preconditioner, iterations become O(1) — independent of n. That is the practical reason AMG dominates large-scale elliptic solvers: it is the only family whose total cost is genuinely O(N) in the number of unknowns N.

Flexible / inner-outer Krylov. When the preconditioner is itself iterative (e.g. AMG inner solve, varying with each call), classical GMRES loses its convergence guarantee. FGMRES (Saad 1993) stores the preconditioned vectors explicitly to handle a varying M⁻¹. Doubles memory but enables nesting of Krylov methods, e.g. outer GMRES with inner CG on each subdomain.

9. Sparse matrix formats

Storage layout determines bandwidth, cache behavior, and parallel scaling of SpMV (sparse matrix-vector product), which is the inner loop of every iterative method.

COO (coordinate). Three arrays: row[k], col[k], val[k] for each of the nnz nonzeros. Easy to construct (append triples), trivial to read. Bad for SpMV — non-sequential memory access. Use only for assembly and conversion.

CSR (compressed sparse row). Three arrays:

  • val[nnz] — nonzero values, row by row.
  • col_idx[nnz] — column index of each value.
  • row_ptr[n+1]row_ptr[i] is the index in val where row i starts.

Row-i nonzeros are contiguous; SpMV streams through val and col_idx once. The de facto default in SciPy (csr_matrix), Eigen sparse, PETSc, Trilinos, cuSPARSE. Cost: 12·nnz + 4n bytes (FP64 + INT32).

CSC (compressed sparse column). Same as CSR but column-major. Natural for sparse direct solvers (column-oriented factorization in CHOLMOD, UMFPACK) and for Aᵀ·x.

DIA (diagonal). Store each nonzero diagonal as a dense vector with an offset. Excellent for banded matrices from structured grids (5-point or 7-point stencils). SpMV is embarrassingly vectorizable.

ELL (ELLPACK / ITPACK). Two dense 2D arrays of shape (n × max_nnz_per_row): values and column indices, padded with zeros. SpMV maps perfectly to GPU warps. Wasteful if row densities vary widely.

Sliced ELL (SELL-C-σ). ELL split into row-slices of size C, each padded to its own max. Combines GPU-friendliness with bounded waste. Used in NVIDIA cuSPARSE’s optimized SpMV.

HYB. Hybrid ELL + COO — store the regular bulk as ELL, the long-tail rows in COO. Original cuSPARSE default for highly irregular sparsity.

BCSR / BSR (block CSR). Group nonzeros into r × r dense blocks. Cuts col_idx overhead by r² and exposes BLAS-3 inside SpMV. Beneficial when matrix arises from vector-valued PDEs (Stokes, elasticity) where blocks are naturally dense.

CSR5, BCCOO, etc. Research formats tuned for specific architectures.

The pragmatic rule: use CSR by default, switch to BSR if the matrix has natural block structure, switch to ELL / SELL on GPUs for regular sparsity.

SpMV roofline. SpMV is almost always memory-bandwidth-bound. Flops per nonzero are 2 (multiply + add); bytes moved per nonzero are 12 (8 FP64 value + 4 INT32 column index) plus amortized row-pointer and input/output vector traffic. Operational intensity ≈ 0.167 flops/byte. On a CPU with 100 GB/s sustained DRAM bandwidth, peak SpMV is ~17 GFLOPS — far below the FPU peak. The same code on an H100 with 3 TB/s HBM3 hits ~500 GFLOPS. Implication: SpMV cannot benefit from Tensor Cores or aggressive FMA stacking; what matters is contiguous memory access (CSR row sweep), prefetching, and reducing index width (INT16 for short rows, INT64 only when needed).

10. Direct sparse solvers

Direct methods factor A symbolically and numerically, then solve. They handle ill-conditioning gracefully and are robust for moderate-size problems (up to ~10⁶ unknowns in 2D, ~10⁵ in 3D, depending on bandwidth).

The fill-in problem. When you factor a sparse matrix, the L and U factors typically have far more nonzeros than A. The pattern of fill-in depends on the elimination order: a good ordering (small fill-in) can be the difference between feasibility and out-of-memory.

Reordering algorithms.

  • AMD / COLAMD (Davis et al.) — Approximate Minimum Degree. Fast, good for symmetric and unsymmetric.
  • METIS (Karypis, Kumar 1998). Nested dissection: recursively partition the matrix graph and order separators last. Optimal asymptotic fill for many PDE meshes.
  • SCOTCH (Pellegrini). Similar to METIS, GPL-licensed alternative.
  • RCM (Reverse Cuthill-McKee). Bandwidth reduction; simple, often inferior to ND/MD on big problems.
  • ParMETIS / PT-SCOTCH — MPI-parallel versions for distributed meshes.

Symbolic vs numeric factorization. Sparse direct solvers split into:

  1. Analyze (symbolic). Compute the elimination tree, fill-reducing permutation, and supernodal structure. Depends only on the sparsity pattern of A. Done once, can be reused for multiple right-hand sides or matrices with identical pattern.
  2. Factor (numeric). Compute the actual L and U values. Reused across different RHS.
  3. Solve. Forward + back substitution.

This separation is essential for nonlinear PDE solvers (Newton iterations reuse symbolic factorization across linear steps) and parameter sweeps.

Sparse LU libraries.

  • SuperLU (Li, Demmel et al.). Multi-frontal LU; SuperLU_DIST is the MPI version.
  • MUMPS (Amestoy et al.). Multifrontal, MPI-parallel, widely used in FEM codes.
  • UMFPACK / SuiteSparse (Davis). Unsymmetric multifrontal; the backend of MATLAB’s \, scipy.sparse.linalg.spsolve.
  • Pardiso (Schenk, Gärtner — original; Intel MKL ships a derivative). Parallel direct solver; very fast for moderate-size symmetric indefinite KKT systems.
  • KLU (Davis, Natarajan). Optimized for circuit-simulation matrices (block triangular form).

Sparse Cholesky.

  • CHOLMOD (Davis 2005, part of SuiteSparse). The gold standard for sparse SPD; supernodal Cholesky, AMD/METIS ordering. Backend for MATLAB chol, R Matrix::Cholesky, scikit-sparse.
  • TAUCS (Toledo). Earlier supernodal Cholesky.

Sparse QR.

  • SuiteSparseQR (Davis 2011). Multifrontal sparse QR with column pivoting; rank-revealing.

When to choose direct vs iterative.

PropertyDirect sparseIterative (Krylov + preconditioner)
MemoryHigh (fill-in)Low (few vectors)
Predictable runtimeYesNo (depends on convergence)
Multiple RHS with same AExcellent (reuse factor)Fair (need block Krylov or Sylvester)
Sensitivity to conditioningLowHigh (needs good preconditioner)
Parallel scalabilityHard at extreme scaleGood (with right preconditioner)
2D problems up to ~10⁶ DOFOften bestCompetitive
3D problems > 10⁵ DOFOften infeasible memory-wiseMandatory

The decision boundary moves with hardware: more RAM and faster cores push direct further; growing problem sizes and exascale push iterative further. PETSc, Trilinos, and Firedrake / FEniCS let you switch between them with a parameter change.

11. BLAS, LAPACK, and the ecosystem

BLAS (Basic Linear Algebra Subprograms) is the standardized API (Fortran 1979, three levels):

  • Level 1. Vector-vector ops: daxpy (y ← α·x + y), ddot, dnrm2, dscal. O(n) flops, O(n) memory traffic → flop/byte ≈ 0.25. Memory-bound; little optimization headroom.
  • Level 2. Matrix-vector ops: dgemv (y ← α·A·x + β·y), dtrsv (triangular solve), dsymv. O(n²) flops, O(n²) memory → flop/byte ≈ 0.25. Still memory-bound.
  • Level 3. Matrix-matrix ops: dgemm (C ← α·A·B + β·C), dtrsm, dsyrk. O(n³) flops on O(n²) memory → flop/byte → ∞ as n grows. Compute-bound; this is where the speedups are. Every modern algorithm tries to recast itself in terms of BLAS-3 for this reason.

LAPACK (Linear Algebra PACKage, Anderson et al., first release 1992) sits on top of BLAS and provides the high-level routines: factorizations (LU, QR, Cholesky, SVD, Schur), eigensolvers, linear and least-squares solvers. It is written assuming BLAS-3 is fast — block-oriented throughout.

BLAS implementations.

  • Reference BLAS (Netlib). The portable Fortran source; correct but slow.
  • OpenBLAS (descendant of GotoBLAS, Goto-Geijn). Hand-tuned assembly kernels for most x86/ARM/Power architectures; open-source default on Linux distros.
  • BLIS (Van Zee, van de Geijn). Modern modular BLAS framework; the basis of AMD’s AOCL BLAS.
  • Intel MKL (now oneAPI Math Kernel Library). Highly optimized for Intel and AMD x86; ships threaded dgemm, dgemv, sparse BLAS, Pardiso, FFT. The performance reference.
  • Apple Accelerate. Apple Silicon: NEON + AMX matrix coprocessor; macOS/iOS default.
  • ARM Performance Libraries (ArmPL). ARM-tuned BLAS, LAPACK, FFTW, SVE-aware.
  • IBM ESSL. Power and AIX.
  • Cray LibSci. Cray supercomputers.
  • ATLAS (Automatically Tuned Linear Algebra Software, Whaley). Older; auto-tunes at install time. Largely supplanted by OpenBLAS / BLIS.
  • NVPL (NVIDIA Performance Libraries) — NVIDIA’s CPU BLAS for Grace ARM CPUs.
  • FlexiBLAS — runtime BLAS switcher; lets the user choose MKL, OpenBLAS, or BLIS without relinking. Standard on Fedora and RHEL.

Inside dgemm. The standard high-performance GEMM is structured as three nested levels of blocking — outer block (fits in L3), middle block (fits in L2), inner kernel (fits in registers + L1). The innermost kernel is an assembly-coded micro-kernel computing a small mc × nc panel update (typically 4 × 12, 8 × 8, or 6 × 16 depending on register count). On Skylake-X with AVX-512, this micro-kernel achieves ~90% of peak. The Goto-van de Geijn analysis (TOMS 2008) shows this packing-and-blocking structure is essentially optimal for the memory hierarchy.

GPU libraries.

  • cuBLAS — NVIDIA’s BLAS (dense). Includes batched GEMM (cublasDgemmBatched) and Tensor-Core paths (cublasGemmEx with FP16/BF16/TF32/INT8).
  • cuSPARSE — sparse BLAS and SpMV; recent versions include the unified cusparseSpMV interface.
  • cuSOLVER — dense and sparse solvers (LU, QR, Cholesky, SVD, sparse direct via SuperLU-like algorithms).
  • cuTENSOR — tensor contractions (used inside cuQuantum, Einstein-summation libraries).
  • CUTLASS — NVIDIA’s open-source GEMM kernel framework; lets you compose Tensor-Core matmuls for custom shapes / fusions. The base for many transformer kernels.
  • rocBLAS / hipBLAS / rocSPARSE / rocSOLVER — AMD’s equivalents for ROCm (MI200, MI300).
  • MAGMA (Tomov, Dongarra). Multi-GPU dense LAPACK with hybrid CPU+GPU algorithms.

Distributed memory.

  • ScaLAPACK (1996) — block-cyclic distribution, MPI; the classical distributed dense library.
  • Elemental (Poulson). Modern C++ alternative to ScaLAPACK; cleaner element-wise distribution.
  • SLATE (Software for Linear Algebra Targeting Exascale; ICL Tennessee). Replacement for ScaLAPACK for exascale machines (Frontier, Aurora); tiled algorithms with explicit task-DAG scheduling.

Threading model traps. OpenBLAS, MKL, and BLIS each thread internally with their own pool (OpenMP, TBB, or native). Calling threaded BLAS from inside an OpenMP-parallel region causes oversubscription — n threads × n threads = n² runnable tasks competing for cores; performance collapses. Fixes: set OMP_NUM_THREADS=1 inside parallel regions, use mkl_set_num_threads_local, or compile against a sequential BLAS. NumPy with threaded MKL inside a multiprocessing pool is a classic offender; threadpoolctl is the standard mitigation.

Batched GEMM. Modern workloads (transformer multi-head attention, batched ODE solvers) need thousands of small GEMMs simultaneously. cuBLAS cublasDgemmBatched and cublasDgemmStridedBatched, MKL dgemm_batch, and the BLAS Extensions standard (Demmel et al.) handle this directly — far more efficient than a Python loop over individual GEMMs.

12. Tensor cores and low-precision GEMM

Starting with the NVIDIA V100 (Volta, 2017), GPUs include Tensor Cores — dedicated matrix-multiply units that perform fused FP16-input / FP32-accumulate matmul on small tiles (originally 4×4×4, growing through Ampere/Hopper). Peak throughput vastly exceeds traditional CUDA-core FP32:

  • V100: ~125 TFLOPS Tensor Core (FP16-in/FP32-acc) vs ~15.7 TFLOPS FP32 CUDA cores.
  • A100: 312 TFLOPS BF16, 624 TFLOPS sparse (2:4 structured), 19.5 TFLOPS FP64.
  • H100: 990 TFLOPS BF16, 1979 TFLOPS FP8, 67 TFLOPS FP64 — and the Hopper Transformer Engine auto-manages FP8 scaling for training.
  • B100/B200 (Blackwell, 2024): adds FP4 paths, > 10 PFLOPS for inference.

Precisions.

  • FP16 / BF16 — training and inference. BF16 dominates for training because of the wide exponent.
  • TF32 — Ampere’s “free” upgrade: pass FP32 inputs, get Tensor Core speed at ~10-bit mantissa precision. Default for cublasGemmEx when compute_type = CUBLAS_COMPUTE_32F_FAST_TF32. Enough for most ML training; not enough for scientific FP32 work.
  • INT8 — quantized inference.
  • FP8 (E4M3, E5M2) — Hopper and Blackwell; ML inference + training.
  • FP4 (E2M1) — Blackwell; ultra-low-precision inference with per-tensor scaling.

Mixed-precision iterative refinement. A classical numerical technique enjoying a renaissance on Tensor Cores:

  1. Factor A in FP16 / BF16 / TF32 — cheap.
  2. Compute the residual r = b − A·x in FP64.
  3. Solve A·d = r using the low-precision factors.
  4. Update x ← x + d; iterate.

Carson, Higham, Pranesh (2018+) analyzed this rigorously. You can recover FP64 accuracy from FP16 factors as long as κ(A)·ε_16 < 1. The HPL-MxP / HPL-AI benchmark (which now leads the Top500-style list) measures exactly this: a mixed-precision linear solve verified in FP64.

Practical recipe (cuSOLVER cusolverDnIRSXgesv). Pass an FP64 A and b; the solver factors A in FP32 or FP16, then performs iterative refinement in FP64 until convergence. On H100, an FP16-factor / FP64-refine path runs roughly 3–4× faster than a pure FP64 cusolverDnDgesv on well-conditioned dense systems, with bit-identical FP64 output. The break-even κ for this path is around 10⁶; beyond that you need a richer refinement (GMRES-IR with the low-precision factor as preconditioner — Higham-Mary 2022).

Sparsity in Tensor Cores. Ampere added structured 2:4 sparsity: in every group of four contiguous weights, two must be zero. Hardware accelerates GEMM on such matrices to 2× dense throughput. Training procedures (NVIDIA’s APEX ASP) prune weights to the 2:4 pattern with minimal accuracy loss. This is structured pruning at the hardware level, not the unstructured CSR-style sparsity scientific computing uses.

13. Python ecosystem

  • NumPy (numpy.linalg) — wraps a BLAS (OpenBLAS by default, MKL in Intel distributions, Accelerate on macOS). np.linalg.solve calls dgesv, np.linalg.lstsq uses dgelsd (D&C SVD), np.linalg.eig uses dgeev.
  • SciPyscipy.linalg exposes more LAPACK (banded, tridiagonal, generalized eigen) with finer control (e.g. lapack_driver kwarg). scipy.sparse.linalg provides spsolve (UMFPACK/SuperLU), eigs/eigsh (ARPACK), svds, and iterative solvers cg, cgs, bicgstab, gmres, qmr, lgmres, gcrotmk, minres, tfqmr, plus the LinearOperator abstraction for matrix-free A.
  • JAXjax.numpy.linalg and jax.scipy.linalg with JIT compilation, XLA backend, and automatic differentiation. Runs on CPU, GPU, TPU. Good for differentiable linear algebra inside ML.
  • PyTorchtorch.linalg (re-engineered in 1.8+, mirroring NumPy/SciPy naming). Differentiable, GPU-native via cuBLAS/cuSOLVER; supports batched dense ops.
  • CuPy — drop-in numpy and scipy.sparse replacement that runs on cuBLAS / cuSOLVER / cuSPARSE. Easiest path from CPU prototype to GPU.
  • scikit-sparse — Python bindings to CHOLMOD, AMD, etc.
  • PETSc / petsc4py — distributed sparse linear algebra via MPI; the standard scientific computing toolkit for large PDE solvers.
  • Trilinos / pyTrilinos — Sandia’s parallel linear-algebra framework; Tpetra, Belos (Krylov), Ifpack2 (preconditioners), Anasazi (eigensolvers).
  • Eigen — C++ template library (header-only). Heavily used in robotics (ROS, Ceres), control, computer vision (OpenCV cv::Mat interop). Dense + sparse, with built-in solvers; can call MKL as a backend.
  • Armadillo / Bandicoot — C++ wrappers with MATLAB-like syntax; CPU and GPU (Bandicoot) targets.
  • Faiss (Meta) — though primarily a vector-similarity library, it exploits BLAS for batched dot products and quantized inner products at the scale of billions of vectors.
  • TensorLy — tensor decompositions (CP, Tucker, tensor train) on top of NumPy / PyTorch / JAX / CuPy backends.

14. Pitfalls

  • Solving the normal equations Aᵀ·A·x = Aᵀ·b. Squares the condition number: κ(AᵀA) = κ(A)². If κ(A) = 10⁸ you can already lose half your digits; via normal equations you’ve lost them all. Use QR or SVD for least squares, never the normal equations unless A is exceedingly well-conditioned.
  • Inverting a matrix to solve a system. x = inv(A) @ b is 3× the flops, far less stable, and pointless. Always use solve(A, b) (which calls LU/QR/Cholesky as appropriate).
  • Storing dense when sparse would work. A 10⁶ × 10⁶ matrix is 8 TB in dense FP64 — out of reach. If it has 10⁷ nonzeros, CSR is ~120 MB. Always check sparsity before allocating.
  • Skipping reordering for sparse factorization. Without AMD / METIS, fill-in can be 100× worse. SciPy and MATLAB do reorder by default; if you wire LAPACK directly, you must.
  • Silent downcasting. Mixing float32 and float64 arrays in NumPy yields float64, but mixing float32 weights with float16 activations in PyTorch can silently keep the lower precision. Audit dtypes in mixed-precision code.
  • Loss of orthogonality in Gram-Schmidt. CGS is dangerous. Use MGS at minimum, Householder for full QR. Inside Arnoldi, reorthogonalize (twice-is-enough rule, Giraud-Langou) when the new vector’s norm shrinks by more than ~1/√2 during projection.
  • Restarted GMRES stagnation. GMRES(m) can stall on hard problems. Switch to LGMRES (loose GMRES), GCROT, or a better preconditioner; do not just increase m blindly.
  • Forgetting that CG requires SPD. Apply CG to an indefinite matrix and it can break down or return garbage silently. Use MINRES for symmetric indefinite.
  • Matrix-vector vs vector-matrix layout confusion. Row-major (C / NumPy) vs column-major (Fortran / BLAS / MATLAB) is a constant source of unnecessary transposes. NumPy’s np.asfortranarray is sometimes worth a one-time cost before a long LAPACK pipeline.
  • Trusting tiny eigenvalues from eig on nonsymmetric A. Pseudospectra (Trefethen, Embree 2005) often show eigenvalues are wildly unstable to perturbation; rely on Schur form or SVD for sensitive analysis.
  • Using a dense triangular solve in a hot loop. np.linalg.solve(L, x) on a triangular L is correct but spends time checking; call scipy.linalg.solve_triangular(L, x, lower=True) or LAPACK dtrtrs directly. Same for cho_solve, lu_solve — reuse the factor.
  • Ignoring the BLAS your build is linked against. np.show_config() reveals whether NumPy uses MKL, OpenBLAS, or reference BLAS — a 10× factor on dense workloads. On Windows, NumPy from PyPI ships with OpenBLAS; the Intel numpy from conda-forge with MKL is markedly faster for dgemm-heavy work.
  • Allocating in the inner loop. r = b - A @ x allocates a new vector every CG iteration. Preallocate and use np.subtract(b, A @ x, out=r) (or write the kernel in Numba / C). On large problems, allocation can dominate the iteration cost.
  • Confusing condition number with sensitivity to the problem. κ(A) bounds worst-case sensitivity of solving A·x = b. The condition number of the eigenvalue problem is different (sep, gaps), and of the SVD different again. Use the right tool: dgecon for linear-solve conditioning, dtrsen for eigenvalue sensitivity.

15. Common applications

  • FEM / FEA (fem-fea). Stiffness matrices are large sparse SPD. Standard solve: CG preconditioned with AMG (BoomerAMG) or incomplete Cholesky. Direct: MUMPS for moderate size, especially nonlinear problems where factor reuse pays off.
  • CFD (cfd-deep). Pressure Poisson is SPD (CG + AMG). Momentum equations are nonsymmetric (GMRES + ILU or BiCGSTAB). Coupled Navier-Stokes: block preconditioners (SIMPLE, PCD, LSC) on the saddle-point KKT system.
  • Recommender systems. Matrix factorization R ≈ U·Vᵀ with users × items rating matrix. Solved by alternating least squares (closed-form ridge regression in each subproblem, hence Cholesky), or SGD. Implicit-feedback variant (Hu, Koren, Volinsky 2008) uses weighted ALS; Spark MLlib’s ALS is the canonical implementation.
  • PCA. Centered data X (n × p). Principal components are right singular vectors of X (or eigenvectors of XᵀX / (n−1) covariance). For sparse / large p, use randomized SVD or scipy.sparse.linalg.svds on X.
  • Optimization. Newton steps require solving H·Δx = −∇f. SPD H (convex): Cholesky. Indefinite H: LDLᵀ or trust-region modification (add λ·I). Quasi-Newton (BFGS, L-BFGS) avoids forming H. Interior-point methods solve a sequence of KKT systems — symmetric indefinite, ill-conditioned near the boundary; need careful regularization and iterative refinement.
  • Kalman filters. State-covariance updates P_{k|k} = (I − K·H)·P_{k|k−1} can lose symmetry in finite precision; use Joseph form or factored (square-root) filter — update Cholesky factor or U-D factors directly. Standard in aerospace navigation since Bierman (1977).
  • Transformer attention. Q · Kᵀ is a dense (sequence × sequence) GEMM at scale; softmax then weights V via another dense GEMM. FlashAttention (Dao et al. 2022) reformulates this as tiled blockwise GEMMs that stay in SRAM, drastically reducing HBM traffic. The entire transformer forward pass is a sequence of dense GEMMs (cuBLAS / CUTLASS) plus normalization and activation kernels. See transformer-architecture.
  • Graph problems. Laplacian linear systems: solved by Koutis-Miller-Peng style nearly-linear-time SDD solvers in theory; in practice by CG + multigrid or specialized solvers (CMG, LAMG).
  • Inverse problems / regularization. Tikhonov: min ‖A·x − b‖² + λ²·‖x‖² → augmented least squares (Aᵀ·A + λ²·I)·x = Aᵀ·b. Solve by QR on [A; λ·I] or by truncated SVD with a regularization filter.
  • Quantum chemistry / DFT. Self-consistent field iterations need a generalized symmetric eigenproblem F·C = S·C·ε at each step, with F (Fock) and S (overlap) dense and moderately sized; LAPACK dsygv is standard. For periodic / plane-wave DFT (VASP, Quantum ESPRESSO), iterative subspace methods (Davidson, RMM-DIIS, LOBPCG) dominate.
  • Markov chain steady-state. Find π with π·P = π, π·1 = 1. Power iteration on Pᵀ works for irreducible chains; for ill-conditioned (near-reducible) chains use GMRES on (I − Pᵀ + 1·1ᵀ/n)·x = 1/n (Grassmann-Taksar-Heyman variant for numerical stability).
  • Control theory. Lyapunov equation A·X + X·Aᵀ + Q = 0 — solved by Bartels-Stewart algorithm (Schur form + back-substitution, O(n³)). Sylvester equation A·X + X·B = C — similar. Riccati equation X·A + Aᵀ·X − X·B·R⁻¹·Bᵀ·X + Q = 0 — solved via the Hamiltonian Schur form or matrix sign function. All in SLICOT.

16. Cross-references

17. Citations

  • Anderson, E. et al. “LAPACK Users’ Guide” 3rd ed, SIAM, 1999 (the LAPACK reference; first release 1992).
  • Davis, T. A. “Direct Methods for Sparse Linear Systems.” SIAM, 2006. The reference for sparse direct solvers; CHOLMOD/UMFPACK author.
  • Demmel, J. W. “Applied Numerical Linear Algebra.” SIAM, 1997.
  • Golub, G. H., Van Loan, C. F. “Matrix Computations.” 4th ed, Johns Hopkins, 2013. The encyclopedic treatise.
  • Halko, N., Martinsson, P. G., Tropp, J. A. “Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” SIAM Review 53(2), 2011. The randomized SVD reference.
  • Hestenes, M. R., Stiefel, E. “Methods of Conjugate Gradients for Solving Linear Systems.” J. Res. NBS 49(6), 1952. The original CG paper.
  • Higham, N. J. “Accuracy and Stability of Numerical Algorithms.” 2nd ed, SIAM, 2002. Authoritative on floating-point error analysis.
  • IEEE Std 754-2019, “IEEE Standard for Floating-Point Arithmetic.”
  • Saad, Y. “Iterative Methods for Sparse Linear Systems.” 2nd ed, SIAM, 2003. The reference for Krylov methods and preconditioners.
  • Saad, Y., Schultz, M. H. “GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems.” SIAM J. Sci. Stat. Comput. 7(3), 1986.
  • Trefethen, L. N., Bau, D. “Numerical Linear Algebra.” SIAM, 1997. The clearest pedagogical text.
  • Trefethen, L. N., Embree, M. “Spectra and Pseudospectra.” Princeton, 2005.
  • van der Vorst, H. A. “Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems.” SIAM J. Sci. Stat. Comput. 13(2), 1992.
  • Goto, K., van de Geijn, R. A. “Anatomy of High-Performance Matrix Multiplication.” ACM TOMS 34(3), 2008.
  • Demmel, J., Grigori, L., Hoemmen, M., Langou, J. “Communication-Optimal Parallel and Sequential QR and LU Factorizations.” SIAM J. Sci. Comput. 34(1), 2012.
  • Brandt, A. “Multi-Level Adaptive Solutions to Boundary-Value Problems.” Math. Comp. 31, 1977.
  • Hackbusch, W. “Multi-Grid Methods and Applications.” Springer, 1985.
  • Sorensen, D. C. “Implicit Application of Polynomial Filters in a k-Step Arnoldi Method.” SIAM J. Matrix Anal. Appl. 13(1), 1992. ARPACK foundation.
  • Higham, N. J., Mary, T. “Mixed Precision Algorithms in Numerical Linear Algebra.” Acta Numerica 31, 2022.
  • Gavish, M., Donoho, D. L. “The Optimal Hard Threshold for Singular Values is 4/√3.” IEEE Trans. Inform. Theory 60(8), 2014.