Tensor Calculus — Math Reference

1. At a glance

The word “tensor” carries two distinct meanings in modern technical use, and conflating them is a common source of confusion:

Abstract tensor (physics / differential geometry view) — a multilinear map between vector spaces (or, equivalently, a multi-indexed quantity whose components transform under change of basis according to fixed covariance/contravariance rules). This is the sense used by Ricci-Curbastro and Levi-Civita (1900) when they introduced “absolute differential calculus,” and by Einstein (1916) in the General Theory of Relativity.
Tensor as multi-dimensional array (numerical / machine learning view) — an n-dimensional array of numbers, as used in NumPy, PyTorch, TensorFlow, and JAX. No coordinate-transformation behavior is implied; “tensor” is just a generalization of scalar (0-d), vector (1-d), matrix (2-d) to higher dimensions.

Both senses share the same index notation, the same algebraic operations (addition, tensor product, contraction), and the same einsum syntax. The distinction matters when you need to change coordinate systems — physical tensors carry rules for how their components must transform; numerical tensors do not.

This note covers both perspectives, the algebra and calculus that links them, and the practical software/hardware that runs tensor computations at scale.

2. Abstract tensor (physics view)

Let V be an n-dimensional vector space over ℝ and V* its dual (covectors / linear forms). A tensor of type (r, s) on V is a multilinear map

T : V× V × … × V* × V × V × … × V → ℝ ___r copies / _ s copies/

that takes r covectors and s vectors and returns a scalar, linear in each argument.

The set of all (r, s) tensors forms a vector space of dimension n^(r+s), denoted T^r_s(V).

A tensor is a coordinate-free object. Once you pick a basis {e_i} for V and the dual basis {e^j} for V* (defined by e^j(e_i) = δ^j_i), the tensor acquires components:

T^{i_1 … i_r}{j_1 … j_s} = T(e^{i_1}, …, e^{i_r}, e{j_1}, …, e_{j_s})

If the basis changes by a matrix M (so e’_i = M^j_i · e_j), the components transform by a corresponding combination of M and M^{-1}. This transformation law is what makes a quantity a tensor — it is invariant under change of basis even though its components are not.

3. Rank and order

Terminology varies between communities; this note uses:

Order (also called valence or, confusingly, rank) = total number of indices = r + s.
Type = the pair (r, s) — r upper indices (contravariant), s lower indices (covariant).

Order	Type examples	Concrete examples
0	(0, 0)	scalar — temperature, mass
1	(1, 0) or (0, 1)	vector v^i (velocity), covector ω_i (gradient of a scalar field)
2	(2, 0), (1, 1), (0, 2)	metric g_{ij}, linear map A^i_j, bilinear form B_{ij}
3	various	Levi-Civita ε_{ijk}, structure constants C^k_{ij}
4	various	Riemann curvature R^d_{abc}, elasticity tensor C_{ijkl}

Beware: in linear algebra “rank” means the dimension of the column space of a matrix (number of linearly independent rows/columns). In ML “rank” of a tensor means the number of dimensions (axes). These are different concepts. When in doubt, say “order” or “number of axes.”

4. Covariant vs contravariant

A vector v ∈ V has components v^i with respect to a basis. Under a change of basis e’_i = M^j_i · e_j, the components transform by the inverse matrix:

v’^i = (M^{-1})^i_j · v

The components transform opposite (“contra”) to the basis — hence contravariant, indicated by an upper index.

A covector ω ∈ V* (a linear form V → ℝ) has components ω_i transforming the same way as the basis:

ω’_i = M^j_i · ω_j

This is covariant, indicated by a lower index.

Mixed tensors have both kinds of indices, each transforming according to its position. A type (1, 1) tensor A^i_j (a linear map V → V) transforms as A’^i_j = (M^{-1})^i_k · M^l_j · A^k_l.

Metric tensor and index gymnastics. A symmetric, non-degenerate (0, 2) tensor g_{ij} (the metric) and its inverse g^{ij} (where g^{ik}·g_{kj} = δ^i_j) let you convert between covariant and contravariant indices:

v_i = g_{ij} · v^j (lower an index) v^i = g^{ij} · v_j (raise an index) A^{ij} = g^{ik} · A_k^j (raise on a mixed tensor)

In ℝ^n with the standard Euclidean inner product and Cartesian coordinates, g_{ij} = δ_{ij}, so v_i = v^i numerically. This is why introductory linear algebra often ignores the distinction. The distinction becomes essential in curvilinear coordinates (spherical, cylindrical), on curved manifolds (general relativity), or when working with non-orthonormal bases.

5. Einstein summation convention

Introduced by Einstein in his 1916 GR paper to compress notation: any index appearing exactly once as an upper index and once as a lower index in the same term is implicitly summed.

A_{ij} · B^{jk} ≡ Σ_{j=1}^n A_{ij} · B^{jk}

The summed index j is a dummy (or contracted) index — you can rename it (j → m) without changing meaning. The remaining indices i and k are free indices — they appear on both sides of an equation.

Rules of well-formed Einstein notation:

Each free index appears once on every term.
Each dummy index appears exactly twice in a term, once up + once down.
An index appearing three or more times indicates an error.
An index appearing twice in the same position (both up or both down) is not a valid contraction in general — only valid in Euclidean settings where g_{ij} = δ_{ij} and the distinction is moot.

In ML libraries the distinction between upper and lower indices is dropped (everything is Euclidean), and einsum strings simply name axes — see Section 9.

6. Tensor algebra

The fundamental operations on tensors:

Addition. Two tensors of the same type can be added componentwise: (A + B)^{ij}_k = A^{ij}_k + B^{ij}_k. Adding tensors of different type is not defined.

Scalar multiplication. (α·A)^{ij}_k = α · A^{ij}_k for α ∈ ℝ.

Tensor product ⊗. Given a (r₁, s₁) tensor A and a (r₂, s₂) tensor B, the tensor product A ⊗ B is a (r₁+r₂, s₁+s₂) tensor with components

(A ⊗ B)^{i_1 … i_{r_1} k_1 … k_{r_2}}{j_1 … j{s_1} l_1 … l_{s_2}} = A^{i_1 … i_{r_1}}{j_1 … j{s_1}} · B^{k_1 … k_{r_2}}{l_1 … l{s_2}}

The rank adds. For rank-1 vectors u, v ∈ ℝ^n, u ⊗ v is the n×n matrix with (u ⊗ v)_{ij} = u_i · v_j (the outer product).

Contraction. Given a tensor with at least one upper and one lower index, you can contract by picking an upper index and a lower index and summing them. For a (1, 1) tensor A^i_j, contraction gives A^i_i = Σ_i A^i_i = trace(A) — a scalar. For a (1, 2) tensor T^i_{jk}, contracting i with j gives T^i_{ik} = (a vector indexed by k). Contraction reduces rank by 2.

Symmetric and antisymmetric parts. Any (0, 2) or (2, 0) tensor decomposes:

T_{ij} = T_{(ij)} + T_{[ij]} T_{(ij)} = ½(T_{ij} + T_{ji}) (symmetric part) T_{[ij]} = ½(T_{ij} - T_{ji}) (antisymmetric part)

Higher-rank tensors have richer symmetry decompositions (Young symmetrizers, representation theory of the symmetric group).

Inner product. With a metric g_{ij}, the inner product of two vectors is u^i · v^j · g_{ij}. Without indexing: ⟨u, v⟩ = u^T g v.

7. Common tensors

Kronecker delta δ^i_j = 1 if i = j, 0 otherwise. The identity tensor; a (1, 1) tensor. Also written δ_{ij} or δ^{ij} when working in Euclidean space.

Levi-Civita symbol ε_{ijk} (Tullio Levi-Civita, 1900) — totally antisymmetric: ε_{123} = ε_{231} = ε_{312} = +1 ε_{132} = ε_{213} = ε_{321} = -1 zero if any two indices are equal. The 3D cross product is (a × b)^i = ε^i_{jk} · a^j · b^k. Generalizes to n dimensions as the n-index totally antisymmetric symbol. Note: ε_{ijk} is a pseudotensor — it picks up a sign under improper rotations (parity).

Metric tensor g_{ij} — encodes geometry (lengths and angles). Inverse g^{ij}. In Cartesian coordinates on ℝ^n: g_{ij} = δ_{ij}. In polar coordinates on ℝ^2: g = diag(1, r²). In Schwarzschild spacetime (GR): g = diag(-(1 - r_s/r), 1/(1 - r_s/r), r², r² sin²θ).

Christoffel symbols Γ^k_{ij} = ½ g^{kl} (∂_i g_{jl} + ∂_j g_{il} - ∂_l g_{ij}) — encode the connection (how to parallel-transport vectors). Symmetric in i, j for the Levi-Civita connection. Not a tensor: under change of coordinates, an inhomogeneous (non-tensorial) extra term appears. They vanish in flat space with Cartesian coordinates but are nonzero in polar coordinates of ℝ^2 — curvature of coordinates, not of space.

Riemann curvature tensor R^d_{abc} (Bernhard Riemann, 1854, formalized later by Ricci and Levi-Civita) — measures intrinsic curvature of a manifold:

R^d_{abc} = ∂_a Γ^d_{bc} - ∂_b Γ^d_{ac} + Γ^d_{ae} Γ^e_{bc} - Γ^d_{be} Γ^e_{ac}

Vanishes iff the manifold is locally flat. On the sphere it is nonzero; on ℝ^n it is zero.

Ricci tensor R_{ab} = R^c_{acb} — contraction of Riemann on the first and third indices. Symmetric (in the Levi-Civita case).

Scalar curvature R = g^{ab} R_{ab} — a scalar field on the manifold.

Stress tensor σ_{ij} (Augustin-Louis Cauchy, 1822) — continuum mechanics; σ_{ij} is the i-th component of force per unit area on a surface with outward normal in the j direction. Symmetric (in absence of body torques).

Strain tensor ε_{ij} = ½(∂_i u_j + ∂_j u_i) — measures local deformation of a continuous medium.

Inertia tensor I_{ij} = ∫ ρ (r² δ_{ij} - x_i x_j) dV — relates angular velocity to angular momentum: L^i = I^{ij} ω_j. A symmetric (2, 0) tensor whose eigenvectors are the principal axes.

Maxwell field tensor (Faraday tensor) F_{μν} — antisymmetric (0, 2) tensor in 4-d spacetime combining electric and magnetic fields:

F_{μν} = ∂_μ A_ν - ∂_ν A_μ

with A_μ the electromagnetic four-potential. Covariant Maxwell’s equations: ∂^μ F_{μν} = μ_0 J_ν and ∂_[μ F_{νρ]} = 0.

Stress-energy tensor T_{μν} (GR) — symmetric (0, 2) tensor encoding energy density, momentum density, momentum flux, and stress in spacetime. Source of gravity in Einstein’s field equations: R_{μν} - ½ g_{μν} R + Λ g_{μν} = (8πG/c⁴) T_{μν}.

8. Tensors as multi-dimensional arrays (ML view)

In NumPy / PyTorch / TensorFlow / JAX, a tensor is simply an n-dimensional array of numbers, characterized by:

dtype — float32, float16, bfloat16, int32, …
shape — tuple of dimension sizes, e.g. (B, C, H, W) for a batch of images.
device — CPU, GPU (cuda:0), TPU.
stride — for non-contiguous views; controls how multi-dim indices map to flat memory.

The “rank” of a tensor here means the number of axes (length of the shape tuple). A 4-d tensor of shape (32, 3, 224, 224) is a batch of 32 RGB images at 224×224 resolution.

Key operations:

Reshape / view — change the shape without moving data (when the new shape is compatible with current stride).
Permute / transpose — reorder axes (may require a copy or just a stride change).
Broadcasting — implicit expansion of dimensions of size 1 to match (rules below).
Reduction — sum, mean, max, min, norm along selected axes.
Element-wise ops — addition, multiplication (Hadamard), activations (ReLU, GELU).
Matrix multiplication and einsum — see next section.

No coordinate-transformation behavior is enforced. If you reshape a tensor incorrectly you get garbage; the library does not know that your tensor was meant to be (covariant, contravariant) under some basis change.

9. Einsum

Einstein summation as a programming primitive: einsum(spec, *operands) where spec is a comma-separated list of axis labels for each operand, an arrow, and the output labels.

einsum(“ij,jk→ik”, A, B) # matrix product C_ik = Σ_j A_ij · B_jk einsum(“ii→”, A) # trace t = Σ_i A_ii einsum(“ij→ji”, A) # transpose T_ji = A_ij einsum(“ij,ij→”, A, B) # Frobenius inner product einsum(“i,j→ij”, a, b) # outer product M_ij = a_i · b_j einsum(“bij,bjk→bik”, A, B) # batched matmul einsum(“bhid,bhjd→bhij”, Q, K) # attention scores (batched, multi-head) einsum(“bhij,bhjd→bhid”, S, V) # attention output

Axes repeated in inputs but absent from output are contracted (summed). Axes repeated in one operand are diagonals. Axes present in inputs and output are batched.

Available in: NumPy (np.einsum), PyTorch (torch.einsum), TensorFlow (tf.einsum), JAX (jnp.einsum), and the standalone library opt_einsum which finds optimal contraction orderings (the contraction-path optimization problem is NP-hard in general; opt_einsum uses heuristics + dynamic programming).

Einsum is expressive (one-liner replacing 5+ lines of reshape/permute/matmul) and fast (the engines compile to efficient BLAS calls). It is the standard way to write transformer attention, tensor contractions in quantum chemistry, and general multi-tensor expressions.

10. Tensor operations in ML

Reshape / view. Changes how indices group into memory. View is zero-copy (if stride permits); reshape may copy. Example: a (32, 3, 224, 224) image batch can be flattened to (32, 150528) before a fully-connected layer.

Permute / transpose. Reorders axes. permute(0, 2, 3, 1) converts NCHW → NHWC. May not be contiguous after permute; call .contiguous() before some downstream ops.

Broadcasting. NumPy-style rules (codified in NEP 1, building on conventions from APL and predecessors; popularized by NumPy 1.0, 2006):

Align shapes from the right.
Each dimension must be equal, or one of them must be 1.
A size-1 dimension is virtually stretched to match. Example: tensor of shape (3, 4) + tensor of shape (4,) → result shape (3, 4) (the 1-d tensor is broadcast over the first axis). Tensor (3, 1, 5) + (4, 5) → (3, 4, 5).

Broadcasting is powerful but is the source of many silent bugs: adding tensors of shapes (5,) and (5, 1) yields a (5, 5) result, not a (5,) result.

Reduction. sum(axis=...), mean, max, argmax, norm. The reduced axes disappear (or remain as size-1 if keepdims=True).

Outer product. Rank-1 tensor product: u ⊗ v with shape (n, m) from u of shape (n,) and v of shape (m,). np.outer(u, v) or einsum("i,j->ij", u, v).

Element-wise ops. Hadamard product (A ⊙ B){ij} = A{ij} · B_{ij}; activations like ReLU, GELU, SiLU; element-wise functions.

Convolution. A discrete tensor contraction with a sliding window. For 2D convolution with input X of shape (B, C_in, H, W), kernel W of shape (C_out, C_in, k_h, k_w), output Y of shape (B, C_out, H’, W’):

Y[b, c_out, i, j] = Σ_{c_in, di, dj} W[c_out, c_in, di, dj] · X[b, c_in, i+di, j+dj]

Modern implementations use im2col, Winograd’s minimal-filtering algorithm (Shmuel Winograd, 1980; applied to ConvNets by Lavin and Gray, 2016), or implicit GEMM.

Attention. Scaled dot-product attention (Vaswani et al., 2017): given Q, K, V each of shape (B, H, N, d): S = softmax(Q · K^T / √d) · V Implemented as batched matmul; modern implementations use FlashAttention (Dao et al., 2022) for IO-aware tiling.

Gather / scatter. Sparse indexing: gather(input, axis, indices) picks values at specified indices; scatter writes them back. Used in embedding lookups, sparse attention, MoE routing.

11. Tensor decomposition

For high-order tensors, decompositions reduce storage and computation:

CP (canonical polyadic) decomposition — also called PARAFAC (Harshman, 1970) or CANDECOMP (Carroll and Chang, 1970). Express a tensor T of order d as a sum of R rank-1 outer products:

T_{i_1 … i_d} = Σ_{r=1}^R a^{(1)}{i_1, r} · a^{(2)}{i_2, r} · … · a^{(d)}_{i_d, r}

The minimal R is the tensor rank. Determining tensor rank is NP-hard in general (Hillar and Lim, 2013).

Tucker decomposition (Ledyard Tucker, 1966) — a core tensor times a matrix on each mode:

T_{i_1 … i_d} = Σ_{j_1 … j_d} G_{j_1 … j_d} · U^{(1)}{i_1 j_1} · … · U^{(d)}{i_d j_d}

Generalizes SVD to higher orders (also called higher-order SVD, HOSVD; De Lathauwer, De Moor, Vandewalle, 2000).

Tensor Train (TT) decomposition (Ivan Oseledets, 2011) — a sequence of order-3 “carriages” multiplied along a chain:

T_{i_1 … i_d} = G_1[i_1] · G_2[i_2] · … · G_d[i_d]

with each G_k[i_k] a matrix of shape (r_{k-1}, r_k); r_0 = r_d = 1. Storage cost O(d · n · r²) versus O(n^d) for the full tensor — exponential compression when ranks are small.

Tensor Networks — generalization with arbitrary graphs of tensor contractions. Matrix Product States (MPS), Projected Entangled Pair States (PEPS), Multi-scale Entanglement Renormalization Ansatz (MERA) are tensor-network ansätze in quantum many-body physics (Vidal, 2003, 2007).

Applications:

Model compression — replace large fully-connected or convolutional layers with low-rank TT or Tucker approximations.
Recommender systems — factorize user-item-context interaction tensors.
Signal processing — multi-way analysis of EEG, fMRI, hyperspectral imaging.
Quantum simulation — MPS / DMRG for 1D quantum systems (White, 1992).

12. Differential geometry essentials

Manifold — a topological space that locally looks like ℝ^n. Formally, a Hausdorff second-countable space M with an atlas of charts (U_α, φ_α) where φ_α : U_α → ℝ^n is a homeomorphism, and chart transitions φ_β ∘ φ_α^{-1} are smooth (for a smooth manifold).

Tangent space T_p M at a point p ∈ M — the vector space of equivalence classes of curves through p, or equivalently the space of derivations on smooth functions at p. dim T_p M = dim M = n.

Tangent bundle TM = ⋃_{p ∈ M} T_p M — a 2n-dimensional manifold.

Vector field X — a smooth section of TM; assigns a tangent vector X_p ∈ T_p M to each point p. In coordinates: X = X^i ∂_i where ∂_i = ∂/∂x^i.

Differential 1-form ω — a smooth section of the cotangent bundle TM; assigns a covector ω_p ∈ T_p M to each point. In coordinates: ω = ω_i dx^i. Example: the differential df of a smooth function f, with components ∂_i f.

Differential p-form — totally antisymmetric (0, p) tensor field. Space of p-forms denoted Ω^p(M). The wedge product α ∧ β combines a p-form and a q-form into a (p+q)-form, antisymmetric.

Exterior derivative d : Ω^p(M) → Ω^{p+1}(M) — generalizes gradient (d on 0-forms), curl (d on 1-forms in 3D), and divergence (d on 2-forms in 3D). Satisfies:

d(α ∧ β) = dα ∧ β + (-1)^p α ∧ dβ (graded Leibniz)
d² = 0 (closed implies “boundaryless”; this single identity unifies “curl of grad = 0” and “div of curl = 0”).

Hodge star ★ : Ω^p(M) → Ω^{n-p}(M) — depends on a metric and orientation. Maps p-forms to (n-p)-forms. In ℝ^3 it identifies 1-forms with 2-forms (and is why curl makes sense in 3D specifically).

Stokes theorem — ∫_M dω = ∫_{∂M} ω for any (n-1)-form ω on an n-dimensional oriented manifold M with boundary ∂M. Specializes to:

Fundamental theorem of calculus (n = 1).
Green’s theorem (n = 2).
Classical Stokes / Kelvin-Stokes (n = 2 surface in ℝ^3).
Divergence theorem / Gauss (n = 3).

13. Riemannian geometry

A Riemannian manifold (M, g) is a smooth manifold with a smoothly varying inner product g_p on each tangent space T_p M — a (0, 2) symmetric positive-definite tensor field, the metric.

The metric lets you measure:

Lengths of curves: L(γ) = ∫_a^b √(g_{ij} γ’^i γ’^j) dt.
Angles between vectors at a point.
Volumes via the volume form dV = √|det g| dx^1 ∧ … ∧ dx^n.

Christoffel symbols (Levi-Civita connection) — uniquely determined by the metric:

Γ^k_{ij} = ½ g^{kl} (∂_i g_{jl} + ∂_j g_{il} - ∂_l g_{ij})

The connection that is (a) torsion-free (Γ^k_{ij} = Γ^k_{ji}) and (b) metric-compatible (∇g = 0).

Covariant derivative ∇_i — extends partial derivative to tensor fields on a manifold. For a vector field v:

∇_i v^k = ∂_i v^k + Γ^k_{ij} v

For a covector ω:

∇_i ω_k = ∂_i ω_k - Γ^j_{ik} ω_j

Generalizes to mixed tensors with one connection term per index.

Parallel transport — a vector v_0 ∈ T_{p_0} M is transported along a curve γ by solving (∇_{γ’} v)^k = 0 along γ. Path-dependent in general — the dependence is curvature.

Geodesics — locally length-minimizing curves; solutions of the geodesic equation

γ”^k + Γ^k_{ij} γ’^i γ’^j = 0

Straight lines in ℝ^n; great circles on the sphere.

Riemann curvature R^d_{abc} — measures failure of covariant derivatives to commute:

(∇_a ∇_b - ∇_b ∇_a) v^d = R^d_{cab} v^c (up to torsion, zero here)

Has symmetries: R_{abcd} = -R_{bacd} = -R_{abdc} = R_{cdab}, plus the first Bianchi identity R^d_{[abc]} = 0 and the second (differential) Bianchi identity ∇_{[e} R^d_{|c|ab]} = 0.

Ricci curvature R_{ab} = R^c_{acb}. Scalar curvature R = g^{ab} R_{ab}.

Einstein field equations of General Relativity (Einstein, 1915–1916):

R_{μν} - ½ g_{μν} R + Λ g_{μν} = (8πG / c⁴) T_{μν}

with Λ the cosmological constant, G Newton’s gravitational constant, c the speed of light, T_{μν} the stress-energy tensor of matter and fields. Solutions: Minkowski (flat, vacuum), Schwarzschild (vacuum exterior of spherical mass), Kerr (rotating black hole), Friedmann-Lemaître-Robertson-Walker (cosmology).

14. Applications

Continuum mechanics. Cauchy stress tensor σ_{ij} relates force and area; strain tensor ε_{ij} relates deformation to position; elasticity tensor C_{ijkl} relates stress to strain (Hooke’s law generalized: σ_{ij} = C_{ijkl} ε_{kl}). Viscous stress tensor in Navier-Stokes equations. Piezoelectric coupling tensor d_{ijk} relates electric field to mechanical strain.

Rigid-body dynamics. Inertia tensor I_{ij} relates angular velocity to angular momentum: L^i = I^{ij} ω_j. Principal axes (eigenvectors of I) diagonalize it. Kinetic energy of rotation: T = ½ I^{ij} ω_i ω_j. See [[Robotics/dynamics-rigid-body]].

Electromagnetism. Faraday tensor F_{μν} encodes E and B fields covariantly. Covariant Maxwell equations: ∂_μ F^{μν} = μ_0 J^ν, ∂_{[μ} F_{νρ]} = 0. Stress-energy tensor of the EM field: T_{μν} = (1/μ_0)(F_{μα} F_ν^α - ¼ g_{μν} F_{αβ} F^{αβ}). See [[Engineering/electromagnetics-engineering]].

General relativity. Spacetime is a 4-d pseudo-Riemannian manifold with signature (-, +, +, +) or (+, -, -, -). Matter curves spacetime via Einstein’s equations; spacetime curvature dictates how matter and light move (geodesics). Predicts gravitational lensing, gravitational waves (detected by LIGO 2015), black holes, cosmological expansion.

Quantum mechanics. Composite systems live in tensor-product Hilbert spaces: H_AB = H_A ⊗ H_B. Density matrices for mixed states are (1, 1) tensors. Partial trace = contraction over one subsystem. Entanglement = non-product states.

Machine learning. Tensors = n-dimensional arrays of floats. Forward and backward passes are sequences of tensor operations (matmul, conv, attention, normalization). Autodiff records the computation graph and propagates gradients backward via vector-Jacobian products. Tensor cores in NVIDIA GPUs accelerate small matrix products at the heart of all these. See [[Compute/transformer-architecture]].

Computer graphics. Vertex positions transform as vectors (contravariant); surface normals transform as covectors and require the inverse-transpose of the model matrix (M^{-1})^T to transform correctly under non-uniform scaling. A common bug: applying M to normals and getting wrong lighting.

Signal processing. Multi-channel, multi-domain signals (time × channel × subject × condition in EEG) are naturally represented as tensors; decompositions like CP and Tucker reveal latent structure.

Diffusion MRI / DTI. At each voxel, water diffusion is modeled by a 3×3 symmetric tensor D_{ij}; eigenvectors give principal diffusion directions (revealing white-matter tracts), eigenvalues give magnitudes; derived scalars include fractional anisotropy (FA) and mean diffusivity (MD).

15. GPU tensor units

NVIDIA introduced Tensor Cores in the Volta architecture (V100, 2017) — specialized units that perform small matrix-multiply-and-accumulate operations (D = A · B + C) in a single instruction:

Volta (2017) — FP16 × FP16 → FP32 accumulate, 4×4×4 fused matmul.
Turing (2018) — adds INT8/INT4.
Ampere (A100, 2020) — adds BF16, TF32, structured sparsity, 8×8 matmul.
Hopper (H100, 2022) — adds FP8 (E4M3, E5M2), Transformer Engine.
Blackwell (B200, 2024) — adds FP4 (E2M1), 2nd-gen Transformer Engine.

Tensor Cores deliver an order-of-magnitude throughput improvement over CUDA cores for the matrix multiplications that dominate transformer training and inference. They are the reason large language model training scales to trillions of operations per token.

Other vendors: AMD Matrix Cores (CDNA / MI300), Intel AMX (Sapphire Rapids), Google TPU Matrix Multiply Unit (systolic array), Apple AMX (private ISA), Tenstorrent Tensix.

16. Software

NumPy — np.einsum, np.tensordot, broadcasting, ufuncs. Reference CPU library; foundation for the Python scientific stack.

PyTorch — torch.einsum, autodiff via dynamic computation graph; torch.compile (TorchInductor, 2023) for kernel fusion. De facto research framework.

JAX — functional, NumPy-compatible API with composable transformations (jit, grad, vmap, pmap). XLA backend for fused kernels on TPU and GPU. Strong for research at scale.

TensorFlow — Google’s framework; tf.einsum, tf.tensordot; Keras high-level API. XLA backend.

opt_einsum — finds optimal contraction paths for multi-tensor einsum expressions. Backend-agnostic (works with NumPy, PyTorch, TF, JAX, CuPy, Dask, TensorFlow). The contraction-ordering problem is NP-hard; opt_einsum uses dynamic programming, greedy, and heuristic algorithms.

TensorNetwork (Google, 2019) — Python library for tensor network computations in quantum physics and ML.

quimb — Python library for quantum information and many-body physics with tensor networks.

ITensor — high-performance tensor network library in Julia and C++; widely used for DMRG and tensor-network methods.

SymPy — symbolic tensor manipulation in Python; sympy.tensor module supports index notation, Lorentz tensors, GR computations.

SageMath — symbolic + numerical math system; differential geometry support via SageManifolds.

Wolfram Mathematica — symbolic tensor algebra; differential geometry via packages like xAct, Ricci.

Cadabra2 — open-source CAS focused on field theory and general relativity; designed for symbolic manipulation of tensor expressions in physics notation.

Tensor Algebra Compiler (TACO) (Kjolstad et al., 2017) — generates efficient code for sparse and dense tensor algebra expressions.

17. Performance

Memory layout. Row-major (C-order, the NumPy default) stores the last axis contiguously; column-major (Fortran-order, MATLAB / LAPACK / Eigen default) stores the first axis contiguously. Mismatches force copies and kill performance. Convolution kernels are highly sensitive to data layout: NCHW (PyTorch default on GPU) vs NHWC (TensorFlow / many GPU kernels) trade off differently for different ops.

Vectorization (SIMD). CPUs execute the same operation on multiple data elements per cycle via SSE / AVX / AVX-512 / NEON / SVE. Compilers auto-vectorize loops over contiguous data; intrinsics (or libraries like Eigen) when needed explicitly.

GEMM batching. Many small matrix products are slow if launched individually; batched GEMM (cublasGemmBatchedEx, torch.bmm) amortizes launch overhead and exposes parallelism. Critical for transformer multi-head attention.

Kernel fusion. Combining multiple element-wise / reduction ops into one kernel avoids intermediate writes to HBM. Implementations:

PyTorch torch.compile (TorchInductor) — 2023+, generates Triton kernels.
JAX JIT — fuses through XLA.
TensorFlow XLA — same XLA backend.
Triton (OpenAI, 2021) — Python-like DSL for writing GPU kernels at a higher abstraction than CUDA.
MLIR — multi-level IR framework for compiler infrastructure; foundation for several tensor compilers.

FlashAttention (Tri Dao, 2022; v2 2023; v3 2024) — IO-aware attention algorithm that tiles Q, K, V to keep operands in SRAM (on-chip) and avoid materializing the full N × N attention matrix in HBM. 2-4× faster training and inference for transformers with long sequences.

Mixed precision. FP16 / BF16 / FP8 / FP4 for the bulk of operations; FP32 master weights and loss scaling to avoid underflow. Standard for LLM training since ~2018; FP8 mainstream since 2023 (H100).

18. Pitfalls

Forgetting the Jacobian under coordinate change. When transforming a tensor expression to new coordinates, every partial derivative pulls in chain-rule factors. Components transform by a product of M and M^{-1} matrices, one per index. Skipping this gives meaningless expressions.

Mixing covariant and contravariant indices without the metric. Writing v^i + ω_i is ill-defined; you must raise or lower one of them first. Einstein summation requires one up and one down index per dummy pair (except in Euclidean settings where the metric is δ_{ij} and the distinction collapses).

Wrong axis order in reshape (silent transposition). Reshaping (B, C, H, W) → (B*C, H, W) is fine; reshaping (B, H, W, C) → (B, C, H, W) by reshape (instead of permute + contiguous) silently scrambles channels. Always use permute (or transpose) for axis reordering, then contiguous before view/reshape.

Broadcasting unintended dimensions. Adding (5,) and (5, 1) gives (5, 5), not (5,). Adding (3,) and (1, 3) gives (1, 3). Always print shapes (or use assertions) before binary ops on tensors of different rank.

Christoffel symbols treated as a tensor. They are not — they have non-tensorial transformation under coordinate change. The difference of two connections is a tensor; a connection alone is not.

Confusing tensor “rank” senses. “Rank” in ML / NumPy means number of dimensions (axes). “Rank” in linear algebra means dimension of the column space of a matrix. “Tensor rank” in decomposition theory (CP) means the minimal number of rank-1 outer products summing to the tensor — different from both above. Specify which you mean.

Pseudotensors and orientation. The Levi-Civita symbol changes sign under improper rotations (parity); strict tensors do not. Quantities like angular momentum L = r × p are pseudovectors. Confusing the two gives wrong signs in chiral or parity-violating contexts.

Numerical underflow / overflow in tensor reductions. Computing softmax as exp(x_i) / Σ exp(x_j) overflows for large x_i; subtract max(x) first. Computing variance as E[X²] - E[X]² loses precision; use Welford’s online algorithm (Welford, 1962).

Order of contraction matters for cost. (A · B) · C vs A · (B · C) — same result, possibly wildly different memory and FLOPs. For chains of matrix multiplications, matrix-chain ordering is a classic dynamic-programming problem; for general tensor networks it is NP-hard, hence opt_einsum.

19. Cross-references

Linear algebra background — [[Math/linear-algebra-essentials]], [[Math/svd-pca-spectral]], [[Math/numerical-linear-algebra]].

Calculus background — [[Math/multivariate-calculus]], [[Math/eigenvalue-problems]].

Lie groups (SO(3), SE(3)) on which tensors live — [[Math/lie-groups-so3-se3]].

Math hub — [[Math/_index]].

Physics and engineering applications:

[[Engineering/mechanics-of-materials]] — stress and strain tensors in elasticity.
[[Engineering/electromagnetics-engineering]] — Maxwell field tensor.
[[Robotics/dynamics-rigid-body]] — inertia tensor.

ML / compute applications:

[[Compute/transformer-architecture]] — attention as batched tensor contractions.
[[Compute/inference-optimization]] — tensor cores, FlashAttention, mixed precision.

20. Citations

Ricci-Curbastro, G. and Levi-Civita, T. (1900). Méthodes de calcul différentiel absolu et leurs applications. Mathematische Annalen 54. — origin of the absolute differential calculus (tensor calculus).
Einstein, A. (1916). Die Grundlage der allgemeinen Relativitätstheorie (“The Foundation of the General Theory of Relativity”). Annalen der Physik 49. — introduces the summation convention; tensor formulation of GR.
Misner, C. W.; Thorne, K. S.; Wheeler, J. A. (1973). Gravitation. W. H. Freeman. — the standard reference on GR (“MTW”, the “phone book”; reissued by Princeton 2017).
Lee, J. M. (2012). Introduction to Smooth Manifolds, 2nd edition. Graduate Texts in Mathematics 218. Springer.
Carroll, S. M. (2003). Spacetime and Geometry: An Introduction to General Relativity. Addison-Wesley.
Spivak, M. (1965). Calculus on Manifolds. Westview Press. — classic compact treatment of differential forms and Stokes theorem.
do Carmo, M. P. (1992). Riemannian Geometry. Birkhäuser.
Wald, R. M. (1984). General Relativity. University of Chicago Press.
Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31(3).
Harshman, R. A. (1970). Foundations of the PARAFAC procedure. UCLA Working Papers in Phonetics 16.
Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing 33(5).
Kolda, T. G.; Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review 51(3). — definitive survey.
Hillar, C. J.; Lim, L.-H. (2013). Most tensor problems are NP-hard. Journal of the ACM 60(6).
White, S. R. (1992). Density matrix formulation for quantum renormalization groups. Physical Review Letters 69(19). — DMRG, foundation of MPS methods.
Vidal, G. (2007). Entanglement renormalization. Physical Review Letters 99(22). — MERA.
Vaswani, A. et al. (2017). Attention is all you need. NeurIPS. — transformer; einsum-friendly attention.
Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS.
Daniel, G. A.; Gray, J. (2018+). opt_einsum documentation. https://optimized-einsum.readthedocs.io/.
NVIDIA (2017–2024). Tensor Core whitepapers for Volta, Turing, Ampere, Hopper, Blackwell architectures.
Kjolstad, F. et al. (2017). The Tensor Algebra Compiler. OOPSLA.
Tillet, P.; Kung, H. T.; Cox, D. (2019). Triton: An intermediate language and compiler for tiled neural network computations. MAPL workshop.
Levi-Civita, T. (1900). Sur l’écart géodésique. — origin of the Levi-Civita symbol and connection.
Cauchy, A.-L. (1822). Lectures on continuum mechanics introducing the stress tensor.
Riemann, B. (1854). Über die Hypothesen, welche der Geometrie zu Grunde liegen. Habilitation lecture; foundation of Riemannian geometry.

Compendium

Explorer

Tensor Calculus — Math Reference

Tensor Calculus — Math Reference

1. At a glance

2. Abstract tensor (physics view)

3. Rank and order

4. Covariant vs contravariant

5. Einstein summation convention

6. Tensor algebra

7. Common tensors

8. Tensors as multi-dimensional arrays (ML view)

9. Einsum

10. Tensor operations in ML

11. Tensor decomposition

12. Differential geometry essentials

13. Riemannian geometry

14. Applications

15. GPU tensor units

16. Software

17. Performance

18. Pitfalls

19. Cross-references

20. Citations

Graph View

Table of Contents