Multivariate Calculus — Math Reference

1. At a glance

Multivariate calculus is calculus on functions of multiple variables: f: ℝⁿ → ℝ (scalar fields), F: ℝⁿ → ℝᵐ (vector fields), and the operators that act on them. Single-variable calculus generalizes in two directions simultaneously: more input variables (partial derivatives, gradients) and more output components (Jacobians). Higher-order behavior produces Hessians, Taylor expansions, and second-order optimization. Integration extends to multiple, line, and surface integrals, and the fundamental theorem of calculus generalizes through Green’s, Stokes’, and the Divergence theorems — all unified as ∫_∂M ω = ∫_M dω in the language of differential forms.

Foundation of:

  • Optimization — gradient ∇f drives descent; Hessian H drives Newton; convexity is a Hessian condition.
  • Vector analysis — divergence, curl, Laplacian describe how vector fields flow, rotate, and diffuse.
  • Physics — fluid mechanics (Navier-Stokes uses ∇, ∇·, ∇²), electromagnetism (Maxwell’s equations are div + curl statements), general relativity (curvature is a multivariate construct).
  • Machine learning — backpropagation is the chain rule applied to Jacobians on a computational graph; every gradient-based optimizer rests on multivariate calculus.
  • Robotics — the Jacobian of forward kinematics maps joint velocities to end-effector velocities; its inverse drives inverse kinematics.
  • Computer graphics — gradients of signed distance fields produce surface normals; ray marching relies on ∇ for stepping.
  • Statistics — Laplace approximations use the Hessian of log-likelihoods; Fisher information is an expected Hessian.

This note is a working reference: definitions, formulas, identities, theorems, and the bridge to automatic differentiation.

2. Partial derivatives

The partial derivative of f(x₁, …, xₙ) with respect to xᵢ holds the other variables constant:

∂f/∂xᵢ (a) = lim_{h→0} [f(a₁, …, aᵢ+h, …, aₙ) − f(a)] / h

Notation: ∂f/∂x, f_x, ∂_x f, or D_i f. The “del” symbol ∂ (curly d) distinguishes the partial from the total derivative d/dx.

Higher-order partials

∂²f/∂x∂y = ∂/∂x (∂f/∂y), ∂²f/∂x² = ∂/∂x (∂f/∂x)

Schwarz / Clairaut theorem (Clairaut 1740, Schwarz 1873)

If the second-order partials are continuous in a neighborhood, mixed partials commute:

∂²f/∂x∂y = ∂²f/∂y∂x

This is the foundation of the symmetry of the Hessian and the closure relation d² = 0 in differential forms. The continuity hypothesis matters: pathological counterexamples exist (e.g., f(x,y) = xy(x²−y²)/(x²+y²) at origin).

Total derivative vs partial

For y = f(x(t), t), the total derivative through t is:

dy/dt = ∂f/∂x · dx/dt + ∂f/∂t

The total derivative accounts for all paths; the partial accounts for only the explicit dependence.

3. Gradient

For f: ℝⁿ → ℝ, the gradient is the column vector of partials:

∇f = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ)ᵀ

Properties:

  • ∇f(a) points in the direction of steepest ascent at a.
  • ‖∇f(a)‖ = the maximum rate of ascent.
  • ∇f is perpendicular to level sets {x : f(x) = c}.
  • If ∇f(a) = 0, a is a stationary point (critical point).

Tangent plane / linearization

The first-order Taylor expansion gives the tangent (hyper)plane to z = f(x) at x₀:

z = f(x₀) + ∇f(x₀)ᵀ (x − x₀)

This is exact for linear f and a good local approximation for smooth f.

Gradient as a covector

Strictly speaking, ∇f is a covector (1-form df) in the cotangent space — it eats a vector and returns a scalar. The metric tensor (Cartesian: identity) converts between covectors and vectors. This distinction matters in non-Euclidean geometry and in tensor calculus.

4. Directional derivative

The rate of change of f at a in direction u (unit vector ‖u‖ = 1):

D_u f(a) = lim_{h→0} [f(a + hu) − f(a)] / h = ∇f(a) · u

For non-unit v, divide by ‖v‖: D_v f / ‖v‖ = ∇f · v̂.

The maximum directional derivative at a is ‖∇f(a)‖, achieved when u = ∇f(a) / ‖∇f(a)‖. The minimum is −‖∇f(a)‖ (steepest descent). Along level sets, D_u f = 0.

5. Jacobian

For f: ℝⁿ → ℝᵐ with components f₁, …, f_m, the Jacobian matrix is the m × n matrix of all first partials:

Jf(x) = ∂f/∂x = [∂fᵢ/∂xⱼ] (1 ≤ i ≤ m, 1 ≤ j ≤ n)

Layouts:

  • Numerator layout (Jacobian above): rows index outputs i, columns index inputs j. Standard in physics, robotics, and most math texts.
  • Denominator layout: transpose of the above (gradient is a column). Common in statistics and some ML texts. Choose one and stick with it.

The Jacobian is the best linear approximation of f at x:

f(x + h) ≈ f(x) + Jf(x) · h

For m = 1 (scalar output), Jf is the transpose of the gradient: Jf = (∇f)ᵀ. For m = n, det(Jf) measures local volume scaling — central to change of variables in integration.

Roles

  • Robotics kinematics — joint velocities → end-effector velocities (rate kinematics, Whitney 1969): ẋ = J(q) q̇.
  • Inverse kinematics — solve J q̇ = ẋ; singularities are det(J) = 0.
  • Force transmission — torque ↔ wrench: τ = Jᵀ F (the principle of virtual work).
  • Autodiff — every operation in a computational graph has a local Jacobian; the chain rule composes them.
  • Change of variables in integration — |det Jg|.

6. Hessian

For f: ℝⁿ → ℝ, the Hessian is the n × n matrix of second partials:

H = ∇²f = [∂²f / ∂xᵢ ∂xⱼ]

By Clairaut’s theorem (Section 2), H is symmetric for smooth f.

Roles

  • Local curvature of f at x.
  • Second-order optimization — Newton’s method: x_{k+1} = x_k − H⁻¹ ∇f.
  • Convexity — f convex on a region iff H ⪰ 0 (positive semidefinite) there; strictly convex iff H ≻ 0 (positive definite) (Boyd + Vandenberghe 2004).
  • Classification of stationary points:
    • H positive definite → local minimum.
    • H negative definite → local maximum.
    • H indefinite (mixed eigenvalue signs) → saddle point.
    • H singular → second-order test inconclusive, need higher order.
  • Laplace approximation in Bayesian inference — approximate posterior by a Gaussian centered at MAP with covariance H⁻¹.

Eigendecomposition of H

H = QΛQᵀ with Q orthogonal, Λ diagonal of eigenvalues. The eigenvectors give the principal axes of curvature; eigenvalues give curvatures along those axes. In ML, “ill-conditioned” loss landscapes have large eigenvalue spread (κ = λ_max / λ_min).

7. Chain rule (multivariate)

For y = f(x), x = g(t) with f: ℝⁿ → ℝ, g: ℝ → ℝⁿ:

dy/dt = ∇f(g(t)) · g’(t) = Σᵢ (∂f/∂xᵢ) (dxᵢ/dt)

Full matrix form

For composition h = f ∘ g with f: ℝⁿ → ℝᵐ and g: ℝᵏ → ℝⁿ:

J_h(t) = J_f(g(t)) · J_g(t) (m × n times n × k = m × k)

Backpropagation as chain rule

In a feedforward neural net with layers a^(L) = σ(W^(L) a^(L−1) + b^(L)) and loss L = ℓ(a^(L), y):

∂L/∂W^(L) = δ^(L) (a^(L−1))ᵀ δ^(L) = ∇_a ℓ ⊙ σ’(z^(L)) δ^(ℓ) = ((W^(ℓ+1))ᵀ δ^(ℓ+1)) ⊙ σ’(z^(ℓ))

These are direct consequences of the chain rule on a layered Jacobian (Rumelhart + Hinton + Williams 1986, Nature). The mathematical content of backprop is multivariate chain rule; the engineering content is reverse-mode accumulation. See Section 18 for the AD perspective.

Vector chain rule conventions

  • Numerator layout: J_{f∘g} = J_f · J_g (rows-by-columns).
  • Denominator layout: J_{f∘g} = J_gᵀ · J_fᵀ (transposed). The PyTorch autograd convention is denominator-like for gradients.

8. Taylor series

For f: ℝⁿ → ℝ smooth at x₀:

f(x₀ + h) = f(x₀) + ∇f(x₀)ᵀ h + ½ hᵀ H(x₀) h + O(‖h‖³)

The second-order truncation is the workhorse for optimization: minimizing the quadratic approximation gives Newton’s direction p = −H⁻¹ ∇f. When H ≻ 0 this direction is a descent direction; when H is indefinite, use trust-region or modified-Newton variants.

The general k-th order term involves the k-th order tensor of partials (symmetric if smooth) contracted with h^⊗k.

9. Optimization classics

Unconstrained

  1. Necessary condition for local extremum: ∇f(x*) = 0 (stationary point).
  2. Second-order classification at stationary point:
    • H ≻ 0 → local minimum (sufficient).
    • H ≺ 0 → local maximum (sufficient).
    • H indefinite → saddle point.
    • H singular → inconclusive.

Lagrange multipliers (Lagrange 1788)

For equality-constrained optimization: maximize f(x) subject to g(x) = 0. The Lagrangian L(x, λ) = f(x) − λᵀ g(x). At an optimum:

∇_x L = 0 ⟹ ∇f = J_gᵀ λ ∇_λ L = 0 ⟹ g(x) = 0

Geometric interpretation: at the optimum, ∇f lies in the span of the constraint gradients ∇gᵢ. Equivalently, the level set of f is tangent to the constraint manifold.

KKT conditions (Karush 1939, Kuhn + Tucker 1951)

For inequality-constrained: minimize f(x) s.t. gᵢ(x) ≤ 0, hⱼ(x) = 0. At an optimum:

Stationarity: ∇f + Σ μᵢ ∇gᵢ + Σ λⱼ ∇hⱼ = 0 Primal feas.: gᵢ(x) ≤ 0, hⱼ(x) = 0 Dual feas.: μᵢ ≥ 0 Compl. slack.: μᵢ gᵢ(x) = 0

KKT is necessary under constraint qualifications and sufficient for convex problems. See [[Math/convex-optimization]] for the full treatment.

10. Multiple integrals

Double integral

For R ⊂ ℝ² and f: R → ℝ continuous:

∫∫_R f(x, y) dA = lim_{‖P‖→0} Σ f(xᵢ*, yᵢ*) ΔAᵢ

Iterated form (Type I region): ∫_a^b ∫_{g(x)}^{h(x)} f(x,y) dy dx.

Triple integral

∫∫∫_V f(x,y,z) dV

Fubini’s theorem (Fubini 1907)

If f is integrable on R × S (product domain), the iterated integrals exist and agree:

∫∫_{R×S} f dA = ∫_R (∫_S f(x,y) dy) dx = ∫_S (∫_R f(x,y) dx) dy

For continuous f on a bounded box this is automatic; for Lebesgue integration the precise statement is the Fubini-Tonelli theorem.

Change of variables

For a smooth bijection g: S → R with Jacobian J_g:

∫_R f(x) dx = ∫_S f(g(u)) |det J_g(u)| du

This is THE multivariate generalization of u-substitution. The |det J_g| factor accounts for how g locally distorts volume.

Standard coordinate changes:

  • Polar (r, θ): x = r cos θ, y = r sin θ; dA = r dr dθ.
  • Cylindrical (r, θ, z): dV = r dr dθ dz.
  • Spherical (ρ, θ, φ) [physics convention: θ azimuth, φ polar]: x = ρ sin φ cos θ, y = ρ sin φ sin θ, z = ρ cos φ; dV = ρ² sin φ dρ dφ dθ.

11. Line integrals

Scalar field along curve C

For f: ℝⁿ → ℝ and C parameterized by r(t), a ≤ t ≤ b:

∫_C f ds = ∫_a^b f(r(t)) ‖r’(t)‖ dt

ds is arc length. Used for: mass of a wire with linear density f, average value along a path.

Vector field along curve C (work integral)

For F: ℝⁿ → ℝⁿ:

∫_C F · dr = ∫_a^b F(r(t)) · r’(t) dt

Physical interpretation: work done by force F moving a particle along C.

Conservative fields

F is conservative iff F = ∇φ for some scalar potential φ. Equivalences (on a simply-connected domain):

F = ∇φ ⟺ ∇ × F = 0 ⟺ ∮_C F · dr = 0 for every closed C

For a conservative field:

∫_C F · dr = φ(end) − φ(start) (the gradient theorem)

Path-independence is the multivariate analog of the Fundamental Theorem of Calculus.

12. Surface integrals

Scalar field on surface S

For S parameterized by r(u,v), (u,v) ∈ D:

∫∫_S f dS = ∫∫_D f(r(u,v)) ‖r_u × r_v‖ du dv

‖r_u × r_v‖ is the area element. Used for: mass of a thin shell, average over a surface.

Flux integral

For vector field F and oriented surface S with unit normal n:

Φ = ∫∫_S F · n dS = ∫∫_D F(r(u,v)) · (r_u × r_v) du dv

The sign depends on orientation (choice of n). Physical interpretation: rate at which F flows through S. Examples: water flow through a membrane, electric flux through a closed surface (Gauss’s law).

13. Vector calculus operators

For F = (P, Q, R) and scalar f in ℝ³:

Divergence (Heaviside 1880s, named earlier)

∇ · F = ∂P/∂x + ∂Q/∂y + ∂R/∂z

Scalar field. Interpretation: net outflow per unit volume at a point — “source density.” ∇ · F > 0 means F is locally expanding; < 0 means contracting; = 0 means F is divergence-free (incompressible, solenoidal).

Curl

∇ × F = (∂R/∂y − ∂Q/∂z, ∂P/∂z − ∂R/∂x, ∂Q/∂x − ∂P/∂y)

Vector field; defined in 3D specifically. Interpretation: rotational density — the axis and rate of local rotation of F. ∇ × F = 0 means F is irrotational (conservative on simply-connected domains).

In 2D, curl reduces to a scalar: (∂Q/∂x − ∂P/∂y). In n ≠ 3 dimensions, the proper generalization is the exterior derivative of a 1-form.

Laplacian

∇²f = ∇ · ∇f = ∂²f/∂x² + ∂²f/∂y² + ∂²f/∂z²

For vector fields, the vector Laplacian is component-wise: (∇²F)ᵢ = ∇² Fᵢ.

The Laplacian appears everywhere in physics: heat equation (∂u/∂t = α ∇²u), wave equation (∂²u/∂t² = c² ∇²u), Laplace equation (∇²u = 0, electrostatics in vacuum), Poisson equation (∇²u = −ρ/ε₀), Schrödinger equation.

Vector identities

∇ × (∇f) = 0 (curl of grad = 0) ∇ · (∇ × F) = 0 (div of curl = 0) ∇ · (∇f) = ∇²f (div of grad = Laplacian) ∇ × (∇ × F) = ∇(∇ · F) − ∇²F (curl-curl identity) ∇ · (fF) = f ∇ · F + F · ∇f ∇ × (fF) = f ∇ × F + ∇f × F ∇ · (F × G) = G · (∇ × F) − F · (∇ × G)

Both d² = 0 identities (curl-grad and div-curl) are special cases of the fact that d² = 0 for exterior derivatives — the foundation of de Rham cohomology.

14. Fundamental theorems

These are all special cases of the generalized Stokes’ theorem ∫_∂M ω = ∫_M dω (Cartan ca. 1899, Élie Cartan), which unifies them in the language of differential forms.

Gradient theorem (FTC for line integrals)

For smooth f and curve C from a to b:

∫_C ∇f · dr = f(b) − f(a)

Green’s theorem (Green 1828)

For a positively-oriented closed curve C bounding a region D in the plane:

∮_C P dx + Q dy = ∫∫_D (∂Q/∂x − ∂P/∂y) dA

Equivalent to 2D Stokes (curl form) and 2D divergence theorem (flux form).

Stokes’ theorem (Stokes 1854, originally Kelvin)

For an oriented surface S with boundary ∂S = C:

∮_C F · dr = ∫∫_S (∇ × F) · n dS

Circulation along the boundary = flux of curl through the surface.

Divergence theorem / Gauss’s theorem (Gauss 1813, Ostrogradsky 1826)

For a solid V with closed boundary ∂V = S (outward normal):

∫∫_S F · n dS = ∫∫∫_V (∇ · F) dV

Net flux out of V = total source density inside V. This is the math behind Gauss’s law in electrostatics: ∮ E · dA = Q_enc / ε₀.

General Stokes (Cartan)

∫_∂M ω = ∫_M dω

Here M is an oriented manifold with boundary ∂M, ω is a (k−1)-form, dω is its exterior derivative (a k-form). Setting M = curve, surface, or solid and ω appropriately recovers each special case above.

15. Coordinate systems

Cartesian (x, y, z)

The default. dV = dx dy dz, ∇ = (∂_x, ∂_y, ∂_z), metric g_ij = δ_ij.

Cylindrical (r, θ, z)

x = r cos θ, y = r sin θ, z = z.

dV = r dr dθ dz ∇f = (∂f/∂r, (1/r) ∂f/∂θ, ∂f/∂z) ∇·F = (1/r) ∂(rF_r)/∂r + (1/r) ∂F_θ/∂θ + ∂F_z/∂z ∇²f = (1/r) ∂/∂r (r ∂f/∂r) + (1/r²) ∂²f/∂θ² + ∂²f/∂z²

Useful for problems with axial symmetry: pipes, cylinders, vortex flows.

Spherical (ρ, θ, φ)

[Physics convention: ρ radial, θ azimuthal (0 to 2π), φ polar (0 to π).]

x = ρ sin φ cos θ, y = ρ sin φ sin θ, z = ρ cos φ dV = ρ² sin φ dρ dφ dθ ∇²f = (1/ρ²) ∂/∂ρ (ρ² ∂f/∂ρ) + (1/(ρ² sin φ)) ∂/∂φ (sin φ ∂f/∂φ) + (1/(ρ² sin²φ)) ∂²f/∂θ²

Useful for problems with radial symmetry: planets, atoms, gravitational and electrostatic fields. (Mathematicians often swap θ and φ — always check.)

Curvilinear / tensor calculus

For a general curvilinear system x^i:

  • Metric tensor g_ij encodes the geometry: ds² = g_ij dx^i dx^j (Einstein summation).
  • Covariant components V_i = g_ij V^j vs contravariant V^i.
  • Christoffel symbols Γ^k_ij = ½ g^kl (∂_i g_jl + ∂_j g_il − ∂_l g_ij) describe how coordinate frames twist.
  • Covariant derivative ∇_i V^j = ∂_i V^j + Γ^j_ik V^k generalizes ∂ to curved spaces. This is the basis of general relativity and Riemannian geometry.

16. Implicit Function Theorem

If F: ℝⁿ × ℝᵐ → ℝᵐ is C¹, F(a, b) = 0, and the partial Jacobian ∂F/∂y |_{(a,b)} is invertible, then near (a, b) there is a unique C¹ function y = g(x) such that F(x, g(x)) = 0 with g(a) = b, and:

∂g/∂x = −(∂F/∂y)⁻¹ (∂F/∂x)

In one scalar variable: if F(x, y) = 0 and F_y ≠ 0, then locally y = g(x) and:

dy/dx = −F_x / F_y

Used to extract level-set surfaces, to differentiate implicitly-defined functions, and in equilibrium analysis (perturb a parameter and track how equilibria move).

17. Inverse Function Theorem

If f: ℝⁿ → ℝⁿ is C¹ at a and J_f(a) is nonsingular (det J_f(a) ≠ 0), then f is locally invertible near a, the inverse is C¹, and:

J_{f⁻¹}(f(a)) = (J_f(a))⁻¹

Consequences: a smooth map is locally a diffeomorphism iff its Jacobian is nonsingular. Robotics singularities are precisely points where the kinematic Jacobian becomes singular — there, small end-effector motions require unbounded joint velocities.

18. Automatic differentiation (AD)

AD computes exact derivatives of programs by applying the chain rule mechanically to elementary operations (Wengert 1964, Speelpenning 1980 for reverse-mode; Linnainmaa 1970 for the algorithm in numerical analysis). Distinct from symbolic differentiation (manipulates expressions) and numerical differentiation (finite differences).

Forward mode

Propagate derivatives forward through the computation graph. For y = f(g(h(x))):

ḣ = h’(x) · ẋ ġ = g’(h) · ḣ ẏ = f’(g) · ġ

Cost: roughly n times forward pass if there are n input variables. Best when input dimension n is small and output dimension m is large.

Implementation: dual numbers x + ε ẋ where ε² = 0. Every operation is overloaded:

(a + εȧ) + (b + εḃ) = (a+b) + ε(ȧ + ḃ) (a + εȧ) · (b + εḃ) = ab + ε(ȧb + aḃ) f(a + εȧ) = f(a) + ε f’(a) ȧ

After evaluation, the ε-coefficient is exactly the derivative.

Reverse mode (backpropagation)

Record the computation as a tape (or computational graph) during the forward pass. After computing y = f(x), traverse the tape backward, computing adjoints ∂y/∂(intermediate). For y = f(g(h(x))):

∂y/∂g = f’(g) ∂y/∂h = ∂y/∂g · g’(h) ∂y/∂x = ∂y/∂h · h’(x)

Cost: roughly m times forward pass if there are m outputs. Best when output dimension m is small (often m = 1, a scalar loss) and input dimension n is huge (often millions of parameters). This is the regime of ML, hence backpropagation.

The fundamental theorem of reverse-mode AD: for a scalar loss L and parameters θ ∈ ℝⁿ, one backward pass computes ∇_θ L in time proportional to one forward pass, regardless of n. This O(1) blowup over forward cost is why deep learning is feasible.

Frameworks (as of 2026)

  • PyTorch autograd (Paszke et al. 2017+) — eager mode, dynamic tape per forward call; reverse-mode by default; torch.func adds forward-mode + vmap
    • jacrev / jacfwd.
  • JAX (Bradbury et al. 2018+) — jax.grad (reverse), jax.jacfwd / jax.jacrev, jax.jvp (forward) / jax.vjp (reverse); pure functional style; compiles via XLA.
  • TensorFlow GradientTape — explicit context manager; supports persistent tapes and higher-order.
  • autograd (Maclaurin + Duvenaud + Adams 2015 ICML workshop) — original Python AD by function overloading; spiritual predecessor to JAX.
  • Zygote.jl (Innes 2018) — source-to-source reverse AD in Julia.
  • Enzyme (Moses + Churavy 2020) — AD at the LLVM IR level; works across languages.
  • Mojo / MLIR-based AD systems emerging in 2024-2026.

Higher-order AD

  • Hessian-vector product Hv = ∇(∇f · v): one forward + one reverse, O(forward).
  • Full Hessian: reverse over reverse (or forward over reverse), O(n × forward).
  • Often Hv is enough — used in conjugate-gradient Newton, trust-region, and Lanczos for spectrum.

Checkpointing (Griewank + Walther 2000)

Reverse-mode needs to store activations for the backward pass. For very deep nets, memory blows up. Gradient checkpointing: drop intermediate activations and recompute on demand during backward. Trades compute for memory; revolutionized training of very deep models (transformer training, diffusion models).

19. Symbolic differentiation

Symbolic systems (SymPy in Python, Mathematica, Maple, Maxima) apply rules to manipulate expressions and produce closed-form derivatives. Strengths: exact algebraic form, simplification, suitable for hand-derived formulas. Weaknesses: expression swell — derivatives can grow exponentially in expression size, even when the underlying derivative is cheap to compute. AD avoids this by operating on the program rather than the expression.

Use symbolic differentiation for: deriving formulas you want to read, LaTeX-quality output, equations you’ll port to other tools. Use AD for: training models, optimization, anything where you only need the numerical gradient.

20. Numerical differentiation

Finite-difference approximations:

Forward: f’(x) ≈ (f(x+h) − f(x)) / h (error O(h)) Central: f’(x) ≈ (f(x+h) − f(x−h)) / (2h) (error O(h²)) Complex step (Squire + Trapp 1998): f’(x) ≈ Im(f(x + ih)) / h (error O(h²), no subtractive cancellation)

For double precision (ε ≈ 1e−16), the truncation error vs roundoff trade-off gives optimal h ≈ √ε ≈ 1e−8 for forward differences, h ≈ ε^(1/3) ≈ 1e−5 for central. Complex step doesn’t suffer subtractive cancellation, so h can be 1e−200; gives near-machine precision.

When to use: gradient checks against AD (smoke-test correctness); legacy or black-box code with no AD support; quick exploratory work. Not for production gradients — accuracy is poor and the cost is O(n) function evaluations.

21. Applications

Machine learning

  • Backprop = reverse-mode AD on a layered computational graph.
  • Optimizers — SGD, Adam, RMSProp all use ∇ (and Adam estimates a diagonal proxy for H). See [[Math/gradient-descent-variants]].
  • Second-order — K-FAC, Shampoo approximate the Fisher / Hessian using block structure; Hessian-free methods use Hv products.
  • Higher derivatives — meta-learning (MAML), implicit differentiation, neural ODEs all need second-order autodiff.

Robotics

  • Forward kinematics: end-effector pose = product of DH transforms. See [[Robotics/kinematics-dh]].
  • Velocity kinematics: ẋ = J(q) q̇.
  • Inverse kinematics: q̇ = J⁺ ẋ + (I − J⁺ J) z (damped least squares for singularities).
  • Force/torque: τ = Jᵀ F (virtual work).
  • Dynamics: Euler-Lagrange equations use partial derivatives of the Lagrangian L = T − V.

Sensitivity analysis

For a system y = f(x; θ) parameterized by θ, the sensitivity ∂y/∂θ tells you how robust y is to parameter perturbations. AD gives exact sensitivities; this underlies uncertainty quantification, design optimization, and adjoint methods in PDE-constrained optimization.

Electromagnetism

Maxwell’s equations are pure div + curl statements:

∇ · E = ρ/ε₀ (Gauss’s law) ∇ · B = 0 (no magnetic monopoles) ∇ × E = −∂B/∂t (Faraday’s law) ∇ × B = μ₀ J + μ₀ ε₀ ∂E/∂t (Ampère-Maxwell)

See [[Engineering/electromagnetics-engineering]].

Fluid mechanics

The Navier-Stokes equations for incompressible flow:

∇ · u = 0 ∂u/∂t + (u · ∇)u = −∇p/ρ + ν ∇²u + f

A nonlinear PDE system in ∇, ∇·, and ∇². See [[Engineering/fluid-mechanics]].

Heat transfer

Heat equation: ∂u/∂t = α ∇²u. Steady-state: Laplace ∇²u = 0.

Computer graphics

  • Surface normals = ∇f for level-set surface f = 0.
  • Signed distance fields — gradient gives normal; magnitude = 1 (eikonal equation: ‖∇d‖ = 1).
  • Ray marching — step along ∇ of SDF to reach surfaces.
  • Shading — Phong / PBR models use dot products with surface normals (which are gradients).

Optimization

  • First-order: gradient descent, momentum, Adam (use ∇ only).
  • Second-order: Newton, BFGS, L-BFGS, Gauss-Newton (use H or its approximation).
  • Constrained: Lagrange + KKT.
  • See [[Math/convex-optimization]].

Physics — Hamiltonian + Lagrangian

  • Lagrangian mechanics: Euler-Lagrange equations d/dt (∂L/∂q̇) − ∂L/∂q = 0.
  • Hamiltonian mechanics: q̇ = ∂H/∂p, ṗ = −∂H/∂q. Symplectic structure preserved by Hamilton’s flow.

Differential geometry

Multivariate calculus generalizes to manifolds: locally-Euclidean spaces with smooth structure. Tangent vectors, cotangent vectors, differential forms, exterior derivative, integration on manifolds, de Rham cohomology, Stokes’ theorem in its general form. Foundation of general relativity, gauge theory, modern geometric mechanics.

22. Pitfalls

  1. Forgetting the Jacobian determinant in change of variables. The single most common multivariate integration error: ∫ f(r cos θ, r sin θ) dr dθ is wrong — it should be ∫ f(r cos θ, r sin θ) r dr dθ.

  2. Layout confusion — numerator vs denominator layout. In numerator layout, ∂(Ax)/∂x = A; in denominator, ∂(Ax)/∂x = Aᵀ. Doctrine: pick one, write it down, never mix.

  3. Sign errors in curl/div. Curl uses a cyclic determinant; getting the sign wrong is easy. Always verify on a known field (e.g., F = (−y, x, 0) should give ∇ × F = (0, 0, 2)).

  4. Singular Jacobians at robotic singularities. Inverse kinematics via J⁻¹ blows up at det(J) = 0. Use damped least squares: q̇ = (JᵀJ + λ²I)⁻¹ Jᵀ ẋ.

  5. Numerical differentiation noise. Catastrophic cancellation makes FD unreliable in floating-point. Use AD or complex step.

  6. Memory blow-up in reverse-mode AD. Storing all activations of a deep network can exhaust GPU memory. Use gradient checkpointing, mixed precision, or activation recomputation.

  7. Confusing partial with total derivative. ∂L/∂t (treats other variables as constants) ≠ dL/dt (accounts for chain through other variables). Critical in PDEs and ODEs.

  8. Non-simply-connected domains and conservativity. ∇ × F = 0 implies conservative only on simply-connected domains. The classic counterexample F = (−y, x) / (x² + y²) on the punctured plane.

  9. Orientation in surface integrals. Flux integrals depend on the choice of normal direction; flipping it flips the sign. Always specify orientation (outward, upward, or by parameterization).

  10. Mixed partial commutativity assumed without smoothness. Pathological f with discontinuous second partials at a point can have ∂²f/∂x∂y ≠ ∂²f/∂y∂x at that point.

  11. Forgetting the metric in non-Cartesian coordinates. The “gradient” in spherical isn’t just (∂_ρ, ∂_θ, ∂_φ) — there are scale factors. Use standard formulas or work in basis-free notation.

  12. Higher-order AD performance traps. grad(grad(f)) can be much slower than necessary; in JAX, prefer jvp(grad(f)) or hessian directly.

23. Cross-references

  • [[Math/linear-algebra-essentials]] — vectors, matrices, eigendecomposition; the algebraic substrate of gradients and Jacobians.
  • [[Math/convex-optimization]] — KKT, duality, convex sets and functions.
  • [[Math/gradient-descent-variants]] — SGD, momentum, Adam, second-order methods.
  • [[Math/ode-numerical-methods]] — Runge-Kutta, stiff solvers, adjoint method (uses chain rule + reverse AD).
  • [[Math/_index]] — math reference index.
  • [[Robotics/kinematics-dh]] — Denavit-Hartenberg parameters, forward and inverse kinematics, Jacobian-based control.
  • [[Engineering/electromagnetics-engineering]] — Maxwell’s equations, EM field theory; div + curl + Laplacian in action.
  • [[Engineering/fluid-mechanics]] — Navier-Stokes, vorticity (curl of velocity), stream functions.
  • [[Compute/transformer-architecture]] — backprop through attention, layer norm, MLP blocks; reverse-mode AD at scale.

24. Citations

  • Clairaut, A.C. 1740. Recherches sur les courbes à double courbure. (Mixed partial equality).
  • Schwarz, H.A. 1873. Communication on equality of mixed partials.
  • Lagrange, J.-L. 1788. Mécanique Analytique. (Lagrange multipliers, the Lagrangian formulation of mechanics).
  • Gauss, C.F. 1813. Theory of biquadratic residues. (Divergence theorem, independently by Ostrogradsky 1826).
  • Green, G. 1828. An Essay on the Application of Mathematical Analysis to the Theories of Electricity and Magnetism.
  • Stokes, G.G. 1854. Smith’s Prize Examination. (Stokes’ theorem; originally suggested by Kelvin in correspondence).
  • Cartan, É. 1899. Sur certaines expressions différentielles et le problème de Pfaff. (Exterior differential forms, general Stokes).
  • Fubini, G. 1907. Sugli integrali multipli. (Fubini’s theorem).
  • Karush, W. 1939. Minima of functions of several variables with inequalities as side conditions. M.Sc. thesis, U. Chicago. Kuhn, H.W. + Tucker, A.W. 1951. Nonlinear programming. Proc. 2nd Berkeley Symp.
  • Marsden, J.E. + Tromba, A.J. 2011. Vector Calculus, 6th ed. W.H. Freeman.
  • Stewart, J. 2020. Calculus: Early Transcendentals, 9th ed. Cengage.
  • Spivak, M. 1965. Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus. Benjamin.
  • Wengert, R.E. 1964. A simple automatic derivative evaluation program. CACM 7(8). (Forward-mode AD, first description).
  • Linnainmaa, S. 1970. The representation of the cumulative rounding error of an algorithm as a Taylor expansion. M.Sc. thesis, U. Helsinki. (Reverse-mode error analysis, the algorithmic precursor to backprop).
  • Speelpenning, B. 1980. Compiling fast partial derivatives of functions given by algorithms. Ph.D. thesis, UIUC. (Reverse-mode AD formalized).
  • Rumelhart, D.E. + Hinton, G.E. + Williams, R.J. 1986. Learning representations by back-propagating errors. Nature 323. (Backprop for neural networks).
  • Griewank, A. + Walther, A. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed. SIAM. (The AD reference text; checkpointing analysis).
  • Squire, W. + Trapp, G. 1998. Using complex variables to estimate derivatives of real functions. SIAM Review 40(1).
  • Maclaurin, D. + Duvenaud, D. + Adams, R.P. 2015. Autograd: Effortless gradients in NumPy. ICML AutoML Workshop.
  • Whitney, D.E. 1969. Resolved motion rate control of manipulators and human prostheses. IEEE Trans. Man-Machine Systems 10(2). (Jacobian-based robot control).
  • Boyd, S. + Vandenberghe, L. 2004. Convex Optimization. Cambridge UP.
  • Paszke, A. et al. 2017+. PyTorch: An imperative style, high-performance deep learning library. NeurIPS 2019.
  • Bradbury, J. et al. 2018+. JAX: composable transformations of Python+NumPy programs. github.com/google/jax.
  • Innes, M. 2018. Don’t unroll adjoint: Differentiating SSA-form programs. (Zygote.jl).
  • Moses, W. + Churavy, V. 2020. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. NeurIPS. (Enzyme).