Stochastic Calculus & SDEs

1. At a Glance

Stochastic calculus extends ordinary calculus to processes driven by randomness — most famously the Brownian motion W_t whose sample paths are continuous but nowhere differentiable. Classical Riemann-Stieltjes integration fails because the driving signal has infinite total variation; Itô’s construction (1944) replaces it with an L² isometry over non-anticipating integrands. The resulting machinery — Itô’s lemma, stochastic differential equations (SDEs), the Fokker-Planck PDE, martingale theory — is the lingua franca of any field that models a noisy, time-evolving system.

Foundation of:

  • Quantitative finance — option pricing (Black-Scholes 1973), interest rate term structure (Vasicek 1977, CIR 1985), portfolio dynamics, risk-neutral pricing.
  • Statistical physics — Langevin equation (1908), Brownian motion (Einstein 1905), molecular dynamics, diffusion in porous media, plasma physics.
  • Mathematical biology — Wright-Fisher genetic drift, population dynamics, ion-channel noise on neuronal membranes (stochastic Hodgkin-Huxley).
  • Stochastic control — LQG (Linear Quadratic Gaussian) regulation, Hamilton-Jacobi-Bellman PDEs, Merton’s portfolio problem (Merton 1969).
  • Machine learning — diffusion generative models (Sohl-Dickstein 2015, Ho 2020, Song 2021), stochastic gradient Langevin dynamics (Welling + Teh 2011), neural SDEs (Li 2020), flow-matching (Lipman 2023).

This note covers the constructive theory (BM → Itô integral → Itô lemma → SDE → Fokker-Planck), the canonical SDEs, the Itô-vs-Stratonovich split, numerical solvers, and the application landscape spanning finance, physics, and modern ML.

2. Stochastic Processes — Overview

A stochastic process is a collection of random variables {X_t : t ∈ T} on a common probability space (Ω, F, P), indexed by a time parameter t. Each ω ∈ Ω realises a sample path t ↦ X_t(ω).

Classifications:

  • Discrete vs continuous time — T = ℕ (Markov chain, random walk) vs T = [0,∞) (Brownian motion, SDE solution).
  • Discrete vs continuous state — counting processes (Poisson) vs ℝⁿ-valued (BM, Itô diffusion).
  • Markov — conditional law of future given past = law given present only.
  • Martingale — fair-game property E[X_t | F_s] = X_s for s ≤ t.
  • Stationary — joint distribution invariant under time translation.
  • Gaussian — every finite-dimensional marginal multivariate normal.

Brownian motion is the unique (up to scaling) Gaussian, Markov, continuous-path, stationary-increment process — the central object of the theory.

3. Random Walk — The Discrete Prototype

Let ξ_1, ξ_2, … be iid with P(ξ_i = ±1) = ½. Define S_n = Σᵢ₌₁ⁿ ξᵢ. Then:

  • E[S_n] = 0; Var[S_n] = n.
  • Reflection principle: P(max_{k≤n} S_k ≥ a) = 2·P(S_n ≥ a) for a > 0.
  • By the central limit theorem, S_n / √n → N(0,1) in distribution.

Donsker’s invariance principle (1951) — the rescaled polygonal path X_t^(n) = S_⌊nt⌋ / √n converges weakly (as a process on C[0,1]) to Brownian motion W_t. This is the rigorous bridge from discrete random walk to continuous BM and justifies BM as the universal scaling limit of any zero-mean, finite-variance walk.

4. Brownian Motion / Wiener Process

History: Bachelier 1900 (PhD thesis, stock prices); Einstein 1905 (diffusion of pollen); Wiener 1923 (rigorous mathematical construction). Hence “Wiener process” = “Brownian motion”.

Definition. A standard Brownian motion W_t is a stochastic process satisfying:

  1. W_0 = 0 (a.s.).
  2. Independent increments — for 0 ≤ t_0 < t_1 < … < t_n, the increments W_{t_k} − W_{t_{k-1}} are mutually independent.
  3. Gaussian increments — W_t − W_s ~ N(0, t − s) for s < t.
  4. Continuous sample paths — t ↦ W_t(ω) continuous for a.e. ω.

These four axioms uniquely determine BM (up to versions). Existence is non-trivial — Lévy’s construction (1948) builds W_t via a wavelet-style hierarchical interpolation; Kolmogorov’s continuity theorem provides the regularity.

Multi-dimensional: W_t = (W_t^1, …, W_t^d) where each component is an independent 1D BM.

5. Properties of Brownian Motion

Mean and variance

E[W_t] = 0, Var[W_t] = t, Cov(W_s, W_t) = min(s,t).

Quadratic variation

For a partition 0 = t_0 < t_1 < … < t_n = T with mesh δ → 0,

Σ_k (W_{t_{k+1}} − W_{t_k})² → T (in L² and a.s.).

So [W, W]_T = T — the quadratic variation is deterministic and linear in time. This is the single algebraic fact that drives all of Itô calculus. Heuristically: (dW)² = dt.

Total variation, by contrast, is infinite almost surely — sample paths zigzag too violently for classical Riemann-Stieltjes integrals.

Nowhere differentiability

For almost every ω, t ↦ W_t(ω) is continuous but has no derivative at any point. Sample paths are α-Hölder for every α < ½, but not for α = ½ (law of iterated logarithm).

Reflection principle

For a > 0, P(max_{s≤t} W_s ≥ a) = 2·P(W_t ≥ a) = 2·(1 − Φ(a/√t)). Used in barrier-option pricing and first-passage statistics.

Time-reversal and scaling symmetries

  • Scaling: (1/√c)·W_{ct} =_d W_t for any c > 0.
  • Time reversal: W̃_t := W_T − W_{T-t} is also a BM on [0,T].
  • Time inversion: t·W_{1/t} (with W̃_0 := 0) is a BM.

Strong Markov property

For any a.s.-finite stopping time τ, the process Ŵ_s := W_{τ+s} − W_τ is a BM independent of F_τ. This is the foundation of optimal stopping theory and option pricing for American claims.

Law of iterated logarithm

lim sup_{t→∞} W_t / √(2t·log log t) = 1 (a.s.) — the precise growth envelope of a BM path.

6. The Itô Integral

We want ∫₀ᵀ f(s, ω) dW_s. Classical Riemann-Stieltjes fails because W has infinite total variation. Itô 1944 constructed it as follows.

Step 1 — Simple processes. Let f(s,ω) = Σ_{k=0}^{n-1} ξ_k(ω)·1_{[t_k, t_{k+1})}(s), where ξ_k is F_{t_k}-measurable (i.e., non-anticipating / left-evaluated). Define

∫₀ᵀ f dW := Σ_k ξ_k·(W_{t_{k+1}} − W_{t_k}).

Step 2 — L² isometry. For simple f, E[(∫₀ᵀ f dW)²] = E[∫₀ᵀ f² ds]. This is Itô’s isometry — it identifies the stochastic integral with an isometric embedding of L²(dt × dP)_adapted into L²(dP).

Step 3 — Extension. Approximate any adapted f ∈ L²(dt × dP) by simple processes; the isometry gives a unique L²-limit. The result is the Itô integral I_t(f) := ∫₀ᵗ f dW, a continuous martingale in t with [I(f), I(f)]_t = ∫₀ᵗ f² ds.

Key: the choice of left endpoint in Step 1 is what makes ∫f dW a martingale. Using the midpoint gives Stratonovich (§11), which is not a martingale but obeys the classical chain rule.

7. Itô Isometry

For adapted f, g ∈ L²(dt × dP):

E[(∫₀ᵀ f dW)²] = E[∫₀ᵀ f² ds] E[(∫f dW)(∫g dW)] = E[∫fg ds]

Two immediate consequences:

  • The Itô integral is a martingale: E[∫₀ᵗ f dW | F_s] = ∫₀ˢ f dW.
  • It has mean zero: E[∫₀ᵀ f dW] = 0.

The isometry is what makes the L² extension well-defined and what underlies the martingale representation theorem: every L² martingale adapted to a Brownian filtration is of the form M_t = M_0 + ∫₀ᵗ H_s dW_s for a unique adapted H ∈ L².

8. Itô’s Lemma — Chain Rule for Stochastic Calculus

Suppose X_t = X_0 + ∫₀ᵗ b(s) ds + ∫₀ᵗ σ(s) dW_s, written dX_t = b dt + σ dW_t. Let f(x, t) ∈ C^{2,1}. Then

df(X_t, t) = (∂_t f + b·∂_x f + ½ σ² ·∂_{xx} f) dt + σ·∂_x f dW_t.

The ½ σ² ∂_{xx} f term is the Itô correction — the classical chain rule would only have ∂_t f + b·∂_x f. It arises because (dW)² = dt is non-negligible at second order; the formal calculus is

(dX)² = σ² dt, (dX)(dt) = 0, (dt)² = 0.

A Taylor expansion of f keeping up to second order in dX then yields the lemma.

Multidimensional version. For dX^i = b^i dt + Σ_j σ^{ij} dW^j and f ∈ C^{2,1}(ℝⁿ × ℝ):

df = (∂_t f + b^i ∂_i f + ½ a^{ij} ∂_{ij} f) dt + σ^{ij} ∂_i f dW^j,

where a = σσᵀ is the diffusion matrix.

Worked example — log of GBM. dS = µS dt + σS dW; let Y = log S. Then ∂_S Y = 1/S, ∂_{SS} Y = −1/S². Itô gives

dY = (µ − ½σ²) dt + σ dW.

Integrating: Y_t = Y_0 + (µ − ½σ²)t + σ W_t, so S_t = S_0·exp((µ − σ²/2)t + σ W_t). The −σ²/2 drift correction is invisible to classical calculus and is the source of “volatility drag” in compound returns.

9. SDE — General Form

A stochastic differential equation is

dX_t = b(X_t, t) dt + σ(X_t, t) dW_t, X_0 = x_0.

  • b — drift coefficient (ℝⁿ-valued).
  • σ — diffusion coefficient (ℝ^{n×m}-valued; W is m-dim).
  • A solution X_t is an adapted continuous process satisfying X_t = x_0 + ∫₀ᵗ b ds + ∫₀ᵗ σ dW.

Existence + uniqueness (Itô 1951) — if b and σ are globally Lipschitz in x with at-most-linear growth, then a strong solution exists, is unique, and X_t is Markov. Under weaker conditions (Yamada-Watanabe 1971, Stroock-Varadhan 1979) one obtains weak solutions.

Markov + generator. The infinitesimal generator of the diffusion is

𝓛 f = b·∇f + ½ tr(σσᵀ ∇²f),

which appears on the right-hand side of Itô’s lemma as the dt part. The generator drives backward Kolmogorov and the Feynman-Kac formula.

10. Classical SDEs

Geometric Brownian Motion (GBM)

dS = µS dt + σS dW, S_0 > 0. Closed form: S_t = S_0·exp((µ − σ²/2)t + σW_t). Always positive, log-normal marginal. The Black-Scholes underlying asset model. Captures multiplicative noise + exponential growth.

Ornstein-Uhlenbeck (OU) — Uhlenbeck + Ornstein 1930

dX = −θ(X − µ) dt + σ dW. Mean-reverting linear SDE. Closed-form solution:

X_t = µ + (X_0 − µ)e^{−θt} + σ ∫₀ᵗ e^{−θ(t−s)} dW_s.

Stationary distribution N(µ, σ²/(2θ)). The “exactly solvable Gaussian SDE”. Used in physics (velocity of a Brownian particle under friction), finance (Vasicek interest rate), neuroscience (leaky integrate-and-fire), control theory.

Cox-Ingersoll-Ross (CIR) — CIR 1985

dX = a(b − X) dt + σ√X dW. Mean-reverting with state-dependent volatility; remains non-negative (Feller condition 2ab ≥ σ² ensures strictly positive). Stationary marginal is a Gamma. Standard short-rate model in fixed income; squared Bessel processes underlie CIR.

Vasicek — Vasicek 1977

dr = a(b − r) dt + σ dW. OU under another name; can give negative interest rates (a feature, not bug, since 2014).

Heston — Heston 1993

Stochastic-volatility model for equity: dS = µS dt + √v·S dW^1, dv = κ(θ − v) dt + ξ√v dW^2, d⟨W^1, W^2⟩ = ρ dt. Variance v_t is CIR; closed-form characteristic function enables Fourier option pricing.

Langevin equation (physics)

m·dv = −γv dt + √(2γ k_B T) dW. The OU velocity process for a Brownian particle in a heat bath at temperature T. Fluctuation-dissipation theorem links friction γ and noise variance.

11. Itô vs Stratonovich

Two integration conventions, differing in where the integrand is evaluated within each subinterval.

Itô — left endpoint

∫₀ᵀ f dW ≈ Σ f(t_k)·(W_{t_{k+1}} − W_{t_k}).

  • Non-anticipating → integral is a martingale.
  • E[∫f dW] = 0.
  • Does not obey classical chain rule; needs Itô’s lemma.
  • Natural in finance (information up to t_k is all we have when we decide a trade).

Stratonovich — midpoint

∫₀ᵀ f ∘ dW ≈ Σ ½(f(t_k) + f(t_{k+1}))·(W_{t_{k+1}} − W_{t_k}).

  • Not a martingale in general.
  • Obeys ordinary chain rule — df(X_t) = f’(X_t) ∘ dX_t.
  • Limit of smooth approximations to BM (Wong-Zakai 1965), so natural in physics where W is an idealisation of a fast coloured noise.

Conversion

If dX_t = b dt + σ dW_t (Itô), then in Stratonovich:

dX_t = (b − ½ σ·∂_x σ) dt + σ ∘ dW_t.

Equivalently, Itô ↔ Stratonovich drifts differ by ½ σ·σ_x (the Stratonovich correction). For additive noise (σ independent of X), the two conventions coincide. Modelling decision: use Stratonovich when calibrating to a coloured-noise limit (physics) or when changing coordinates on a manifold (geometry of SDEs, rough paths). Use Itô for finance, control, ML.

12. Numerical SDE Solvers

The classical ODE-solver hierarchy (Euler → RK4 → adaptive) extends to SDEs but the order theory bifurcates into strong (pathwise) and weak (distributional) convergence.

  • Strong order γ_s: E[|X_T^Δ − X_T|] = O(Δ^{γ_s}).
  • Weak order γ_w: |E[f(X_T^Δ)] − E[f(X_T)]| = O(Δ^{γ_w}) for smooth f.

Euler-Maruyama — Maruyama 1955

ΔX = b(X_n, t_n)·Δt + σ(X_n, t_n)·ΔW, where ΔW = √Δt·Z, Z ~ N(0,1). Strong order 0.5, weak order 1.0. The workhorse — simple, robust, but slow to converge in the strong sense.

Milstein — Milstein 1974

Adds a correction from the σ·∂_x σ term in the Itô-Taylor expansion:

X_{n+1} = X_n + b·Δt + σ·ΔW + ½ σ·σ_x·(ΔW² − Δt).

Strong order 1.0. Recommended whenever σ depends on X — Euler-Maruyama silently loses an order in that case.

Stochastic Runge-Kutta

Higher-order schemes (Burrage + Burrage 1996, Rößler 2010) avoid evaluating σ_x. Strong order 1.5–2.0. Implementation is much more complex than ODE RK because of iterated Itô integrals.

Implicit + drift-implicit schemes

For stiff SDEs (large drift Lipschitz constant), explicit Euler-Maruyama requires tiny Δt. Drift-implicit Euler stabilises:

X_{n+1} = X_n + b(X_{n+1})·Δt + σ(X_n)·ΔW.

Fully implicit + multi-step methods (Higham 2000, Higham + Mao + Stuart 2002) exist but require care since σ·ΔW has unbounded moments.

Adaptive step size

Strong-error estimates for adaptive Δt (Burrage + Burrage + Tian 2004, Rackauckas + Nie 2017). DifferentialEquations.jl implements rejection-sampled adaptive SDE solvers — typically 10×–100× faster than fixed-step at the same accuracy.

Software

  • Julia DifferentialEquations.jl (Rackauckas + Nie) — most comprehensive SDE solver suite (50+ algorithms, adaptive, GPU).
  • SDE-Toolbox, pysde, sdeint — Python.
  • JAX diffrax (Kidger 2021) — differentiable SDE solvers; required for neural-SDE training.
  • PyTorch torchsde (Li 2020) — adjoint-method gradients through SDE solves.

13. Fokker-Planck (Forward Kolmogorov) Equation

For dX_t = b(x,t) dt + σ(x,t) dW_t, the marginal density p(x, t) of X_t satisfies

∂_t p = −∂_x(b·p) + ½ ∂_{xx}(σ²·p).

In multi-D:

∂_t p = −∂_i(b^i p) + ½ ∂_{ij}(a^{ij} p), a = σσᵀ.

Interpretation: probability mass flows under drift b, spreads by diffusion of strength σ². Steady states satisfy the right-hand side = 0; for OU the stationary p is Gaussian.

Why it matters for diffusion models. Sohl-Dickstein et al. (2015) and Song et al. (2021) recognised that running an SDE forward to corrupt data and computing the score function ∇log p_t(x) is exactly what’s needed to write the time-reversed SDE (§20). Fokker-Planck is the bridge from sample-level SDE to density-level PDE that makes score-based generative modelling tractable.

See [[Math/pde-methods]] for the broader PDE framework — Fokker-Planck is a parabolic PDE in conservation form.

14. Backward Kolmogorov + Feynman-Kac

For u(x, t) = E[f(X_T) | X_t = x]:

−∂_t u = b·∂_x u + ½ σ²·∂_{xx} u, u(x, T) = f(x).

This backward Kolmogorov equation is a parabolic PDE in reverse time, propagating the terminal condition backward via the diffusion’s generator 𝓛.

Feynman-Kac formula (Kac 1949) — for a discount/source term V(x):

u(x,t) = E^x[exp(−∫_t^T V(X_s) ds)·f(X_T) + ∫_t^T exp(−∫_t^s V) g(X_s) ds]

solves ∂_t u + 𝓛u − Vu + g = 0, u(x,T) = f(x). This is the probabilistic representation of parabolic PDEs and the engine of risk-neutral option pricing (§17).

Backward Kolmogorov also underlies Hamilton-Jacobi-Bellman in stochastic optimal control (§16).

15. Martingales

Definition. Adapted X_t is a martingale w.r.t. (F_t) if E[|X_t|] < ∞ and E[X_t | F_s] = X_s for s ≤ t. Supermartingale if ≤; submartingale if ≥.

Key examples: BM W_t; the Itô integral ∫f dW; W_t² − t; exp(λW_t − ½λ²t) for any λ (exponential martingale).

Doob’s optional stopping theorem

For bounded stopping time τ and a martingale X, E[X_τ] = E[X_0]. Generalisations (Doob 1953) under uniform integrability. The technical backbone of pricing American options + first-passage problems.

Martingale representation theorem

Every L² random variable F measurable with respect to the BM filtration F_T is

F = E[F] + ∫₀ᵀ H_s dW_s,

for a unique adapted H ∈ L²(dt × dP). Foundation of completeness in Black-Scholes — every contingent claim is replicable by trading the underlying.

Girsanov’s theorem

Under suitable conditions on θ_t, the measure dQ/dP = exp(∫₀ᵀ θ dW − ½∫₀ᵀ θ² ds) is equivalent to P, and W̃_t = W_t − ∫₀ᵗ θ ds is a Q-Brownian motion. Girsanov = change of measure absorbs drift into a different probability. The technical core of risk-neutral pricing.

16. Stochastic Control + HJB

Cost functional: J(x, t, u) = E[∫_t^T L(X_s, u_s) ds + φ(X_T) | X_t = x], over admissible controls u_t with state SDE dX = b(X, u) dt + σ(X, u) dW.

The value function V(x,t) = inf_u J(x, t, u) satisfies the Hamilton-Jacobi-Bellman (HJB) PDE:

−∂_t V = inf_u {L(x, u) + b(x, u)·∂_x V + ½ σ(x, u)²·∂_{xx} V}, V(x, T) = φ(x).

This is a fully nonlinear, backward parabolic PDE. The optimal control is the minimiser u*(x, t) = arg min{…} (Pontryagin/dynamic-programming principle).

Linear Quadratic Gaussian (LQG) — dX = (AX + Bu) dt + σ dW, cost ∫(XᵀQX + uᵀRu) dt + X_TᵀQ_T X_T. Optimal u* = −R⁻¹BᵀP(t)X where P solves a deterministic Riccati ODE. Certainty equivalence: optimal control is the same as the deterministic problem; noise only inflates the cost. See [[Robotics/optimal-control]].

Merton’s portfolio problem (Merton 1969) — log/CRRA utility maximisation gives explicit u*; “Merton fraction” = (µ − r)/(γσ²) of wealth in the risky asset.

17. Black-Scholes — Option Pricing

Black + Scholes + Merton 1973 (Nobel 1997 to Scholes + Merton; Black died 1995). Assumes underlying S_t follows GBM, frictionless trading, constant r, σ. By Itô + replication (or risk-neutral pricing under Girsanov):

∂_t V + ½ σ² S² ∂_{SS} V + r S ∂_S V − r V = 0,

a backward parabolic PDE on (S, t) ∈ (0,∞) × [0,T]. Terminal condition is the payoff (e.g., (S − K)_+ for a call).

Risk-neutral pricing. Under the unique equivalent martingale measure Q (Girsanov shifts drift µ → r):

V(S, t) = e^{−r(T−t)}·E^Q[payoff(S_T) | S_t = S].

Closed form for European call: V = S·Φ(d_1) − K·e^{−r(T−t)}·Φ(d_2), with d_1 = (log(S/K) + (r + σ²/2)(T−t))/(σ√(T−t)), d_2 = d_1 − σ√(T−t).

Greeks

Partial derivatives of V used for hedging:

  • Δ = ∂V/∂S — sensitivity to spot price (= Φ(d_1) for a call).
  • Γ = ∂²V/∂S² — curvature.
  • 𝒱 (vega) = ∂V/∂σ — sensitivity to volatility.
  • Θ = ∂V/∂t — time decay.
  • ρ = ∂V/∂r — interest rate sensitivity.

Implied volatility surface

Black-Scholes assumes constant σ; market prices imply σ(K, T) — the volatility surface. Persistent smile + skew patterns motivate models beyond BS.

Beyond Black-Scholes

  • Heston 1993 — stochastic volatility (§10).
  • Merton jump-diffusion 1976 — adds compound Poisson jumps.
  • SABR (Hagan 2002) — local-stochastic vol for fixed-income smile.
  • Lévy processes (Cont + Tankov 2003) — replaces BM with infinitely divisible Lévy noise.
  • Local volatility (Dupire 1994) — calibrates σ(S, t) to match the full surface.

18. Lévy Processes + Jump SDEs

A Lévy process L_t has stationary, independent increments and is càdlàg (right-continuous with left limits). By the Lévy-Khintchine formula, every Lévy process decomposes:

L_t = bt + σ W_t + (jumps)

with jumps governed by a Lévy measure ν(dy) on ℝ{0}, ∫ min(1, y²)·ν(dy) < ∞.

Special cases:

  • Compound Poisson — finitely many jumps; ν finite.
  • α-stable — heavy-tailed jumps, ν(dy) = c·|y|^{−1−α} dy.
  • Variance Gamma, CGMY, Normal Inverse Gaussian — finance-relevant pure-jump processes.

Jump-diffusion SDE: dX = b dt + σ dW + ∫ γ(x,y)·Ñ(dt, dy), with Ñ the compensated Poisson random measure. Itô’s lemma generalises with an extra integral over jump amplitudes. Used for asset returns with crashes, insurance claim processes, network packet arrivals.

19. Stochastic Gradient Langevin Dynamics (SGLD)

Welling + Teh 2011 — combine SGD with Gaussian noise to sample from a Bayesian posterior. Update rule:

θ_{k+1} = θ_k + (ε_k/2)·∇log p(θ_k | data) + √ε_k·η_k, η_k ~ N(0, I).

This is the Euler-Maruyama discretisation of the Langevin SDE

dθ = ½∇log π(θ) dt + dW,

whose stationary density is π. For ε_k → 0, the chain converges to the posterior; with ε_k fixed, SGLD samples from a biased approximation. Covered in detail in [[Math/mcmc-sampling]]. The discretisation bias was sharpened by Metropolis-adjusted Langevin (MALA) and Hamiltonian Monte Carlo (Duane 1987, Neal 2011).

SGLD is one foot of the score-matching ladder that leads to diffusion models — the Langevin sampler in §20 uses essentially the same update.

20. Diffusion Models in ML

A pure stochastic-calculus story that broke open generative modelling.

Forward (noising) SDE

Start at data x_0 ~ p_{data}; corrupt by

dx = f(x, t) dt + g(t) dW_t, t ∈ [0, T],

with f, g chosen so x_T ~ N(0, σ_T²·I) is nearly pure Gaussian. Common choices:

  • Variance-exploding (VE), Song-Ermon 2019: f = 0, g(t) growing.
  • Variance-preserving (VP), Ho 2020 (DDPM): Ornstein-Uhlenbeck style, f(x,t) = −½β(t)·x, g(t) = √β(t).
  • Sub-VP, Song 2021 — tightens the ELBO.

Reverse (denoising) SDE — Anderson 1982

The time-reversed SDE is

dx = [f(x,t) − g(t)²·∇_x log p_t(x)] dt + g(t) dW̄_t,

run backward from t = T to 0, with W̄ a reverse-time BM. The score s_θ(x, t) ≈ ∇_x log p_t(x) is the only unknown — and it’s learned by a neural net via denoising score matching (Vincent 2011):

L = E_{t, x_0, ε}[‖s_θ(x_t, t) − ∇log p(x_t | x_0)‖²].

For VP-SDE, ∇log p(x_t | x_0) has a closed form proportional to ε (the added noise), so the training target is “predict the noise” — the DDPM objective (Ho 2020).

Probability-flow ODE

There exists an equivalent deterministic ODE with the same marginal densities:

dx = [f(x,t) − ½ g(t)²·s_θ(x,t)] dt.

This makes diffusion models invertible + provides exact likelihood estimation (via the instantaneous change-of-variables formula from neural ODEs).

Sampling

  • Predictor-corrector — Euler-Maruyama on the reverse SDE + Langevin MCMC correction at each step.
  • DDPM ancestral sampling (Ho 2020) — discrete-time DDPM is a numerical scheme for the VP-SDE.
  • DDIM (Song 2020) — fast deterministic sampler via the probability-flow ODE; 10–50 steps instead of 1000.
  • DPM-Solver, EDM (Karras 2022) — higher-order ODE solvers tailored to the score-SDE.
  • Consistency models (Song 2023), flow matching (Lipman 2023) — newer single-step or few-step variants built on the same SDE/ODE backbone.

Production deployments

DALL-E 2 (OpenAI 2022), Imagen (Google 2022), Stable Diffusion (Stability AI 2022), Midjourney v5+, Sora video (OpenAI 2024), AlphaFold-3 protein structure (DeepMind 2024), Mochi video (Genmo 2024), Flux image (BFL 2024), HunyuanVideo (Tencent 2024). All built on the Anderson reverse-SDE formula and score matching.

See [[Compute/transformer-architecture]] for the score-net backbone (U-Net + cross-attention + DiT — diffusion transformer, Peebles 2023).

21. Statistical Physics Connection

Langevin 1908 wrote the equation of motion for a Brownian particle in a fluid:

m·dv = −γ v·dt + √(2γ k_B T)·dW.

The drift −γv is Stokes drag; the noise √(2γk_BT) dW models thermal collisions. Fluctuation-dissipation theorem (Einstein 1905, Kubo 1957) links the dissipation coefficient γ to the noise amplitude — both originate in the same molecular collisions.

The Langevin equation is the OU process (§10), and its Fokker-Planck PDE is the heat/diffusion equation in velocity space. Stationary distribution is the Maxwell-Boltzmann density ∝ exp(−mv²/(2 k_B T)).

Molecular dynamics — for a system at temperature T, Langevin dynamics provides a thermostat: integrate Newton’s equations with a friction + noise term that drives the ensemble to the Gibbs measure exp(−H/k_BT). Used in protein folding (GROMACS, OpenMM), materials simulation (LAMMPS), and as the foundation of generative chemistry (NequIP, DimeNet, MACE force fields evaluated under Langevin or Nosé-Hoover dynamics).

22. Applications

Finance

  • Option pricing (BS, Heston, jump-diffusion).
  • Interest rate models (Vasicek, CIR, Hull-White, HJM framework).
  • Credit risk (Merton 1974 structural model; reduced-form intensity-based models).
  • Portfolio dynamics + Merton’s problem.
  • Risk-neutral Monte Carlo for path-dependent derivatives (Asians, barriers, exotics).
  • Market microstructure — order-flow imbalance modelled as a Hawkes / Lévy process.

Physics

  • Brownian motion of pollen, colloids, nanoparticles.
  • Langevin + Fokker-Planck for non-equilibrium statistical mechanics.
  • Plasma turbulence, cosmological perturbations (stochastic inflation).
  • Quantum stochastic calculus (Hudson + Parthasarathy 1984) for open quantum systems.

Biology

  • Wright-Fisher diffusion — allele frequency drift; dX = √(X(1−X))·dW. Backbone of population genetics.
  • Predator-prey with environmental noise (Lotka-Volterra + dW).
  • Neuronal membrane voltage — stochastic Hodgkin-Huxley + Fokker-Planck firing-rate equations.
  • Single-cell gene expression: chemical Langevin equation + Gillespie SSA.

Engineering + control

  • Continuous-time Kalman-Bucy filter (Kalman + Bucy 1961) — optimal linear filter for SDE-driven state with Gaussian observation noise. Detailed in [[Robotics/bayesian-estimation]].
  • LQG control (§16) — aerospace, robotics, process control.
  • Signal processing — coloured noise as a filtered BM.
  • Reliability — failure times as first-passage of an OU.

Machine learning

  • Diffusion generative models (§20).
  • SGLD posterior sampling (§19).
  • Neural SDEs (Li 2020) — learnable drift + diffusion, trained via adjoint.
  • Flow matching + rectified flow (Lipman 2023, Liu 2023) — continuous-time normalising flows that subsume score-SDE sampling.
  • Stochastic regularisation (Dropout, noise injection) as an Itô-SDE limit (Mandt + Hoffman + Blei 2017).

23. Common Pitfalls

  • Treating dW like dt. dW is √dt-scale; (dW)² = dt at leading order. Forgetting this is the source of every Itô-correction bug.
  • Confusing Itô vs Stratonovich. A simulation that converts the wrong way picks up a spurious ½σ·σ_x drift. Always state which convention.
  • Forgetting ½ σ² ∂_{xx} f in Itô’s lemma. Classical chain rule gives the wrong answer whenever σ depends on the state.
  • Euler-Maruyama with state-dependent σ. Strong order silently drops to 0.5 — use Milstein.
  • Differentiating a BM sample path. Paths are α-Hölder for α < ½ but nowhere differentiable. Methods that require X̄_t (finite-difference of W) blow up as Δt → 0.
  • Treating ∫ f dW as an ordinary integral. It’s a stochastic process with zero mean and quadratic variation ∫ f² ds. Bounds, comparisons, and chain rules all require the L² / martingale framework.
  • Using non-adapted integrands — leads to Skorohod / Malliavin calculus, not Itô. Make sure your integrand depends only on the past.
  • Confusing strong and weak convergence — for distributional quantities (expectations, prices), weak order is what matters; for pathwise quantities (max, hitting times), strong order matters.
  • Numerical positivity — Euler-Maruyama on CIR can produce negative X (square root undefined). Use full truncation, log-transform, or specialised CIR schemes (Alfonsi 2010).
  • Boundary behaviour — Feller test for explosion / accessibility; matters for SDEs on bounded domains (e.g., Wright-Fisher).

24. Software

  • Julia DifferentialEquations.jl (Rackauckas + Nie) — 50+ SDE algorithms, GPU, adaptive, jump SDEs. State of the art.
  • SciML / SDE-Toolbox — Julia ecosystem extensions; sensitivity analysis through SDE solves.
  • SciPy — no native SDE solver; community packages sdeint, pysde.
  • JAX diffrax (Kidger 2021) — JIT-compiled differentiable SDE solvers; required for neural-SDE + diffusion training.
  • PyTorch torchsde (Li 2020) — adjoint-method gradients through SDE solves; used in latent SDEs.
  • QuantLib (C++) — fixed-income + derivatives library; production-grade BS, Heston, LMM.
  • PyMC / Stan — Bayesian inference for parametric SDEs via NUTS / HMC.
  • Hugging Face Diffusers — pretrained score-SDE models (SD, SDXL, Flux, video) with k-diffusion / DPM-Solver samplers.
  • OpenMM / GROMACS / LAMMPS — molecular dynamics with Langevin thermostats.

25. Cross-References

  • [[Math/probability-fundamentals]] — measure-theoretic probability, conditional expectation, modes of convergence.
  • [[Math/probability-distributions]] — Gaussian, Gamma, stable, Poisson families.
  • [[Math/ode-numerical-methods]] — Euler / RK base for Euler-Maruyama / Milstein.
  • [[Math/pde-methods]] — parabolic PDEs (Fokker-Planck, backward Kolmogorov, Black-Scholes).
  • [[Math/mcmc-sampling]] — SGLD, MALA, HMC; Langevin sampler.
  • [[Math/multivariate-calculus]] — chain rule (classical) for contrast with Itô.
  • [[Math/bayesian-inference]] — posterior sampling via stochastic dynamics.
  • [[Math/_index]] — math hub.
  • [[Compute/transformer-architecture]] — score-network backbones (U-Net + DiT) in diffusion models.
  • [[Robotics/bayesian-estimation]] — continuous-time Kalman-Bucy filter; LQG.
  • [[Robotics/optimal-control]] — HJB, dynamic programming, LQG.

26. Citations

  • Bachelier, L. (1900). Théorie de la spéculation. Annales Scientifiques de l’École Normale Supérieure.
  • Einstein, A. (1905). On the motion of small particles suspended in liquids at rest required by the molecular-kinetic theory of heat. Annalen der Physik 17, 549–560.
  • Langevin, P. (1908). Sur la théorie du mouvement brownien. Comptes Rendus 146, 530–533.
  • Wiener, N. (1923). Differential space. J. Math. Phys. 2, 131–174.
  • Uhlenbeck, G. + Ornstein, L. (1930). On the theory of the Brownian motion. Phys. Rev. 36, 823.
  • Itô, K. (1944). Stochastic integral. Proc. Imperial Acad. Tokyo 20, 519–524.
  • Kac, M. (1949). On distributions of certain Wiener functionals. Trans. AMS 65, 1–13.
  • Itô, K. (1951). On Stochastic Differential Equations. Memoirs AMS 4.
  • Donsker, M. (1951). An invariance principle for certain probability limit theorems. Memoirs AMS 6.
  • Maruyama, G. (1955). Continuous Markov processes and stochastic equations. Rend. Circ. Mat. Palermo 4, 48–90.
  • Kalman, R. + Bucy, R. (1961). New results in linear filtering and prediction theory. J. Basic Eng. 83, 95–108.
  • Wong, E. + Zakai, M. (1965). On the convergence of ordinary integrals to stochastic integrals. Ann. Math. Stat. 36, 1560–1564.
  • Merton, R. (1969). Lifetime portfolio selection under uncertainty. Rev. Econ. Stat. 51, 247–257.
  • Black, F. + Scholes, M. (1973). The pricing of options and corporate liabilities. J. Polit. Econ. 81, 637–654.
  • Merton, R. (1973). Theory of rational option pricing. Bell J. Econ. Manag. Sci. 4, 141–183.
  • Merton, R. (1976). Option pricing when underlying stock returns are discontinuous. J. Fin. Econ. 3, 125–144.
  • Vasicek, O. (1977). An equilibrium characterization of the term structure. J. Fin. Econ. 5, 177–188.
  • Anderson, B. (1982). Reverse-time diffusion equation models. Stoch. Proc. Appl. 12, 313–326.
  • Stroock, D. + Varadhan, S. (1979). Multidimensional Diffusion Processes. Springer.
  • Cox, J., Ingersoll, J. + Ross, S. (1985). A theory of the term structure of interest rates. Econometrica 53, 385–407.
  • Karatzas, I. + Shreve, S. (1991). Brownian Motion and Stochastic Calculus, 2nd ed. Springer.
  • Heston, S. (1993). A closed-form solution for options with stochastic volatility. Rev. Fin. Studies 6, 327–343.
  • Dupire, B. (1994). Pricing with a smile. Risk 7, 18–20.
  • Burrage, K. + Burrage, P. (1996). High strong order explicit Runge-Kutta methods for stochastic ODEs. Appl. Num. Math. 22, 81–101.
  • Higham, D. (2000). Mean-square and asymptotic stability of the stochastic theta method. SIAM J. Num. Anal. 38, 753–769.
  • Hagan, P. et al. (2002). Managing smile risk. Wilmott Magazine 1, 84–108.
  • Cont, R. + Tankov, P. (2003). Financial Modelling with Jump Processes. Chapman & Hall.
  • Øksendal, B. (2003). Stochastic Differential Equations, 6th ed. Springer.
  • Shreve, S. (2004). Stochastic Calculus for Finance II: Continuous-Time Models. Springer.
  • Welling, M. + Teh, Y. (2011). Bayesian learning via stochastic gradient Langevin dynamics. ICML.
  • Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation 23, 1661–1674.
  • Sohl-Dickstein, J. et al. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. ICML.
  • Rackauckas, C. + Nie, Q. (2017). DifferentialEquations.jl. J. Open Research Software 5.
  • Ho, J., Jain, A. + Abbeel, P. (2020). Denoising diffusion probabilistic models. NeurIPS.
  • Li, X. et al. (2020). Scalable gradients for stochastic differential equations. AISTATS.
  • Song, J., Meng, C. + Ermon, S. (2020). Denoising diffusion implicit models. ICLR.
  • Song, Y. et al. (2021). Score-based generative modeling through stochastic differential equations. ICLR.
  • Kidger, P. (2021). On Neural Differential Equations. PhD thesis, University of Oxford.
  • Karras, T. et al. (2022). Elucidating the design space of diffusion-based generative models. NeurIPS.
  • Peebles, W. + Xie, S. (2023). Scalable diffusion models with transformers (DiT). ICCV.
  • Lipman, Y. et al. (2023). Flow matching for generative modeling. ICLR.
  • Liu, X., Gong, C. + Liu, Q. (2023). Flow straight and fast: learning to generate and transfer data with rectified flow. ICLR.
  • Song, Y. + Dhariwal, P. (2023). Consistency models. ICML.