Gradient Descent Variants — Math Reference

1. At a Glance

Gradient descent (GD) is the canonical first-order iterative optimization method: at each step, take a step opposite the local gradient. It is the workhorse of machine learning — 99%+ of neural networks are trained by some variant of gradient descent (SGD with momentum, Adam, AdamW, Lion, etc.).

Convergence theory is mature and well-understood for convex problems: precise rates exist for vanilla GD, accelerated methods (Nesterov), and stochastic methods (SGD). For non-convex problems — which is the regime of all deep learning — global convergence guarantees do not exist in general, but practical methods rely on:

  • Empirical evidence that local minima found by SGD generalize well.
  • Favorable geometry of overparameterized neural-net loss surfaces (many saddle points but few bad local minima; wide valleys around good solutions).
  • Implicit regularization from stochasticity, step size, and architecture.

This note covers the canonical update equations, convergence guarantees where they exist, learning-rate schedules used in practice, and the practical recipes for modern LLM and vision training.

Notation: minimize f : R^n → R. Iterates x_k ∈ R^n. Step size (learning rate) α > 0. Gradient g_k = ∇f(x_k). Smoothness constant L (Lipschitz gradient: ‖∇f(x) − ∇f(y)‖ ≤ L‖x − y‖). Strong convexity constant µ > 0 (f(y) ≥ f(x) + ⟨∇f(x), y − x⟩ + ½µ‖y − x‖²). Condition number κ = L/µ.


2. Vanilla Gradient Descent

The base update:

x_{k+1} = x_k − α · ∇f(x_k)

Step size α (learning rate) is the single most important hyperparameter. If α is too large, iterates diverge (oscillate with growing amplitude in steep directions). If α is too small, convergence is slow (linear-in-iteration rate but with a tiny base).

Convergence (Cauchy 1847 originated the method; modern analysis Nesterov 2004):

  • Convex + L-smooth, α = 1/L: f(x_k) − f* = O(1/k). Achieves ε-suboptimality in O(1/ε) iterations.
  • Strongly convex + L-smooth, α = 2/(µ+L): linear convergence ‖x_k − x*‖² ≤ ((κ − 1)/(κ + 1))^{2k} · ‖x_0 − x*‖² i.e., contraction factor (1 − µ/L)^k for the standard analysis with α = 1/L.
  • Non-convex + L-smooth, α = 1/L: min_{i ≤ k} ‖∇f(x_i)‖² = O(1/k). Converges to a stationary point (not necessarily a global minimum or even local minimum — could be a saddle point in principle, though for ML loss surfaces this rarely matters).

The constant α = 1/L is in practice unknown, so backtracking line search, adaptive methods, or hand-tuned step sizes dominate.


3. Step-Size Selection

3.1 Constant step size

Simplest: pick α once. Works if you know L (or have tuned by grid search). Practical default for ML: try {1e−1, 1e−2, 1e−3, 1e−4, 1e−5} on a log scale.

3.2 Backtracking line search (Armijo 1966)

Start with α₀ (say, α₀ = 1). While the Armijo sufficient-decrease condition fails:

f(x_k − α · g_k) > f(x_k) − c · α · ‖g_k‖²    (c ∈ (0, 1), typically c = 1e−4)

halve α (α ← α / 2). Then take the step. Guaranteed to terminate for differentiable f with bounded sublevel sets.

3.3 Wolfe conditions (Wolfe 1969)

Strengthens Armijo with a curvature condition:

⟨∇f(x_k − α · p_k), p_k⟩ ≥ c₂ · ⟨∇f(x_k), p_k⟩    (c₂ ∈ (c, 1), typically 0.9)

Ensures the step is not too short. Required for quasi-Newton methods (BFGS, L-BFGS).

α_k = argmin_{α ≥ 0} f(x_k − α · g_k). Closed-form only for quadratics: α* = ‖g_k‖² / ⟨g_k, A · g_k⟩ for f(x) = ½xᵀAx − bᵀx. Rarely used in practice; the cost of solving the 1D problem usually exceeds the savings.

3.5 Polyak step size (Polyak 1987)

If you know the optimal value f*(e.g., for interpolating models in ML where f* ≈ 0):

α_k = (f(x_k) − f*) / ‖∇f(x_k)‖²

Adaptive without tuning. Subgradient-method version works for non-smooth convex problems.


4. Stochastic Gradient Descent (SGD)

Robbins + Monro 1951 introduced the stochastic approximation framework. For f(x) = E_i[f_i(x)] (or a finite sum ∑ f_i(x) / n), replace the full gradient with a single-sample estimate:

x_{k+1} = x_k − α_k · ∇f_i(x_k),    i ~ Uniform{1..n}

Why it works: E[∇f_i(x)] = ∇f(x), so the expected update direction is correct. Variance introduces noise that must be controlled by decreasing α_k.

Cost: O(1) per step instead of O(n). For ML with n = 10^6 to 10^9 examples, this is 100×–10000× cheaper per step than full-batch GD, and total wall-clock convergence is much faster despite the higher iteration count.

Convergence (Robbins-Monro): Almost-sure convergence to a stationary point for diminishing step sizes satisfying:

∑ α_k = ∞,    ∑ α_k² < ∞

Classical choice: α_k = α₀ / (k + 1) or α_k = α₀ / √(k + 1).

Rates:

  • Convex + L-smooth: E[f(x̄_k) − f*] = O(1/√k) with α = O(1/√k), or O(1/k) with averaging.
  • Strongly convex: O(1/k).

These are slower than full-batch GD per iteration but the per-iteration cost is so much lower that SGD wins in total compute.


5. Mini-Batch SGD

The standard compromise: average gradients over a batch B of |B| samples drawn uniformly:

g̃_k = (1 / |B|) · ∑_{i ∈ B} ∇f_i(x_k)
x_{k+1} = x_k − α · g̃_k

Variance reduction: Var(g̃) = Var(∇f_i) / |B|, so standard deviation shrinks as 1/√|B|.

Practical batch sizes in ML:

  • Vision (ResNet on ImageNet): 256–4096.
  • LLM pre-training: 256–8192 sequences, often with gradient accumulation to reach effective batches in the millions of tokens.
  • Fine-tuning: 8–64.

Linear scaling rule (Goyal 2017): when batch size is multiplied by k, multiply α by k (up to a point — typically saturates around batch 8k–32k). Combined with warmup, allows training ResNet-50 in 1 hour on 256 GPUs.

Critical batch size (McCandlish 2018): beyond a certain batch size, doubling batch no longer doubles training speed — gradient noise scale dictates the regime.


6. Momentum (Polyak 1964, “Heavy Ball”)

Augment GD with a velocity term that accumulates past gradients:

v_{k+1} = β · v_k + ∇f(x_k)
x_{k+1} = x_k − α · v_{k+1}

(Some references write v_{k+1} = β · v_k + (1 − β) · ∇f(x_k); the choice rescales α.)

Intuition: A ball rolling down the loss surface with mass. Builds speed along consistent downhill directions; damps oscillations across narrow valleys (where the gradient changes sign every step).

Hyperparameter β (momentum coefficient): typical 0.9. β = 0 reduces to plain GD. β → 1 approaches full averaging.

Convergence: for strongly convex quadratics, Polyak’s heavy-ball method achieves the optimal accelerated rate (1 − √(µ/L))^k — much better than GD’s (1 − µ/L)^k when condition number κ = L/µ is large. For general strongly convex functions, the heavy-ball method does not achieve acceleration globally (Lessard 2016 counterexample), motivating Nesterov.

Effective step size: α / (1 − β). With β = 0.9, momentum amplifies effective steps by 10× in consistent directions.


7. Nesterov Accelerated Gradient (NAG, 1983)

Yurii Nesterov 1983 (“A method of solving a convex programming problem with convergence rate O(1/k²)”). The key idea: evaluate the gradient at the look-ahead point, not the current iterate.

v_{k+1} = β · v_k + ∇f(x_k − α · β · v_k)
x_{k+1} = x_k − α · v_{k+1}

Convergence: for convex + L-smooth, O(1/k²) rate — provably optimal for any first-order method (Nemirovski-Yudin 1983 lower bound). For strongly convex, achieves the accelerated rate (1 − √(µ/L))^k.

Equivalent formulations. Sutskever 2013 popularized the deep-learning-friendly version:

y_k = x_k + β · (x_k − x_{k−1})    (look-ahead)
x_{k+1} = y_k − α · ∇f(y_k)

These are mathematically equivalent up to reparametrization.

Why it matters: before NAG, no first-order method achieved O(1/k²) for convex problems. NAG is a foundational result in convex optimization — used in FISTA (Beck + Teboulle 2009) for L1-regularized problems, ISTA-family proximal methods, and many ML solvers.


8. AdaGrad (Duchi, Hazan + Singer 2011)

Per-parameter adaptive learning rate — scale α by accumulated past squared gradients (element-wise):

G_k = G_{k−1} + ∇f(x_k) ⊙ ∇f(x_k)    (element-wise square)
x_{k+1} = x_k − α / (√G_k + ε) · ∇f(x_k)

ε ≈ 1e−8 for numerical stability. α typical 0.01.

Intuition: parameters with large historical gradients (frequently active features) get smaller effective step sizes; parameters with small gradients (rare features) get larger steps. Adapts especially well to sparse features (NLP bag-of-words, recommender systems with rare items).

Convergence: O(1/√k) regret in online convex optimization (Duchi 2011 JMLR). Notable theoretical guarantees in the adversarial online setting.

Drawback: G_k grows monotonically → effective LR decays monotonically → progress stalls in long training. Motivates RMSprop and Adam.


9. RMSprop (Tieleman + Hinton 2012)

Introduced by Geoffrey Hinton in his Coursera lecture 6e (2012), never formally published. Replaces AdaGrad’s accumulation with an exponential moving average of squared gradients:

G_k = γ · G_{k−1} + (1 − γ) · ∇f(x_k) ⊙ ∇f(x_k)
x_{k+1} = x_k − α / (√G_k + ε) · ∇f(x_k)

γ typical 0.99 or 0.9. ε ≈ 1e−8. α typical 1e−3.

Fix vs AdaGrad: the moving average has a “forgetting” effect so G_k does not blow up; effective LR stays roughly constant. Works well for non-stationary objectives (online learning, RL) and was widely used in RNN training before Adam took over.


10. Adam (Adaptive Moment Estimation, Kingma + Ba 2014)

The de facto default optimizer for deep learning since 2015. Published at ICLR 2015 (paper title: “Adam: A Method for Stochastic Optimization”). Combines momentum (first moment) and RMSprop (second moment):

m_k = β₁ · m_{k−1} + (1 − β₁) · ∇f(x_k)              (first moment / momentum)
v_k = β₂ · v_{k−1} + (1 − β₂) · ∇f(x_k) ⊙ ∇f(x_k)     (second moment / RMSprop)

Initialize m_0 = v_0 = 0. Since m, v start near zero (biased toward zero in early steps), bias-correct:

m̂_k = m_k / (1 − β₁^k)
v̂_k = v_k / (1 − β₂^k)

Update:

x_{k+1} = x_k − α · m̂_k / (√v̂_k + ε)

Defaults (from the paper): β₁ = 0.9, β₂ = 0.999, ε = 1e−8, α = 1e−3.

Why it works in practice:

  • Per-parameter adaptive LR (handles different feature scales).
  • Momentum (smooths gradient noise, accelerates).
  • Bias correction (well-behaved from step 1).
  • Robust to LR choice across many problems.

Theory caveat: the original Adam paper had a convergence-proof gap (Reddi 2018 “On the Convergence of Adam and Beyond”). Adam can fail to converge on a simple convex example. AMSGrad fixes this in theory; in practice, Adam works fine for non-convex deep learning despite the gap.


11. AdamW (Loshchilov + Hutter 2017/2019)

Decoupled weight decay. Standard Adam with L2 regularization adds λ · x to the gradient, then divides by √v̂ — so the regularization strength is inversely scaled by gradient magnitude, which is incorrect.

AdamW decouples weight decay from the gradient update:

m_k = β₁ · m_{k−1} + (1 − β₁) · ∇f(x_k)
v_k = β₂ · v_{k−1} + (1 − β₂) · ∇f(x_k) ⊙ ∇f(x_k)
m̂_k = m_k / (1 − β₁^k)
v̂_k = v_k / (1 − β₂^k)
x_{k+1} = x_k − α · m̂_k / (√v̂_k + ε) − α · λ · x_k

The weight-decay term λ · x is applied directly to x, not through the gradient and not scaled by v̂.

Published: ICLR 2019 (“Decoupled Weight Decay Regularization”). Defaults: λ ∈ [1e−2, 1e−1] for large transformers, β₁ = 0.9, β₂ = 0.95 (often) or 0.999, α tuned with LR schedule.

Now the default in modern training: BERT (2018 actually used Adam, but the modern recipe is AdamW), GPT-2/3/4, LLaMA, T5, almost all transformer pre-training.


12. Lion (Chen, Liang et al. 2023, Google)

EvoLved Sign Momentum. Discovered via symbolic search over optimizer programs (NeurIPS 2023). Sign-based update with momentum:

c = β₁ · m_{k−1} + (1 − β₁) · ∇f(x_k)
x_{k+1} = x_k − α · sign(c) − α · λ · x_k
m_k = β₂ · m_{k−1} + (1 − β₂) · ∇f(x_k)

Defaults: β₁ = 0.9, β₂ = 0.99, λ ≈ 3× AdamW’s weight decay, α ≈ AdamW’s α / 3 to / 10.

Why it matters:

  • Half the memory of Adam (no v second-moment buffer — only m).
  • Matches or exceeds AdamW on vision and language at scale.
  • The sign function produces uniform step magnitudes per parameter, similar in spirit to signSGD.

Adoption is growing but AdamW remains the dominant default as of 2025.


13. Other Notable Optimizers

13.1 AMSGrad (Reddi 2018)

Fixes Adam’s convergence-proof gap by using max of past v_k instead of the EMA — i.e., the effective second moment is non-decreasing like AdaGrad. Theoretically sound; in practice rarely outperforms Adam.

13.2 RAdam (Liu 2019)

“Rectified Adam.” Adam’s variance estimate has high variance in early steps (warmup territory). RAdam computes the variance correction factor analytically and skips adaptive step until variance is reliable. Often eliminates the need for manual warmup.

13.3 AdamP (Heo 2020)

Adds a projection step to prevent the norm of weights from inflating in scale-invariant layers (batchnorm + linear).

13.4 LAMB (You 2019)

“Layer-wise Adaptive Moments.” Adam with layer-wise gradient normalization — divide the update by ‖update‖/‖weight‖ ratio per layer. Enables huge batch sizes (32k+) for BERT pre-training (74-min BERT training on TPU pod). Used heavily in early LLM scaling.

13.5 Adafactor (Shazeer + Stern 2018)

Memory-efficient Adam variant. Instead of storing v as a full second-moment tensor, factor it as a row × column outer-product → O(d) memory instead of O(d²) for matrix-shaped parameters. Used in T5, PaLM. Essential for training models with billions of parameters when GPU memory is tight.

13.6 Shampoo (Gupta, Koren + Singer 2018)

Preconditioned method using block-diagonal approximation to the Gauss-Newton / Hessian. Second-order in spirit; per-step cost higher than Adam but per-step progress better. Resurging in 2024 (Distributed Shampoo at Google) for very large model training.

13.7 NAdam (Dozat 2016)

Nesterov + Adam. Apply the Nesterov look-ahead trick inside Adam. Marginal improvement over Adam in some settings.

13.8 Sophia (Liu 2023)

“Second-order Clipped Stochastic Optimization.” Uses a cheap diagonal Hessian estimate (via Hutchinson’s stochastic trace estimator) to precondition the gradient, with element-wise clipping. Claimed 2× speedup over AdamW on LLM pre-training. Adoption ongoing.

13.9 signSGD (Bernstein 2018)

x_{k+1} = x_k − α · sign(∇f(x_k)). Theoretically interesting for distributed training (extreme gradient compression — 1 bit per parameter). Ancestor of Lion.


14. Learning Rate Schedules

The learning rate is rarely held constant; modern training uses schedules that evolve α over the course of training.

14.1 Step decay

α_k = α₀ · γ^{⌊k / T⌋} for some γ ∈ (0, 1) (typically 0.1) and decay interval T (epochs). Classic in vision: ResNet on ImageNet decays LR by 10× at epochs 30, 60, 90.

14.2 Exponential decay

α_k = α₀ · γ^k. Smoother version of step decay.

14.3 Polynomial decay

α_k = α₀ · (1 − k / T)^p for some p ≥ 1. Used in BERT (p = 1, linear decay).

14.4 Cosine annealing (Loshchilov + Hutter 2016, ICLR 2017)

α_k = α_min + ½ · (α_max − α_min) · (1 + cos(π · k / T))

Smoothly decays from α_max to α_min over T steps. Standard in modern transformer training. Empirically outperforms step / exponential decay; smooth final approach to small LR.

14.5 Cosine with warm restarts (SGDR, same paper)

Periodic restarts: cosine over T_0 steps, then reset to α_max and decay over 2·T_0, then 4·T_0, etc. Often helps escape local minima.

14.6 Linear warmup

First T_warm steps (typically 5–10% of total training, or 500–4000 steps for LLMs): α ramps linearly from 0 (or near-0) to α_max. Essential for transformers — adaptive optimizers like Adam have unreliable variance estimates in early steps; without warmup, early steps can produce destructive updates and training diverges.

14.7 1cycle / superconvergence (Smith 2017, 2018)

Single up-down cycle: warmup to α_max over half of training, decay to α_min < α₀ over the other half. Combined with momentum cycling (down then up). Demonstrated “superconvergence” — training ResNet on CIFAR-10 in 70 epochs at high LR.

14.8 Trapezoidal / WSD (warmup-stable-decay)

Three phases: warmup → constant α_max → decay. Popular for LLM pre-training because the “stable” phase allows extending training (just continue at α_max and decay later), avoiding the need to re-train from scratch.

14.9 Inverse-square-root (Vaswani 2017, “Attention Is All You Need”)

α_k = α₀ · min(1/√k, k · 1/√(T_warm³))

Originally used for the original Transformer. Largely superseded by cosine + warmup.


15. Convergence Theory

15.1 Convex + L-smooth

  • Plain GD, α = 1/L: f(x_k) − f≤ L · ‖x_0 − x‖² / (2k) = O(1/k).
  • Nesterov, α = 1/L: f(x_k) − f≤ 2L · ‖x_0 − x‖² / (k + 1)² = O(1/k²).

15.2 Strongly convex (µ-strongly convex) + L-smooth

Let κ = L / µ (condition number).

  • Plain GD: ‖x_k − x*‖² ≤ (1 − 1/κ)^k · ‖x_0 − x*‖² (linear rate).
  • Nesterov: ‖x_k − x*‖² ≤ (1 − 1/√κ)^k · ‖x_0 − x*‖² (accelerated linear rate). For ill-conditioned problems (κ = 10^6), Nesterov needs √κ = 1000 iterations, plain GD needs κ = 10^6 — orders of magnitude faster.

15.3 SGD

  • Strongly convex, decreasing α_k = O(1/k): E[‖x_k − x*‖²] = O(1/k) — variance-dominated rate; constant determined by gradient noise.
  • Convex, α_k = O(1/√k) with averaging: E[f(x̄_k) − f*] = O(1/√k).
  • Interpolating models (f_i(x*) = 0 for all i): SGD with constant α achieves the same rate as full GD (Vaswani 2019, Ma 2018) — important for overparameterized neural networks where the training loss can reach zero.

15.4 Non-convex + L-smooth

  • GD, α = 1/L: min_{i ≤ k} ‖∇f(x_i)‖² ≤ 2L · (f(x_0) − f*) / k = O(1/k) (in terms of gradient norm).
  • Converges to a stationary point — could be a saddle point in principle, but Lee et al. 2016 + Jin et al. 2017: GD almost surely escapes strict saddle points; perturbed GD escapes efficiently. Practical implication: SGD’s noise also escapes saddles in poly time.

15.5 Sharp vs Flat Minima (Hochreiter + Schmidhuber 1997; Keskar 2016)

Flat minima (large basin of low loss around the optimum) generalize better than sharp minima (deep narrow well). Large-batch training tends to find sharp minima; small-batch SGD’s noise prefers flat minima. Linear scaling rule + warmup partially recovers small-batch generalization at large batch.

15.6 SGD implicit regularization

  • Smith + Le 2018 (“Don’t Decay the Learning Rate, Increase the Batch Size”): SGD with constant α and small batch is equivalent to Langevin dynamics on a regularized loss — the noise injects a temperature ~ α / |B|.
  • Zhang 2017 (“Understanding Deep Learning Requires Rethinking Generalization”): deep nets can memorize random labels; generalization comes from algorithm (SGD) + data + architecture, not capacity alone.

16. Practical Tips

16.1 Tune LR first

The learning rate is the #1 hyperparameter. Before tuning batch size, optimizer choice, schedule, or weight decay, sweep α on a log scale. Order of magnitude matters far more than fine-tuning.

16.2 Warmup for transformers

Without warmup, transformer training routinely diverges at step ≤ 1000 due to early-step Adam variance estimate being noisy. Standard: linear warmup over 2000–10000 steps, or 1–5% of total training.

16.3 The 2024 LLM recipe

AdamW + cosine schedule + linear warmup + gradient clipping (norm 1.0) + bf16 mixed-precision + weight decay 0.1. β₁ = 0.9, β₂ = 0.95, ε = 1e−8. Peak LR scales inversely with √d_model (or per Chinchilla / scaling-law-tuned heuristics).

16.4 Monitor loss curves

  • Training loss + validation loss together.
  • Plateau in training loss → decay LR, or schedule has already done so.
  • Validation loss diverging from training → overfitting; add regularization, more data, or early stop.
  • Sudden spike → numerical instability (NaN gradients, exploding activation). Lower LR, clip gradients harder, check data preprocessing.

16.5 Mixed precision

bf16 (Google TPU, Ampere+ GPU) is generally preferred over fp16 — same dynamic range as fp32, no loss scaling needed. fp16 (older GPUs) needs loss scaling (multiply loss by 2^k, divide gradients back) to avoid underflow, and gradient clipping to avoid overflow. Master weights in fp32, computations in fp16/bf16.

16.6 Per-parameter LR caveat

Adam’s per-parameter adaptive LR is a strength but also a weakness: it loses gradient direction information (each component is scaled separately). SGD with momentum is sometimes preferred for vision (e.g., the original ResNet recipe) where the direction matters more than the per-axis scaling.

16.7 LR finder (Smith 2017)

Sweep α exponentially from 1e−8 to 1e1 over a few hundred mini-batches. Plot loss vs LR (log scale). Pick α slightly below the point where loss starts diverging (typically α_max ≈ 10× the divergence point’s α / 10). Practical, fast, and widely used in fastai-style training.

16.8 Optimizer state memory

  • SGD: 1× model size (gradients).
  • SGD + momentum: 2× model size (gradients + momentum).
  • Adam / AdamW: 3× model size (gradients + m + v). For an 8B param model in bf16: 16 GB params + 16 GB grad + 16 GB m + 16 GB v + 32 GB fp32 master = ~96 GB.
  • Lion: 2× model size (no v). Saves ~25% memory vs AdamW.
  • Adafactor: ~1× plus O(√d) for the factored second moment. Critical at >10B params.

17. Gradient Clipping

Prevent destructive updates from outlier gradient magnitudes:

Norm clipping (global): g ← g · min(1, c / ‖g‖) (typically c = 1.0)

Per-parameter clipping: g_i ← clip(g_i, −c, c) (less common, can break direction)

Why critical:

  • RNNs (LSTM, GRU): backprop through time amplifies gradients; norms can be 10^6+ for long sequences. Without clipping, training is unusable.
  • Transformer training: even with attention, early-step outliers can spike. Standard clip norm 1.0.
  • RL with policy gradients: action variance can produce huge gradients on rare events. Clip to 0.5.

Adaptive gradient clipping (AGC, Brock 2021): clip per-parameter relative to the parameter norm — used in NFNets (high-performance ResNet replacement without batchnorm).


18. Distributed Training

18.1 Data parallelism (single-host)

PyTorch DataParallel (single-process, single-host, multi-GPU): split batch across GPUs, average gradients via gather → scatter. Now mostly superseded by DDP.

18.2 DistributedDataParallel (DDP)

Multi-process, multi-host. NCCL all-reduce on gradients after backward. Each GPU has full model copy; each computes gradient on its slice of the batch; all-reduce averages gradients. Linear scaling up to network-bandwidth limit (typically 64–256 GPUs before all-reduce dominates).

18.3 ZeRO (Rajbhandari 2020, DeepSpeed)

Partition optimizer state, gradients, and parameters across data-parallel ranks.

  • ZeRO-1: partition optimizer state (m, v in Adam). Saves 4× memory.
  • ZeRO-2: partition gradients too. Saves 8×.
  • ZeRO-3: partition parameters too. Saves N× (where N = data-parallel size), but adds all-gather before forward/backward.
  • ZeRO-Infinity: offload to CPU / NVMe. Trains models far larger than aggregate GPU memory.

18.4 FSDP (Fully Sharded Data Parallel, PyTorch 2021)

PyTorch-native equivalent of ZeRO-3. Now the standard for LLM training in PyTorch.

18.5 Tensor parallelism (Megatron-LM 2019)

Split individual layer weights across GPUs — e.g., the attention QKV projection’s output dimension split across 8 GPUs. Required for models > 70B that don’t fit on a single GPU. Adds all-reduce inside each layer’s forward/backward.

18.6 Pipeline parallelism (GPipe 2018, PipeDream 2019)

Split layers across GPUs (e.g., GPUs 0–7 hold layers 1–10, GPUs 8–15 hold layers 11–20). Micro-batch through the pipeline to keep all stages busy. Bubble overhead non-trivial.

18.7 3D parallelism

Data × tensor × pipeline. Used for >100B-param training (Megatron-Turing NLG 530B, GPT-3 175B, GPT-4, etc.). Modern practice: FSDP for data sharding + tensor parallelism for layer sharding, optional pipeline for very deep models.


19. Convergence + Generalization

19.1 SGD as MCMC / Langevin dynamics

Mandt 2017, Welling + Teh 2011: SGD with constant α and gradient noise approximates Langevin dynamics:

dx = −∇f(x) dt + √(2T) · dW

with effective temperature T = α · σ² / 2 · |B|^{-1} (σ² = gradient variance per sample). SGD’s stationary distribution is Boltzmann ∝ exp(−f(x) / T), concentrating on wider minima (which have larger entropic volume). Theoretical foundation for why SGD generalizes better than full-batch optimization.

19.2 Loss landscape visualization (Li 2018)

“Visualizing the Loss Landscape of Neural Nets” — random 2D projections of trained ResNet loss surfaces. Surprising findings:

  • Neural net loss surfaces are far less rugged than feared.
  • Skip connections (ResNet) flatten the landscape compared to plain CNNs.
  • Wider networks have more benign landscapes.

19.3 Lottery ticket hypothesis (Frankle + Carbin 2019)

Dense neural networks contain sparse subnetworks (“winning tickets”) that, when trained in isolation from the same initialization, reach comparable accuracy. Implication: full-network training is largely searching for the right sparse subnetwork to train; SGD is good at this search.

19.4 Neural Tangent Kernel (NTK, Jacot 2018)

In the infinite-width limit, neural network training reduces to kernel regression with a fixed kernel — and SGD on that kernel converges linearly. Bridges optimization theory and deep learning for the lazy / NTK regime. Doesn’t fully explain feature learning at finite width.

19.5 Double descent (Belkin 2019)

Test error decreases, then increases, then decreases again as model capacity grows past the interpolation threshold. SGD’s implicit bias toward low-norm solutions in the overparameterized regime explains the second descent. Contradicts classical bias-variance theory; foundational result for modern overparameterized learning theory.


20. Cross-References


21. Citations

  • Robbins + Monro 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22(3). Origin of SGD.
  • Cauchy 1847. “Méthode générale pour la résolution des systèmes d’équations simultanées.” Comptes Rendus. First explicit gradient method.
  • Polyak 1964. “Some methods of speeding up the convergence of iteration methods.” USSR Computational Mathematics. Heavy-ball momentum.
  • Nesterov 1983. “A method of solving a convex programming problem with convergence rate O(1/k²).” Soviet Mathematics Doklady 27.
  • Armijo 1966. “Minimization of functions having Lipschitz continuous first partial derivatives.” Pacific Journal of Mathematics.
  • Wolfe 1969. “Convergence conditions for ascent methods.” SIAM Review 11.
  • Polyak 1987. Introduction to Optimization. Polyak step size.
  • Duchi, Hazan + Singer 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” JMLR 12. AdaGrad.
  • Tieleman + Hinton 2012. “Lecture 6e — RMSprop.” Coursera Neural Networks for Machine Learning. Unpublished.
  • Kingma + Ba 2014. “Adam: A Method for Stochastic Optimization.” ICLR 2015. arXiv:1412.6980.
  • Reddi, Kale + Kumar 2018. “On the Convergence of Adam and Beyond.” ICLR 2018. AMSGrad.
  • Loshchilov + Hutter 2017/2019. “Decoupled Weight Decay Regularization.” ICLR 2019. arXiv:1711.05101. AdamW.
  • Loshchilov + Hutter 2016/2017. “SGDR: Stochastic Gradient Descent with Warm Restarts.” ICLR 2017. arXiv:1608.03983. Cosine annealing.
  • Chen, Liang et al. 2023. “Symbolic Discovery of Optimization Algorithms.” NeurIPS 2023. arXiv:2302.06675. Lion.
  • Smith 2017. “Cyclical Learning Rates for Training Neural Networks.” WACV 2017. 1cycle.
  • Smith 2018. “A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay.” arXiv:1803.09820.
  • Sutskever, Martens, Dahl + Hinton 2013. “On the importance of initialization and momentum in deep learning.” ICML 2013. Practical NAG for deep learning.
  • You, Li + Hseu 2019. “Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes.” ICLR 2020. LAMB.
  • Shazeer + Stern 2018. “Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.” ICML 2018.
  • Gupta, Koren + Singer 2018. “Shampoo: Preconditioned Stochastic Tensor Optimization.” ICML 2018.
  • Dozat 2016. “Incorporating Nesterov Momentum into Adam.” ICLR Workshop. NAdam.
  • Liu, Li + Wang 2023. “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training.” arXiv:2305.14342.
  • Liu, Jiang + Sun 2019. “On the Variance of the Adaptive Learning Rate and Beyond.” ICLR 2020. RAdam.
  • Heo, Chun + Oh 2020. “AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights.” ICLR 2021.
  • Rajbhandari, Rasley + Ruwase 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC20. DeepSpeed.
  • Goyal et al. 2017. “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.” arXiv:1706.02677. Linear scaling + warmup.
  • McCandlish, Kaplan + Amodei 2018. “An Empirical Model of Large-Batch Training.” arXiv:1812.06162.
  • Hochreiter + Schmidhuber 1997. “Flat Minima.” Neural Computation 9(1).
  • Keskar et al. 2016. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.” ICLR 2017.
  • Smith + Le 2018. “A Bayesian Perspective on Generalization and Stochastic Gradient Descent.” ICLR 2018.
  • Zhang et al. 2017. “Understanding Deep Learning Requires Rethinking Generalization.” ICLR 2017.
  • Lee et al. 2016. “Gradient Descent Only Converges to Minimizers.” COLT 2016.
  • Jin et al. 2017. “How to Escape Saddle Points Efficiently.” ICML 2017.
  • Frankle + Carbin 2019. “The Lottery Ticket Hypothesis.” ICLR 2019.
  • Jacot, Gabriel + Hongler 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” NeurIPS 2018.
  • Belkin et al. 2019. “Reconciling modern machine-learning practice and the classical bias–variance trade-off.” PNAS. Double descent.
  • Li et al. 2018. “Visualizing the Loss Landscape of Neural Nets.” NeurIPS 2018.
  • Bernstein et al. 2018. “signSGD: Compressed Optimisation for Non-Convex Problems.” ICML 2018.
  • Brock, De + Smith 2021. “High-Performance Large-Scale Image Recognition Without Normalization.” ICML 2021. NFNets + AGC.
  • Vaswani et al. 2017. “Attention Is All You Need.” NeurIPS 2017. Inverse-square-root schedule for Transformer.
  • Vaswani, Bach + Schmidt 2019. “Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron.” AISTATS 2019.
  • Ma, Bassily + Belkin 2018. “The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning.” ICML 2018.
  • Mandt, Hoffman + Blei 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR 18.
  • Welling + Teh 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” ICML 2011.
  • Lessard, Recht + Packard 2016. “Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints.” SIAM J. Optimization.
  • Nemirovski + Yudin 1983. Problem Complexity and Method Efficiency in Optimization. Wiley. Lower bound on first-order methods.
  • Beck + Teboulle 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Imaging Sciences. FISTA.
  • Nesterov 2004. Introductory Lectures on Convex Optimization: A Basic Course. Springer.