Optimization Methods — Cross-Cutting Comparison
This note compares every optimization method discussed in the Math library — exact (LP / QP / MIP / SDP), convex first-order (gradient descent, accelerated, proximal, ADMM), non-convex stochastic (SGD, Adam, AdamW, Lion, Muon, Shampoo, SOAP, LAMB), evolutionary (GA, DE, CMA-ES, NSGA-II), Bayesian (GP-EI, Hyperband, BOHB, Optuna TPE, SMAC), and derivative-free (Nelder-Mead, COBYLA, Powell) — on the axes that decide which one fits a real problem: convexity required, smoothness required, derivative availability, scalability, batch/streaming, parallelism, and memory footprint. Use the table-by-axis structure to triage; decision tree at the bottom.
See also
- convex-optimization
- gradient-descent-variants
- combinatorial-optimization
- riemannian-optimization
- bayesian-inference
- reinforcement-learning-theory
- variational-inference
- optimization-algorithm-taxonomy
- fine-tuning-rlhf
- inference-optimization
1. The taxonomy
EXACT (polynomial-time on the class) CONVEX FIRST-ORDER (gradient-based)
LP — simplex, interior-point Karmarkar vanilla GD
QP — active-set, interior-point Nesterov accelerated
MIP — branch-and-bound, cut, Benders heavy-ball / Polyak momentum
SDP — interior-point (SDPT3, Mosek) proximal gradient
FISTA (proximal + Nesterov)
ADMM (Boyd 2011)
dual decomposition
coordinate descent
NON-CONVEX STOCHASTIC (deep learning) EVOLUTIONARY / POPULATION
SGD GA — genetic algorithm
SGD + momentum DE — differential evolution
Adam, AdamW, AdamWR CMA-ES (Hansen 2001)
AdaGrad, RMSProp PSO — particle swarm
Adadelta, Nadam NSGA-II (Deb 2002) multi-obj
Lion (Chen et al 2023) MOEA/D (Zhang-Li 2007) multi-obj
Sophia (Liu et al 2023)
LAMB (You et al 2019)
Shampoo (Gupta-Koren-Singer 2018)
SOAP (Vyas et al 2024)
Muon (Jordan et al 2024)
BAYESIAN / SURROGATE-BASED DERIVATIVE-FREE / BLACK-BOX
GP-EI (Mockus 1978; Snoek et al 2012) Nelder-Mead (1965)
Hyperband (Li et al 2018) Powell's method
BOHB (Falkner-Klein-Hutter 2018) COBYLA (Powell 1994)
TPE (Optuna, Bergstra et al 2011) Brent / golden-section (1D)
SMAC (Hutter-Hoos-Leyton-Brown 2011) ES (evolutionary strategies, OpenAI 2017)
BoTorch (Facebook/Meta)
Vizier (Google)
2. Convexity required vs not
| Method | Requires convexity? | Behavior on non-convex |
|---|---|---|
| Simplex (LP) | yes | undefined — LP must be convex |
| Interior-point (LP/QP/SDP) | yes | undefined |
| Branch-and-bound (MIP) | yes (on relaxation) | works (integer non-convex by nature) |
| Vanilla GD | no | converges to stationary point (saddle / local min) |
| Nesterov accelerated | requires convexity for the optimal rate | works heuristically |
| ADMM | requires convex separable for convergence proof | works heuristically (deep learning, neural ODE) |
| SGD + Adam | no | de facto default in deep learning |
| Lion, Muon, Shampoo, SOAP | no | designed for non-convex deep learning |
| GA / DE / PSO | no | designed for non-convex |
| CMA-ES | no | gold standard for moderate-dim non-convex |
| Nelder-Mead | no | works for low-dim non-convex |
| Bayesian (GP-EI, TPE) | no | designed for non-convex expensive black-box |
The convex/non-convex split is the first filter: if your problem is convex, you can prove global convergence and use the polynomial-time methods. If not, you live in deep-learning-land where convergence guarantees vanish and benchmarks substitute for theory.
3. Smoothness and derivative availability
| Method | Needs ∇f | Needs ∇²f | Smoothness assumption |
|---|---|---|---|
| Simplex | — | — | LP (piecewise linear by definition) |
| Interior-point | — (uses ∇²) | yes (or quasi-Newton) | smooth convex barrier |
| Newton | yes | yes | smooth (Lipschitz ∇²) |
| Gauss-Newton, Levenberg-Marquardt | yes | structure of ∇² | smooth (least-squares) |
| Quasi-Newton (BFGS, L-BFGS) | yes | no (estimated) | smooth |
| Vanilla GD | yes | no | Lipschitz ∇ (smoothness constant L) |
| Nesterov | yes | no | smooth + (optionally) strongly convex |
| Proximal gradient | yes (smooth part) + prox of non-smooth part | no | smooth + non-smooth proximable |
| FISTA | same | no | smooth + non-smooth |
| ADMM | needs to evaluate prox of each block | no | block-separable |
| Subgradient | needs subgradient | no | convex non-smooth (or just Lipschitz f) |
| SGD + Adam | yes | no | non-smooth OK (popular for ReLU networks!) |
| Lion, Muon | yes (sign or projected) | no | non-smooth OK |
| Shampoo, SOAP | yes | no (preconditioner = second-moment of gradients) | non-smooth OK |
| BFGS/L-BFGS | yes | no | smooth |
| CMA-ES | no | no | none required (black-box) |
| Nelder-Mead | no | no | none required |
| Bayesian opt | no | no | none required (assumes f is GP-smooth) |
| Trust-region | yes | yes or Gauss-Newton | smooth |
Derivative availability is the practical question. If you have ∇f cheaply (via autograd), first-order methods are best. If ∇f is expensive (each call requires a simulation), use derivative-free or Bayesian opt. If ∇²f is cheap (small dim), use Newton-type.
4. Scalability — dimension N and data n
| Method | Dim N (variables) | Data n (constraints / samples) | Memory |
|---|---|---|---|
| Simplex | up to ~10⁶ in commercial solvers (Gurobi, CPLEX) | up to ~10⁶ | sparse Cholesky factorization |
| Interior-point LP/QP | up to ~10⁵–10⁶ | up to ~10⁶ | needs full factor |
| Interior-point SDP | up to ~10³ (variables in semidef matrix) | up to ~10³ constraints | O(N⁴) — bottleneck |
| Branch-and-bound (MIP) | up to ~10⁴ integer + ~10⁶ continuous | up to ~10⁶ constraints | depends on tree depth |
| Vanilla GD | unlimited (1B+ for deep learning) | full batch | O(N) |
| Nesterov | unlimited | full batch | O(N) |
| L-BFGS | up to ~10⁹ | full batch typical | O(mN), m = 5–20 history |
| FISTA | unlimited | full batch | O(N) |
| ADMM | unlimited | block-separable | O(N) per block |
| SGD + Adam | trillions (DeepSeek-V3 671B, GPT-4 ~1.8T) | streamed minibatches | O(N) + O(N) for moments (Adam: 3×) |
| Lion | trillions | minibatch | O(N) + O(N) (half memory of Adam) |
| AdamW | trillions | minibatch | 3× |
| LAMB | trillions | minibatch | 3× |
| Shampoo | trillions, w/ layerwise blocks | minibatch | larger (preconditioner blocks) |
| SOAP | trillions | minibatch | larger |
| Muon | trillions | minibatch | smaller than Adam (uses sign-like update on momentum) |
| CMA-ES | up to ~10³ practical (~10⁴ w/ sep-CMA-ES) | varies | O(N²) covariance |
| GA / DE | up to ~10³ practical | varies | O(pop × N) |
| Nelder-Mead | up to ~30 | varies | O(N²) simplex |
| Bayesian opt (GP) | up to ~30 (saturated kernel) | up to ~10³ observations | O(n³) GP fit |
| Bayesian opt (TPE) | up to ~100 | up to ~10⁴ observations | O(n) |
| Hyperband | up to ~10² (HPs) | up to ~10⁴ trials | minimal state |
The N≈30 ceiling on Bayesian opt and Nelder-Mead is a hard practical wall — kernel-based GPs degrade in high dim, and Nelder-Mead’s simplex deflates to a singular configuration. Above 30 dim, switch to TPE / SMAC (random-forest surrogate) or population methods.
5. Parallelism and batch/streaming
| Method | Embarrassingly parallel? | Online / streaming? |
|---|---|---|
| Simplex / interior-point | no (single solver) | no (batch) |
| Branch-and-bound | yes (subtrees) | no |
| Vanilla GD (full batch) | yes (gradient over data) | no |
| SGD + minibatch | yes (per-replica) | yes |
| ADMM | yes (per block) | no |
| L-BFGS | partial | no |
| Synchronous data-parallel SGD | yes; gradient all-reduce | yes |
| Asynchronous SGD (Hogwild!) | yes; relaxed | yes |
| ZeRO / FSDP (DeepSpeed, FSDP) | yes (parameter sharded) | yes |
| GA / DE / PSO | yes (population) | n/a |
| CMA-ES | partial (population eval) | n/a |
| Bayesian opt | partial (batch acquisition, q-EI) | yes (sequential) |
| Hyperband / BOHB | yes (rungs) | yes |
For modern deep learning, data-parallel SGD with all-reduce scales to ~1024 GPUs efficiently. Beyond that, you need tensor-parallel + pipeline-parallel + ZeRO/FSDP. See cuda-triton-gpu-programming and inference-optimization.
6. Deep-learning optimizer landscape (2024–2026)
| Optimizer | Year | What’s new vs Adam | When it wins |
|---|---|---|---|
| SGD + momentum | 1960s (heavy-ball Polyak) | baseline | small models, well-tuned LR schedules; ResNet ImageNet |
| AdaGrad | Duchi-Hazan-Singer 2011 | per-parameter LR | sparse features, NLP-bag-of-words |
| RMSProp | Hinton lecture 2012 (unpublished) | EMA of squared grads | RNN training |
| Adam | Kingma-Ba 2015 (DiP) | combine momentum + RMSProp | default since 2015 |
| AdamW | Loshchilov-Hutter 2019 | decoupled weight decay | NLP transformers; default since GPT-2 era |
| LAMB | You et al 2019 | layer-wise normalization | large-batch BERT |
| Adafactor | Shazeer-Stern 2018 | factored second moment, saves memory | T5 trained with this |
| Lion | Chen et al 2023 (Google) | sign of momentum; ½ memory of Adam | competitive at LLM scale, much cheaper |
| Sophia | Liu et al 2023 | clipped diagonal Hessian estimate | 2× speedup vs AdamW for LLM pretraining |
| Shampoo | Gupta-Koren-Singer 2018; revived 2023 | block-diagonal full-matrix preconditioner | competitive at small/mid scale; expensive memory |
| Distributed Shampoo | Anil et al 2020 | scalable Shampoo | adopted at Google for some LLMs |
| CASPR | Li et al 2024 | Shampoo + LR-free | research |
| SOAP | Vyas et al 2024 | Shampoo + Adam-in-eigenbasis (best of both) | strong empirical, growing adoption |
| Muon | Jordan-Jin-Boza-Bernstein 2024 | Newton-Schulz iteration on momentum + linear-algebra trick; ½ memory of Adam | Llama-style architectures, current SOTA on small/mid LLM benchmarks |
| PSGD | Li 2018 (revisited 2024) | Preconditioned SGD | small but competitive |
The 2024 lesson: Adam is no longer the obvious default. Muon, SOAP, and Shampoo are competitive or better for LLM-scale training; Lion ties Adam at half the memory. The gap is small (~5–15% loss-equivalent compute) but compounds over trillion-parameter runs.
7. Convex first-order methods — convergence rates
| Method | Smooth (L) | Smooth + strong convex (L, μ) | Non-smooth (Lipschitz f) |
|---|---|---|---|
| Vanilla GD | O(1/k) | O(exp(-k/κ)) | O(1/√k) (with subgradient) |
| Nesterov accelerated | O(1/k²) | O(exp(-k/√κ)) | n/a (Nesterov-smooth required) |
| Proximal gradient | O(1/k) (smooth part) | O(exp(-k/κ)) | composite (smooth + prox) |
| FISTA | O(1/k²) | accelerated rate | composite |
| ADMM | O(1/k) | linear w/ strong convexity | composite block-separable |
| Subgradient | n/a | n/a | O(1/√k) |
| Mirror descent | matches GD on dual | matches | O(1/√k) |
| Polyak heavy-ball | not always faster than GD asymptotically | sometimes optimal | n/a |
κ = L/μ = condition number. Nesterov 1983 achieved the lower-bound O(1/k²) rate. FISTA (Beck-Teboulle 2009) extended to the composite case. Optimal rates are information-theoretic lower bounds (Nemirovski-Yudin 1983) — no first-order method using only function + gradient queries can do better.
8. Coordinate descent + proximal methods
For separable problems (objective decomposes into per-coordinate terms), block-coordinate descent matches accelerated rates in practice with much lower per-iteration cost.
- Coordinate Descent for Lasso (Friedman-Hastie-Tibshirani GLMNET 2010) — fastest in practice for ℓ₁-regularized linear models. Used by GLMNET, scikit-learn Lasso, R glmnet.
- Proximal mappings — soft-thresholding for ℓ₁, projection for indicator functions, prox of group-norm for group Lasso.
- ADMM decomposes problems with affine constraints; convergent on convex separable; heuristic on non-convex (Wang-Yin 2019 study).
9. Combinatorial optimization
For discrete decision variables: MIP, set cover, TSP, scheduling, network flow.
| Approach | When it works | Real solver |
|---|---|---|
| LP relaxation + rounding | structure permits | textbook |
| Branch-and-bound | exact, exponential worst-case | Gurobi, CPLEX, COIN-OR CBC, SCIP, HiGHS |
| Branch-and-cut | adds cutting planes | Gurobi, CPLEX |
| Benders decomposition | structure with complicating variables | CPLEX Benders, Gurobi |
| Column generation | many variables, few binding constraints | CPLEX, FICO Xpress |
| Cutting-plane (Gomory, Chvátal-Gomory, MIR) | strengthens LP relaxation | most MIP solvers |
| Lagrangian relaxation | dualize hard constraints | research / specialized |
| Constraint programming | scheduling, allocation w/ logic constraints | Google OR-Tools CP-SAT, IBM CP Optimizer |
| Local search (TS, SA) | NP-hard heuristic | LocalSolver, OR-Tools |
| MILP solvers as black box | most production | Gurobi 11.x (2024), CPLEX 22.x |
See combinatorial-optimization for set cover, TSP, max-flow / min-cut, vehicle routing, scheduling.
10. Bayesian optimization and HPO
For expensive black-box functions (each evaluation costs minutes-hours; e.g., training an ML model):
| Method | Surrogate model | Acquisition function | Library |
|---|---|---|---|
| GP-EI (Mockus 1978, Snoek-Larochelle-Adams 2012) | Gaussian process | expected improvement | BoTorch, GPyOpt |
| GP-UCB | GP | upper confidence bound | BoTorch |
| TPE (Bergstra et al 2011) | tree-structured Parzen estimator | quantile-based | Optuna (default), Hyperopt |
| SMAC (Hutter-Hoos-Leyton-Brown 2011) | random forest | EI | SMAC3 |
| Hyperband (Li et al 2018) | n/a (multi-fidelity) | successive halving | Optuna, Ray Tune |
| BOHB (Falkner-Klein-Hutter 2018) | TPE + Hyperband | EI w/ multi-fidelity | HpBandSter |
| Vizier (Google) | proprietary | EI variants | Google internal + Open Source Vizier |
| ASHA | n/a | async successive halving | Ray Tune |
| PBT (DeepMind 2017) | n/a — population-based | learn schedule | Ray, AutoML libraries |
In 2025, Optuna with TPE + ASHA pruner is the most common open-source HPO stack. Ray Tune dominates for distributed HPO. Sweeps in Weights & Biases for managed HPO.
11. Multi-objective optimization
For multiple conflicting objectives:
| Method | Approach | Output |
|---|---|---|
| Weighted-sum scalarization | reduce to single-obj | one Pareto point per weight |
| ε-constraint | constrain all but one | one Pareto point per ε |
| NSGA-II (Deb et al 2002) | population + non-dominated sorting + crowding | full Pareto front |
| NSGA-III (Deb-Jain 2014) | reference-direction NSGA | better for ≥3 objectives |
| MOEA/D (Zhang-Li 2007) | decomposition-based | full front |
| ParEGO (Knowles 2006) | BoTorch-style Bayesian multi-obj | front via random scalarization |
| qEHVI (qExpected Hypervolume Improvement) | BoTorch | batched Bayesian multi-obj |
| Goal programming | targeted | ”best” wrt targets |
NSGA-II is the default for evolutionary multi-objective (cited 50k+ times). qEHVI is the Bayesian-multiobjective state of the art.
12. Riemannian optimization
For optimization on manifolds (Stiefel, Grassmann, sphere, SPD, hyperbolic):
| Method | Manifold version |
|---|---|
| Riemannian GD | gradient + retraction back to manifold |
| Riemannian Adam (Becigneul-Ganea 2019) | Adam on manifold |
| Trust-region (RTR, Absil-Mahony-Sepulchre 2008) | Riemannian Newton-like |
| Pymanopt, Manopt.jl, GeoOpt | Python/Julia/PyTorch libraries |
Used for: orthogonal matrices (Stiefel) for neural net weights, SPD matrices in BCI / EEG, hyperbolic embeddings for hierarchy data, dictionary learning. See riemannian-optimization and lie-groups-so3-se3.
13. Reinforcement-learning-style optimization
When the objective is the return of a sequential decision process:
| Method | When |
|---|---|
| REINFORCE / policy gradient | small policy networks |
| Actor-critic | stable when value function is learnable |
| TRPO (Schulman et al 2015) | trust-region; expensive |
| PPO (Schulman et al 2017) | de facto default for RLHF |
| SAC (Haarnoja et al 2018) | continuous control, max-entropy |
| DDPG / TD3 | continuous control, deterministic policy |
| iLQR / DDP | model-based, trajectory opt |
| MPC w/ shooting | model-based, finite horizon |
| GRPO (group relative policy optimization, DeepSeek 2024) | post-training LLMs; no value model |
| DPO (Rafailov et al 2023) | direct preference optimization for RLHF |
| RLOO (Ahmadian et al 2024) | RL on language models |
See reinforcement-learning-theory and fine-tuning-rlhf.
14. Memory footprint comparison (LLM training era)
| Optimizer | State per parameter | Example: 70B model (fp32) |
|---|---|---|
| SGD | 0 | 0 GB |
| SGD + momentum | 1 (1× param) | 280 GB |
| Adam / AdamW | 2 (m + v) | 560 GB |
| AdaFactor | ≪ 2 (factored) | ~70 GB |
| Lion | 1 (momentum only) | 280 GB |
| Shampoo (block-diagonal) | varies; can be larger | depends on block size |
| SOAP | larger than Adam (preconditioner) | depends |
| Muon | 1 (momentum) | 280 GB |
| 8-bit Adam (bnb) | 2× int8 | ~140 GB |
| 4-bit Adam | 2× int4 | ~70 GB |
Optimizer state often dominates GPU memory for trillion-parameter training. ZeRO Stage 3 / FSDP shards optimizer state across GPUs. 8-bit Adam (bitsandbytes, Tim Dettmers 2021) is widely used.
15. Numerical solvers — the practical ecosystem
| Tool | Class | Languages |
|---|---|---|
| Gurobi | LP / QP / MIP / MIQP | C, Python, Julia, R |
| CPLEX | LP / QP / MIP / CP | C, Python, Java |
| Mosek | LP / QP / SOCP / SDP / MIP | C, Python, MATLAB |
| COPT | LP / QP / SDP / MIP | C, Python |
| HiGHS (open) | LP / QP / MIP | C, Python, R, Julia |
| SCIP (open) | MIP, MINLP | C, Python |
| CBC (COIN-OR) | LP / MIP | C, Python |
| OR-Tools (Google) | CP-SAT, MIP, routing | C, Python, Java |
| CVXPY (Diamond-Boyd) | disciplined convex programming | Python |
| JuMP.jl | mathematical programming DSL | Julia |
| scipy.optimize | first-order, derivative-free, root-finding | Python |
| PyTorch / JAX optimizers | autograd-based first-order | Python |
| Optuna / Ray Tune / Hyperopt | HPO / Bayesian opt | Python |
| Pymanopt / Manopt | Riemannian opt | Python / MATLAB |
The 2025 commercial-MIP leader is Gurobi 11; the academic-OSS leader is HiGHS. CVXPY is the modeling-language default for convex optimization in Python.
16. Decision tree — pick by problem class
What's the problem?
├─ Linear program (linear obj, linear constraints)
│ → Simplex (Gurobi/CPLEX/HiGHS) for sparse, large-scale; interior-point for dense.
├─ Quadratic program (quadratic obj, linear constraints)
│ → QP active-set or interior-point (Gurobi, OSQP, qpOASES, MOSEK).
├─ Semidefinite program (linear matrix inequalities)
│ → Interior-point (Mosek, SDPT3, SeDuMi); CVXPY for modeling.
├─ Mixed-integer linear (some variables integer)
│ → Branch-and-cut (Gurobi, CPLEX, SCIP, HiGHS).
│ → If structure: Benders, column generation.
├─ Convex non-smooth (regularized regression, Lasso, group-Lasso)
│ → Coordinate descent (GLMNET); FISTA; ADMM.
├─ Convex smooth large-scale (logistic regression, deep linear)
│ → L-BFGS (small-mid); SGD/Adam (large).
├─ Non-convex deep learning
│ → AdamW for most NLP; SGD+momentum for ResNets.
│ → Try Muon, SOAP, Shampoo for SOTA on small-mid LLMs.
│ → Lion if memory-constrained.
├─ Expensive black-box (CFD, lab experiment, neural arch search, ML hyperparams)
│ → Bayesian opt (BoTorch, Optuna TPE, SMAC).
│ → If multi-fidelity: Hyperband / BOHB / ASHA.
├─ Sequential decision (RL, control)
│ → PPO (default LLM-RLHF), SAC (continuous control), iLQR / MPC (model-based).
│ → GRPO / DPO for LLM preference tuning.
├─ Population-based black-box, moderate dim (10–1000)
│ → CMA-ES (Hansen) if continuous; GA/DE if discrete.
├─ Multi-objective (2–5 objectives)
│ → NSGA-II (DEAP, pymoo) for evolutionary; qEHVI for Bayesian.
├─ Low-dim derivative-free (<30 dim)
│ → Nelder-Mead (scipy.optimize.minimize method='Nelder-Mead');
│ → COBYLA if constraints.
├─ On a Riemannian manifold (Stiefel, Grassmann, SPD)
│ → Pymanopt / GeoOpt; Riemannian Adam.
├─ Optimal control / trajectory optimization
│ → DDP / iLQR; direct collocation; MPC w/ IPOPT.
│ → CasADi for symbolic gradient; Drake for robotics.
├─ Combinatorial scheduling / routing / assignment
│ → Google OR-Tools CP-SAT, IBM CP Optimizer, LocalSolver.
└─ Game-theoretic / minimax
→ Extragradient, OGDA, optimistic-gradient; Nash-EGT for adversarial training.
17. Anti-patterns
- Using SGD when the problem is convex and small — L-BFGS converges 100× faster.
- Using Adam when SGD-momentum would do — ResNet ImageNet was famously trained with SGD-momentum; Adam is worse for it.
- Using Bayesian opt for cheap functions — every BO step costs O(n³) GP fit; pointless if function is cheap.
- Using GP-BO above ~30 dim — kernel saturates; use TPE / SMAC / random search.
- Solving an LP with MIP solver — MIP solver adds integer-handling overhead.
- Mixing optimizer with LR schedule wrong — AdamW + cosine schedule + linear warmup is the deep-learning recipe; Adam without weight decay overfits.
- Random search for HPO of >5 hyperparams without budget pruning — use ASHA or Hyperband.
- CMA-ES in 10,000 dim — covariance matrix is 10⁸ × 10⁸. Use sep-CMA or random search.
18. The 2024–2026 frontier
- Muon (Jordan et al 2024) — half-memory Adam alternative; rapidly adopted in 2025 LLM pretraining.
- SOAP (Vyas et al 2024) — Shampoo + Adam, combines best of both.
- Schedule-Free SGD (Defazio et al 2024) — adaptive without LR schedule; emerging.
- Distributed Shampoo — scaled Shampoo for multi-trillion parameter.
- JAX-based optimization stacks — Optax (Google) is the cleanest API.
- Gurobi 11 (2024) — significant MIP speedup (~30% over v10).
- HiGHS 1.7 — open-source MIP catching up to commercial.
- CP-SAT (OR-Tools) — won MiniZinc Challenge every year 2018–2024.
- End-to-end differentiable convex layers (cvxpylayers, Agrawal-Amos-Barratt-Boyd 2019) — embed CVXPY problem inside a neural network.
- NPB / Neural ODE optimizers — back-prop through ODE solvers.
- Implicit gradients / OptNet (Amos-Kolter 2017) — differentiate through QP solutions.
Adjacent
- Linear algebra — linear-algebra-essentials for the matrix operations underlying every method.
- Numerical linear algebra — numerical-linear-algebra for QR / LU / Cholesky / preconditioned-CG used by interior-point.
- Eigenvalue / SVD — eigenvalue-problems, svd-pca-spectral for Hessian + preconditioner spectra.
- Convex optimization theory — convex-optimization for the duality + KKT + interior-point story.
- Gradient descent — gradient-descent-variants for the full deep-learning optimizer survey.
- Combinatorial — combinatorial-optimization for MIP, TSP, scheduling.
- Riemannian — riemannian-optimization for manifold-constrained.
- Reinforcement learning — reinforcement-learning-theory for policy gradient, value methods, RLHF.
- Variational inference — variational-inference for ELBO maximization (a non-convex optimization in disguise).
- Optimization algorithm taxonomy — optimization-algorithm-taxonomy for the full method catalog.
- GPU programming — cuda-triton-gpu-programming for kernel-level acceleration of SGD/Adam.
- Distributed training — fine-tuning-rlhf for ZeRO, FSDP, RLHF.
- Engineering control — mpc-control for MPC’s quadratic-program at the core.
When to pick what
The fastest narrowing: convex + small/medium → interior-point or L-BFGS; convex + huge → SGD or proximal-FISTA; non-convex deep learning → AdamW (or try Muon/SOAP); black-box expensive → Bayesian opt or evolutionary; discrete decisions → MIP solver (Gurobi/HiGHS) or CP-SAT; multi-objective → NSGA-II or qEHVI; sequential decision → PPO / SAC / iLQR; manifold-constrained → Pymanopt / GeoOpt. The single biggest mistake is applying SGD-class methods to problems where exact methods exist — if your problem is an LP, use a simplex solver; if it’s convex, prove it convex and use interior-point or proximal methods. SGD/Adam earn their keep on non-convex deep-learning loss surfaces where exact methods don’t apply.