Optimization Methods — Cross-Cutting Comparison

This note compares every optimization method discussed in the Math library — exact (LP / QP / MIP / SDP), convex first-order (gradient descent, accelerated, proximal, ADMM), non-convex stochastic (SGD, Adam, AdamW, Lion, Muon, Shampoo, SOAP, LAMB), evolutionary (GA, DE, CMA-ES, NSGA-II), Bayesian (GP-EI, Hyperband, BOHB, Optuna TPE, SMAC), and derivative-free (Nelder-Mead, COBYLA, Powell) — on the axes that decide which one fits a real problem: convexity required, smoothness required, derivative availability, scalability, batch/streaming, parallelism, and memory footprint. Use the table-by-axis structure to triage; decision tree at the bottom.

See also

1. The taxonomy

EXACT (polynomial-time on the class)         CONVEX FIRST-ORDER (gradient-based)
  LP — simplex, interior-point Karmarkar       vanilla GD
  QP — active-set, interior-point              Nesterov accelerated
  MIP — branch-and-bound, cut, Benders         heavy-ball / Polyak momentum
  SDP — interior-point (SDPT3, Mosek)          proximal gradient
                                               FISTA (proximal + Nesterov)
                                               ADMM (Boyd 2011)
                                               dual decomposition
                                               coordinate descent

NON-CONVEX STOCHASTIC (deep learning)         EVOLUTIONARY / POPULATION
  SGD                                            GA — genetic algorithm
  SGD + momentum                                 DE — differential evolution
  Adam, AdamW, AdamWR                            CMA-ES (Hansen 2001)
  AdaGrad, RMSProp                               PSO — particle swarm
  Adadelta, Nadam                                NSGA-II (Deb 2002) multi-obj
  Lion (Chen et al 2023)                         MOEA/D (Zhang-Li 2007) multi-obj
  Sophia (Liu et al 2023)
  LAMB (You et al 2019)
  Shampoo (Gupta-Koren-Singer 2018)
  SOAP (Vyas et al 2024)
  Muon (Jordan et al 2024)

BAYESIAN / SURROGATE-BASED                    DERIVATIVE-FREE / BLACK-BOX
  GP-EI (Mockus 1978; Snoek et al 2012)          Nelder-Mead (1965)
  Hyperband (Li et al 2018)                      Powell's method
  BOHB (Falkner-Klein-Hutter 2018)               COBYLA (Powell 1994)
  TPE (Optuna, Bergstra et al 2011)              Brent / golden-section (1D)
  SMAC (Hutter-Hoos-Leyton-Brown 2011)           ES (evolutionary strategies, OpenAI 2017)
  BoTorch (Facebook/Meta)
  Vizier (Google)

2. Convexity required vs not

MethodRequires convexity?Behavior on non-convex
Simplex (LP)yesundefined — LP must be convex
Interior-point (LP/QP/SDP)yesundefined
Branch-and-bound (MIP)yes (on relaxation)works (integer non-convex by nature)
Vanilla GDnoconverges to stationary point (saddle / local min)
Nesterov acceleratedrequires convexity for the optimal rateworks heuristically
ADMMrequires convex separable for convergence proofworks heuristically (deep learning, neural ODE)
SGD + Adamnode facto default in deep learning
Lion, Muon, Shampoo, SOAPnodesigned for non-convex deep learning
GA / DE / PSOnodesigned for non-convex
CMA-ESnogold standard for moderate-dim non-convex
Nelder-Meadnoworks for low-dim non-convex
Bayesian (GP-EI, TPE)nodesigned for non-convex expensive black-box

The convex/non-convex split is the first filter: if your problem is convex, you can prove global convergence and use the polynomial-time methods. If not, you live in deep-learning-land where convergence guarantees vanish and benchmarks substitute for theory.

3. Smoothness and derivative availability

MethodNeeds ∇fNeeds ∇²fSmoothness assumption
SimplexLP (piecewise linear by definition)
Interior-point— (uses ∇²)yes (or quasi-Newton)smooth convex barrier
Newtonyesyessmooth (Lipschitz ∇²)
Gauss-Newton, Levenberg-Marquardtyesstructure of ∇²smooth (least-squares)
Quasi-Newton (BFGS, L-BFGS)yesno (estimated)smooth
Vanilla GDyesnoLipschitz ∇ (smoothness constant L)
Nesterovyesnosmooth + (optionally) strongly convex
Proximal gradientyes (smooth part) + prox of non-smooth partnosmooth + non-smooth proximable
FISTAsamenosmooth + non-smooth
ADMMneeds to evaluate prox of each blocknoblock-separable
Subgradientneeds subgradientnoconvex non-smooth (or just Lipschitz f)
SGD + Adamyesnonon-smooth OK (popular for ReLU networks!)
Lion, Muonyes (sign or projected)nonon-smooth OK
Shampoo, SOAPyesno (preconditioner = second-moment of gradients)non-smooth OK
BFGS/L-BFGSyesnosmooth
CMA-ESnononone required (black-box)
Nelder-Meadnononone required
Bayesian optnononone required (assumes f is GP-smooth)
Trust-regionyesyes or Gauss-Newtonsmooth

Derivative availability is the practical question. If you have ∇f cheaply (via autograd), first-order methods are best. If ∇f is expensive (each call requires a simulation), use derivative-free or Bayesian opt. If ∇²f is cheap (small dim), use Newton-type.

4. Scalability — dimension N and data n

MethodDim N (variables)Data n (constraints / samples)Memory
Simplexup to ~10⁶ in commercial solvers (Gurobi, CPLEX)up to ~10⁶sparse Cholesky factorization
Interior-point LP/QPup to ~10⁵–10⁶up to ~10⁶needs full factor
Interior-point SDPup to ~10³ (variables in semidef matrix)up to ~10³ constraintsO(N⁴) — bottleneck
Branch-and-bound (MIP)up to ~10⁴ integer + ~10⁶ continuousup to ~10⁶ constraintsdepends on tree depth
Vanilla GDunlimited (1B+ for deep learning)full batchO(N)
Nesterovunlimitedfull batchO(N)
L-BFGSup to ~10⁹full batch typicalO(mN), m = 5–20 history
FISTAunlimitedfull batchO(N)
ADMMunlimitedblock-separableO(N) per block
SGD + Adamtrillions (DeepSeek-V3 671B, GPT-4 ~1.8T)streamed minibatchesO(N) + O(N) for moments (Adam: 3×)
LiontrillionsminibatchO(N) + O(N) (half memory of Adam)
AdamWtrillionsminibatch
LAMBtrillionsminibatch
Shampootrillions, w/ layerwise blocksminibatchlarger (preconditioner blocks)
SOAPtrillionsminibatchlarger
Muontrillionsminibatchsmaller than Adam (uses sign-like update on momentum)
CMA-ESup to ~10³ practical (~10⁴ w/ sep-CMA-ES)variesO(N²) covariance
GA / DEup to ~10³ practicalvariesO(pop × N)
Nelder-Meadup to ~30variesO(N²) simplex
Bayesian opt (GP)up to ~30 (saturated kernel)up to ~10³ observationsO(n³) GP fit
Bayesian opt (TPE)up to ~100up to ~10⁴ observationsO(n)
Hyperbandup to ~10² (HPs)up to ~10⁴ trialsminimal state

The N≈30 ceiling on Bayesian opt and Nelder-Mead is a hard practical wall — kernel-based GPs degrade in high dim, and Nelder-Mead’s simplex deflates to a singular configuration. Above 30 dim, switch to TPE / SMAC (random-forest surrogate) or population methods.

5. Parallelism and batch/streaming

MethodEmbarrassingly parallel?Online / streaming?
Simplex / interior-pointno (single solver)no (batch)
Branch-and-boundyes (subtrees)no
Vanilla GD (full batch)yes (gradient over data)no
SGD + minibatchyes (per-replica)yes
ADMMyes (per block)no
L-BFGSpartialno
Synchronous data-parallel SGDyes; gradient all-reduceyes
Asynchronous SGD (Hogwild!)yes; relaxedyes
ZeRO / FSDP (DeepSpeed, FSDP)yes (parameter sharded)yes
GA / DE / PSOyes (population)n/a
CMA-ESpartial (population eval)n/a
Bayesian optpartial (batch acquisition, q-EI)yes (sequential)
Hyperband / BOHByes (rungs)yes

For modern deep learning, data-parallel SGD with all-reduce scales to ~1024 GPUs efficiently. Beyond that, you need tensor-parallel + pipeline-parallel + ZeRO/FSDP. See cuda-triton-gpu-programming and inference-optimization.

6. Deep-learning optimizer landscape (2024–2026)

OptimizerYearWhat’s new vs AdamWhen it wins
SGD + momentum1960s (heavy-ball Polyak)baselinesmall models, well-tuned LR schedules; ResNet ImageNet
AdaGradDuchi-Hazan-Singer 2011per-parameter LRsparse features, NLP-bag-of-words
RMSPropHinton lecture 2012 (unpublished)EMA of squared gradsRNN training
AdamKingma-Ba 2015 (DiP)combine momentum + RMSPropdefault since 2015
AdamWLoshchilov-Hutter 2019decoupled weight decayNLP transformers; default since GPT-2 era
LAMBYou et al 2019layer-wise normalizationlarge-batch BERT
AdafactorShazeer-Stern 2018factored second moment, saves memoryT5 trained with this
LionChen et al 2023 (Google)sign of momentum; ½ memory of Adamcompetitive at LLM scale, much cheaper
SophiaLiu et al 2023clipped diagonal Hessian estimate2× speedup vs AdamW for LLM pretraining
ShampooGupta-Koren-Singer 2018; revived 2023block-diagonal full-matrix preconditionercompetitive at small/mid scale; expensive memory
Distributed ShampooAnil et al 2020scalable Shampooadopted at Google for some LLMs
CASPRLi et al 2024Shampoo + LR-freeresearch
SOAPVyas et al 2024Shampoo + Adam-in-eigenbasis (best of both)strong empirical, growing adoption
MuonJordan-Jin-Boza-Bernstein 2024Newton-Schulz iteration on momentum + linear-algebra trick; ½ memory of AdamLlama-style architectures, current SOTA on small/mid LLM benchmarks
PSGDLi 2018 (revisited 2024)Preconditioned SGDsmall but competitive

The 2024 lesson: Adam is no longer the obvious default. Muon, SOAP, and Shampoo are competitive or better for LLM-scale training; Lion ties Adam at half the memory. The gap is small (~5–15% loss-equivalent compute) but compounds over trillion-parameter runs.

7. Convex first-order methods — convergence rates

MethodSmooth (L)Smooth + strong convex (L, μ)Non-smooth (Lipschitz f)
Vanilla GDO(1/k)O(exp(-k/κ))O(1/√k) (with subgradient)
Nesterov acceleratedO(1/k²)O(exp(-k/√κ))n/a (Nesterov-smooth required)
Proximal gradientO(1/k) (smooth part)O(exp(-k/κ))composite (smooth + prox)
FISTAO(1/k²)accelerated ratecomposite
ADMMO(1/k)linear w/ strong convexitycomposite block-separable
Subgradientn/an/aO(1/√k)
Mirror descentmatches GD on dualmatchesO(1/√k)
Polyak heavy-ballnot always faster than GD asymptoticallysometimes optimaln/a

κ = L/μ = condition number. Nesterov 1983 achieved the lower-bound O(1/k²) rate. FISTA (Beck-Teboulle 2009) extended to the composite case. Optimal rates are information-theoretic lower bounds (Nemirovski-Yudin 1983) — no first-order method using only function + gradient queries can do better.

8. Coordinate descent + proximal methods

For separable problems (objective decomposes into per-coordinate terms), block-coordinate descent matches accelerated rates in practice with much lower per-iteration cost.

  • Coordinate Descent for Lasso (Friedman-Hastie-Tibshirani GLMNET 2010) — fastest in practice for ℓ₁-regularized linear models. Used by GLMNET, scikit-learn Lasso, R glmnet.
  • Proximal mappings — soft-thresholding for ℓ₁, projection for indicator functions, prox of group-norm for group Lasso.
  • ADMM decomposes problems with affine constraints; convergent on convex separable; heuristic on non-convex (Wang-Yin 2019 study).

9. Combinatorial optimization

For discrete decision variables: MIP, set cover, TSP, scheduling, network flow.

ApproachWhen it worksReal solver
LP relaxation + roundingstructure permitstextbook
Branch-and-boundexact, exponential worst-caseGurobi, CPLEX, COIN-OR CBC, SCIP, HiGHS
Branch-and-cutadds cutting planesGurobi, CPLEX
Benders decompositionstructure with complicating variablesCPLEX Benders, Gurobi
Column generationmany variables, few binding constraintsCPLEX, FICO Xpress
Cutting-plane (Gomory, Chvátal-Gomory, MIR)strengthens LP relaxationmost MIP solvers
Lagrangian relaxationdualize hard constraintsresearch / specialized
Constraint programmingscheduling, allocation w/ logic constraintsGoogle OR-Tools CP-SAT, IBM CP Optimizer
Local search (TS, SA)NP-hard heuristicLocalSolver, OR-Tools
MILP solvers as black boxmost productionGurobi 11.x (2024), CPLEX 22.x

See combinatorial-optimization for set cover, TSP, max-flow / min-cut, vehicle routing, scheduling.

10. Bayesian optimization and HPO

For expensive black-box functions (each evaluation costs minutes-hours; e.g., training an ML model):

MethodSurrogate modelAcquisition functionLibrary
GP-EI (Mockus 1978, Snoek-Larochelle-Adams 2012)Gaussian processexpected improvementBoTorch, GPyOpt
GP-UCBGPupper confidence boundBoTorch
TPE (Bergstra et al 2011)tree-structured Parzen estimatorquantile-basedOptuna (default), Hyperopt
SMAC (Hutter-Hoos-Leyton-Brown 2011)random forestEISMAC3
Hyperband (Li et al 2018)n/a (multi-fidelity)successive halvingOptuna, Ray Tune
BOHB (Falkner-Klein-Hutter 2018)TPE + HyperbandEI w/ multi-fidelityHpBandSter
Vizier (Google)proprietaryEI variantsGoogle internal + Open Source Vizier
ASHAn/aasync successive halvingRay Tune
PBT (DeepMind 2017)n/a — population-basedlearn scheduleRay, AutoML libraries

In 2025, Optuna with TPE + ASHA pruner is the most common open-source HPO stack. Ray Tune dominates for distributed HPO. Sweeps in Weights & Biases for managed HPO.

11. Multi-objective optimization

For multiple conflicting objectives:

MethodApproachOutput
Weighted-sum scalarizationreduce to single-objone Pareto point per weight
ε-constraintconstrain all but oneone Pareto point per ε
NSGA-II (Deb et al 2002)population + non-dominated sorting + crowdingfull Pareto front
NSGA-III (Deb-Jain 2014)reference-direction NSGAbetter for ≥3 objectives
MOEA/D (Zhang-Li 2007)decomposition-basedfull front
ParEGO (Knowles 2006)BoTorch-style Bayesian multi-objfront via random scalarization
qEHVI (qExpected Hypervolume Improvement)BoTorchbatched Bayesian multi-obj
Goal programmingtargeted”best” wrt targets

NSGA-II is the default for evolutionary multi-objective (cited 50k+ times). qEHVI is the Bayesian-multiobjective state of the art.

12. Riemannian optimization

For optimization on manifolds (Stiefel, Grassmann, sphere, SPD, hyperbolic):

MethodManifold version
Riemannian GDgradient + retraction back to manifold
Riemannian Adam (Becigneul-Ganea 2019)Adam on manifold
Trust-region (RTR, Absil-Mahony-Sepulchre 2008)Riemannian Newton-like
Pymanopt, Manopt.jl, GeoOptPython/Julia/PyTorch libraries

Used for: orthogonal matrices (Stiefel) for neural net weights, SPD matrices in BCI / EEG, hyperbolic embeddings for hierarchy data, dictionary learning. See riemannian-optimization and lie-groups-so3-se3.

13. Reinforcement-learning-style optimization

When the objective is the return of a sequential decision process:

MethodWhen
REINFORCE / policy gradientsmall policy networks
Actor-criticstable when value function is learnable
TRPO (Schulman et al 2015)trust-region; expensive
PPO (Schulman et al 2017)de facto default for RLHF
SAC (Haarnoja et al 2018)continuous control, max-entropy
DDPG / TD3continuous control, deterministic policy
iLQR / DDPmodel-based, trajectory opt
MPC w/ shootingmodel-based, finite horizon
GRPO (group relative policy optimization, DeepSeek 2024)post-training LLMs; no value model
DPO (Rafailov et al 2023)direct preference optimization for RLHF
RLOO (Ahmadian et al 2024)RL on language models

See reinforcement-learning-theory and fine-tuning-rlhf.

14. Memory footprint comparison (LLM training era)

OptimizerState per parameterExample: 70B model (fp32)
SGD00 GB
SGD + momentum1 (1× param)280 GB
Adam / AdamW2 (m + v)560 GB
AdaFactor≪ 2 (factored)~70 GB
Lion1 (momentum only)280 GB
Shampoo (block-diagonal)varies; can be largerdepends on block size
SOAPlarger than Adam (preconditioner)depends
Muon1 (momentum)280 GB
8-bit Adam (bnb)2× int8~140 GB
4-bit Adam2× int4~70 GB

Optimizer state often dominates GPU memory for trillion-parameter training. ZeRO Stage 3 / FSDP shards optimizer state across GPUs. 8-bit Adam (bitsandbytes, Tim Dettmers 2021) is widely used.

15. Numerical solvers — the practical ecosystem

ToolClassLanguages
GurobiLP / QP / MIP / MIQPC, Python, Julia, R
CPLEXLP / QP / MIP / CPC, Python, Java
MosekLP / QP / SOCP / SDP / MIPC, Python, MATLAB
COPTLP / QP / SDP / MIPC, Python
HiGHS (open)LP / QP / MIPC, Python, R, Julia
SCIP (open)MIP, MINLPC, Python
CBC (COIN-OR)LP / MIPC, Python
OR-Tools (Google)CP-SAT, MIP, routingC, Python, Java
CVXPY (Diamond-Boyd)disciplined convex programmingPython
JuMP.jlmathematical programming DSLJulia
scipy.optimizefirst-order, derivative-free, root-findingPython
PyTorch / JAX optimizersautograd-based first-orderPython
Optuna / Ray Tune / HyperoptHPO / Bayesian optPython
Pymanopt / ManoptRiemannian optPython / MATLAB

The 2025 commercial-MIP leader is Gurobi 11; the academic-OSS leader is HiGHS. CVXPY is the modeling-language default for convex optimization in Python.

16. Decision tree — pick by problem class

What's the problem?
├─ Linear program (linear obj, linear constraints)
│    → Simplex (Gurobi/CPLEX/HiGHS) for sparse, large-scale; interior-point for dense.
├─ Quadratic program (quadratic obj, linear constraints)
│    → QP active-set or interior-point (Gurobi, OSQP, qpOASES, MOSEK).
├─ Semidefinite program (linear matrix inequalities)
│    → Interior-point (Mosek, SDPT3, SeDuMi); CVXPY for modeling.
├─ Mixed-integer linear (some variables integer)
│    → Branch-and-cut (Gurobi, CPLEX, SCIP, HiGHS).
│    → If structure: Benders, column generation.
├─ Convex non-smooth (regularized regression, Lasso, group-Lasso)
│    → Coordinate descent (GLMNET); FISTA; ADMM.
├─ Convex smooth large-scale (logistic regression, deep linear)
│    → L-BFGS (small-mid); SGD/Adam (large).
├─ Non-convex deep learning
│    → AdamW for most NLP; SGD+momentum for ResNets.
│    → Try Muon, SOAP, Shampoo for SOTA on small-mid LLMs.
│    → Lion if memory-constrained.
├─ Expensive black-box (CFD, lab experiment, neural arch search, ML hyperparams)
│    → Bayesian opt (BoTorch, Optuna TPE, SMAC).
│    → If multi-fidelity: Hyperband / BOHB / ASHA.
├─ Sequential decision (RL, control)
│    → PPO (default LLM-RLHF), SAC (continuous control), iLQR / MPC (model-based).
│    → GRPO / DPO for LLM preference tuning.
├─ Population-based black-box, moderate dim (10–1000)
│    → CMA-ES (Hansen) if continuous; GA/DE if discrete.
├─ Multi-objective (2–5 objectives)
│    → NSGA-II (DEAP, pymoo) for evolutionary; qEHVI for Bayesian.
├─ Low-dim derivative-free (<30 dim)
│    → Nelder-Mead (scipy.optimize.minimize method='Nelder-Mead');
│    → COBYLA if constraints.
├─ On a Riemannian manifold (Stiefel, Grassmann, SPD)
│    → Pymanopt / GeoOpt; Riemannian Adam.
├─ Optimal control / trajectory optimization
│    → DDP / iLQR; direct collocation; MPC w/ IPOPT.
│    → CasADi for symbolic gradient; Drake for robotics.
├─ Combinatorial scheduling / routing / assignment
│    → Google OR-Tools CP-SAT, IBM CP Optimizer, LocalSolver.
└─ Game-theoretic / minimax
     → Extragradient, OGDA, optimistic-gradient; Nash-EGT for adversarial training.

17. Anti-patterns

  1. Using SGD when the problem is convex and small — L-BFGS converges 100× faster.
  2. Using Adam when SGD-momentum would do — ResNet ImageNet was famously trained with SGD-momentum; Adam is worse for it.
  3. Using Bayesian opt for cheap functions — every BO step costs O(n³) GP fit; pointless if function is cheap.
  4. Using GP-BO above ~30 dim — kernel saturates; use TPE / SMAC / random search.
  5. Solving an LP with MIP solver — MIP solver adds integer-handling overhead.
  6. Mixing optimizer with LR schedule wrong — AdamW + cosine schedule + linear warmup is the deep-learning recipe; Adam without weight decay overfits.
  7. Random search for HPO of >5 hyperparams without budget pruning — use ASHA or Hyperband.
  8. CMA-ES in 10,000 dim — covariance matrix is 10⁸ × 10⁸. Use sep-CMA or random search.

18. The 2024–2026 frontier

  • Muon (Jordan et al 2024) — half-memory Adam alternative; rapidly adopted in 2025 LLM pretraining.
  • SOAP (Vyas et al 2024) — Shampoo + Adam, combines best of both.
  • Schedule-Free SGD (Defazio et al 2024) — adaptive without LR schedule; emerging.
  • Distributed Shampoo — scaled Shampoo for multi-trillion parameter.
  • JAX-based optimization stacks — Optax (Google) is the cleanest API.
  • Gurobi 11 (2024) — significant MIP speedup (~30% over v10).
  • HiGHS 1.7 — open-source MIP catching up to commercial.
  • CP-SAT (OR-Tools) — won MiniZinc Challenge every year 2018–2024.
  • End-to-end differentiable convex layers (cvxpylayers, Agrawal-Amos-Barratt-Boyd 2019) — embed CVXPY problem inside a neural network.
  • NPB / Neural ODE optimizers — back-prop through ODE solvers.
  • Implicit gradients / OptNet (Amos-Kolter 2017) — differentiate through QP solutions.

Adjacent

When to pick what

The fastest narrowing: convex + small/medium → interior-point or L-BFGS; convex + huge → SGD or proximal-FISTA; non-convex deep learning → AdamW (or try Muon/SOAP); black-box expensive → Bayesian opt or evolutionary; discrete decisions → MIP solver (Gurobi/HiGHS) or CP-SAT; multi-objective → NSGA-II or qEHVI; sequential decision → PPO / SAC / iLQR; manifold-constrained → Pymanopt / GeoOpt. The single biggest mistake is applying SGD-class methods to problems where exact methods exist — if your problem is an LP, use a simplex solver; if it’s convex, prove it convex and use interior-point or proximal methods. SGD/Adam earn their keep on non-convex deep-learning loss surfaces where exact methods don’t apply.