Optimization Methods — Cross-Cutting Comparison

This note compares every optimization method discussed in the Math library — exact (LP / QP / MIP / SDP), convex first-order (gradient descent, accelerated, proximal, ADMM), non-convex stochastic (SGD, Adam, AdamW, Lion, Muon, Shampoo, SOAP, LAMB), evolutionary (GA, DE, CMA-ES, NSGA-II), Bayesian (GP-EI, Hyperband, BOHB, Optuna TPE, SMAC), and derivative-free (Nelder-Mead, COBYLA, Powell) — on the axes that decide which one fits a real problem: convexity required, smoothness required, derivative availability, scalability, batch/streaming, parallelism, and memory footprint. Use the table-by-axis structure to triage; decision tree at the bottom.

1. The taxonomy

EXACT (polynomial-time on the class)         CONVEX FIRST-ORDER (gradient-based)
  LP — simplex, interior-point Karmarkar       vanilla GD
  QP — active-set, interior-point              Nesterov accelerated
  MIP — branch-and-bound, cut, Benders         heavy-ball / Polyak momentum
  SDP — interior-point (SDPT3, Mosek)          proximal gradient
                                               FISTA (proximal + Nesterov)
                                               ADMM (Boyd 2011)
                                               dual decomposition
                                               coordinate descent

NON-CONVEX STOCHASTIC (deep learning)         EVOLUTIONARY / POPULATION
  SGD                                            GA — genetic algorithm
  SGD + momentum                                 DE — differential evolution
  Adam, AdamW, AdamWR                            CMA-ES (Hansen 2001)
  AdaGrad, RMSProp                               PSO — particle swarm
  Adadelta, Nadam                                NSGA-II (Deb 2002) multi-obj
  Lion (Chen et al 2023)                         MOEA/D (Zhang-Li 2007) multi-obj
  Sophia (Liu et al 2023)
  LAMB (You et al 2019)
  Shampoo (Gupta-Koren-Singer 2018)
  SOAP (Vyas et al 2024)
  Muon (Jordan et al 2024)

BAYESIAN / SURROGATE-BASED                    DERIVATIVE-FREE / BLACK-BOX
  GP-EI (Mockus 1978; Snoek et al 2012)          Nelder-Mead (1965)
  Hyperband (Li et al 2018)                      Powell's method
  BOHB (Falkner-Klein-Hutter 2018)               COBYLA (Powell 1994)
  TPE (Optuna, Bergstra et al 2011)              Brent / golden-section (1D)
  SMAC (Hutter-Hoos-Leyton-Brown 2011)           ES (evolutionary strategies, OpenAI 2017)
  BoTorch (Facebook/Meta)
  Vizier (Google)

2. Convexity required vs not

Method	Requires convexity?	Behavior on non-convex
Simplex (LP)	yes	undefined — LP must be convex
Interior-point (LP/QP/SDP)	yes	undefined
Branch-and-bound (MIP)	yes (on relaxation)	works (integer non-convex by nature)
Vanilla GD	no	converges to stationary point (saddle / local min)
Nesterov accelerated	requires convexity for the optimal rate	works heuristically
ADMM	requires convex separable for convergence proof	works heuristically (deep learning, neural ODE)
SGD + Adam	no	de facto default in deep learning
Lion, Muon, Shampoo, SOAP	no	designed for non-convex deep learning
GA / DE / PSO	no	designed for non-convex
CMA-ES	no	gold standard for moderate-dim non-convex
Nelder-Mead	no	works for low-dim non-convex
Bayesian (GP-EI, TPE)	no	designed for non-convex expensive black-box

The convex/non-convex split is the first filter: if your problem is convex, you can prove global convergence and use the polynomial-time methods. If not, you live in deep-learning-land where convergence guarantees vanish and benchmarks substitute for theory.

3. Smoothness and derivative availability

Method	Needs ∇f	Needs ∇²f	Smoothness assumption
Simplex	—	—	LP (piecewise linear by definition)
Interior-point	— (uses ∇²)	yes (or quasi-Newton)	smooth convex barrier
Newton	yes	yes	smooth (Lipschitz ∇²)
Gauss-Newton, Levenberg-Marquardt	yes	structure of ∇²	smooth (least-squares)
Quasi-Newton (BFGS, L-BFGS)	yes	no (estimated)	smooth
Vanilla GD	yes	no	Lipschitz ∇ (smoothness constant L)
Nesterov	yes	no	smooth + (optionally) strongly convex
Proximal gradient	yes (smooth part) + prox of non-smooth part	no	smooth + non-smooth proximable
FISTA	same	no	smooth + non-smooth
ADMM	needs to evaluate prox of each block	no	block-separable
Subgradient	needs subgradient	no	convex non-smooth (or just Lipschitz f)
SGD + Adam	yes	no	non-smooth OK (popular for ReLU networks!)
Lion, Muon	yes (sign or projected)	no	non-smooth OK
Shampoo, SOAP	yes	no (preconditioner = second-moment of gradients)	non-smooth OK
BFGS/L-BFGS	yes	no	smooth
CMA-ES	no	no	none required (black-box)
Nelder-Mead	no	no	none required
Bayesian opt	no	no	none required (assumes f is GP-smooth)
Trust-region	yes	yes or Gauss-Newton	smooth

Derivative availability is the practical question. If you have ∇f cheaply (via autograd), first-order methods are best. If ∇f is expensive (each call requires a simulation), use derivative-free or Bayesian opt. If ∇²f is cheap (small dim), use Newton-type.

4. Scalability — dimension N and data n

Method	Dim N (variables)	Data n (constraints / samples)	Memory
Simplex	up to ~10⁶ in commercial solvers (Gurobi, CPLEX)	up to ~10⁶	sparse Cholesky factorization
Interior-point LP/QP	up to ~10⁵–10⁶	up to ~10⁶	needs full factor
Interior-point SDP	up to ~10³ (variables in semidef matrix)	up to ~10³ constraints	O(N⁴) — bottleneck
Branch-and-bound (MIP)	up to ~10⁴ integer + ~10⁶ continuous	up to ~10⁶ constraints	depends on tree depth
Vanilla GD	unlimited (1B+ for deep learning)	full batch	O(N)
Nesterov	unlimited	full batch	O(N)
L-BFGS	up to ~10⁹	full batch typical	O(mN), m = 5–20 history
FISTA	unlimited	full batch	O(N)
ADMM	unlimited	block-separable	O(N) per block
SGD + Adam	trillions (DeepSeek-V3 671B, GPT-4 ~1.8T)	streamed minibatches	O(N) + O(N) for moments (Adam: 3×)
Lion	trillions	minibatch	O(N) + O(N) (half memory of Adam)
AdamW	trillions	minibatch	3×
LAMB	trillions	minibatch	3×
Shampoo	trillions, w/ layerwise blocks	minibatch	larger (preconditioner blocks)
SOAP	trillions	minibatch	larger
Muon	trillions	minibatch	smaller than Adam (uses sign-like update on momentum)
CMA-ES	up to ~10³ practical (~10⁴ w/ sep-CMA-ES)	varies	O(N²) covariance
GA / DE	up to ~10³ practical	varies	O(pop × N)
Nelder-Mead	up to ~30	varies	O(N²) simplex
Bayesian opt (GP)	up to ~30 (saturated kernel)	up to ~10³ observations	O(n³) GP fit
Bayesian opt (TPE)	up to ~100	up to ~10⁴ observations	O(n)
Hyperband	up to ~10² (HPs)	up to ~10⁴ trials	minimal state

The N≈30 ceiling on Bayesian opt and Nelder-Mead is a hard practical wall — kernel-based GPs degrade in high dim, and Nelder-Mead’s simplex deflates to a singular configuration. Above 30 dim, switch to TPE / SMAC (random-forest surrogate) or population methods.

5. Parallelism and batch/streaming

Method	Embarrassingly parallel?	Online / streaming?
Simplex / interior-point	no (single solver)	no (batch)
Branch-and-bound	yes (subtrees)	no
Vanilla GD (full batch)	yes (gradient over data)	no
SGD + minibatch	yes (per-replica)	yes
ADMM	yes (per block)	no
L-BFGS	partial	no
Synchronous data-parallel SGD	yes; gradient all-reduce	yes
Asynchronous SGD (Hogwild!)	yes; relaxed	yes
ZeRO / FSDP (DeepSpeed, FSDP)	yes (parameter sharded)	yes
GA / DE / PSO	yes (population)	n/a
CMA-ES	partial (population eval)	n/a
Bayesian opt	partial (batch acquisition, q-EI)	yes (sequential)
Hyperband / BOHB	yes (rungs)	yes

For modern deep learning, data-parallel SGD with all-reduce scales to ~1024 GPUs efficiently. Beyond that, you need tensor-parallel + pipeline-parallel + ZeRO/FSDP. See cuda-triton-gpu-programming and inference-optimization.

6. Deep-learning optimizer landscape (2024–2026)

Optimizer	Year	What’s new vs Adam	When it wins
SGD + momentum	1960s (heavy-ball Polyak)	baseline	small models, well-tuned LR schedules; ResNet ImageNet
AdaGrad	Duchi-Hazan-Singer 2011	per-parameter LR	sparse features, NLP-bag-of-words
RMSProp	Hinton lecture 2012 (unpublished)	EMA of squared grads	RNN training
Adam	Kingma-Ba 2015 (DiP)	combine momentum + RMSProp	default since 2015
AdamW	Loshchilov-Hutter 2019	decoupled weight decay	NLP transformers; default since GPT-2 era
LAMB	You et al 2019	layer-wise normalization	large-batch BERT
Adafactor	Shazeer-Stern 2018	factored second moment, saves memory	T5 trained with this
Lion	Chen et al 2023 (Google)	sign of momentum; ½ memory of Adam	competitive at LLM scale, much cheaper
Sophia	Liu et al 2023	clipped diagonal Hessian estimate	2× speedup vs AdamW for LLM pretraining
Shampoo	Gupta-Koren-Singer 2018; revived 2023	block-diagonal full-matrix preconditioner	competitive at small/mid scale; expensive memory
Distributed Shampoo	Anil et al 2020	scalable Shampoo	adopted at Google for some LLMs
CASPR	Li et al 2024	Shampoo + LR-free	research
SOAP	Vyas et al 2024	Shampoo + Adam-in-eigenbasis (best of both)	strong empirical, growing adoption
Muon	Jordan-Jin-Boza-Bernstein 2024	Newton-Schulz iteration on momentum + linear-algebra trick; ½ memory of Adam	Llama-style architectures, current SOTA on small/mid LLM benchmarks
PSGD	Li 2018 (revisited 2024)	Preconditioned SGD	small but competitive

The 2024 lesson: Adam is no longer the obvious default. Muon, SOAP, and Shampoo are competitive or better for LLM-scale training; Lion ties Adam at half the memory. The gap is small (~5–15% loss-equivalent compute) but compounds over trillion-parameter runs.

7. Convex first-order methods — convergence rates

Method	Smooth (L)	Smooth + strong convex (L, μ)	Non-smooth (Lipschitz f)
Vanilla GD	O(1/k)	O(exp(-k/κ))	O(1/√k) (with subgradient)
Nesterov accelerated	O(1/k²)	O(exp(-k/√κ))	n/a (Nesterov-smooth required)
Proximal gradient	O(1/k) (smooth part)	O(exp(-k/κ))	composite (smooth + prox)
FISTA	O(1/k²)	accelerated rate	composite
ADMM	O(1/k)	linear w/ strong convexity	composite block-separable
Subgradient	n/a	n/a	O(1/√k)
Mirror descent	matches GD on dual	matches	O(1/√k)
Polyak heavy-ball	not always faster than GD asymptotically	sometimes optimal	n/a

κ = L/μ = condition number. Nesterov 1983 achieved the lower-bound O(1/k²) rate. FISTA (Beck-Teboulle 2009) extended to the composite case. Optimal rates are information-theoretic lower bounds (Nemirovski-Yudin 1983) — no first-order method using only function + gradient queries can do better.

8. Coordinate descent + proximal methods

For separable problems (objective decomposes into per-coordinate terms), block-coordinate descent matches accelerated rates in practice with much lower per-iteration cost.

Coordinate Descent for Lasso (Friedman-Hastie-Tibshirani GLMNET 2010) — fastest in practice for ℓ₁-regularized linear models. Used by GLMNET, scikit-learn Lasso, R glmnet.
Proximal mappings — soft-thresholding for ℓ₁, projection for indicator functions, prox of group-norm for group Lasso.
ADMM decomposes problems with affine constraints; convergent on convex separable; heuristic on non-convex (Wang-Yin 2019 study).

9. Combinatorial optimization

For discrete decision variables: MIP, set cover, TSP, scheduling, network flow.

Approach	When it works	Real solver
LP relaxation + rounding	structure permits	textbook
Branch-and-bound	exact, exponential worst-case	Gurobi, CPLEX, COIN-OR CBC, SCIP, HiGHS
Branch-and-cut	adds cutting planes	Gurobi, CPLEX
Benders decomposition	structure with complicating variables	CPLEX Benders, Gurobi
Column generation	many variables, few binding constraints	CPLEX, FICO Xpress
Cutting-plane (Gomory, Chvátal-Gomory, MIR)	strengthens LP relaxation	most MIP solvers
Lagrangian relaxation	dualize hard constraints	research / specialized
Constraint programming	scheduling, allocation w/ logic constraints	Google OR-Tools CP-SAT, IBM CP Optimizer
Local search (TS, SA)	NP-hard heuristic	LocalSolver, OR-Tools
MILP solvers as black box	most production	Gurobi 11.x (2024), CPLEX 22.x

See combinatorial-optimization for set cover, TSP, max-flow / min-cut, vehicle routing, scheduling.

10. Bayesian optimization and HPO

For expensive black-box functions (each evaluation costs minutes-hours; e.g., training an ML model):

Method	Surrogate model	Acquisition function	Library
GP-EI (Mockus 1978, Snoek-Larochelle-Adams 2012)	Gaussian process	expected improvement	BoTorch, GPyOpt
GP-UCB	GP	upper confidence bound	BoTorch
TPE (Bergstra et al 2011)	tree-structured Parzen estimator	quantile-based	Optuna (default), Hyperopt
SMAC (Hutter-Hoos-Leyton-Brown 2011)	random forest	EI	SMAC3
Hyperband (Li et al 2018)	n/a (multi-fidelity)	successive halving	Optuna, Ray Tune
BOHB (Falkner-Klein-Hutter 2018)	TPE + Hyperband	EI w/ multi-fidelity	HpBandSter
Vizier (Google)	proprietary	EI variants	Google internal + Open Source Vizier
ASHA	n/a	async successive halving	Ray Tune
PBT (DeepMind 2017)	n/a — population-based	learn schedule	Ray, AutoML libraries

In 2025, Optuna with TPE + ASHA pruner is the most common open-source HPO stack. Ray Tune dominates for distributed HPO. Sweeps in Weights & Biases for managed HPO.

11. Multi-objective optimization

For multiple conflicting objectives:

Method	Approach	Output
Weighted-sum scalarization	reduce to single-obj	one Pareto point per weight
ε-constraint	constrain all but one	one Pareto point per ε
NSGA-II (Deb et al 2002)	population + non-dominated sorting + crowding	full Pareto front
NSGA-III (Deb-Jain 2014)	reference-direction NSGA	better for ≥3 objectives
MOEA/D (Zhang-Li 2007)	decomposition-based	full front
ParEGO (Knowles 2006)	BoTorch-style Bayesian multi-obj	front via random scalarization
qEHVI (qExpected Hypervolume Improvement)	BoTorch	batched Bayesian multi-obj
Goal programming	targeted	”best” wrt targets

NSGA-II is the default for evolutionary multi-objective (cited 50k+ times). qEHVI is the Bayesian-multiobjective state of the art.

12. Riemannian optimization

For optimization on manifolds (Stiefel, Grassmann, sphere, SPD, hyperbolic):

Method	Manifold version
Riemannian GD	gradient + retraction back to manifold
Riemannian Adam (Becigneul-Ganea 2019)	Adam on manifold
Trust-region (RTR, Absil-Mahony-Sepulchre 2008)	Riemannian Newton-like
Pymanopt, Manopt.jl, GeoOpt	Python/Julia/PyTorch libraries

Used for: orthogonal matrices (Stiefel) for neural net weights, SPD matrices in BCI / EEG, hyperbolic embeddings for hierarchy data, dictionary learning. See riemannian-optimization and lie-groups-so3-se3.

13. Reinforcement-learning-style optimization

When the objective is the return of a sequential decision process:

Method	When
REINFORCE / policy gradient	small policy networks
Actor-critic	stable when value function is learnable
TRPO (Schulman et al 2015)	trust-region; expensive
PPO (Schulman et al 2017)	de facto default for RLHF
SAC (Haarnoja et al 2018)	continuous control, max-entropy
DDPG / TD3	continuous control, deterministic policy
iLQR / DDP	model-based, trajectory opt
MPC w/ shooting	model-based, finite horizon
GRPO (group relative policy optimization, DeepSeek 2024)	post-training LLMs; no value model
DPO (Rafailov et al 2023)	direct preference optimization for RLHF
RLOO (Ahmadian et al 2024)	RL on language models

See reinforcement-learning-theory and fine-tuning-rlhf.

14. Memory footprint comparison (LLM training era)

Optimizer	State per parameter	Example: 70B model (fp32)
SGD	0	0 GB
SGD + momentum	1 (1× param)	280 GB
Adam / AdamW	2 (m + v)	560 GB
AdaFactor	≪ 2 (factored)	~70 GB
Lion	1 (momentum only)	280 GB
Shampoo (block-diagonal)	varies; can be larger	depends on block size
SOAP	larger than Adam (preconditioner)	depends
Muon	1 (momentum)	280 GB
8-bit Adam (bnb)	2× int8	~140 GB
4-bit Adam	2× int4	~70 GB

Optimizer state often dominates GPU memory for trillion-parameter training. ZeRO Stage 3 / FSDP shards optimizer state across GPUs. 8-bit Adam (bitsandbytes, Tim Dettmers 2021) is widely used.

15. Numerical solvers — the practical ecosystem

Tool	Class	Languages
Gurobi	LP / QP / MIP / MIQP	C, Python, Julia, R
CPLEX	LP / QP / MIP / CP	C, Python, Java
Mosek	LP / QP / SOCP / SDP / MIP	C, Python, MATLAB
COPT	LP / QP / SDP / MIP	C, Python
HiGHS (open)	LP / QP / MIP	C, Python, R, Julia
SCIP (open)	MIP, MINLP	C, Python
CBC (COIN-OR)	LP / MIP	C, Python
OR-Tools (Google)	CP-SAT, MIP, routing	C, Python, Java
CVXPY (Diamond-Boyd)	disciplined convex programming	Python
JuMP.jl	mathematical programming DSL	Julia
scipy.optimize	first-order, derivative-free, root-finding	Python
PyTorch / JAX optimizers	autograd-based first-order	Python
Optuna / Ray Tune / Hyperopt	HPO / Bayesian opt	Python
Pymanopt / Manopt	Riemannian opt	Python / MATLAB

The 2025 commercial-MIP leader is Gurobi 11; the academic-OSS leader is HiGHS. CVXPY is the modeling-language default for convex optimization in Python.

16. Decision tree — pick by problem class

What's the problem?
├─ Linear program (linear obj, linear constraints)
│    → Simplex (Gurobi/CPLEX/HiGHS) for sparse, large-scale; interior-point for dense.
├─ Quadratic program (quadratic obj, linear constraints)
│    → QP active-set or interior-point (Gurobi, OSQP, qpOASES, MOSEK).
├─ Semidefinite program (linear matrix inequalities)
│    → Interior-point (Mosek, SDPT3, SeDuMi); CVXPY for modeling.
├─ Mixed-integer linear (some variables integer)
│    → Branch-and-cut (Gurobi, CPLEX, SCIP, HiGHS).
│    → If structure: Benders, column generation.
├─ Convex non-smooth (regularized regression, Lasso, group-Lasso)
│    → Coordinate descent (GLMNET); FISTA; ADMM.
├─ Convex smooth large-scale (logistic regression, deep linear)
│    → L-BFGS (small-mid); SGD/Adam (large).
├─ Non-convex deep learning
│    → AdamW for most NLP; SGD+momentum for ResNets.
│    → Try Muon, SOAP, Shampoo for SOTA on small-mid LLMs.
│    → Lion if memory-constrained.
├─ Expensive black-box (CFD, lab experiment, neural arch search, ML hyperparams)
│    → Bayesian opt (BoTorch, Optuna TPE, SMAC).
│    → If multi-fidelity: Hyperband / BOHB / ASHA.
├─ Sequential decision (RL, control)
│    → PPO (default LLM-RLHF), SAC (continuous control), iLQR / MPC (model-based).
│    → GRPO / DPO for LLM preference tuning.
├─ Population-based black-box, moderate dim (10–1000)
│    → CMA-ES (Hansen) if continuous; GA/DE if discrete.
├─ Multi-objective (2–5 objectives)
│    → NSGA-II (DEAP, pymoo) for evolutionary; qEHVI for Bayesian.
├─ Low-dim derivative-free (<30 dim)
│    → Nelder-Mead (scipy.optimize.minimize method='Nelder-Mead');
│    → COBYLA if constraints.
├─ On a Riemannian manifold (Stiefel, Grassmann, SPD)
│    → Pymanopt / GeoOpt; Riemannian Adam.
├─ Optimal control / trajectory optimization
│    → DDP / iLQR; direct collocation; MPC w/ IPOPT.
│    → CasADi for symbolic gradient; Drake for robotics.
├─ Combinatorial scheduling / routing / assignment
│    → Google OR-Tools CP-SAT, IBM CP Optimizer, LocalSolver.
└─ Game-theoretic / minimax
     → Extragradient, OGDA, optimistic-gradient; Nash-EGT for adversarial training.

17. Anti-patterns

Using SGD when the problem is convex and small — L-BFGS converges 100× faster.
Using Adam when SGD-momentum would do — ResNet ImageNet was famously trained with SGD-momentum; Adam is worse for it.
Using Bayesian opt for cheap functions — every BO step costs O(n³) GP fit; pointless if function is cheap.
Using GP-BO above ~30 dim — kernel saturates; use TPE / SMAC / random search.
Solving an LP with MIP solver — MIP solver adds integer-handling overhead.
Mixing optimizer with LR schedule wrong — AdamW + cosine schedule + linear warmup is the deep-learning recipe; Adam without weight decay overfits.
Random search for HPO of >5 hyperparams without budget pruning — use ASHA or Hyperband.
CMA-ES in 10,000 dim — covariance matrix is 10⁸ × 10⁸. Use sep-CMA or random search.

18. The 2024–2026 frontier

Muon (Jordan et al 2024) — half-memory Adam alternative; rapidly adopted in 2025 LLM pretraining.
SOAP (Vyas et al 2024) — Shampoo + Adam, combines best of both.
Schedule-Free SGD (Defazio et al 2024) — adaptive without LR schedule; emerging.
Distributed Shampoo — scaled Shampoo for multi-trillion parameter.
JAX-based optimization stacks — Optax (Google) is the cleanest API.
Gurobi 11 (2024) — significant MIP speedup (~30% over v10).
HiGHS 1.7 — open-source MIP catching up to commercial.
CP-SAT (OR-Tools) — won MiniZinc Challenge every year 2018–2024.
End-to-end differentiable convex layers (cvxpylayers, Agrawal-Amos-Barratt-Boyd 2019) — embed CVXPY problem inside a neural network.
NPB / Neural ODE optimizers — back-prop through ODE solvers.
Implicit gradients / OptNet (Amos-Kolter 2017) — differentiate through QP solutions.

Adjacent

Linear algebra — linear-algebra-essentials for the matrix operations underlying every method.
Numerical linear algebra — numerical-linear-algebra for QR / LU / Cholesky / preconditioned-CG used by interior-point.
Eigenvalue / SVD — eigenvalue-problems, svd-pca-spectral for Hessian + preconditioner spectra.
Convex optimization theory — convex-optimization for the duality + KKT + interior-point story.
Gradient descent — gradient-descent-variants for the full deep-learning optimizer survey.
Combinatorial — combinatorial-optimization for MIP, TSP, scheduling.
Riemannian — riemannian-optimization for manifold-constrained.
Reinforcement learning — reinforcement-learning-theory for policy gradient, value methods, RLHF.
Variational inference — variational-inference for ELBO maximization (a non-convex optimization in disguise).
Optimization algorithm taxonomy — optimization-algorithm-taxonomy for the full method catalog.
GPU programming — cuda-triton-gpu-programming for kernel-level acceleration of SGD/Adam.
Distributed training — fine-tuning-rlhf for ZeRO, FSDP, RLHF.
Engineering control — mpc-control for MPC’s quadratic-program at the core.

When to pick what

The fastest narrowing: convex + small/medium → interior-point or L-BFGS; convex + huge → SGD or proximal-FISTA; non-convex deep learning → AdamW (or try Muon/SOAP); black-box expensive → Bayesian opt or evolutionary; discrete decisions → MIP solver (Gurobi/HiGHS) or CP-SAT; multi-objective → NSGA-II or qEHVI; sequential decision → PPO / SAC / iLQR; manifold-constrained → Pymanopt / GeoOpt. The single biggest mistake is applying SGD-class methods to problems where exact methods exist — if your problem is an LP, use a simplex solver; if it’s convex, prove it convex and use interior-point or proximal methods. SGD/Adam earn their keep on non-convex deep-learning loss surfaces where exact methods don’t apply.

Compendium

Explorer

Optimization Methods — Cross-Cutting Comparison

Optimization Methods — Cross-Cutting Comparison

See also

1. The taxonomy

2. Convexity required vs not

3. Smoothness and derivative availability

4. Scalability — dimension N and data n

5. Parallelism and batch/streaming

6. Deep-learning optimizer landscape (2024–2026)

7. Convex first-order methods — convergence rates

8. Coordinate descent + proximal methods

9. Combinatorial optimization

10. Bayesian optimization and HPO

11. Multi-objective optimization

12. Riemannian optimization

13. Reinforcement-learning-style optimization

14. Memory footprint comparison (LLM training era)

15. Numerical solvers — the practical ecosystem

16. Decision tree — pick by problem class

17. Anti-patterns

18. The 2024–2026 frontier

Adjacent

When to pick what

Graph View

Table of Contents

Backlinks