Optimization Algorithm Taxonomy

A family index of mathematical optimization methods. Optimization is the discipline of selecting the best element from a feasible set with respect to one or more objective functions. The field spans linear and nonlinear programming, discrete combinatorial problems, stochastic and online variants, and a wide ecosystem of metaheuristics. This note maps the major branches, their canonical algorithms, the people who built them, and the software that ships them today.

1. Categorization Schemes

A given problem can be classified along several orthogonal axes; the right algorithm depends on the combination.

1.1 By Algebraic Structure

LP (Linear Program) — linear objective, linear constraints. Convex polyhedral.
QP (Quadratic Program) — quadratic objective, linear constraints. Convex iff Hessian PSD.
SOCP (Second-Order Cone Program) — linear objective, constraints of form ‖Ax + b‖₂ ≤ cᵀx + d. Convex.
SDP (Semidefinite Program) — variables are matrices constrained to the PSD cone. Convex.
NLP (Nonlinear Program) — general smooth nonlinear objective and constraints. May be nonconvex.
MILP (Mixed-Integer Linear) — LP with some variables integer-valued. NP-hard.
MINLP (Mixed-Integer Nonlinear) — combination of MILP and NLP; the hardest mainstream class.

LP ⊂ QP ⊂ SOCP ⊂ SDP gives a nested hierarchy of convex conic programs. Beyond SDP lies general convex programming, beyond that nonconvex NLP, and orthogonal to all is integrality.

1.2 By Constraint Type

Unconstrained — minimize f(x) over all x ∈ ℝⁿ.
Box constraints — lᵢ ≤ xᵢ ≤ uᵢ. Often handled by projection or active-set.
Equality constraints — g(x) = 0. Lagrange multipliers; KKT conditions.
Inequality constraints — h(x) ≤ 0. Slack variables, barrier methods.
Mixed equality/inequality — the general NLP form min f(x) s.t. g(x) = 0, h(x) ≤ 0.

1.3 By Smoothness

Smooth — f, g, h ∈ C¹ or C². Enables gradient and Hessian methods.
Nonsmooth — subgradients (Clarke), proximal operators, bundle methods.
Composite — f(x) = g(x) + h(x) with g smooth, h nonsmooth-but-proximable. Enables proximal-gradient.

1.4 By Dimension

Low-dimensional (n < 10²) — direct/derivative-free methods feasible.
Medium (10² ≤ n ≤ 10⁵) — second-order methods (Newton, IPOPT) shine.
High (n > 10⁵) — first-order / stochastic methods dominate (SGD, ADMM, L-BFGS).
Extreme (n > 10⁸, deep learning) — stochastic gradient with mini-batches is essentially mandatory.

1.5 By Determinism

Deterministic — fixed objective, fixed data, exact gradients.
Stochastic — objective is an expectation E[F(x, ξ)]; gradient estimates are noisy.
Robust — worst-case over an uncertainty set (Ben-Tal–El Ghaoui–Nemirovski 2009).
Distributionally robust — worst-case over a set of plausible distributions (Esfahani–Kuhn 2018).

1.6 By Convexity

Convex — every local optimum is global; strong duality (Slater); efficient algorithms.
Nonconvex smooth — local search returns stationary points; global optimization needed for guarantees.
Geodesically convex / manifold convex — Riemannian optimization on Stiefel, Grassmann, SPD manifolds.

2. Linear Programming (LP)

The historical foundation of mathematical optimization. Standard form: min cᵀx s.t. Ax = b, x ≥ 0.

2.1 Simplex Method (Dantzig 1947)

George Dantzig’s pivoting algorithm walks along edges of the feasible polyhedron from vertex to vertex, improving the objective each step. Worst-case exponential (Klee–Minty cube, 1972), but average-case polynomial in practice. Revised simplex maintains an LU factorization of the basis; dual simplex exploits primal infeasibility.

2.2 Interior-Point Methods

Karmarkar (1984) showed polynomial-time LP via a projective interior-point method, ending the long-standing question opened by Khachiyan’s ellipsoid method (1979). Modern primal-dual path-following methods (Mehrotra predictor-corrector, 1992) follow the central path defined by a logarithmic barrier and converge in O(√n log(1/ε)) iterations.

2.3 Primal vs Dual Formulations

Every LP has a dual; strong duality always holds at optimum (no duality gap). Solvers exploit whichever is smaller or more tractable.

2.4 LP Solvers

Open source — GLPK (Makhorin), CLP (COIN-OR), HiGHS (Huangfu–Hall, Edinburgh).
Commercial — Gurobi, CPLEX (originally CPLEX Inc., now IBM), Mosek, FICO Xpress.

HiGHS has emerged in the last few years as the strongest open-source LP/MIP solver, used as the default in SciPy 1.9+ and embedded in Pyomo, JuMP, and others.

3. Quadratic Programming (QP)

Standard form: min ½xᵀHx + cᵀx s.t. Ax ≤ b. Convex iff H ⪰ 0.

3.1 Active-Set Methods

Generalize simplex: maintain a working set of active inequality constraints, solve the equality-constrained subproblem, then update the set. Effective for medium scale, warm-starts well (sequential QP).

3.2 Interior-Point QP

Mehrotra-style primal-dual methods extend naturally. Mosek and Gurobi both ship robust interior-point QP. OSQP (Stellato et al 2020) is a popular operator-splitting QP solver based on ADMM, particularly suited to embedded and MPC applications.

3.3 Augmented Lagrangian QP

QPALM (Hermans, Pipeleers, Patrinos 2022) blends semismooth Newton with augmented Lagrangian for sparse, ill-conditioned QPs.

4. Second-Order Cone Programming (SOCP)

Variables constrained to live in a Lorentz cone {(t, x) : ‖x‖₂ ≤ t}. Encompasses LP, convex QP, and many robust formulations.

4.1 Applications

Robust LP — uncertainty in coefficients lifts to SOCP constraints.
Portfolio optimization — Markowitz mean-variance with second-order moment constraints.
Trajectory optimization — quadratic state cost with conic input constraints.
Truss topology, antenna design, filter design.

4.2 SOCP Solvers

Mosek — gold standard for conic programming.
ECOS (Domahidi–Chu–Boyd 2013) — embedded conic solver, single-header C, used inside CVXPY.
SCS (O’Donoghue et al 2016) — splitting conic solver for large-scale cone programs.

5. Semidefinite Programming (SDP)

Variables are symmetric matrices X constrained to X ⪰ 0. Objective and constraints linear in X.

5.1 Theory and Applications

MaxCut relaxation — Goemans and Williamson (1995) gave a 0.878-approximation via SDP rounding, the first nontrivial approximation for an NP-hard graph problem.
Control via LMIs (linear matrix inequalities) — Boyd–El Ghaoui–Feron–Balakrishnan 1994. Lyapunov stability, H∞ synthesis, gain scheduling all reduce to SDP.
Polynomial optimization via Lasserre hierarchy (2001) and SOS programming (Parrilo 2000).
Quantum information — channel capacity, entanglement detection.

5.2 SDP Solvers

SeDuMi (Sturm 1999) — historical reference solver.
SDPT3 (Toh–Todd–Tütüncü) — interior-point.
COSMO (Garstka–Cannon–Goulart 2021) — operator-splitting, exploits sparsity and chordal decomposition.
SCS, Mosek — large-scale.

6. Mixed-Integer Linear Programming (MILP)

NP-hard in general but practically solvable for many real instances thanks to decades of engineering.

6.1 Branch-and-Bound (Land–Doig 1960)

Recursively partition the search space on fractional integer variables, using LP relaxations as bounds to prune subtrees. Selection rules (best-first, depth-first, best-estimate), branching rules (most-infractional, pseudo-cost, strong branching, reliability branching) are critical for performance.

6.2 Cutting Planes (Gomory 1958)

Add valid inequalities that exclude fractional LP solutions but no integer solutions. Mixed-integer Gomory cuts, mixed-integer rounding (MIR), Chvátal-Gomory cuts, lift-and-project (Balas-Ceria-Cornuéjols 1993), zero-half cuts.

6.3 Branch-and-Cut and Branch-and-Price

Branch-and-cut combines branch-and-bound with cutting planes at every node.
Branch-and-price is column generation embedded in branch-and-bound, used for huge LPs like vehicle routing and crew scheduling.

6.4 MILP Solvers

Commercial — Gurobi, CPLEX, FICO Xpress — currently leading benchmarks (Mittelmann).
Open source — SCIP (Achterberg, ZIB) for academic use, HiGHS for production.

7. Mixed-Integer Nonlinear Programming (MINLP)

Combination of integrality and nonlinearity. Often nonconvex.

Outer Approximation (Duran–Grossmann 1986) — alternate MILP master with NLP subproblems.
Generalized Benders Decomposition (Geoffrion 1972).
Bonmin (Bonami et al 2008) — branch-and-bound + outer approximation, COIN-OR.
BARON (Tawarmalani–Sahinidis 2005) — branch-and-reduce, deterministic global optimization.
Couenne (Belotti et al 2009) — convex over- and under-estimators, COIN-OR.
SCIP also handles MINLP via constraint-integer-programming framework.

8. Unconstrained Smooth Optimization

The bedrock of continuous optimization, and where many famous algorithms originate.

8.1 First-Order Methods

Gradient descent — xₖ₊₁ = xₖ − α∇f(xₖ). Converges sublinearly O(1/k) on smooth convex.
Steepest descent — exact line search along the negative gradient.
Conjugate gradient (Hestenes–Stiefel 1952) — generates A-orthogonal search directions, finite termination on quadratic problems in n steps. Nonlinear CG variants: Fletcher–Reeves (1964), Polak–Ribière (1969), Hager–Zhang (2005).

8.2 Newton’s Method

xₖ₊₁ = xₖ − [∇²f(xₖ)]⁻¹∇f(xₖ). Quadratic convergence near a strict minimum, but Hessian assembly and factorization cost O(n³). Modifications: damped Newton (with line search), modified Cholesky for nonconvex regions, truncated Newton (use CG to solve the Newton system approximately).

8.3 Quasi-Newton

Build a Hessian approximation from gradient differences. The BFGS update (Broyden–Fletcher–Goldfarb–Shanno 1970, independently) is the workhorse: rank-2 PSD update that satisfies the secant equation. SR1 (symmetric rank-1) is non-PSD but better for trust regions.

L-BFGS (Liu–Nocedal 1989) stores only the last m (typically 5–20) (s, y) pairs and applies the inverse Hessian implicitly via a two-loop recursion. O(nm) per iteration — the standard for large-scale smooth optimization.

8.4 Trust Region Methods

Instead of a line search, solve a constrained quadratic subproblem min m(p) = f + gᵀp + ½pᵀBp s.t. ‖p‖ ≤ Δ. Update Δ based on actual-to-predicted-reduction ratio. Conn–Gould–Toint authored the definitive monograph (2000). Algorithms: dogleg, Steihaug-CG, exact (More-Sorensen).

8.5 Line Search Theory

Wolfe conditions (Wolfe 1969):

Armijo (sufficient decrease) — f(xₖ + αpₖ) ≤ f(xₖ) + c₁α∇fᵀpₖ.
Curvature — ∇f(xₖ + αpₖ)ᵀpₖ ≥ c₂∇fᵀpₖ. Strong Wolfe replaces the curvature condition with its absolute-value form, important for nonlinear CG.

9. Constrained Nonlinear Programming

9.1 Sequential Quadratic Programming (SQP)

Han (1977), Powell (1978). At each iterate, solve a QP that linearizes constraints and uses a quadratic model of the Lagrangian. Pros: fast on small/medium problems, warm-starts well; cons: requires globalization (line search or trust region with merit function or filter).

9.2 Interior-Point NLP

IPOPT (Wächter–Biegler 2006, CMU, then CMU/Carnegie-Mellon and now COIN-OR). Primal-dual interior-point with a filter line search and inertia-corrected linear systems. The reference open-source NLP solver, used in process engineering, robotics, energy markets.

9.3 Augmented Lagrangian

Add a quadratic penalty term to the Lagrangian: Lₐ(x, λ) = f(x) + λᵀg(x) + (ρ/2)‖g(x)‖². LANCELOT (Conn–Gould–Toint 1992), ALADIN (Houska 2016 for distributed). Useful when KKT systems are too large to assemble directly.

9.4 Trust-Region NLP

TRON (Lin–Moré 1999) — for bound-constrained problems. KNITRO (Byrd–Nocedal–Waltz, commercial) — combines interior-point, active-set, and SQP modes.

10. Stochastic Gradient Methods

The dominant class for machine learning, where the objective is an expectation over data and exact gradients are prohibitive.

10.1 SGD (Robbins–Monro 1951)

Replace the gradient with a noisy unbiased estimate; converge with decreasing step size αₖ such that Σαₖ = ∞, Σαₖ² < ∞.

10.2 Momentum

Heavy-ball (Polyak 1964) — vₖ₊₁ = βvₖ − α∇f(xₖ); xₖ₊₁ = xₖ + vₖ₊₁.
Nesterov accelerated gradient (1983) — evaluate gradient at the look-ahead point, achieves the optimal O(1/k²) rate on smooth convex.

10.3 Adaptive Step Sizes

AdaGrad (Duchi–Hazan–Singer 2011) — per-parameter step size 1/√Σgₜ². Aggressive decay; good for sparse data.
RMSprop (Hinton lecture, 2012) — replace cumulative sum with EMA of squared gradients.
Adam (Kingma–Ba 2014) — momentum + RMSprop + bias correction. The default for deep learning since 2015.
AdamW (Loshchilov–Hutter 2019) — decouple weight decay from gradient-based update; widely adopted for transformers.

10.4 Newer Optimizers

Lion (Chen et al, Google 2023) — sign-momentum, memory-efficient, beats Adam on some vision/language tasks.
Shampoo (Gupta–Koren–Singer 2018) — preconditioner that approximates a block-diagonal full-matrix Adagrad.
Sophia (Liu et al 2023, Stanford) — clipped second-moment via Hutchinson trace estimator, ~2× faster than AdamW on LLM pretraining in their benchmarks.

10.5 Variance-Reduced SGD

For finite-sum f(x) = (1/n)Σfᵢ(x):

SAG / SAGA (Defazio–Bach–Lacoste-Julien 2014) — keep stored gradient table, update one component each step.
SVRG (Johnson–Zhang 2013) — periodically compute full gradient as control variate.
SPIDER (Fang et al 2018), STORM, SARAH — recursive variance reduction with better complexity for nonconvex finite sums.

These achieve linear convergence on strongly convex finite sums, matching full-batch rates with stochastic-batch cost.

11. Composite and Proximal Optimization

For f(x) = g(x) + h(x), g smooth, h nonsmooth but with cheap prox.

11.1 ISTA / FISTA

Iterative Shrinkage-Thresholding (ISTA) — xₖ₊₁ = proxₐₕ(xₖ − α∇g(xₖ)). FISTA (Beck–Teboulle 2009) adds Nesterov momentum, O(1/k²) rate. Standard for L1-regularized regression (Lasso), total-variation denoising.

11.2 ADMM

Alternating Direction Method of Multipliers, originating with Glowinski–Marrocco (1975) and Gabay–Mercier (1976), popularized in ML by Boyd et al’s 2011 survey “Distributed Optimization and Statistical Learning via ADMM”. Splits min f(x) + g(z) s.t. Ax + Bz = c into alternating x-update, z-update, dual update. Excellent for distributed and consensus problems.

11.3 Other Operator Splitting

Douglas–Rachford — for sums of two nonsmooth functions.
Forward-Backward — generalizes prox-gradient.
Primal-Dual (Chambolle–Pock 2011) — saddle-point splitting for imaging.

12. Online and Bandit Optimization

12.1 Online Convex Optimization

Online Gradient Descent (OGD) — Zinkevich (2003). Regret O(√T) on convex losses.
Follow-The-Regularized-Leader (FTRL) — McMahan (2011). Used in Google’s ad ranking.
Multiplicative Weights (Littlestone–Warmuth 1994; Freund–Schapire 1997) — exponentiated-gradient over the simplex; basis of boosting and game-theoretic learning.

12.2 Multi-Armed Bandits

UCB (Auer–Cesa-Bianchi–Fischer 2002) — Upper Confidence Bound, optimism under uncertainty.
Thompson Sampling (Thompson 1933, revived ~2010) — Bayesian posterior sampling. Often empirically best.
EXP3 — adversarial bandits.
LinUCB, Lin-TS — contextual linear bandits.

13. Derivative-Free / Direct Search

When gradients are unavailable, expensive, or noisy.

Nelder–Mead simplex (1965) — reflect/expand/contract/shrink a simplex of n+1 points. Heuristic but ubiquitous.
Powell’s methods — COBYLA (linear approximations under constraints), BOBYQA (bound-constrained quadratic interpolation), NEWUOA.
Pattern search — Hooke–Jeeves (1961), Generalized Pattern Search (Torczon 1997), MADS (Audet–Dennis 2006).
DIRECT (Jones–Perttunen–Stuckman 1993) — global Lipschitzian optimization without Lipschitz constant.
Implicit Filtering (Kelley) — gradient descent with noisy finite-difference gradients.

14. Bayesian Optimization

Sample-efficient global optimization for expensive black-box functions.

GP-EI — Gaussian process surrogate, Expected Improvement acquisition (Mockus 1989, Jones–Schonlau–Welch 1998).
Spearmint (Snoek–Larochelle–Adams 2012) — popularized BO for ML hyperparameter tuning.
GP-UCB (Srinivas–Krause–Kakade–Seeger 2010) — confidence-bound acquisition with regret bounds.
Thompson Sampling, Entropy Search (Hennig–Schuler 2012), Predictive Entropy Search, Knowledge Gradient.
Multi-fidelity BO (Kandasamy et al 2016) — combine cheap approximations with expensive evaluations.

Libraries: Optuna (Akiba et al, Preferred Networks), BoTorch (Balandat et al, Meta), GPyOpt, scikit-optimize, HyperOpt (Bergstra), Ax (Meta).

15. Metaheuristics

Population- or trajectory-based heuristics for hard nonconvex problems. No global convergence guarantees, but practically effective.

15.1 Evolutionary Algorithms

Genetic Algorithm (GA) — Holland (1975). Binary or real-valued chromosomes, crossover + mutation + selection.
Evolution Strategies (ES) — Rechenberg, Schwefel (1960s–70s). (1+1)-ES, (μ/ρ, λ)-ES.
CMA-ES — Covariance Matrix Adaptation Evolution Strategy (Hansen–Ostermeier 1996). Adapts a full-rank multivariate normal sampling distribution. State of the art for low-dimensional continuous black-box optimization.
Differential Evolution (DE) — Storn–Price (1997). Vector difference mutation; simple, robust.
Genetic Programming — Koza (1992). Evolves programs.

15.2 Swarm Intelligence

Particle Swarm Optimization (PSO) — Kennedy–Eberhart (1995). Particles in search space with velocity updates toward personal-best and global-best.
Ant Colony Optimization (ACO) — Dorigo (1992). Pheromone trail reinforcement; strong on TSP and routing.
Artificial Bee Colony (Karaboga 2005), Firefly Algorithm (Yang 2008), Cuckoo Search (Yang–Deb 2009), Grey Wolf Optimizer (Mirjalili 2014) — the metaheuristic zoo continues to grow.

15.3 Trajectory Methods

Simulated Annealing (SA) — Kirkpatrick–Gelatt–Vecchi (1983), based on Metropolis–Hastings. Cooling schedule essential.
Tabu Search — Glover (1986). Memory-based local search with forbidden moves.
GRASP — Feo–Resende (1995). Greedy Randomized Adaptive Search Procedure.
Variable Neighborhood Search — Mladenović–Hansen (1997).
Iterated Local Search (Lourenço–Martin–Stützle 2003).

16. Deterministic Global Optimization

Provably global optima for nonconvex NLP.

Branch-and-bound for NLP — partition the box, compute relaxation bounds, prune.
αBB (Adjiman–Floudas 1998) — α-based branch-and-bound, convex underestimators via interval Hessians.
alphaECP (Westerlund) — extended cutting-plane for MINLP.
BARON (Sahinidis) — branch-and-reduce.

These are practical for problems up to ~100 continuous variables.

17. Combinatorial Optimization

Discrete problems with combinatorial structure.

17.1 Traveling Salesman Problem (TSP)

Concorde (Applegate–Bixby–Chvátal–Cook) — combines cutting planes (subtour elimination, comb, clique-tree) with branch-and-cut. Solved instances up to ~85,900 cities (pla85900) to optimality.
LKH (Lin–Kernighan–Helsgaun 2000) — heuristic but produces optimal or near-optimal solutions in seconds.

17.2 Vehicle Routing Problem (VRP)

Classic benchmarks: Solomon (1987), Gehring–Homberger (1999).
Methods: column generation (branch-and-price), large-neighborhood search (Ropke–Pisinger 2006), genetic algorithms with split (Prins 2004).

17.3 SAT and MaxSAT

DPLL — Davis–Putnam–Logemann–Loveland (1962). Backtracking with unit propagation.
CDCL — Conflict-Driven Clause Learning (Marques-Silva–Sakallah 1996; Moskewicz et al GRASP 2001). Learn no-good clauses from conflicts.
MiniSat (Eén–Sörensson 2003), Glucose (Audemard–Simon 2009), CaDiCaL and kissat (Biere) — state-of-the-art modern SAT solvers.
MaxSAT — weighted/partial extensions; solvers Open-WBO, EvalMaxSAT.

17.4 QUBO and Ising

Quadratic Unconstrained Binary Optimization: min xᵀQx with x ∈ {0,1}ⁿ. Equivalent to Ising via spin {−1, +1}.

D-Wave quantum annealer (since 2011) — physical adiabatic optimization.
Simulated Bifurcation Machine (Toshiba 2019) — classical heuristic inspired by Kerr-nonlinear parametric oscillators.
Tabu search + simulated annealing classical baselines.

18. Distributed and Parallel Optimization

18.1 Decomposition Methods

Lagrangian decomposition — relax linking constraints with multipliers; decompose into subproblems.
Benders decomposition — split into master (complicating variables) and subproblems.
Dantzig-Wolfe decomposition — column generation for LPs with block-angular structure.

18.2 Consensus and Async

ADMM consensus — local agents optimize copies of x, coupled by consensus constraint.
Hogwild! (Niu–Recht–Ré–Wright 2011) — lock-free async SGD; correctness under sparsity.
Async SGD (Recht–Re, others) — parameter-server architectures.
Federated Averaging (FedAvg) — McMahan et al (2017). Local SGD epochs on each client, server averages weights.

19. Robust Optimization

Optimize for worst case over a deterministic uncertainty set.

Ben-Tal–El Ghaoui–Nemirovski “Robust Optimization” (Princeton 2009) is the canonical text. Robust LP over an ellipsoidal set lifts to SOCP; over a box stays LP; over a polyhedral set stays LP via duality. Robust QP / SDP analogously.

20. Distributionally Robust Optimization (DRO)

Optimize for worst-case distribution within an ambiguity set.

Moment-based DRO — Delage–Ye (2010).
Wasserstein DRO — Esfahani–Kuhn (2018), Blanchet et al. Wasserstein-1 balls give tractable reformulations via duality; finite-sample guarantees out-of-sample.
f-divergence DRO — Ben-Tal et al.

DRO has become a major framework for trustworthy ML in the last five years.

21. Multi-Objective Optimization

Multiple competing objectives: min (f₁(x), …, fₘ(x)).

21.1 Pareto Concepts

Pareto optimum — no other point dominates it on all objectives.
Pareto front — set of all Pareto-optimal points.
Hypervolume, inverted generational distance — quality indicators for approximations.

21.2 Scalarization

Weighted sum — min Σwᵢfᵢ(x). Misses non-convex portions of front.
ε-constraint — minimize one, bound others.
Tchebycheff scalarization — captures non-convex fronts.

21.3 Evolutionary Multi-Objective

NSGA-II (Deb–Pratap–Agarwal–Meyarivan 2002) — non-dominated sorting + crowding distance. The reference algorithm; thousands of citations per year.
NSGA-III (Deb–Jain 2014) — for many-objective (≥4) via reference directions.
MOEA/D (Zhang–Li 2007) — decomposition-based.
SPEA2 (Zitzler–Laumanns–Thiele 2001), SMS-EMOA (hypervolume-based).

22. Software Ecosystem

22.1 Modeling Languages

AMPL (Fourer–Gay–Kernighan, 1985) — the original; commercial.
GAMS (Brooke–Kendrick–Meeraus) — heavily used in operations research and energy.
Pyomo (Hart et al, Sandia 2017) — Python-based algebraic modeling.
JuMP (Dunning–Huchette–Lubin 2017) — Julia-based; speed advantage from Julia’s type system.
CVXPY (Diamond–Boyd 2016) — Python, automatic conversion of disciplined convex programs to conic form.
CVX (Grant–Boyd, MATLAB) — the original DCP modeler.
YALMIP (Löfberg 2004, MATLAB) — supports SDP, MILP, NLP, robust optimization.

22.2 General-Purpose Solver Front-Ends

CVXOPT (Andersen–Dahl–Vandenberghe) — Python convex optimization library, also a low-level conic solver.
OR-Tools (Google) — open-source suite covering LP/MIP/CP-SAT/routing.
Optaplanner (Red Hat) — Java constraint-satisfaction and scheduling.

22.3 Specialized

CasADi (Andersson–Gillis et al) — algorithmic differentiation + optimal control NLP.
Drake (Russ Tedrake, MIT) — robotics MPC and trajectory optimization.
acados (Verschueren et al) — real-time embedded NLP for MPC.

22.4 Auto-Diff for Optimization

Modern optimization frequently relies on JAX, PyTorch, or TensorFlow for gradient computation, even for non-ML problems. JAXopt, Optax (DeepMind), scipy.optimize, NLopt (Johnson) round out the Python landscape.

23. How to Pick an Algorithm

A pragmatic checklist:

Is it convex? If yes, you can usually find a polynomial-time solver. Try CVXPY first.
Smooth or nonsmooth? Nonsmooth → proximal methods (ISTA/ADMM). Smooth → gradient/quasi-Newton.
Constraints? Equality only → SQP or projected gradient. Box → L-BFGS-B (Byrd–Lu–Nocedal 1995). Mixed → IPOPT.
Scale? > 10⁴ variables → first-order or L-BFGS. > 10⁷ → stochastic. < 10² → derivative-free OK.
Integers? Yes → MILP/MINLP solvers; consider relaxations.
Stochastic or noisy? SGD-family + variance reduction.
Expensive function evaluations? Bayesian optimization.
Multiple objectives? NSGA-II or scalarization.
Global guarantees required? Branch-and-bound, BARON, or accept heuristic with multi-start.

24. Open Problems and Directions

Theory of nonconvex optimization — escape rates from saddle points, landscape analysis for deep networks (Jin et al, Ge et al, Lee et al on strict saddle).
Mixed-integer convex — closing the gap between MILP and MISDP performance.
Quantum and quantum-inspired — QAOA (Farhi–Goldstone–Gutmann 2014), VQE for combinatorial optimization.
Differentiable optimization layers — Amos–Kolter “OptNet” (2017); embedding QP/LP/cone programs inside neural networks.
Learning-augmented algorithms — neural branching heuristics for MILP (Gasse et al 2019, Khalil et al 2016).

Adjacent

probability-distribution-zoo — distributions used in stochastic and robust optimization.
linear-algebra-essentials — matrix factorizations underlying interior-point methods.
numerical-linear-algebra — conditioning, stability, finite-difference gradients.
machine learning loss functions — objectives that SGD-family methods optimize.
convex analysis — subgradients, duality, prox operators.
graph algorithms — combinatorial substrate for routing and assignment.
control theory LMIs — SDP applications.
bayesian-inference — surrogate models for Bayesian optimization.
reinforcement learning — online and bandit optimization in disguise.
transformer-architecture — consumers of Adam/AdamW/Lion at scale.

Compendium

Explorer

Optimization Algorithm Taxonomy — LP, QP, NLP, Combinatorial, Stochastic, Metaheuristic