Convex Optimization — Math Reference

1. At a glance

Convex optimization is the study of minimizing a convex objective function over a convex feasible set. The defining property: any local minimum is a global minimum. This single property, plus the existence of polynomial-time algorithms for the major problem classes, makes convex optimization the workhorse of modern quantitative engineering.

Why it matters:

Global optimum guaranteed. No worry about local minima, saddle points, or restart heuristics. The KKT conditions are both necessary and sufficient.
Scalable. Interior-point methods solve LPs/QPs/SOCPs/SDPs in polynomial time. First-order methods (gradient descent, proximal gradient, FISTA) scale to billions of variables.
Foundational across disciplines. Machine learning (SVM, Lasso, logistic regression), control (MPC, LQR), signal processing (compressed sensing, denoising, beamforming), finance (Markowitz portfolios, risk parity), operations research (LP scheduling, transportation, assignment), robotics (inverse kinematics QP, collision-avoidance QP), and statistics (MLE for log-concave families) all reduce to convex problems.
Disciplined convex programming (DCP). Modeling languages (CVX, CVXPY, Convex.jl) let you write the problem in a near-mathematical form and automatically transform it to a solver’s standard form.

The canonical reference is Boyd + Vandenberghe, “Convex Optimization” (Cambridge, 2004) — freely available as a PDF from the authors. Together with Nesterov, “Lectures on Convex Optimization” (2nd ed, Springer, 2018) and Nocedal + Wright, “Numerical Optimization” (2nd ed, Springer, 2006), these three texts cover essentially everything in this note in detail.

See also: [[Math/_index]], [[Math/linear-algebra-essentials]].

2. Convex sets

A set C ⊆ Rⁿ is convex if for all x, y ∈ C and λ ∈ [0, 1]:

λx + (1 − λ)y ∈ C

Equivalently, the line segment between any two points of C lies entirely in C.

Canonical examples

Hyperplane: {x : aᵀx = b}, where a ≠ 0, b ∈ R.
Halfspace: {x : aᵀx ≤ b}. The basic building block of polyhedra.
Ball (Euclidean): {x : ‖x − x_c‖₂ ≤ r}.
Ellipsoid: {x : (x − x_c)ᵀ P⁻¹ (x − x_c) ≤ 1}, with P ≻ 0.
Polyhedron: {x : Ax ≤ b, Cx = d}. Intersection of finitely many halfspaces and hyperplanes.
Polytope: bounded polyhedron.
Simplex: {x ∈ Rⁿ : x ≥ 0, 1ᵀx = 1} (the probability simplex).
Positive orthant: Rⁿ₊ = {x : xᵢ ≥ 0}.
Norm ball: {x : ‖x‖ ≤ r} for any norm.
Norm cone: {(x, t) : ‖x‖ ≤ t} ⊂ Rⁿ⁺¹. The second-order cone (Lorentz cone) is the case of the 2-norm.
Positive semidefinite cone: Sⁿ₊ = {X ∈ Sⁿ : X ⪰ 0}. The set of n × n symmetric PSD matrices. This is the cone over which SDPs are posed.

Operations preserving convexity

Intersection. Arbitrary intersections of convex sets are convex. This is why polyhedra (intersections of halfspaces) are convex.
Affine image and preimage. If C is convex and f(x) = Ax + b, then f(C) and f⁻¹(C) are convex.
Perspective: P(x, t) = x/t for t > 0 preserves convexity.
Linear-fractional: (Ax + b)/(cᵀx + d) on the set {x : cᵀx + d > 0} preserves convexity.
Conic hull and convex hull. The convex hull of any set is the smallest convex set containing it.
Minkowski sum: C + D = {x + y : x ∈ C, y ∈ D}.

Separating and supporting hyperplanes

Separating hyperplane theorem. Two disjoint convex sets can be separated by a hyperplane. This is the geometric heart of duality.
Supporting hyperplane theorem. For any boundary point x₀ of a convex set C, there exists a ≠ 0 with aᵀx ≤ aᵀx₀ for all x ∈ C.

3. Convex functions

A function f : Rⁿ → R (with convex domain) is convex if for all x, y in dom f and λ ∈ [0, 1]:

f(λx + (1 − λ)y) ≤ λf(x) + (1 − λ)f(y)

Strictly convex if strict inequality holds for x ≠ y and λ ∈ (0, 1). Strongly convex with modulus m > 0 if f(x) − (m/2)‖x‖² is convex; equivalently, ∇²f(x) ⪰ mI when twice differentiable.

Equivalent characterizations

Epigraph. epi(f) = {(x, t) : f(x) ≤ t} is convex iff f is convex.
First-order condition. For differentiable f, f is convex iff f(y) ≥ f(x) + ∇f(x)ᵀ(y − x) for all x, y in dom f. The graph lies above all tangent hyperplanes.
Second-order condition. For twice differentiable f, f is convex iff ∇²f(x) ⪰ 0 for all x ∈ dom f.

Canonical examples

Affine: f(x) = aᵀx + b. Both convex and concave.
Quadratic: f(x) = (1/2) xᵀPx + qᵀx + r is convex iff P ⪰ 0.
Norms: every norm is convex (triangle inequality).
Exponential: e^(ax) is convex on R.
Power: xᵖ on R₊ is convex for p ≥ 1 or p ≤ 0; concave for 0 ≤ p ≤ 1.
Logarithm: log x is concave on R₊; −log x is convex.
Negative entropy: x log x on R₊ is convex.
Quadratic-over-linear: f(x, y) = x²/y is convex on R × R₊₊.
Log-sum-exp: f(x) = log(Σ exp(xᵢ)) is convex; a smooth approximation of max.
Max function: max(x₁, …, xₙ) is convex.
Spectral norm: ‖X‖₂ = σ_max(X) is convex.
Log-determinant: −log det X is convex on Sⁿ₊₊.
Indicator of a convex set: I_C(x) = 0 if x ∈ C, +∞ otherwise. Convex; central to proximal methods.

Operations preserving convexity

Nonnegative weighted sum: if f_i convex and α_i ≥ 0, then Σ α_i f_i is convex.
Composition with affine: f(Ax + b) is convex if f is.
Pointwise maximum: max_i f_i is convex if each f_i is.
Pointwise supremum over an arbitrary family preserves convexity.
Composition rules: f(g(x)) is convex if f is convex and nondecreasing, g convex; or f convex and nonincreasing, g concave.
Minimization over one argument: if f(x, y) is convex jointly and C is convex, then g(x) = inf_{y ∈ C} f(x, y) is convex.
Perspective: tf(x/t) is convex on dom f × R₊₊.

4. Standard problem form

A general optimization problem in standard form:

minimize    f₀(x)
subject to  f_i(x) ≤ 0,   i = 1, …, m
            h_j(x) = 0,   j = 1, …, p

The problem is convex if:

f₀ is convex (the objective),
f_i is convex for each i (inequality constraints define a convex feasible region),
h_j is affine, i.e., h_j(x) = a_jᵀx − b_j (equality constraints are linear).

The feasible set {x : f_i(x) ≤ 0, h_j(x) = 0} is convex (intersection of sublevel sets of convex functions with an affine subspace).

The optimal value p* = inf{f₀(x) : x feasible}; the optimal set is the set of minimizers. For convex problems, the optimal set is convex.

A point x is ε-suboptimal if f₀(x) ≤ p* + ε. Locally optimal means optimal in some neighborhood. Key fact: for convex problems, every locally optimal point is globally optimal.

5. Problem classes (in increasing generality)

5.1 Linear Program (LP)

minimize    cᵀx
subject to  Ax ≤ b
            Cx = d

Linear objective and linear constraints. Feasible region is a polyhedron; optimum (if finite) is attained at a vertex. The Dantzig simplex method (1947) pivots between vertices; interior-point methods (Karmarkar 1984) traverse the interior.

Applications: scheduling, transportation, assignment, network flow, diet, blending, production planning.

5.2 Quadratic Program (QP)

minimize    (1/2) xᵀPx + qᵀx + r
subject to  Ax ≤ b
            Cx = d

Convex iff P ⪰ 0. QPs are the workhorse of model predictive control (MPC) — one QP per control step. Also: SVM (hinge loss + L2), Lasso reformulation, portfolio mean-variance, least-squares with linear constraints.

5.3 Quadratically Constrained QP (QCQP)

minimize    (1/2) xᵀP₀x + q₀ᵀx + r₀
subject to  (1/2) xᵀP_i x + q_iᵀx + r_i ≤ 0,  i = 1, …, m
            Cx = d

Convex iff all P_i ⪰ 0 (including i = 0). Trust-region subproblems, certain robust LPs, and second-order cone constraints in disguise.

5.4 Second-Order Cone Program (SOCP)

minimize    fᵀx
subject to  ‖A_i x + b_i‖₂ ≤ c_iᵀx + d_i,  i = 1, …, m
            Fx = g

Generalizes LP and QP (every convex QP is an SOCP) and accommodates 2-norm constraints, robust LP under ellipsoidal uncertainty, sum-of-norms minimization, and certain antenna design problems. Solved efficiently by primal-dual interior-point.

5.5 Semidefinite Program (SDP)

minimize    tr(CX)
subject to  tr(A_i X) = b_i,  i = 1, …, m
            X ⪰ 0

(Or equivalently in inequality form: min cᵀx s.t. F₀ + Σ xᵢ F_i ⪰ 0.) Linear objective and constraints over the cone of PSD matrices. Generalizes LP, QP, QCQP, SOCP. Applications: relaxations of NP-hard combinatorial problems (Goemans-Williamson MaxCut), control synthesis (LMIs — linear matrix inequalities, Boyd et al. 1994), experiment design, polynomial optimization (Lasserre 2001 hierarchy), quantum information.

5.6 Conic Program (most general)

minimize    cᵀx
subject to  Ax = b
            x ∈ K

where K is a closed convex cone. The full hierarchy:

LP ⊂ SOCP ⊂ SDP ⊂ Conic

Modern solvers (MOSEK, SCS, ECOS) handle problems posed as conic programs over the product cone of {free, zero, nonneg, second-order cone, PSD cone, exponential cone, power cone}. The exponential and power cones broaden the reach to geometric programs, entropy maximization, and a wide class of “exotic” but still convex problems.

5.7 Nonsmooth convex

Many convex problems have nonsmooth components: L1 regularizer ‖x‖₁, hinge loss max(0, 1 − yᵢxᵀwᵢ), nuclear norm ‖X‖_*, total variation. These are handled by subgradient methods, proximal methods, ADMM (Alternating Direction Method of Multipliers, Glowinski + Marrocco 1975, popularized by Boyd et al. 2011), and bundle methods.

6. Duality + KKT

6.1 Lagrangian

For the standard-form problem in §4, the Lagrangian is:

L(x, λ, ν) = f₀(x) + Σᵢ λᵢ f_i(x) + Σⱼ νⱼ h_j(x)

with λ ∈ Rᵐ (multipliers for inequalities) and ν ∈ Rᵖ (multipliers for equalities).

6.2 Dual function

The Lagrange dual function:

g(λ, ν) = inf_x L(x, λ, ν)

g is always concave (pointwise infimum of affine functions in (λ, ν)), regardless of whether the primal problem is convex.

Weak duality: g(λ, ν) ≤ pfor any λ ≥ 0 and any ν. Setting d = sup_{λ ≥ 0, ν} g(λ, ν), we always have d≤ p. The duality gap is p− d ≥ 0.

6.3 Dual problem

maximize    g(λ, ν)
subject to  λ ≥ 0

This is always a convex problem (concave maximization). For LP, QP, SOCP, SDP, the dual is in the same class.

6.4 Strong duality + Slater’s condition

Strong duality holds when p*= d*. For convex problems, strong duality holds under Slater’s condition: there exists a strictly feasible point x with f_i(x) < 0 for all nonlinear inequalities (linear inequalities and equalities need only be satisfied with equality / nonstrict inequality). Slater’s condition is sufficient, not necessary; many problems satisfy strong duality without it.

6.5 KKT conditions (Karush 1939 master’s thesis; Kuhn + Tucker 1951)

For a convex problem with differentiable f_i, h_j and strong duality, xis primal-optimal and (λ, ν*) dual-optimal iff the Karush-Kuhn-Tucker (KKT) conditions hold:

Primal feasibility: f_i(x*) ≤ 0 for all i; h_j(x*) = 0 for all j.
Dual feasibility: λ_i* ≥ 0 for all i.
Complementary slackness: λ_i*· f_i(x*) = 0 for all i. (Either the constraint is active, f_i(x*) = 0, or its multiplier is zero, λ_i* = 0.)
Stationarity: ∇f₀(x*) + Σᵢ λ_i* ∇f_i(x*) + Σⱼ ν_j* ∇h_j(x*) = 0.

For non-convex problems, KKT is necessary (under a constraint qualification such as LICQ — linear independence of active constraint gradients) but not sufficient. For convex problems, KKT is both necessary and sufficient for optimality.

KKT is the workhorse of optimization: every convex solver eventually drives a KKT residual below tolerance. KKT also gives the shadow-price interpretation: λ_i* is the rate at which the optimal value would decrease if we relaxed constraint i.

6.6 Examples of duality

LP dual of LP: min cᵀx s.t. Ax = b, x ≥ 0 dualizes to max bᵀν s.t. Aᵀν ≤ c.
Quadratic dual: dual of a QP is another QP.
SDP dual of SDP: dual of an SDP is another SDP. Sometimes the dual is more tractable than the primal.
Conjugate function: f*(y) = sup_x (yᵀx − f(x)). Used throughout proximal and splitting methods. (f*) * = f for closed proper convex f.

7. Gradient methods (smooth convex)

(See [[Math/gradient-descent-variants]] — TBD — for deeper treatment.)

7.1 Gradient descent

x_{k+1} = x_k − α_k · ∇f(x_k)

Step-size choices:

Constant α_k = α. Converges if α ≤ 1/L where L is the Lipschitz constant of ∇f.
Exact line search: α_k = argmin_α f(x_k − α ∇f(x_k)). Optimal but expensive.
Backtracking line search (Armijo rule, 1966): start with α = 1, shrink by factor β until f(x_k − α ∇f) ≤ f(x_k) − c · α · ‖∇f‖². Typical c ∈ (0, 0.5), β ∈ (0, 1).

Convergence rates (for smooth convex f with L-Lipschitz gradient):

Convex: f(x_k) − f≤ O(L‖x₀ − x‖² / k). Sublinear, O(1/k).
Strongly convex (modulus m): ‖x_k − x*‖² ≤ ((L − m)/(L + m))^(2k) · ‖x₀ − x*‖². Linear convergence; condition number κ = L/m controls the rate.

7.2 Subgradient method

For nonsmooth convex f, replace ∇f by any subgradient g ∈ ∂f(x):

x_{k+1} = x_k − α_k · g_k

A subgradient g satisfies f(y) ≥ f(x) + gᵀ(y − x) for all y. The subdifferential ∂f(x) is the set of all subgradients.

Convergence: O(1/√k) under diminishing step sizes (α_k → 0, Σ α_k = ∞). Slower than gradient descent on smooth problems, but applies to nonsmooth.

7.3 Proximal gradient + ISTA + FISTA

For composite problems f(x) = g(x) + h(x) where g is smooth and h is convex (possibly nonsmooth, e.g., L1 or indicator), use the proximal operator:

prox_{αh}(v) = argmin_x { h(x) + (1/(2α)) ‖x − v‖² }

ISTA (Iterative Shrinkage-Thresholding Algorithm):

x_{k+1} = prox_{α h}(x_k − α ∇g(x_k))

Converges at O(1/k). For h = ‖·‖₁, the proximal operator is soft-thresholding: prox(v) = sign(v) ⊙ max(|v| − α, 0).

FISTA (Beck + Teboulle 2009, SIAM J Imaging Sci): adds Nesterov-style momentum to ISTA, achieving O(1/k²) — the optimal rate among first-order methods on this problem class.

y_k = x_k + (t_{k-1} − 1)/t_k · (x_k − x_{k-1})
x_{k+1} = prox_{α h}(y_k − α ∇g(y_k))
t_{k+1} = (1 + √(1 + 4 t_k²)) / 2

7.4 Conjugate gradient (CG)

For minimizing (1/2) xᵀAx − bᵀx with A ⪰ 0:

p_0 = r_0 = b − Ax_0
α_k = (r_kᵀ r_k) / (p_kᵀ A p_k)
x_{k+1} = x_k + α_k p_k
r_{k+1} = r_k − α_k A p_k
β_k = (r_{k+1}ᵀ r_{k+1}) / (r_kᵀ r_k)
p_{k+1} = r_{k+1} + β_k p_k

Converges in at most n steps in exact arithmetic. In practice, converges much faster than gradient descent on well-conditioned QP. Nonlinear conjugate gradient (Fletcher-Reeves, Polak-Ribière) extends to general smooth convex.

7.5 Accelerated methods (Nesterov 1983)

Nesterov’s accelerated gradient method achieves O(1/k²) for smooth convex f — provably optimal among methods using only first-order information (Nesterov’s lower bound).

y_{k+1} = x_k − α ∇f(x_k)
x_{k+1} = y_{k+1} + ((k − 1)/(k + 2)) · (y_{k+1} − y_k)

The momentum term is a “lookahead” that anticipates the next step. The same Θ(1/k²) bound holds, and for strongly convex problems, accelerated gradient achieves linear convergence with rate dependent on √κ instead of κ — a square-root improvement in the condition number.

8. Stochastic + first-order methods for ML

8.1 SGD (Robbins-Monro 1951, Ann Math Stat)

For finite-sum f(x) = (1/N) Σᵢ fᵢ(x), the stochastic gradient is a single sample (or mini-batch):

x_{k+1} = x_k − α_k · ∇f_{i_k}(x_k)

E[∇f_{i_k}(x_k)] = ∇f(x_k), so the update is unbiased. With diminishing step sizes Σ α_k = ∞, Σ α_k² < ∞, SGD converges almost surely for convex objectives.

Convergence rate: O(1/√k) without strong convexity; O(1/k) with strong convexity (Bottou + Curtis + Nocedal 2018 survey).

8.2 Momentum (Polyak 1964; heavy-ball)

v_{k+1} = β v_k − α ∇f(x_k)
x_{k+1} = x_k + v_{k+1}

Typical β ∈ [0.9, 0.99]. Accelerates convergence on ill-conditioned problems by accumulating velocity along consistent gradient directions.

8.3 Nesterov accelerated gradient (NAG)

The “lookahead” variant — gradient evaluated at the predicted next point:

v_{k+1} = β v_k − α ∇f(x_k + β v_k)
x_{k+1} = x_k + v_{k+1}

Equivalent to the §7.5 form under reparameterization. In deep learning practice, NAG and heavy-ball momentum often give similar results.

8.4 AdaGrad (Duchi + Hazan + Singer 2011, JMLR)

G_k = G_{k-1} + ∇f_k ⊙ ∇f_k         (elementwise)
x_{k+1} = x_k − α · ∇f_k / (√G_k + ε)

Per-coordinate adaptive learning rate. Excellent for sparse gradients. Accumulating squared gradients indefinitely causes the effective step size to decay to zero — addressed by RMSProp.

8.5 RMSProp (Hinton, Coursera 2012)

Exponential moving average instead of sum:

G_k = γ G_{k-1} + (1 − γ) ∇f_k ⊙ ∇f_k
x_{k+1} = x_k − α · ∇f_k / (√G_k + ε)

8.6 Adam (Kingma + Ba 2014, ICLR 2015)

Combines momentum (first moment) and RMSProp (second moment) with bias correction:

m_k = β₁ m_{k-1} + (1 − β₁) ∇f_k
v_k = β₂ v_{k-1} + (1 − β₂) ∇f_k ⊙ ∇f_k
m̂_k = m_k / (1 − β₁^k)
v̂_k = v_k / (1 − β₂^k)
x_{k+1} = x_k − α · m̂_k / (√v̂_k + ε)

Default hyperparameters β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸. The default optimizer in PyTorch / TensorFlow for ten years and counting.

8.7 AdamW (Loshchilov + Hutter 2019, ICLR)

Decouples weight decay from the gradient update. In Adam, adding L2 to the loss couples the regularization with the adaptive learning-rate scaling, weakening it. AdamW applies decay directly to the parameters:

x_{k+1} = (1 − α · w) · x_k − α · m̂_k / (√v̂_k + ε)

Now the dominant optimizer for large-model training.

8.8 Lion (Chen et al. 2023, NeurIPS — “EvoLved Sign Momentum”)

Discovered by program search. Uses sign of momentum:

c_k = β₁ m_{k-1} + (1 − β₁) ∇f_k
x_{k+1} = x_k − α · sign(c_k) − α · w · x_k
m_k = β₂ m_{k-1} + (1 − β₂) ∇f_k

Cheaper than Adam (no second moment), comparable or better on large transformer training in some regimes. Adoption mixed; AdamW still default.

8.9 LBFGS (limited-memory BFGS)

Quasi-Newton method that approximates H⁻¹ using the last m gradient differences (typically m = 5-20). Batch-only (needs full gradient); memory O(mn). Excellent for moderate-scale smooth problems; popular in scientific computing and logistic regression. Implementations: scipy.optimize.minimize(method="L-BFGS-B"), torch.optim.LBFGS.

9. Newton + second-order methods

9.1 Newton’s method

x_{k+1} = x_k − [∇²f(x_k)]⁻¹ · ∇f(x_k)

Quadratic local convergence near a minimum when ∇²f ⪰ 0 and Lipschitz. Cost: O(n³) per iteration to factor the Hessian (cubic in dimension).

Globalization: pure Newton may diverge from a bad initial point. Standard fixes:

Damped Newton: x_{k+1} = x_k − α_k · H⁻¹ ∇f with backtracking line search on α_k.
Trust region: at each step, minimize a local quadratic model over a ball ‖x − x_k‖ ≤ Δ_k; expand/contract Δ_k based on agreement ratio.
Modified Newton: if H is indefinite, add λI to make it PSD (cf. Levenberg-Marquardt).

9.2 Self-concordance (Nesterov + Nemirovskii 1994)

A convex function f is self-concordant if |f‴x| ≤ 2 (uᵀ ∇²f(x) u)^(3/2). Self-concordant barriers (log, log det) admit Newton’s method with global linear convergence and an explicit complexity bound — the foundation of interior-point complexity theory.

9.3 Quasi-Newton: BFGS, L-BFGS

BFGS (Broyden 1970, Fletcher 1970, Goldfarb 1970, Shanno 1970): rank-2 update to an approximate inverse Hessian B_k ≈ H⁻¹.

s_k = x_{k+1} − x_k,  y_k = ∇f_{k+1} − ∇f_k
B_{k+1} = (I − ρ_k s_k y_kᵀ) B_k (I − ρ_k y_k s_kᵀ) + ρ_k s_k s_kᵀ
where ρ_k = 1 / (y_kᵀ s_k)

Cost: O(n²) per step; no matrix factorization. Superlinear convergence under mild assumptions.

L-BFGS stores only the last m pairs (s, y) and reconstructs the action of B implicitly via a two-loop recursion (Nocedal 1980). Memory O(mn); now O(n) per step. Standard for n in the millions.

10. Interior-point methods

10.1 Background

Khachiyan’s ellipsoid method (1979) was the first polynomial-time algorithm for LP, but impractical. Karmarkar’s algorithm (1984, STOC) triggered the interior-point revolution: polynomial-time and practical. Nesterov + Nemirovskii (1994, “Interior-Point Polynomial Algorithms in Convex Programming”) generalized to all conic convex programs via self-concordant barriers.

10.2 Barrier method

Replace inequality constraints f_i(x) ≤ 0 with a smooth penalty:

minimize    f₀(x) + (1/t) · Σᵢ φ(f_i(x))
subject to  h_j(x) = 0

Common barrier: logarithmic, φ(u) = −log(−u), defined for u < 0. As t → ∞, the penalty vanishes for interior points and explodes on the boundary, so solutions trace the central path x*(t), converging to the optimum.

Algorithm: solve the inner subproblem with Newton’s method, then increase t (typically t ← μ · t with μ ∈ [10, 100]) and re-solve, warm-starting from the previous solution. Total complexity for LP: O(√n · log(1/ε)) outer iterations.

10.3 Primal-dual interior-point

Apply Newton’s method directly to a perturbed KKT system (relax complementary slackness λ_i · f_i = −1/t for some duality measure parameter). Update primal and dual simultaneously. Tighter complexity bound and more numerically robust than the pure barrier method. The standard in MOSEK, Gurobi, ECOS, SDPT3.

10.4 Complexity

For LP: O(n^{3.5} L) bit operations (Karmarkar). Modern interior-point implementations are typically O(n³) per iteration with O(√n · log(1/ε)) iterations. The simplex method, by contrast, has exponential worst-case but typically O(n) pivots in practice.

10.5 Implementations

MOSEK (commercial) — LP, QP, SOCP, SDP, exponential-cone.
Gurobi (commercial) — LP, QP, MIP.
ECOS (Domahidi + Chu + Boyd 2013) — open-source SOCP.
CVXOPT — Python, open.
SDPT3 (Tütüncü + Toh + Todd 2003) — SDP-SOCP, Matlab.
SeDuMi (Sturm 1999) — SDP-SOCP, Matlab.
SCS (O’Donoghue 2016) — operator-splitting first-order conic solver, scales to very large problems with moderate accuracy.

11. Simplex method (LP)

Dantzig (1947). Pivots between vertices of the polyhedron Ax ≤ b, x ≥ 0. At each pivot, exchanges one basic for one non-basic variable to improve the objective. Terminates at an optimal vertex (or detects unboundedness / infeasibility).

Worst case: exponential (Klee-Minty 1972 cubes force 2ⁿ pivots). Practical case: typically O(n) to O(m) pivots — extremely fast on most real problems.

Smoothed analysis (Spielman + Teng 2004, J ACM): expected polynomial pivots under perturbation. Helps explain the empirical efficiency.

Simplex remains competitive with interior-point on many LPs, particularly very large sparse LPs, and integrates naturally with branch-and-bound for MIP. Modern dual-simplex with steepest-edge pricing (in Gurobi, CPLEX, HiGHS) is state-of-the-art.

12. Specialized solvers + libraries

12.1 Modeling languages (DSLs)

CVX (Grant + Boyd, Matlab) — pioneer of disciplined convex programming.
CVXPY (Diamond + Boyd 2016, Python) — Python port; widely used in ML / control / signal-processing research.
Convex.jl — Julia.
JuMP (Dunning + Huchette + Lubin 2017) — Julia, covers convex and nonconvex.
YALMIP (Löfberg, Matlab) — broad SDP / LMI support.
PICOS, PyOMO — alternative Python interfaces.

12.2 Commercial solvers

MOSEK — best-in-class for SDP, second-order cone, exponential cone.
Gurobi — best-in-class for LP, QP, MIP.
CPLEX (IBM) — LP, MIP, QP.
KNITRO — interior-point + active-set for general NLP.
XPRESS (FICO) — LP, MIP.

12.3 Open-source solvers

OSQP (Stellato + Banjac + Goulart + Bemporad + Boyd 2020, Math Programming Computation) — operator-splitting QP solver, embedded-friendly, used in MPC. C with Python / Matlab / Julia bindings.
HiGHS (Huangfu + Hall) — open LP and QP; simplex and interior-point.
IPOPT (Wächter + Biegler 2006) — interior-point for general NLP; widely used in chemical engineering and trajectory optimization.
SCS (O’Donoghue + Chu + Parikh + Boyd 2016) — splitting conic solver, scales to very large conic problems at moderate accuracy.
ECOS — embedded SOCP solver.

12.4 Differentiable + ML-flavored

JAXopt (Blondel et al. 2021) — JAX-native solvers + implicit differentiation through optima.
cvxpylayers (Agrawal + Amos + Barratt + Boyd + Diamond + Kolter 2019) — differentiable CVXPY layers for PyTorch / TensorFlow / JAX.
scipy.optimize — linprog (LP via HiGHS), minimize (BFGS, L-BFGS-B, Nelder-Mead, etc.).

13. Applications

13.1 Machine learning

Support Vector Machine (SVM) — soft-margin: minimize (1/2)‖w‖² + C Σ max(0, 1 − yᵢ (wᵀxᵢ + b)). Convex QP. Vapnik 1995.
Lasso (Tibshirani 1996, JRSS-B) — minimize (1/2)‖Ax − b‖² + λ‖x‖₁. Convex; solved by FISTA, ADMM, coordinate descent (glmnet).
Elastic net (Zou + Hastie 2005) — Lasso + ridge: λ₁‖x‖₁ + λ₂‖x‖².
Logistic regression — convex (negative log-likelihood is log-sum-exp); solved by L-BFGS or IRLS.
Kernel methods — convex in the dual when kernel matrix is PSD.

13.2 Control

MPC (Model Predictive Control) — at each control step, solve a (typically small) QP over a finite horizon. Workhorse of modern process control, autonomous vehicles, drones. See [[Engineering/mpc-control]], [[Robotics/mpc-for-robots]].
LQR — closed-form solution via the Riccati equation, but the discrete-time finite-horizon formulation is a QP.
H∞ control — LMI / SDP formulation (Boyd et al. 1994).
Trajectory optimization — direct collocation produces a sparse NLP, often locally convex.

13.3 Signal processing

Compressed sensing (Candès + Romberg + Tao 2006, Donoho 2006) — L1 minimization recovers sparse signals from few measurements.
Total variation denoising (Rudin + Osher + Fatemi 1992) — min ‖∇x‖₁ + (λ/2)‖x − y‖². Solved by primal-dual splitting.
Beamforming — SOCP for antenna weight design.
Phase retrieval (PhaseLift, Candès et al.) — SDP relaxation.

13.4 Statistics

Maximum likelihood for log-concave families (Gaussian, Poisson, multinomial logit) is convex.
Quantile regression (Koenker + Bassett 1978) — LP.
Generalized linear models — convex when the link is canonical.

13.5 Operations research

Transportation problem — LP, solved by simplex in O(n³).
Assignment problem — LP / Hungarian algorithm.
Scheduling — often MIP, but LP relaxations give bounds.
Network flow — LP with special structure; specialized algorithms (Edmonds-Karp, push-relabel) outperform general LP.

13.6 Finance

Mean-variance portfolio (Markowitz 1952, J Finance — Nobel 1990) — minimize wᵀΣw subject to μᵀw ≥ r, 1ᵀw = 1, w ≥ 0. Convex QP.
Risk parity / risk budgeting — convex under specific formulations.
Index tracking — QP with tracking-error objective.
CVaR optimization (Rockafellar + Uryasev 2000, J Risk) — LP under scenario representation.

13.7 Robotics

Inverse kinematics — QP for end-effector tracking with joint-limit constraints.
Collision-avoidance QP — at each control step, solve a QP enforcing safe-set membership.
Whole-body control for humanoids — large QP solving force / torque distribution.
Grasp force optimization — SOCP enforcing friction-cone constraints.

13.8 Engineering

Truss / structural optimization — LP and SDP for compliance minimization.
Network design — convex relaxations of NP-hard topology problems.
Power flow — convex relaxations (SDP, SOCP) approximate the AC power flow equations (Lavaei + Low 2012).
Antenna array design — SOCP for sidelobe suppression.

14. Non-convex optimization (brief)

Most real-world problems are non-convex. Key issues:

Local minima: many — gradient methods get trapped.
Saddle points: in high dimensions, saddles vastly outnumber local minima; second-order methods or perturbations escape them.
No duality gap guarantees: weak duality still holds (d≤ p), but strong duality may fail.

Algorithms for non-convex NLP

SQP (Sequential Quadratic Programming) — at each iterate, solve a QP approximating the Lagrangian. SNOPT (Gill + Murray + Saunders 2002), filterSQP.
Interior-point NLP — IPOPT (Wächter + Biegler 2006), KNITRO. Treat barrier penalty + Newton step on KKT.
Augmented Lagrangian — LANCELOT, ALGENCAN.
Trust-region methods — robust globalization for nonsmooth or indefinite Hessian.

Global optimization

Branch-and-bound: BARON (Sahinidis 1996), Couenne, αBB. Provably global for many NLP classes.
CMA-ES (Hansen 2001) — evolutionary, derivative-free.
Simulated annealing (Kirkpatrick et al. 1983).
Bayesian optimization — GP surrogate models; popular in hyperparameter tuning.
Basin-hopping — local minimization + random restarts.

Neural networks

Massively non-convex (millions to trillions of parameters).
SGD + Adam empirically find good local optima that generalize well — an active research area (loss landscape geometry, lottery tickets, NTK limits).
Convex relaxations underlie certification (Wong + Kolter 2018, dual bounds on robustness).

15. Numerical considerations

15.1 Conditioning

The condition number κ = L/m (or κ(A) = σ_max(A)/σ_min(A) for an LP / QP data matrix) controls convergence rate of first-order methods and numerical accuracy of factorizations. Preconditioning (diagonal scaling, Jacobi, Ruiz equilibration) is essential for ill-conditioned problems.

15.2 Scaling

Rescale variables and constraints to similar magnitudes. Most commercial solvers do this automatically; in custom solvers, neglecting it leads to slow convergence or false infeasibility.

15.3 Warm-starting

For sequential problems (MPC: one QP per control step with slightly different data), reusing the previous solution as the starting point can reduce solve time by 10-100×. Interior-point methods warm-start poorly (need a strictly interior point); active-set methods and ADMM warm-start excellently. This is why OSQP and Gurobi’s dual simplex / active-set QP are favored in real-time control.

15.4 Tolerances

Solvers report:

Primal-dual gap: |p_k − d_k|. Direct measure of optimality.
KKT residual: ‖∇L‖ + complementarity gap + feasibility.
Primal residual: ‖Ax − b‖, max(0, f_i(x)).
Dual residual: dual feasibility violation.

Typical tolerances: 10⁻⁶ to 10⁻⁸ for academic problems; 10⁻³ to 10⁻⁴ acceptable for MPC where speed matters and the solution is consumed by an actuator anyway.

15.5 Feasibility vs. optimality trade-off

In real-time systems, returning a feasible but suboptimal point quickly is often preferable to waiting for full optimality. Anytime algorithms (interior-point with early termination, ADMM, primal-dual splitting) provide this. Hard constraints can be softened with slack penalties to guarantee feasibility.

16. Modeling tips

16.1 Recognize convexity

Convexity is often hidden. Useful tricks:

Epigraph form: replace min max(f_i(x)) with min t s.t. f_i(x) ≤ t.
Log-sum-exp: smooth approximation of max; convex.
Quadratic-over-linear: x²/y for y > 0; convex.
Geometric mean: (Π xᵢ)^(1/n) is concave; arises in geometric programming.
Schur complement: turns nonlinear matrix conditions into linear ones over the PSD cone.

16.2 Disciplined Convex Programming (DCP)

The DCP ruleset (Grant + Boyd + Ye 2006) imposes structural constraints on how convex expressions can be combined, guaranteeing the resulting problem is convex. CVX/CVXPY enforce DCP at parse time, rejecting expressions whose convexity cannot be verified compositionally.

Common DCP-friendly idioms:

norm(x, 1) + norm(x, 2) — sum of convex norms is convex.
square(x - y) — convex.
log_sum_exp(x) — convex.
inv_pos(x) — 1/x for x > 0; convex.
geo_mean(x) — concave.

16.3 Reformulations

L1 minimization: minimize ‖x‖₁ ↔ minimize 1ᵀt s.t. −t ≤ x ≤ t. LP.
L∞ minimization: minimize ‖x‖∞ ↔ minimize t s.t. −t ≤ x ≤ t. LP.
Maximum: minimize max(f_i(x)) ↔ minimize t s.t. f_i(x) ≤ t.
Soft constraints: replace hard f_i(x) ≤ 0 with f_i(x) ≤ s, s ≥ 0, penalize ρ·s in the objective (huge ρ for “almost-hard”).
Robust LP (ellipsoidal uncertainty) → SOCP. Robust LP (polyhedral uncertainty) → LP.
Indicator I_C(x) is convex iff C is convex; used in proximal splitting to enforce constraints implicitly.

16.4 Sparsity

Many problems have block-sparse structure (banded, tree-structured, network). Specialized solvers exploit this for orders-of-magnitude speedup. OSQP, qpOASES, FORCES Pro all exploit MPC-typical sparsity patterns.

17. Cross-references

[[Math/_index]] — top-level Math MOC.
[[Math/linear-algebra-essentials]] — vectors, matrices, eigenvalues, SVD, PSD.
[[Math/gradient-descent-variants]] (TBD) — deeper coverage of GD, momentum, Adam, etc.
[[Math/lagrangian-kkt]] (TBD — but largely covered here in §6).
[[Math/probability-essentials]] (TBD) — for stochastic gradient analysis.
[[Engineering/mpc-control]] — convex MPC as the canonical embedded QP application.
[[Robotics/mpc-for-robots]] — robot whole-body QP, force distribution.
[[Compute/transformer-architecture]] — training transformers via Adam / AdamW.
[[ML/sgd-deep-learning]] (TBD) — non-convex first-order in practice.
[[Finance/portfolio-optimization]] (TBD) — Markowitz and beyond.

18. Citations + further reading

Textbooks

Boyd, S. + Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Free PDF: https://web.stanford.edu/~boyd/cvxbook/. The standard reference.
Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed). Springer. Rigorous complexity-theory treatment.
Nocedal, J. + Wright, S. (2006). Numerical Optimization (2nd ed). Springer. Numerical algorithms in depth.
Bertsekas, D. (2009). Convex Optimization Theory. Athena Scientific.
Bertsekas, D. (2015). Convex Optimization Algorithms. Athena Scientific.
Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. Foundations and Trends in ML. Free online.
Beck, A. (2017). First-Order Methods in Optimization. SIAM.

Seminal papers

Dantzig, G. (1947). The simplex method for LP.
Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side Conditions. MSc thesis, U Chicago.
Kuhn, H. W. + Tucker, A. W. (1951). Nonlinear programming. Proc 2nd Berkeley Symposium.
Markowitz, H. (1952). Portfolio selection. J Finance, 7(1), 77-91. Nobel 1990.
Robbins, H. + Monro, S. (1951). A stochastic approximation method. Ann Math Stat, 22(3), 400-407.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Comput Math + Math Phys.
Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate O(1/k²). Soviet Mathematics Doklady.
Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. STOC ‘84, 302-311.
Nesterov, Y. + Nemirovskii, A. (1994). Interior-Point Polynomial Algorithms in Convex Programming. SIAM Studies in Applied Mathematics.
Khachiyan, L. G. (1979). A polynomial algorithm in linear programming. Soviet Math Doklady.
Boyd, S. + El Ghaoui, L. + Feron, E. + Balakrishnan, V. (1994). Linear Matrix Inequalities in System and Control Theory. SIAM.
Beck, A. + Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sciences, 2(1), 183-202. FISTA.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. JRSS-B, 58(1), 267-288.
Candès, E. + Romberg, J. + Tao, T. (2006). Robust uncertainty principles. IEEE Trans Info Theory.
Donoho, D. (2006). Compressed sensing. IEEE Trans Info Theory, 52(4), 1289-1306.
Kingma, D. + Ba, J. (2014). Adam: A method for stochastic optimization. ICLR 2015.
Loshchilov, I. + Hutter, F. (2019). Decoupled weight decay regularization. ICLR 2019. AdamW.
Chen, X. et al. (2023). Symbolic discovery of optimization algorithms. NeurIPS. Lion.
Duchi, J. + Hazan, E. + Singer, Y. (2011). Adaptive subgradient methods. JMLR 12, 2121-2159. AdaGrad.
Stellato, B. et al. (2020). OSQP: An operator splitting solver for quadratic programs. Math Programming Computation 12, 637-672.
Wächter, A. + Biegler, L. T. (2006). On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math Programming 106, 25-57. IPOPT.
Diamond, S. + Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. JMLR 17(83), 1-5.
Agrawal, A. + Amos, B. + Barratt, S. + Boyd, S. + Diamond, S. + Kolter, J. Z. (2019). Differentiable convex optimization layers. NeurIPS.
Nocedal, J. (1980). Updating quasi-Newton matrices with limited storage. Math Comput 35(151), 773-782. L-BFGS.
Rockafellar, R. T. (1970). Convex Analysis. Princeton. The mathematical foundation.

Software documentation

CVXPY: https://www.cvxpy.org/
MOSEK: https://docs.mosek.com/
Gurobi: https://www.gurobi.com/documentation/
OSQP: https://osqp.org/
IPOPT: https://coin-or.github.io/Ipopt/
SCS: https://www.cvxgrp.org/scs/
HiGHS: https://highs.dev/
scipy.optimize: https://docs.scipy.org/doc/scipy/reference/optimize.html

Tier-1 math reference. Last revised 2026-05-16.

Compendium

Explorer

Convex Optimization — Math Reference

Convex Optimization — Math Reference

1. At a glance

2. Convex sets

Canonical examples

Operations preserving convexity

Separating and supporting hyperplanes

3. Convex functions

Equivalent characterizations

Canonical examples

Operations preserving convexity

4. Standard problem form

5. Problem classes (in increasing generality)

5.1 Linear Program (LP)

5.2 Quadratic Program (QP)

5.3 Quadratically Constrained QP (QCQP)

5.4 Second-Order Cone Program (SOCP)

5.5 Semidefinite Program (SDP)

5.6 Conic Program (most general)

5.7 Nonsmooth convex

6. Duality + KKT

6.1 Lagrangian

6.2 Dual function

6.3 Dual problem

6.4 Strong duality + Slater’s condition

6.5 KKT conditions (Karush 1939 master’s thesis; Kuhn + Tucker 1951)

6.6 Examples of duality

7. Gradient methods (smooth convex)

7.1 Gradient descent

7.2 Subgradient method

7.3 Proximal gradient + ISTA + FISTA

7.4 Conjugate gradient (CG)

7.5 Accelerated methods (Nesterov 1983)

8. Stochastic + first-order methods for ML

8.1 SGD (Robbins-Monro 1951, Ann Math Stat)

8.2 Momentum (Polyak 1964; heavy-ball)

8.3 Nesterov accelerated gradient (NAG)

8.4 AdaGrad (Duchi + Hazan + Singer 2011, JMLR)

8.5 RMSProp (Hinton, Coursera 2012)

8.6 Adam (Kingma + Ba 2014, ICLR 2015)

8.7 AdamW (Loshchilov + Hutter 2019, ICLR)

8.8 Lion (Chen et al. 2023, NeurIPS — “EvoLved Sign Momentum”)

8.9 LBFGS (limited-memory BFGS)

9. Newton + second-order methods

9.1 Newton’s method

9.2 Self-concordance (Nesterov + Nemirovskii 1994)

9.3 Quasi-Newton: BFGS, L-BFGS

10. Interior-point methods

10.1 Background

10.2 Barrier method

10.3 Primal-dual interior-point

10.4 Complexity

10.5 Implementations

11. Simplex method (LP)

12. Specialized solvers + libraries

12.1 Modeling languages (DSLs)

12.2 Commercial solvers

12.3 Open-source solvers

12.4 Differentiable + ML-flavored

13. Applications

13.1 Machine learning

13.2 Control

13.3 Signal processing

13.4 Statistics

13.5 Operations research

13.6 Finance

13.7 Robotics

13.8 Engineering

14. Non-convex optimization (brief)

Algorithms for non-convex NLP

Global optimization

Neural networks

15. Numerical considerations

15.1 Conditioning

15.2 Scaling

15.3 Warm-starting

15.4 Tolerances

15.5 Feasibility vs. optimality trade-off