Riemannian Optimization

Many machine-learning and engineering problems are most naturally written as optimization over a smooth manifold rather than over an unconstrained Euclidean space: rotations live on SO(3), orthonormal frames on the Stiefel manifold, low-rank matrices on a fixed-rank variety, covariance matrices on the cone of symmetric positive-definite (SPD) matrices, taxonomy embeddings in hyperbolic space, discrete distributions on the probability simplex. Riemannian optimization treats the manifold as a first-class object — gradient steps, line searches, conjugate gradient, quasi-Newton, trust-region, and stochastic methods all generalize cleanly once the manifold is equipped with a metric and a way to move along it.

1. Motivation

Three common reactions to a manifold-constrained problem:

  1. Project after each step. Cheap, but loses the geometry; convergence rates degrade and the iterate may leave any meaningful neighborhood.
  2. Lagrangian / penalty methods. Drag the constraint into the cost. The constraint surface becomes a basin instead of a hard boundary; conditioning is awful near the manifold.
  3. Riemannian optimization. Keep iterates exactly on the manifold and replace gradient / line-search machinery with manifold-aware analogs. Cleanest, exposes invariances, and convergence theory mirrors the Euclidean theory once the curvature terms are accounted for.

The third path is what Edelman-Arias-Smith 1998 (“The Geometry of Algorithms with Orthogonality Constraints”, SIAM J. Matrix Analysis) crystallized for the Stiefel and Grassmann manifolds, and what Absil-Mahony-Sepulchre 2008 (“Optimization Algorithms on Matrix Manifolds”, Princeton) wrote into a textbook treatment. Boumal 2023 (“An Introduction to Optimization on Smooth Manifolds”, Cambridge) is the current standard reference.

2. Mathematical setup

2.1 Smooth manifold

A d-dimensional smooth manifold M is a topological space locally homeomorphic to R^d with smooth transition maps between charts. Examples we care about are embedded submanifolds of a Euclidean space R^n (then tangent spaces are linear subspaces of R^n) and quotient manifolds like the Grassmann.

2.2 Tangent space and Riemannian metric

The tangent space T_p M at p ∈ M is the d-dimensional vector space of velocities of smooth curves through p. A Riemannian metric g is a smooth assignment of an inner product <·, ·>_p on each T_p M. The pair (M, g) is a Riemannian manifold.

For embedded submanifolds the natural metric is the restriction of the ambient Euclidean inner product to T_p M.

2.3 Exponential map

Given p ∈ M and v ∈ T_p M, the exponential map exp_p(v) ∈ M is obtained by following the geodesic starting at p with initial velocity v for unit time. Geodesics are length-minimizing curves; on R^n with the standard metric they are straight lines and exp_p(v) = p + v.

Computing exp_p requires solving a second-order ODE in general and is expensive on most manifolds. Two common cheap surrogates:

2.4 Retraction

A retraction R_p : T_p M -> M is a smooth map with R_p(0) = p and dR_p(0) = id on T_p M. Geometrically: a first-order approximation to exp_p. Examples:

  • Sphere S^{n-1}: R_p(v) = (p + v) / ||p + v|| — normalize.
  • Stiefel St(n, k): R_p(v) = qf(p + v) (QR factor) or polar R_p(v) = (p + v)((p + v)^T (p + v))^{-1/2}.
  • SPD Sym+(n): R_p(v) = p + v + (1/2) v p^{-1} v (second-order retraction) or simpler R_p(v) = p^{1/2} expm(p^{-1/2} v p^{-1/2}) p^{1/2} (matches the affine-invariant exponential).

Retractions preserve first-order convergence theory for gradient methods; for Newton and trust-region you typically want second-order retractions (Absil-Malick 2012).

2.5 Vector transport

To move tangent vectors between tangent spaces (essential for conjugate-gradient and quasi-Newton on manifolds), one needs parallel transport Γ_{p->q} (the exact, geodesic-preserving move) or a cheaper vector transport T_{p->q} that is a smooth approximation. For embedded submanifolds, projection transportT_{p->q}(v) = Π_{T_q M}(v) — is the dominant practical choice.

2.6 Riemannian gradient

For a smooth f : M -> R, the Riemannian gradient grad f(p) ∈ T_p M is defined by

<grad f(p), v>_p = D f(p)[v]   for all v ∈ T_p M

For embedded submanifolds with the induced metric this reduces to projection of the ambient gradient:

grad f(p) = Π_{T_p M} ∇f(p)

where ∇f(p) is the Euclidean gradient computed as if p lived in the ambient space.

2.7 Riemannian Hessian

The Riemannian Hessian at p is a self-adjoint linear operator on T_p M:

Hess f(p)[v] = ∇_v grad f(p)

where is the Levi-Civita connection (the unique torsion-free metric-compatible connection). For embedded submanifolds it is the projected directional derivative of the gradient field plus a curvature correction:

Hess f(p)[v] = Π_{T_p M}(D grad f(p)[v]) + curvature term

For matrix manifolds the curvature term comes from differentiating the projection itself. Mishra-Sepulchre 2014 and Boumal 2023 give clean per-manifold expressions.

3. Algorithms

3.1 Riemannian gradient descent

p_{k+1} = R_{p_k}(- α_k grad f(p_k))

Step size α_k from Armijo / Wolfe line search on the manifold (back-track along the retraction curve). Convergence to a critical point under standard smoothness; convergence rate O(1/k) for L-smooth objectives, O(c^k) for geodesically μ-strongly-convex objectives where c = 1 - μ/L. Geodesic strong convexity is rare but is satisfied for some interesting problems (e.g. Karcher mean on Hadamard manifolds).

3.2 Riemannian conjugate gradient

Generalize Fletcher-Reeves or Polak-Ribière:

η_{k+1} = - grad f(p_{k+1}) + β_{k+1} T_{p_k -> p_{k+1}}(η_k)

with β chosen from the manifold analog of FR (||grad f(p_{k+1})||^2 / ||grad f(p_k)||^2) or PR. Absil-Mahony-Sepulchre 2008 lay out the convergence theory; Sato-Iwai 2015 give the modern formulation with vector transport.

3.3 Riemannian L-BFGS / quasi-Newton

Huang-Gallivan-Absil 2015 (“A Broyden Class of Quasi-Newton Methods for Riemannian Optimization”, SIAM J. Optimization) — careful definition of the manifold update such that the inverse Hessian approximation stays positive definite under vector transport. Memory-limited variant LRBFGS is the default workhorse in Manopt and Pymanopt for medium-sized problems.

3.4 Riemannian trust region (RTR)

Absil-Baker-Gallivan 2007 (“Trust-Region Methods on Riemannian Manifolds”, Found. Comput. Math.) — solve

min_{η ∈ T_p M, ||η|| <= Δ}   f(p) + <grad f(p), η>_p + (1/2) <Hess f(p)[η], η>_p

via a truncated CG inner solver, then accept / reject with the standard trust-region ratio test and update Δ. Quadratic local convergence under standard regularity; the gold standard for medium-dimensional Riemannian problems.

3.5 Riemannian Newton

Hess f(p) η = - grad f(p), then p_{k+1} = R_{p_k}(η_k). Locally quadratic, but globally fragile; trust-region replaces it in practice.

3.6 Riemannian stochastic gradient (RSGD)

Bonnabel 2013 (“Stochastic Gradient Descent on Riemannian Manifolds”, IEEE TAC) — same as Euclidean SGD with retraction and projected stochastic gradient:

p_{k+1} = R_{p_k}(- α_k g_k),   E[g_k] = grad f(p_k)

O(1/k) rate for strongly geodesically convex objectives; almost-sure convergence to a stationary point with diminishing step sizes.

3.7 Riemannian Adam / RAdam

Bécigneul-Ganea 2018 (“Riemannian Adaptive Optimization Methods”) — port Adam to manifolds with vector transport of the first and second moment estimates. The hyperbolic-space variant powers Poincaré embedding training. Kasai-Sato-Mishra 2018 covers Riemannian variance reduction (SVRG on manifolds).

3.8 Constraints inside manifolds

For manifold-and-inequality problems (e.g. SPD matrices with bounded trace) one combines Riemannian descent with augmented Lagrangian or barrier terms — Liu-Boumal 2019 give the Riemannian augmented Lagrangian recipe. See convex-optimization for the Euclidean baseline.

4. Key manifolds

4.1 Stiefel manifold St(n, k)

Orthonormal n x k matrices: St(n, k) = { X ∈ R^{n x k} : X^T X = I_k }. Dimension nk - k(k+1)/2.

Tangent space: T_X St(n, k) = { V ∈ R^{n x k} : X^T V + V^T X = 0 }.

Projection: Π_{T_X}(Z) = Z - X sym(X^T Z) where sym(A) = (A + A^T)/2.

Retraction: QR or polar (see 2.4).

Used in: orthogonal-constrained NN layers, PCA / generalized eigenvalue problems, low-rank matrix completion factor U. Edelman-Arias-Smith 1998 is the canonical reference.

4.2 Grassmann manifold Gr(n, k)

k-dimensional subspaces of R^n. Quotient of St(n, k) by O(k) — two Stiefel points represent the same Grassmann point if they share the column space.

Tangent space at [X]: T_{[X]} Gr(n, k) ≃ { V ∈ R^{n x k} : X^T V = 0 } (the horizontal lift).

Used when only the subspace, not the specific basis, matters: subspace tracking, robust PCA, subspace clustering, low-rank matrix completion (e.g. GROUSE, Balzano-Recht-Nowak 2010).

4.3 SPD cone Sym+(n)

Symmetric positive-definite n x n matrices. Open convex cone in Sym(n), but the geometry that matters is Riemannian, not the flat one.

Three common metrics:

  • Affine-invariant (Pennec-Fillard-Ayache 2006) — <U, V>_P = tr(P^{-1} U P^{-1} V). Geodesics: P^{1/2} expm(t P^{-1/2} V P^{-1/2}) P^{1/2}. Invariant under congruence transforms P -> A P A^T; reflects the natural geometry for covariance matrices. Distances are huge near the boundary, so the cone “feels” infinitely deep.
  • Log-Euclidean (Arsigny-Fillard-Pennec-Ayache 2007) — d(P, Q) = ||logm(P) - logm(Q)||_F. Cheaper to compute (linear in log-domain), used as a fast approximation but loses affine invariance.
  • Bures-Wasserstein — quotient of St(n, n) by O(n); metric d_{BW}^2(P, Q) = tr(P) + tr(Q) - 2 tr((P^{1/2} Q P^{1/2})^{1/2}). Coincides with the 2-Wasserstein metric between zero-mean Gaussians and is foundational in optimal transport.

Used in: diffusion-tensor imaging (DTI), brain-computer interfaces (Riemannian classifiers of EEG covariance, Barachant-Bonnet-Congedo-Jutten 2013), portfolio covariance estimation, Gaussian process kernels.

4.4 Fixed-rank matrix manifold M_r

M_r = { X ∈ R^{n x m} : rank(X) = r }. A smooth manifold of dimension (n + m - r) r. Parametrized via thin SVD X = U Σ V^T with U ∈ St(n, r), V ∈ St(m, r), Σ ∈ R^{r x r}_+.

Workhorse for matrix completion (Netflix-style problems): minimize (1/2) sum_{(i,j) ∈ Ω} (X_{ij} - M_{ij})^2 over M_r. Boumal-Absil 2011 (“RTRMC: A Riemannian Trust-Region Method for Low-Rank Matrix Completion”) was state of the art in 2011 and remains a strong baseline. Mishra-Meyer-Bach-Sepulchre 2014 covers the full family of fixed-rank algorithms.

4.5 Sphere S^{n-1} and projective spaces

S^{n-1} = { x ∈ R^n : ||x|| = 1 }. Trivial tangent space T_x S^{n-1} = x^⊥. Exp map closed form: exp_x(v) = cos(||v||) x + sin(||v||) v / ||v||. Used in direction-only problems (gravity vectors, axis estimation), rank-1 tensor approximation, neural-network weight normalization (Salimans-Kingma 2016).

RP^{n-1} (real projective space) and CP^{n-1} (complex projective) — quotients of spheres by the action of ±1 and S^1 respectively; the natural manifold when global phase is meaningless.

4.6 Hyperbolic spaces

Constant negative-curvature spaces. Three equivalent models all in active use:

  • Poincaré disk / ball — unit ball with metric 4 / (1 - ||x||^2)^2 (dx)^2. Conformally flat. Used by Nickel-Kiela 2017 (Facebook, “Poincaré Embeddings for Learning Hierarchical Representations”, NeurIPS 2017) — embed WordNet taxonomy; tree structure embeds with low distortion in H^d whereas Euclidean needs much higher d.
  • Lorentz / hyperboloid model{ x ∈ R^{n+1} : -x_0^2 + x_1^2 + ... + x_n^2 = -1, x_0 > 0 } with Minkowski inner product. Numerically more stable than Poincaré near the boundary; Nickel-Kiela 2018 (“Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry”) revisited.
  • Klein model — open ball with straight-line geodesics but non-conformal angles.

Hyperbolic neural networks (Ganea-Becigneul-Hofmann 2018) generalize MLPs to H^n; used in tree-structured NLP, hierarchical graph embeddings, knowledge graphs.

4.7 Lie groups SO(n), SE(n)

SO(n) — rotations; SE(n) = SO(n) ⋉ R^n — rigid motions. Exponential map = matrix exponential of skew-symmetric matrix (for SO(n)); Rodrigues’ formula for SO(3). Foundational in robotics — see state-space-lqr, lie-groups-so3-se3, slam. Pose-graph SLAM is Riemannian optimization on SE(3)^N plus correlations.

4.8 Manifolds of probability distributions

The space of probability densities equipped with the Fisher information metric is a Riemannian manifold — information geometry (Amari 1998, “Natural Gradient Works Efficiently in Learning”, Neural Computation). The natural gradient is the Riemannian gradient under the Fisher metric. For exponential families it has a closed form, and modern stochastic gradient methods (K-FAC by Martens-Grosse 2015 for neural networks) are practical approximations.

The probability simplex Δ_n = { p ∈ R^n_+ : sum p_i = 1 } carries the Fisher metric and equals an open subset of a sphere under a square-root change of variable — the Hellinger embedding.

5. Applications

5.1 PCA, sparse PCA, robust PCA

PCA = max_{X ∈ St(n, k)} tr(X^T C X) for sample covariance C. Classical Euclidean solution = top eigenvectors; Riemannian solvers are competitive at scale and naturally generalize to:

  • Sparse PCASt(n, k) plus l_1 penalty on X; Journée-Nesterov-Richtárik-Sepulchre 2010 for the manifold formulation.
  • Robust PCA — Tyler’s M-estimator on the SPD cone (Wiesel 2012); robust to outliers.
  • Online PCA / subspace tracking — Grassmann gradient descent (GROUSE, Balzano et al 2010), streaming version of PCA.

5.2 Matrix completion

The Netflix prize era cemented the importance of fixed-rank matrix optimization. LMaFit, OptSpace (Keshavan-Montanari-Oh 2010), RTRMC (Boumal-Absil 2011), Mishra-Sepulchre 2014 are all Riemannian solvers; trust-region competitive with or beating ALS / convex nuclear-norm methods on large rated entries. Used internally at Netflix, Pandora, Amazon for collaborative filtering.

5.3 Tensor decomposition

Tucker decomposition has core tensor G ∈ R^{r_1 x ... x r_d} plus factors U_i ∈ St(n_i, r_i). Optimization is naturally on the product manifold of Stiefel × Stiefel × … × R^{r_1 x ... x r_d}. Eldén-Savas 2009, Ishteva-Absil-Van Huffel-De Lathauwer 2011 for Riemannian Tucker. CP decomposition has its own manifold (joint of unit-norm columns).

5.4 Robust covariance, Tyler M-estimator

Tyler’s M-estimator (Tyler 1987) is the fixed point of an iteration on the SPD cone; reframed as a Riemannian optimization problem (Wiesel 2012, Sun-Babu-Palomar 2014), it has convergence guarantees under geodesic convexity. Used in radar STAP, financial risk, BCI.

5.5 Independent Component Analysis (ICA)

Whitened ICA is optimization over the orthogonal group O(n) — find rotation maximizing source non-Gaussianity. FastICA (Hyvärinen 1999) is fixed-point iteration on O(n); Riemannian gradient on O(n) gives a more general, slower-but-more-robust solver.

5.6 Pose-graph SLAM

SE-Sync (David Rosen et al 2019, “SE-Sync: A Certifiably Correct Algorithm for Synchronization over the Special Euclidean Group”, Int. J. Robotics Research) — solve pose-graph SLAM by Riemannian optimization over St(d, dN) / O(d)-style manifold with a semidefinite relaxation that is provably exact in low-noise regimes. The frontier of certifiable SLAM. See slam and lie-groups-so3-se3.

5.7 Diffeomorphic image registration

LDDMM (Large Deformation Diffeomorphic Metric Mapping, Beg-Miller-Trouvé-Younes 2005) — register two images by finding a diffeomorphism that minimizes a path-length functional on the manifold of diffeomorphisms. ANTs (Avants-Epstein-Grossman-Gee 2008, Medical Image Analysis) — Advanced Normalization Tools; widely used in NIH-funded neuroimaging. Modern variant Voxelmorph (Balakrishnan-Zhao-Sabuncu-Guttag-Dalca 2018) replaces the optimization with a learned NN but inherits the same manifold formulation.

5.8 Hyperbolic embeddings

Tree-structured data (taxonomies, hierarchical labels, citation networks) embeds in hyperbolic space with much lower dimension and distortion than Euclidean. Poincaré embeddings (Nickel-Kiela 2017, Facebook AI Research) for WordNet; HGCN (Chami-Ying-Re-Leskovec 2019) hyperbolic graph convolution networks. RAdam (Bécigneul-Ganea 2018) is the standard optimizer.

5.9 Continuous attention and soft routing

Niculae-Blondel 2020 (“Sparse and Constrained Attention for Neural Machine Translation”) — generalize softmax attention to general convex polytopes on the simplex; the manifold is the simplex and the optimizer is Riemannian Frank-Wolfe.

5.10 Information-geometric optimization

CMA-ES (Hansen-Ostermeier 2001) and natural-gradient evolution strategies (Wierstra-Schaul-Glasmachers-Sun-Peters-Schmidhuber 2014) — evolutionary algorithms on the information manifold of Gaussian search distributions. K-FAC (Martens-Grosse 2015) — natural gradient for deep networks via Kronecker-factored Fisher approximation.

6. Software

ToolLanguageStrength
Manopt (Boumal et al, since 2014)MATLABThe original; full algorithm zoo and broad manifold library
Pymanopt (Townsend-Koep-Weichwald 2016)PythonDirect port of Manopt; uses autodiff (autograd / JAX / TF / PyTorch) for gradients and Hessians
ROPTLIB (Huang et al)C++Performance-focused, supports very large problems
Geomstats (Miolane et al 2020, JMLR)PythonStrong info-geometry side; statistical analysis on manifolds
GeoTorch (Lezcano-Casado 2019)PyTorchParametrize NN layers as living on a manifold; differentiable retractions
McTorch (Meghwanshi et al 2018)PyTorchPyTorch-native Riemannian autograd
Manifolds.jl (Mateusz Baran et al)JuliaComposable manifolds; pairs with Manopt.jl for solvers

Use Pymanopt or GeoTorch for new ML projects; Manopt or Manopt.jl for prototyping with the broadest manifold catalogue; ROPTLIB for production C++ deployment.

7. Theory

7.1 Convergence rates

  • Gradient descent: O(1/k) for L-smooth f; O((1 - μ/L)^k) linear rate for geodesically μ-strongly-convex f.
  • Conjugate gradient: super-linear under standard regularity once iterates land in a neighborhood of a strict local minimum.
  • L-BFGS: super-linear under standard regularity.
  • Trust-region: quadratic if Hessian is non-singular at the minimum.

Geodesic convexity is rare in ML (the SPD cone with affine-invariant metric admits a sizable class — geodesic convex programming, Sra-Hosseini 2015) but local strong convexity is the operative regime in practice.

7.2 Cost of retraction vs exponential

Exponential map on Stiefel: matrix exponential plus a 2k x 2k SVD per step; retraction: nk^2 QR. For large n and modest k, retraction is 10-100x cheaper and the convergence-rate exponents are the same. The exception is when the curvature term matters (Newton, trust-region with very tight tolerances), where second-order retractions or true exp_p are needed.

7.3 Benchmarking and bench problems

Boumal et al maintain the Manopt benchmark suite: low-rank matrix completion on Netflix-like data, Karcher mean on SPD, dominant invariant subspace, joint diagonalization. The ManOpt OPTIMUM benchmark (Boumal 2023 appendix) tabulates iteration counts and timings across solvers.

8. Frontiers

8.1 Stochastic Riemannian optimization for ML

Variance-reduced SVRG (Kasai-Sato-Mishra 2018), Riemannian SAGA, momentum methods (Alimisis-Orvieto-Bécigneul-Lucchi 2020 — “Practical Accelerated Optimization on Riemannian Manifolds”). Open problems: principled acceleration on manifolds with bounded curvature, large-batch parallel Riemannian SGD.

8.2 Hyperbolic deep learning

Beyond Poincaré embeddings — hyperbolic attention (Gulcehre-Denil-Malinowski 2019), hyperbolic neural ODEs (Bose et al 2020), product spaces mixing Euclidean / spherical / hyperbolic components (Gu-Sala-Gunel-Re 2019 “Learning Mixed-Curvature Representations in Product Spaces”). Pinned by the fact that hyperbolic geometry is the natural metric for many tree-like objects in language, knowledge graphs, and biology.

8.3 Manifold-aware diffusion and generative models

Generative diffusion on the manifold of densities and on geometric data — Riemannian Score-Based Generative Models (De Bortoli-Mathieu-Hutchinson-Thornton-Teh-Doucet 2022) — define forward / reverse SDEs on manifolds (sphere, SO(3), SPD, tori). Used for molecular conformer generation, protein structure design (Watson et al RFdiffusion 2023 — diffusion on SE(3)^N), and crystal structure prediction.

8.4 Certifiable nonconvex optimization

SE-Sync (Rosen et al 2019) and Cartan-Sync (Briales-Gonzalez-Jimenez 2017) — SDP relaxations of pose-graph problems that are provably tight, hence Riemannian optimization yields globally optimal solutions in low-noise regimes. The frontier of guaranteed-optimal SLAM and rotation averaging.

8.5 Riemannian / geometric deep learning

A broader umbrella that includes Riemannian optimization plus equivariant architectures (Bronstein-Bruna-Cohen-Velickovic 2021 “Geometric Deep Learning”). Equivariant networks on SO(3), SE(3), gauge-equivariant CNNs, and physics-informed networks all consume manifold geometry; Riemannian optimization is one of the tools they bring along.

9. Worked manifold computations

9.1 Sphere S^{n-1} — gradient and exponential

For f : S^{n-1} -> R extended smoothly to R^n \ {0},

grad f(p) = ∇f(p) - <∇f(p), p> p

(the orthogonal projection of the Euclidean gradient onto the tangent plane). The exponential map is

exp_p(v) = cos(||v||) p + sin(||v||) v / ||v||,   ||v|| ∈ (0, π)

(at ||v|| = π we land at the antipodal point; the cut locus is a single point). Geodesic distance is arccos(<p, q>).

A concrete example: dominant-eigenvector iteration as Riemannian gradient ascent of f(x) = (1/2) x^T A x on S^{n-1}. Gradient: grad f(p) = A p - (p^T A p) p. Setting grad f = 0 recovers the eigenvalue equation A p = λ p with λ = p^T A p. Power iteration is the simplest Riemannian gradient method here.

9.2 Stiefel St(n, k) — tangent and projection

Stiefel: St(n, k) = { X : X^T X = I_k }. A tangent vector V at X satisfies X^T V + V^T X = 0. Write V = X Ω + X_⊥ K with Ω skew-symmetric (k x k) and K ∈ R^{(n-k) x k}. Dimension count: k(k-1)/2 + k(n-k) = nk - k(k+1)/2. The canonical metric (Edelman-Arias-Smith 1998) gives equal weight to the Ω and K parts; the embedded metric weights them unequally.

Riemannian gradient (embedded metric):

grad f(X) = ∇f(X) - X sym(X^T ∇f(X))

QR retraction:

R_X(V) = qf(X + V)   (Q-factor of QR decomposition of X + V)

9.3 SPD Sym+(n) — log map and Karcher mean

Affine-invariant metric at P: <U, V>_P = tr(P^{-1} U P^{-1} V). Exponential and log maps:

exp_P(V) = P^{1/2} expm(P^{-1/2} V P^{-1/2}) P^{1/2}
log_P(Q) = P^{1/2} logm(P^{-1/2} Q P^{-1/2}) P^{1/2}

Geodesic between P and Q:

γ(t) = P^{1/2} (P^{-1/2} Q P^{-1/2})^t P^{1/2}

Karcher / Fréchet mean of P_1, ..., P_N on SPD is argmin_M sum_i d(M, P_i)^2. Unique under the affine-invariant metric (Hadamard manifold — negative curvature, geodesic convex). Iteration:

M_{k+1} = exp_{M_k}((1/N) sum_i log_{M_k}(P_i))

Used in diffusion-tensor imaging atlas construction (Fletcher-Joshi 2007) and in Riemannian classifiers for EEG BCI.

9.4 Fixed-rank — embedded vs quotient parametrization

Two equivalent ways to handle M_r:

  • EmbeddedX = U Σ V^T with U ∈ St(n, r), V ∈ St(m, r), Σ diagonal positive. Manifold dimension (n - r) r + (m - r) r + r = (n + m - r) r. Tangent vectors have the canonical “thin-SVD perturbation” form (Vandereycken 2013).
  • QuotientX = L R^T with L ∈ R^{n x r}, R ∈ R^{m x r}, modded out by GL(r) since (L Q^{-1}, R Q^T) represents the same X. The “factored” representation; tangent space is the horizontal subspace orthogonal to the orbit.

For matrix completion both are used; the quotient form is conceptually simpler but the embedded form has cleaner conditioning.

10. Comparison to Euclidean optimization

AspectEuclideanRiemannian
Iterate updatex + α dR_x(α d)
Gradient∇f(x)Π_{T_x M}(∇f(x))
Hessian∇^2 f(x)∇^2 f plus curvature correction
Conjugate-grad transporttrivialneeds vector transport
Quasi-Newton storageR^n vectorstangent vectors with transport
Cost per iterO(n) ops typicalO(n) ops + retraction (often O(nk^2) or matrix exp)
Convergence theorymatureparallels Euclidean modulo curvature

Two practical guidelines:

  1. If the manifold is low-dimensional relative to ambient (e.g. SO(3) in R^9), the savings from working intrinsically dominate.
  2. If retraction is expensive (matrix exp on 1000 x 1000 matrices), reconsider whether a projected-Euclidean method is competitive. Boumal 2023 has cost tables.

11. Connection to numerical linear algebra

Many classical NLA algorithms are Riemannian-optimization in disguise:

  • Power iteration = Riemannian gradient ascent on S^{n-1} for f(x) = x^T A x.
  • Subspace iteration = Riemannian gradient ascent on Grassmann Gr(n, k).
  • Jacobi rotations = coordinate-wise descent on SO(n).
  • Procrustes alignment = gradient zero of f(R) = ||R A - B||_F^2 on SO(n); closed form via SVD.
  • Generalized eigenvalue problems = Brockett’s gradient flow on Stiefel (Brockett 1991, “Dynamical Systems that Sort Lists, Diagonalize Matrices, and Solve Linear Programming Problems”).

See eigenvalue-problems and numerical-linear-algebra.

12. Practical recipes

12.1 Recipe: orthogonally-constrained NN layer

Use GeoTorch to parametrize a weight matrix W ∈ R^{n x n} as living on O(n). Replaces explicit orthogonality penalties and matches or beats the Cayley parametrization in expressivity. Useful for recurrent networks (orthogonal RNNs — Arjovsky-Shah-Bengio 2016) and normalizing flows (Lezcano-Casado-Martínez-Rubio 2019).

12.2 Recipe: low-rank Adam-style optimizer on M_r

For matrix-completion style problems with rank r << min(n, m):

  1. Parametrize iterate as (U, Σ, V) ∈ St(n, r) × diag_+(r) × St(m, r).
  2. Use RAdam (Bécigneul-Ganea 2018) on the product manifold.
  3. Vector-transport first / second moment estimates between iterations via projection transport.

Faster wall-clock than Euclidean SGD on the dense matrix when r < 50 and n, m > 10^4.

12.3 Recipe: covariance estimation under SPD constraint

Minimize negative log-likelihood of a Gaussian on Sym+(n). Riemannian gradient descent on Sym+(n) with affine-invariant metric converges in 10-30 iterations to machine precision; standard Euclidean Newton diverges near the cone boundary. Pymanopt one-liner.

12.4 Recipe: pose-graph SLAM warm-start

Run SE-Sync (Rosen et al 2019) on a product of SO(3)s to get a certifiably optimal rotation estimate, then solve translations as a sparse linear least-squares. The SDP relaxation is provably tight for pose-graph noise levels typical of LIDAR / RGB-D SLAM.

13. Pitfalls and gotchas

13.1 Wrong projection formula

The most common bug: using Π_M (the closest-point projection onto M) instead of Π_{T_p M} (the projection onto the tangent space at p). These coincide only at the trivial case p = closest point. Always project the gradient onto the tangent space; then retract.

13.2 Forgetting the curvature term in the Hessian

For Newton-type methods on embedded submanifolds, the Riemannian Hessian is the projected ambient Hessian plus a Weingarten-map term that accounts for how the tangent space rotates as p moves. Skipping the curvature correction yields linear, not quadratic, local convergence.

13.3 Vector-transport drift in L-BFGS

Naive transport of stored difference vectors between tangent spaces can violate the secant equation and lose positive-definiteness of the inverse-Hessian approximation. Use the Huang-Gallivan-Absil 2015 transport-preserving update or fall back to non-transport-preserving RBFGS with explicit re-initialization.

13.4 Hyperbolic numerical underflow

Working in the Poincaré ball near ||x|| ≈ 1 causes (1 - ||x||^2) to underflow. Switch to the Lorentz model for long training runs; Nickel-Kiela 2018 reports 10x more stable training compared to Poincaré at moderate dimensions.

13.5 Fixed-rank manifold is non-closed

M_r = {rank = r} is open inside {rank <= r}; iterates can drift towards rank < r matrices where M_r is not a manifold. Modern fixed-rank solvers detect rank drop and either restart with smaller rank or use a closure-aware retraction (Schneider-Uschmajew 2015).

14. History and milestones

  • 1989 — Brockett (“Dynamical systems that sort lists”) implicitly does Riemannian flow on Stiefel for eigenvalue computation.
  • 1998 — Edelman-Arias-Smith (“The Geometry of Algorithms with Orthogonality Constraints”) establishes Riemannian gradient / conjugate-gradient / Newton on Stiefel and Grassmann.
  • 2007 — Absil-Baker-Gallivan publish Riemannian trust-region (RTR).
  • 2008 — Absil-Mahony-Sepulchre textbook “Optimization Algorithms on Matrix Manifolds”.
  • 2013 — Bonnabel publishes Riemannian SGD, opening Riemannian methods to large-scale ML.
  • 2014Manopt released (Boumal-Mishra-Absil-Sepulchre, JMLR).
  • 2016Pymanopt released (Townsend-Koep-Weichwald, JMLR).
  • 2017 — Nickel-Kiela Poincaré embeddings; hyperbolic deep learning takes off.
  • 2018 — Bécigneul-Ganea generalize Adam to manifolds (RAdam).
  • 2019 — Rosen et al SE-Sync for certifiably-optimal pose-graph SLAM.
  • 2020 — Geomstats JMLR paper; info-geometry tooling reaches parity with classical Manopt.
  • 2022 — Riemannian diffusion models (De Bortoli et al).
  • 2023 — Boumal textbook “An Introduction to Optimization on Smooth Manifolds” (Cambridge); RFdiffusion uses SE(3)^N diffusion for protein design.

15. Geodesic convexity

A function f : M -> R is geodesically convex (g-convex) if for every geodesic γ : [0, 1] -> M,

f(γ(t)) <= (1 - t) f(γ(0)) + t f(γ(1))

The Karcher-mean problem on a Hadamard manifold (complete, simply connected, non-positive curvature) is g-convex; so is log-determinant on the SPD cone with the affine-invariant metric. On positively-curved manifolds (sphere, SO(n)) g-convexity is rare and is usually a local property.

Geodesic-convex programming (Sra-Hosseini 2015, SIAM J. Optimization) — a theory of duality, KKT conditions, and interior-point methods analogous to Euclidean convex optimization. Open problem: efficient barrier-based interior-point methods for g-convex programs on broad classes of manifolds.

16. Stochastic Riemannian optimization theory

For Riemannian SGD with constant step α, on an L-smooth f on a manifold with sectional curvature lower-bounded by κ, Zhang-Sra 2016 (“First-order Methods for Geodesically Convex Optimization”) prove

E[||grad f(p_k)||^2] <= O(1/(α k)) + O(α L σ^2)

where σ^2 is the gradient-noise variance. Same rate as Euclidean SGD up to curvature constants. For geodesically strongly-convex f, the rate becomes linear, again with curvature-dependent constants.

Variance reduction: R-SVRG (Kasai-Sato-Mishra 2018) achieves O(1/k) deterministic rate by periodic snapshot gradients; mirrors Euclidean SVRG with a curvature penalty.

Acceleration: Zhang-Sra 2018 conjectured Nesterov-style O(1/k^2) rates on manifolds; Alimisis-Orvieto-Bécigneul-Lucchi 2020 deliver a practical accelerator with provable acceleration in a neighborhood of the optimum, though full global acceleration on positively-curved manifolds remains open.

17. Optimization on quotient manifolds

When a problem has a symmetry — f(g · X) = f(X) for g in some group G — working on the quotient manifold M / G removes the redundancy. Two canonical examples:

  • Grassmann = Stiefel / O(k) — orthogonal-basis change in PCA is a O(k) symmetry; quotient out.
  • Fixed-rank quotientX = L R^T has GL(r) symmetry; quotient out.

Mishra-Meyer-Bach-Sepulchre 2014 (“Fixed-rank matrix factorizations and Riemannian low-rank optimization”, Computational Statistics) is the canonical treatment. Working on the quotient yields strictly lower-dimensional iterates with no parametric redundancy; the price is that tangent vectors must be lifted to the horizontal subspace of the total space.

18. Product manifolds

Many real problems live on a product M_1 × M_2 × ... × M_d. Example: deep tensor factorization with each factor on a different Stiefel. The tangent space is the product, retraction is per-factor, and the Riemannian gradient is the concatenation of per-factor projections. Block-coordinate descent on product manifolds — alternating among factors — is widely used and often easier to tune than joint updates. Pymanopt and Manopt both expose ProductManifold containers.

19. Manifold-aware regularization

Beyond hard manifold constraints, “soft” regularizers encourage iterates near a manifold:

  • Orthogonal regularizationλ ||W^T W - I||_F^2 pushes weight matrices toward St(n, n). Used in StyleGAN2 (Karras et al 2020) and BigGAN (Brock-Donahue-Simonyan 2019).
  • Spectral normalization (Miyato-Kataoka-Koyama-Yoshida 2018) — divide weight matrix by its top singular value; effectively projects to the operator-norm-1 ball. Used in GANs and robust classifiers.
  • Cayley transform parametrization — express weight matrix as (I - A)(I + A)^{-1} with A skew-symmetric; iterates live in O(n) by construction. Lezcano-Casado 2019 generalizes to other Lie groups via the matrix exponential parametrization, which underlies GeoTorch.

These are not Riemannian optimization per se but borrow the same geometric intuition.

20. Common cost-model comparison

For matrix completion on a n x m = 10^4 x 10^4 problem with r = 50, observed entries |Ω| = 10^7:

MethodPer-iter costTypical itersWall-clock (CPU)
Alternating least squaresO(m_obs r^2 + (n+m) r^3)10-50minutes
Soft-Impute (nuclear-norm SVT)O(n m r) per SVD50-200hours
Riemannian trust-region (RTRMC)O(m_obs r + (n+m) r^2)20-100minutes
Riemannian L-BFGSsimilar to RTRMC50-200minutes
SGD on factored formO(r) per entrymany epochsdepends on noise

Boumal-Absil 2011 benchmarks: RTRMC matches ALS on small problems and dominates by 5-10x on hard, near-rank-degenerate problems thanks to the trust-region globalization.

Adjacent

  • convex-optimization — Euclidean baseline; geodesic convexity is the manifold extension.
  • gradient-descent-variants — Euclidean SGD / Adam / momentum; Riemannian analogs port directly.
  • lie-groups-so3-se3 — concrete realization of Riemannian optimization on the Lie groups that drive robotics.
  • eigenvalue-problems — PCA / generalized eigenproblems are the gateway use case.
  • svd-pca-spectral — low-rank machinery that underlies fixed-rank manifold optimization.
  • tensor-calculus — index-calculus prerequisite for manifold derivatives and curvature.
  • slam — pose-graph SLAM via SE-Sync is the canonical robotics example.
  • inference-optimization — natural-gradient / K-FAC sit at the intersection of information geometry and large-scale model training.