Foreword to lecture notes by Irène Waldspurger from her Cours Peccot

nonconvex
rank
Author

Nicolas Boumal

Published

May 24, 2024

Abstract
Irène Waldspurger is a laureate of the Cours Peccot, and as such gave a series of lectures at the Collège de France in 2020. She wrote matching lecture notes for the Cours Spécialisés de la Société Mathématique de France. At their request, this is a foreword I wrote for these notes.

Link to the lecture notes for which this foreword is written.

Link to the Cours Peccot.

Irène’s lectures at the Collège de France (2 hours each): lecture 1, lecture 2, lecture 3, lecture 4.

As Irène Waldspurger exposes in the introduction through examples, there are at least two broad reasons why optimization problems with rank constraints tend to come up in applied mathematics.

In the first broad category, say data (that is, numbers) are arranged into a matrix. Some of it may be corrupted by noise or even missing: we would like to compensate for that. The complete and clean matrix (our target) may have a small rank for structural reasons (one example in photometry is detailed by Wu et al. (2011)). Or it may have fast decaying singular values and hence is well approximated by a low-rank matrix (a common occurrence as argued by Udell and Townsend (2019)).

In that context, it is wise to restrict our search to low-rank matrices. Indeed, since the set of low-rank matrices has low dimension, doing so is advantageous both for computation and for accuracy (because enforcing structure also acts as a regularizer). That set is non-convex though: this calls for special algorithms, and dedicated theory.

One approach could be to sidestep the non-convexity entirely and relax the problem to a convex one: see Chapter 2. This allows for striking mathematical guarantees, some of which are reviewed in these notes for two classical tasks: matrix completion and phase retrieval. However, this (seemingly inevitably) involves recasting the problem to a high-dimensional space. For example, whereas the space of \(m \times n\) matrices of rank \(r\) has dimension \((m+n-r)r\), convex formulations may require us to optimize over general matrices, that is, in a space of dimension \(mn\). (See the discussion at the end of that chapter for algorithmic ways to handle the dimensional increase in some cases.)

Alternatively, we may embrace the non-convexity. Indeed, part of the appeal of convexity is its rich collection of algorithms that come with off-the-shelf guarantees stating that they can find global optima. If we go down a different path, we may very well still find global optima—only, we have to work harder to prove that we do so reliably. Chapters 3, 4 and 5 describe some such efforts. Interestingly, the algorithms themselves are simple: it is the theory required to explain why they perform so well that is intricate.

In the second broad category, say we are searching for a vector \(x\) whose entries \(x_k\) satisfy certain quadratic constraints. For example, with \(x\) real, forcing \(x_k^2 = 1\) encodes the binary requirement \(x_k = \pm 1\), whereas with \(x\) complex the constraint \(|x_k|^2 = 1\) requires \(x_k\) to be a phase: \(x_k = e^{i\theta_k}\). Now say the optimal \(x\) must minimize a function which is itself quadratic in \(x\). Such problems can be challenging: the combinatorial case is rich enough to encode all sorts of NP hard problems. And likewise for the complex case: it is continuous but hard all the same because some vectors \(x\) may locally look optimal yet be far from it.

Continuing with the latter, a classical technique (often attributed to N.Z. Shor in the late 1980s) consists in introducing the matrix \(X = xx^*\) and to rewrite the optimization problem (cost function and constraints) in terms of \(X\). Quadratic terms become linear, which should make the problem easier. But there is no free lunch: to make sure \(X\) can be decomposed into a product \(xx^*\), we must require \(X\) to be positive semidefinite and (crucially) that it has rank one.

If we simply ignore the rank condition then the problem may become convex (akin to the first approach above). Solving it may yield a solution of rank one anyway (which feels magical), or it may not. Even if it does not, the solution thus obtained may still be useful: a famous example of that is the analysis by Goemans and Williamson (1995) for the Max-Cut problem. But relaxing the rank constraint effectively means that we go from optimizing in \(n\) variables to roughly \(n^2\). Just as in the first broad category, this creates computational bottlenecks.

Burer and Monteiro (2003) offered an interesting and algorithmically actionable take on this rank-relaxation step (somewhere in between the two approaches above). The starting point for Chapter 6 goes as follows:

  1. The original problem, once reformulated in terms of \(X\), includes a constraint of the form \(\mathrm{rank}(X) \leq 1\).
  2. The full relaxation removes that constraint, which amounts to replacing it with \(\mathrm{rank}(X) \leq n\) where \(n\) is the size of \(X\).
  3. Instead, why not relax progressively and consider intermediate ranks, say, \(\mathrm{rank}(X) \leq p\) with \(p \leq n\) a parameter of our choosing?

Now, we get a new problem for each value of \(p\). Each of these is non-convex, and as such still calls for special algorithms and theory. Notwithstanding, as experience shows, there is much to gain computationally from this approach.

Theory is catching up, revealing subtleties and serendipities due to geometric structures of various kinds (convex geometry and differential geometry in particular).

For brevity, let us say that a non-convex problem is benign if its second-order critical points are globally optimal (that is, there are no bad local traps).

One task is to argue precisely that we can indeed compute (approximate) global minimizers of benign problems: that is an algorithmic consideration.

Another task is to argue that problems of interest, non-convex as they may be, are benign. From high to low values of rank relaxation \(p\), here is what is known for the particular case described above where \(x\) is real and we aim to minimize a quadratic function \(f(x) = x^* Cx\) under the constraints \(x_k^2 = 1\) for all \(k\) via the Burer-Monteiro approach:

Together, these results paint a precise picture of what can happen for both general and generic cost matrices \(C\). Since \(C\) contains the data of the problem instance (e.g., for Max-Cut, \(C\) would encode the adjacency matrix of the graph one wants to cut), this offers good understanding of general and generic problem instances.

Interestingly, if we wish to take \(p\) much smaller than \(\sqrt{2n}\), then we must change gears:

And indeed, continuing with the latter, in some applications of variations of the problem above, it was observed numerically (and sometimes partially confirmed theoretically) that it is sufficient to set \(p\) to be a constant only mildly larger than the target rank. Rosen et al. (2021) discuss this in the context of robotics. Given the potential numerical gains, those observations call for more research.

These lecture notes and accompanying video lectures are an excellent entry point for anyone keen on pursuing these gains. And indeed, in the years past since their writing, additional mathematical structure has come to play a role: let us discuss one such aspect in closing.

Rank constraints break convexity, and as such allow bad local minimizers to appear. This is already challenging enough (and is the main object of these notes), but that is not all. Indeed, rank constraints are also nonsmooth. Specifically, the set of matrices of size \(m \times n\) with rank at most \(r\) is not a smooth set: it is an algebraic variety with singularities. These singularities can create difficulties for optimization algorithms, causing them to converge to non-stationary points, see Levin, Kileel, and Boumal (2023).

In recent work that came after this Cours Peccot, Pauwels (2024) on the one hand and also Olikier and Waldspurger (2024) on the other hand showed that the classical projected gradient descent algorithm is capable of navigating the intricacies of nonsmooth sets such as the bounded-rank variety in order to find truly stationary points of optimization problems, all under mild conditions on the cost function. This is not a fast algorithm, but it has the merit of being conceptually simple and we now know that it achieves a goal that had been elusive. This, too, is surely the starting point for more discoveries.

References

Boumal, N., V. Voroninski, and A. S. Bandeira. 2020. “Deterministic Guarantees for Burer-Monteiro Factorizations of Smooth Semidefinite Programs.” Communications on Pure and Applied Mathematics 73 (3): 581–608. https://doi.org/10.1002/cpa.21830.
Burer, S., and R. D. C. Monteiro. 2003. “A Nonlinear Programming Algorithm for Solving Semidefinite Programs via Low-Rank Factorization.” Mathematical Programming 95 (2): 329–57. https://doi.org/10.1007/s10107-002-0352-8.
Goemans, M. X., and D. P. Williamson. 1995. “Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming.” Journal of the ACM (JACM) 42 (6): 1115–45. https://doi.org/10.1145/227683.227684.
Levin, E., J. Kileel, and N. Boumal. 2023. “Finding Stationary Points on Bounded-Rank Matrices: A Geometric Hurdle and a Smooth Remedy.” Mathematical Programming 199 (1–2): 831–64. https://doi.org/10.1007/s10107-022-01851-2.
Mei, S., T. Misiakiewicz, A. Montanari, and R. I. Oliveira. 2017. “Solving SDPs for Synchronization and MaxCut Problems via the Grothendieck Inequality.” In Proceedings of the 2017 Conference on Learning Theory, edited by S. Kale and O. Shamir, 65:1476–1515. Proceedings of Machine Learning Research. Amsterdam, Netherlands: PMLR. http://proceedings.mlr.press/v65/mei17a.html.
O’Carroll, L., V. Srinivas, and A. Vijayaraghavan. 2022. “The Burer–Monteiro SDP Method Can Fail Even Above the Barvinok–Pataki Bound.” In Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh". Advances in Neural Information Processing Systems. Neural information processing systems foundation.
Olikier, G., and I. Waldspurger. 2024. “Projected Gradient Descent Accumulates at Bouligand Stationary Points.” arXiv Preprint arXiv:2403.02530.
Pauwels, E. 2024. “Generic Fréchet Stationarity in Constrained Optimization.” arXiv Preprint arXiv:2402.09831.
Rosen, D. M., K. J. Doherty, A. Terán Espinoza, and J. J. Leonard. 2021. “Advances in Inference and Representation for Simultaneous Localization and Mapping.” Annual Review of Control, Robotics, and Autonomous Systems 4: 215–42.
Udell, M., and A. Townsend. 2019. “Why Are Big Data Matrices Approximately Low Rank?” SIAM Journal on Mathematics of Data Science 1 (1): 144–60. https://doi.org/10.1137/18m1183480.
Waldspurger, I., and A. Waters. 2020. “Rank Optimality for the Burer–Monteiro Factorization.” SIAM Journal on Optimization 30 (3): 2577–2602. https://doi.org/10.1137/19m1255318.
Wu, L., A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, and Y. Ma. 2011. “Robust Photometric Stereo via Low-Rank Matrix Completion and Recovery.” In Lecture Notes in Computer Science, 703–17. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-19318-7_55.