Benign landscape of low-rank approximation: Part I

nonconvex
rank
Authors

Nicolas Boumal

Christopher Criscitiello

Published

December 11, 2023

Abstract
The minimizers of \(\sqfrobnorm{X-M}\) subject to \(\rank(X) \leq r\) are given by the SVD of \(M\) truncated at rank \(r\). The optimization problem is nonconvex but it has a benign landscape. That’s folklore, here with a proof.

The fact

The Eckart–Young–Mirsky theorem states that the best approximation of a matrix \(M\) by a matrix of rank at most \(r\) can be computed by truncating an SVD of \(M\) to \(r\) dominant components. This is widely known. It is also well known (but rarely proved) that the corresponding optimization problem has a nonconvex yet benign landscape. This post spells out a proof of that fact:

Theorem 1 (Benign landscape) Let \(M \in \Rmn\) be arbitrary. Consider \[ \min_{X \in \Rmn} \frac{1}{2}\sqfrobnorm{X - M} \qquad \textrm{ subject to } \qquad \rank(X) \leq r. \] Second-order necessary optimality conditions are sufficient for global optimality, in the following sense:

  1. If \(\rank(X) < r\) and \(X\) is first-order critical, then \(X\) is optimal.
  2. If \(\rank(X) = r\) and \(X\) is second-order critical, then \(X\) is optimal.

In particular, local minima are global minima and saddle points are strict.

The oldest other proof we could locate is the PhD thesis of Ngoc-Diep Ho (Ho 2008, Thm. 1.14)—he argues succinctly that local minima are global. A paper by Helmke and Shayman (1995) comes close but doesn’t prove the statement here. Please e-mail us if you know an older one—this should be ancient. (There are many sources for the fact that the optimal solution is the SVD of \(M\) truncated at rank \(r\). The point here is to have a proof that the optimization landscape is benign.)

We follow up with neat corollaries of that theorem in Part II.

Proof structure

One particularity of this optimization problem is that its search space \[ \Rmnlr = \{ X \in \Rmn : \rank(X) \leq r \} \] is not smooth: it is an algebraic variety. Fortunately, it splits in two parts:

  • The matrices of rank strictly less than \(r\), and
  • The set \(\Rmnr\) of matrices of rank exactly \(r\), which is a smooth manifold embedded in \(\Rmn\).

Accordingly, the proof of Theorem 1 has two parts. First:

  1. Show that if \(X\) has rank strictly less than \(r\) and it is stationary then it is optimal.

Intuitively, this is because if the rank constraint is not active then it is “as if” there was no constraint. And since the cost function is convex, stationary implies optimal.

Second, for the cost function \[ f(X) = \frac{1}{2}\sqfrobnorm{X - M} \] specifically, do the following:

  1. Write down the necessary optimality conditions of order two.
  2. Deduce that if \(X\) has rank \(r\) and it is second-order critical then \(f(X)\) is a certain value, independent of \(X\).
  3. Conclude that all such \(X\) are optimal, because one of them must be.

Item 2 takes a bit of know-how but it’s straightforward since \(\Rmnr\) is smooth. Item 3 requires a little insight into the problem. Item 4 is direct.

Notice how items 3 and 4 together allow us to conclude without knowing that the optima are truncated SVDs of \(M\). Actually, we recover that fact as a by-product.

Matrices of lesser rank

This part works for \(\min_{\rank(X) \leq r} f(X)\) with any continuously differentiable \(f \colon \Rmn \to \reals\).

Let \(\nabla f\) denote the Euclidean gradient of \(f\) with respect to the usual inner product \(\inner{A}{B} = \trace(A\transpose B)\). First-order criticality at a matrix \(X\) of rank strictly less than \(r\) is equivalent to the condition \(\nabla f(X) = 0\). This can be formalized in terms of tangent cones and their polars (Schneider and Uschmajew 2015). Here is a short proof:

Lemma 1 If \(\rank(X) < r\), the first-order necessary optimality conditions at \(X\) are: \(\nabla f(X) = 0\).

Proof. First-order conditions require in particular that \((f \circ c)'(0) \geq 0\) for all curves \(c \colon [0, 1] \to \Rmnlr\) which start at \(c(0) = X\) and which are differentiable at that point. (That is, the value of \(f\) cannot decrease at first order if we move away from \(X\) along the curve.) For all \(x \in \Rm, y \in \Rn\), the smooth curve \(c(t) = X + txy\transpose\) indeed satisfies \(\rank(c(t)) \leq r\) and so first-order criticality requires \[ 0 \leq (f \circ c)'(0) = \inner{\nabla f(c(0))}{c'(0)} = \inner{\nabla f(X)}{xy\transpose} = x\transpose \nabla f(X) y. \] This holds for all \(x, y\), so \(\nabla f(X) = 0\). Conversely, if \(\nabla f(X) = 0\), then the full first-order conditions (not spelled out here) hold.

For \(X \mapsto \frac{1}{2} \sqfrobnorm{X - M}\) which is convex, a zero gradient implies optimality: this takes care of the first part of Theorem 1. The general statement is:

Corollary 1 Assume \(f\) is convex. If \(\rank(X) < r\) and \(X\) is first-order critical, then \(X\) is optimal.

Matrices of maximal rank

When \(X\) has rank \(r\), first-order necessary optimality conditions typically are not sufficient for optimality. Writing down second-order conditions for minimization of a function \(f\) on an arbitrary set is a tad uncomfortable. Luckily, the set \(\Rmnr\) of matrices of rank exactly \(r\) is an embedded submanifold of \(\Rmn\). As a result, necessary optimality conditions are well understood; they even take a reasonably nice form. We spell them out below: see for example (Boumal 2023, sec. 7.5) for details.

Second-order conditions

Let \(X = U\Sigma V\transpose\) be a thin SVD of \(X\), so that \(U, V\) each have \(r\) orthonormal columns and \(\Sigma\) is diagonal of size \(r \times r\) with diagonal entries \(\sigma_1 \geq \cdots \geq \sigma_r > 0\). Select \(U_\perp, V_\perp\) to complete the orthonormal bases, that is, \([U, U_\perp] \in \Om\) and \([V, V_\perp] \in \On\) are orthogonal. The tangent space to \(\Rmnr\) at \(X\) is the subspace \(\T_X\Rmnr\) consisting of all \(\dot X \in \Rmn\) of the form \[ \dot X = \begin{bmatrix} U & U_\perp \end{bmatrix} \begin{bmatrix} A & B \\ C & 0 \end{bmatrix} \begin{bmatrix} V & V_\perp \end{bmatrix}\transpose, \tag{1}\] where \(A \in \reals^{r \times r}, B \in \reals^{r \times (n - r)}, C \in \reals^{(m - r) \times r}\) are arbitrary.

The first-order necessary optimality conditions at \(X\) require that \((f \circ c)'(0) = 0\) for all smooth curves on \(\Rmnr\) passing through \(c(0) = X\). This holds exactly if \(\nabla f(X)\) is orthogonal to the tangent space, that is, there exists \(W \in \reals^{(m-r) \times (n-r)}\) such that \[ \nabla f(X) = \begin{bmatrix} U & U_\perp \end{bmatrix} \begin{bmatrix} 0 & 0 \\ 0 & -W \end{bmatrix} \begin{bmatrix} V & V_\perp \end{bmatrix}\transpose. \] (We could also express this as: \(\nabla f(X) X\transpose = 0\) and \(X\transpose \nabla f(X) = 0\).)

If \(X\) is first-order critical for \(f\), then it is also second-order critical if and only if \((f \circ c)''(0) \geq 0\) for all smooth curves on \(\Rmnr\) passing through \(c(0) = X\). Particularizing (Boumal 2023, Ex. 7.6) to \(f\), we see that this holds exactly if \[ 0 \leq \innerbig{\dot X}{\dot X + (X-M)(X^\dagger \dot X)\transpose + (\dot X X^\dagger)\transpose (X-M)} \] for all \(\dot X \in \T_X\Rmnr\) (a dagger denotes pseudo-inverse). This uses that the Euclidean gradient and Hessian of \(f\) are \(\nabla f(X) = X - M\) and \(\nabla^2 f(X)[\dot X] = \dot X\). If we plug in an arbitrary tangent vector of the form Eq. 1 and work through the expression, it follows that \[ 0 \leq \sqfrobnorm{A} + \sqfrobnorm{B} + \sqfrobnorm{C} - 2\inner{C \Sigma^{-1} B}{W} \tag{2}\] for all \(A, B, C\).

Second-order points have the same cost value

It only remains to choose \(A, B, C\) in Eq. 2 to reveal interesting conclusions about \(X\). Clearly, we should set \(A = 0\) (that only makes the inequality stronger). Also let \(B = e_r y\transpose\) and \(C = xe_r\transpose\), where \(e_r\) is the \(r\)th column of the identity matrix of size \(r\), and \(x,y\) are unit-norm vectors. Then, the inequality becomes: \[ 0 \leq 2 - 2 (x\transpose W y) / \sigma_r. \tag{3}\] Choose \(x,y\) to be singular vectors of \(W\) associated to its largest singular value to conclude that \[ \sigmamax(\nabla f(X)) = \sigmamax(W) \leq \sigma_r(X). \] In words: the \(r\) positive singular values of \(X\) are all at least as large as the singular values of \(\nabla f(X)\).

After the fact, we can also figure out from this choice of \(A, B, C\) a particular curve \(c\) for which the inequality \((f \circ c)''(0) \geq 0\) yields Eq. 3. Then we don’t need much in the way of Riemannian optimization, but it obfuscates how one could discover the proof.

Yet, we also know that \(\nabla f(X) = X - M\), so that \[ M = X - \nabla f(X) = \begin{bmatrix} U & U_\perp \end{bmatrix} \begin{bmatrix} \Sigma & 0 \\ 0 & W \end{bmatrix} \begin{bmatrix} V & V_\perp \end{bmatrix}\transpose. \] Thus, the singular values of \(M\) are exactly those of \(X\) together with those of \(\nabla f(X)\). Moreover, \(X\) holds \(r\) largest singular values of \(M\) while \(\nabla f(X)\) holds the rest. It follows that \[ f(X) = \frac{1}{2} \sqfrobnorm{X - M} = \frac{1}{2} \sum_{i = r+1}^{\min(m, n)} \sigma_i(M)^2. \tag{4}\] In particular, all second-order critical points have the same cost function value.

Hence second-order points are optimal

We know \(f\) has a minimizer on \(\Rmnlr\) because the latter is closed and the former is strongly convex; let’s call it \(X^\star\). There are two scenarios to entertain:

  • Either \(\rank(M) \leq r\), in which case the function value in Eq. 4 is zero. This confirms that second-order critical points are optimal. (Actually, in this case, first-order criticality is enough.)
  • Or \(\rank(M) > r\), in which case \(\rank(X^\star) = r\). (Indeed, if \(X^\star\) had lesser rank, we would have \(\nabla f(X^\star) = 0\) so that \(X^\star = M\): a contradiction.) In particular, \(X^\star\) is second-order critical on \(\Rmnr\), and we know from Eq. 4 that all such points have the same cost function value; hence, they are all optimal.

This takes care of the second part of Theorem 1.

Notice that a by-product of this proof is that the minimizers of \(f\) are the SVDs of \(M\) truncated to rank \(r\). In other words: we re-proved the Eckart–Young–Mirsky theorem for the Frobenius norm.

References

Boumal, N. 2023. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press. https://doi.org/10.1017/9781009166164.
Helmke, U., and M. A. Shayman. 1995. “Critical Points of Matrix Least Squares Distance Functions.” Linear Algebra and Its Applications 215: 1–19. https://doi.org/https://doi.org/10.1016/0024-3795(93)00070-G.
Ho, N.-D. 2008. “Nonnegative Matrix Factorization: Algorithms and Applications.” PhD thesis, Louvain-la-Neuve, Belgium: Université catholique de Louvain. https://perso.uclouvain.be/paul.vandooren/ThesisHo.pdf.
Schneider, R., and A. Uschmajew. 2015. “Convergence Results for Projected Line-Search Methods on Varieties of Low-Rank Matrices via Łojasiewicz Inequality.” SIAM Journal on Optimization 25 (1): 622–46. https://doi.org/10.1137/140957822.

Citation

BibTeX citation:
@online{boumal2023,
  author = {Boumal, Nicolas and Criscitiello, Christopher},
  title = {Benign Landscape of Low-Rank Approximation: {Part} {I}},
  date = {2023-12-11},
  url = {racetothebottom.xyz/posts/low-rank-approx},
  langid = {en},
  abstract = {The minimizers of \$\textbackslash sqfrobnorm\{X-M\}\$
    subject to \$\textbackslash rank(X) \textbackslash leq r\$ are given
    by the SVD of \$M\$ truncated at rank \$r\$. The optimization
    problem is nonconvex but it has a benign landscape. That’s folklore,
    here with a proof.}
}