Benign landscape of low-rank approximation: Part II – Race to the bottom

Abstract

The landscape of \(X \mapsto \sqfrobnorm{AXB - C}\) subject to \(\rank(X) \leq r\) is benign for full rank \(A, B\). Same for the landscape of \((W_1, W_2) \mapsto \|W_2W_1 B - C\|_{\mathrm{F}}^2\) (2-layer linear neural networks). We show these by combining a few basic facts.

The facts

In Part I, we noted that the optimization problem \[ \min_{X \in \Rmn} \frac{1}{2} \sqfrobnorm{X - M} \qquad \textrm{ subject to } \qquad \rank(X) \leq r \] has a benign landscape: despite the nonconvex constraint, there are no local traps.

Part of that fact stems from convexity of the cost function. And indeed, Ha, Liu, and Foygel Barber (2020) (among others) have shown results of the following nature:

If the cost function is \(\alpha\)-strongly convex (in some rank-restricted sense) and \(\beta\)-smooth, and if the condition number \(\beta / \alpha\) is less than 2, and if the unconstrained minimizer is sufficiently close to the bounded-rank matrices, then the landscape is benign despite the nonconvex constraint \(\rank(X) \leq r\).

The limitation on the condition number is stringent, but unfortunately necessary even for a quadratic cost function: we circle back to this at the end.

Still, we can depart from \(\sqfrobnorm{X - M}\) in other ways. In particular, using a change-of-variable argument as in (Lu and Kawaguchi 2017), it is easy to extend the theorem from Part I as follows:

Theorem 1 (Benign landscape) Let \(A, B, C\) be given matrices with \(\rank(A) = m\) and \(\rank(B) = n\). Consider \[ \min_{X \in \Rmn} \frac{1}{2}\sqfrobnorm{AXB - C} \qquad \textrm{ subject to } \qquad \rank(X) \leq r. \] Second-order necessary optimality conditions are sufficient for global optimality, in the following sense:

If \(\rank(X) < r\) and \(X\) is first-order critical, then \(X\) is optimal.
If \(\rank(X) = r\) and \(X\) is second-order critical, then \(X\) is optimal.

In particular, local minima are global minima and saddle points are strict.

The set \(\Rmnlr\) of matrices with rank at most \(r\) can be parameterized via the map \((L, R) \mapsto LR\transpose\), where \(L\) and \(R\) each have \(r\) columns. Ha, Liu, and Foygel Barber (2020) studied the general properties of this parameterization. Applying their observations to Theorem 1, the following claim is immediate:

Corollary 1 (Through a lift) Let \(A, B, C\) be given matrices with \(\rank(A) = m\) and \(\rank(B) = n\). Consider minimizing \(g \colon \reals^{m \times r} \times \reals^{n \times r} \to \reals\) where \[ g(L, R) = \frac{1}{2} \sqfrobnorm{A LR\transpose B - C}. \] Its second-order critical points are global minima, i.e., local minima are global and saddle points are strict.

We prove all of this below, after a brief discussion of linear networks.

Interpretation for linear networks

You may recognize in Corollary 1 a classical result in the machine learning literature.

Indeed, set \(A\) to identity, and think of the \(p\) columns of \(B\) and \(C\) as input-output pairs. Rename \(L, R\) to \(W_1, W_2\) to evoke the weight matrices of a neural network. If the activation function is linear (which is only of theoretical interest), then the neural network encodes the function \(x \mapsto W_2W_1x\) and the least-squares loss is \(\sqfrobnorm{W_2 W_1 B - C}\).

Baldi and Hornik (1989) investigated that landscape long ago. They obtained a result akin to the following (with some additional conditions on the data).

Corollary 2 (Linear neural network, two layers) Consider minimizing \(g \colon \reals^{r \times n} \times \reals^{m \times r} \to \reals\) where \[ g(W_1, W_2) = \frac{1}{2} \sqfrobnorm{W_2 W_1 B - C}. \] If \(\rank(B) = n\), then the local minima of \(g\) are global minima and saddle points are strict.

From this post’s perspective, Corollary 2 is as a consequence of

benign non-convexity of low-rank approximation (from Part I),
extended via a change of variable argument to become Theorem 1,
then further extended by a lifts argument from \(X\) to \(W_2W_1\).

Baldi (1988) also entertained (though without proofs) deep linear networks, where the weight matrices of the \(L\) layers \((W_1, \ldots, W_L)\) are mapped to the product \(W_L \cdots W_1\). That was in 1988, at the second edition of what is now NeurIPS. More recently, Lu and Kawaguchi (2017) proved a version of Corollary 2 for that setting. Their claim is that local minima (but, correctly, not necessarily second-order critical points) are global minima. Their proof is rather direct, also splitting concerns between the bounded-rank version and the factor version. Laurent and Brecht (2018) give a different proof (more of a “lifts” flavor) that allows for other convex losses (but no rank constraint).

Update on March 18, 2024: Cédric Josz pointed out two very nice papers related to the above:

Nouiehed and Razaviyayn (2018) study openness of maps such as the product maps here. This is indeed sufficient to ensure local minima are mapped to local minima, which are then global minima by convexity. (In (Levin, Kileel, and Boumal 2024, Thm. 2.3), it is also shown that openness is necessary in a sense.)
L. Zhang (2020) argues that, quite generally for linear networks, if the non-convexity is benign with two layers, then it is benign with any number of layers. Moreover, it is stated there that no assumptions on the data matrices (here denoted by \(B, C\)) are needed at all.

Accommodating data via a change of variable

In Theorem 1, we minimize \(f(X) = \frac{1}{2} \sqfrobnorm{AXB - C}\). The first part of the theorem states that if \(\rank(X) < r\) and \(r\) is first-order critical, then it is optimal. That holds because \(f\) is convex: see the corresponding corollary in Part I.

For the second part of the claim, assume \(X\) is second-order critical with \(\rank(X) = r\). Moreover, we assume \(A\) has full column rank and \(B\) has full row rank. This makes it possible to operate a change of variable, as follows. The essence of this readily appears in (Baldi and Hornik 1989), and very explicitly so in (Lu and Kawaguchi 2017).

Let \(A = U_A^{} S_A^{} V_A\transpose\) be a full SVD of \(A\). Here, \(S_A = [\Sigma_A, 0]\transpose\) has the same size as \(A\), and \(\Sigma_A\) is \(m \times m\), diagonal and invertible. Likewise, let \(B = U_B^{} S_B^{} V_B\transpose\) be a full SVD of \(B\), with \(S_B = [\Sigma_B, 0]\) and \(\Sigma_B\) is \(n \times n\), diagonal and invertible. Then, \[\begin{align} f(X) & = \frac{1}{2} \sqfrobnorm{AXB - C} \\ & = \frac{1}{2} \sqfrobnorm{S_A^{} V_A\transpose X U_B^{} S_B^{} - U_A\transpose C V_B^{}} \\ & = \frac{1}{2} \sqfrobnorm{\Sigma_A^{} V_A\transpose X U_B^{} \Sigma_B^{} - \tilde C} + \mathrm{ constant}, \end{align}\] where \(\tilde C\) is the upper-left block of \(U_A\transpose C V_B^{}\) of appropriate size, and the constant is (half) the squared Frobenius norm of the rest of that matrix.

Now notice that \(f = h \circ \psi + \mathrm{ constant}\), where \(h(Z) = \frac{1}{2} \sqfrobnormsmall{Z - \tilde C}\) and \[ \psi(X) = \Sigma_A^{} V_A\transpose X U_B^{} \Sigma_B^{} \] is a diffeomorphism from \(\Rmnr\) to \(\Rmnr\) (the manifold of matrices of rank exactly \(r\)). This naturally implies the following:

If \(X\) is second-order critical for \(f|_{\Rmnr}\), then \(Z = \psi(X)\) is second-order critical for \(h|_{\Rmnr}\) (this is because \(\D\psi(X)\) is surjective, see (Levin, Kileel, and Boumal 2024, sec. 2.6)).
From Part I, it follows that \(Z\) is a global minimum for \(h\) on \(\Rmnlr\).
Therefore, \(X\) is a global minimum for \(f\) on \(\Rmnr\) and (by taking the closure) on \(\Rmnlr\).

This concludes the proof of Theorem 1.

Landscape through a lift

Insight into the landscape for \(f\) on the set \(\Rmnlr\) of bounded rank matrices immediately delivers insight into the landscape of \(g(L, R) = f(LR\transpose)\).

This is thanks to an excellent observation by Ha, Liu, and Foygel Barber (2020), who proved that if \((L, R)\) is second-order critical for \(g\), then \(LR\transpose\) is stationary for \(f\) on \(\Rmnlr\), always. Versions of that fact appeared many times befores, though (as far as we know) limited to cases where \(L\) and \(R\) have identical or even full rank. The neat proof in (Ha, Liu, and Foygel Barber 2020) elegantly handles all cases at once.

Moreover, if \(X = \varphi(L, R) = LR\transpose\) has (maximal) rank \(r\), then necessarily \(L\) and \(R\) both have full rank \(r\). It is then easy to check that \(\D\varphi(L, R)\) is surjective to the tangent space of \(\Rmnr\) at \(X\). It follows that if such a pair \((L, R)\) is second-order critical for \(g = f \circ \varphi\), then \(X = \varphi(L, R)\) is second-order critical for \(f\) on \(\Rmnr\) (we used that argument on \(\psi\) in the previous section). See for example (Levin, Kileel, and Boumal 2024, secs. 2.3, 2.6).

Combined, these observations warrant the following lemma:

Lemma 1 (Factorization lift) Let \(f \colon \Rmn \to \reals\) be twice continuously differentiable, and let \[ \varphi \colon \Rmr \times \Rnr \to \Rmn, \qquad \varphi(L, R) = LR\transpose \] be a smooth parameterization of \(\Rmnlr\). If \((L, R)\) is second-order critical for \(g = f \circ \varphi\), then

\(X = LR\transpose\) is first-order critical for \(f\) on \(\Rmnlr\), and
If \(X\) has rank \(r\), then it is also second-order critical for \(f\) on \(\Rmnr\).

Corollary 1 is a direct consequence of Theorem 1 and Lemma 1.

Such arguments (based on the properties of a smooth map used to parameterize a nonsmooth set) are quite powerful. You can find many more examples in (Levin, Kileel, and Boumal 2024), including other ways of parameterizing \(\Rmnlr\) which satisfy a version of Lemma 1, thus also preserving the benign landscape.

Not benign for general quadratics

One might wonder if Theorem 1 should hold for all strongly convex quadratics, or even for a much broader class of strongly convex cost functions. The results in Ha, Liu, and Foygel Barber (2020) are limited (in particular) to condition numbers strictly less than 2. Uschmajew and Vandereycken (2020) have also encountered such limitations.

Unfortunately, these are inescapable. R. Y. Zhang et al. (2018) (also R. Y. Zhang, Sojoudi, and Lavaei (2019)) showed as much for bounded-rank optimization over symmetric, positive semidefinite matrices.

Here is an adaptation of their third example, to work with general bounded-rank matrices. Let \(\calA \colon \reals^{2 \times 2} \to \reals^4\) be the linear map defined by \(\calA(X)_i = \inner{A_i}{X}\) with \(\inner{A}{B} = \trace(A\transpose B)\) and \[ A_1 = \begin{bmatrix} \sqrt{2} & 0 \\ 0 & \frac{1}{\sqrt{2}} \end{bmatrix}, \qquad A_2 = \begin{bmatrix} 0 & 2 \\ 0 & 0 \end{bmatrix}, \qquad A_3 = \begin{bmatrix} 0 & 0 \\ 2 & 0 \end{bmatrix}, \qquad A_4 = \begin{bmatrix} 0 & 0 \\ 0 & \sqrt{\frac{3}{2}} \end{bmatrix}. \] Also let \(Z = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}\). The quadratic cost function is \(f \colon \reals^{2 \times 2} \to \reals\), defined by \[ f(X) = \frac{1}{2}\|\calA(X - Z)\|_2^2. \] This is indeed strongly convex, with condition number 4 (that’s the condition number of \(\calA\) squared).

Yet, one can check that \(X = \begin{bmatrix} 0 & 0 \\ 0 & \frac{1}{2} \end{bmatrix}\) is a strict, non-global local minimum of \(\min_{\rank(X) \leq 1} f(X)\). To do so, notice that \(X\) has maximal rank (equal to 1), that the Riemannian gradient is zero at \(X\), and that the Riemannian Hessian is positive definite at \(X\); yet \(f(X) = 3/4\) whereas the minimal value is \(f(Z) = 0\).

Here is Matlab code to run those checks using Manopt:

% This code requires [Manopt](https://www.manopt.org)
% Call the manopt factory for the manifold of m x n matrices of rank r,
% with m = n = 2 and r = 1
manifold = fixedrankembeddedfactory(2, 2, 1);
problem.M = manifold;
Z = [1, 0 ; 0, 0];
A = [sqrt(2), 0, 0, 1/sqrt(2)
           0, 2, 0, 0
           0, 0, 2, 0
           0, 0, 0, sqrt(3/2)];
% Points and tangent vectors on this manifold are represented as triplets
% rather than matrices (to allow scaling up to high dimensions). In this
% small setup, we use a few tools to transform those triplets to matrices.
vec = @(M) M(:);
mat = @(X) manifold.triplet2matrix(X);
matv = @(X, V) manifold.triplet2matrix(manifold.tangent2ambient(X, V));
% Define the cost function and its Euclidean gradient and Hessian.
% Manopt automatically figures out the Riemannian versions.
problem.cost = @(X) .5*norm(A*vec(mat(X) - Z), 'fro')^2;
problem.egrad = @(X) reshape(A'*A*vec(mat(X) - Z), 2, 2);
problem.ehess = @(X, V) reshape(A'*A*vec(matv(X, V)), 2, 2);

% That point is a second-order critical point
X = manifold.matrix2triplet([0, 0, ; 0, .5]);
fprintf('Function value at X: %.2g\n', getCost(problem, X));
fprintf('Riemannian gradient norm at X: %.2g\n', ...
        manifold.norm(X, getGradient(problem, X)));
fprintf('Eigenvalues of Riemannian Hessian at X:\n');
fprintf('  %.2g ', eig(hessianmatrix(problem, X)));
fprintf('\n');
fprintf('Singular values of operator A:\n');
fprintf('  %.2g ', svd(A));
fprintf('\n');

Running that code produces the following output:

Function value at X: 0.75
Riemannian gradient norm at X: 2.2e-16
Eigenvalues of Riemannian Hessian at X:
  1   2   7 
Singular values of operator A:
  2   2   1.7   1

Of course, this example carries over to Corollary 1. That is because if \(X\) is a local minimum for \(f\) on \(\Rmnlr\), then any \((L, R)\) such that \(\varphi(L, R) = LR\transpose = X\) is a local minimum for \(f \circ \varphi\) (by continuity of \(\varphi\)). Thus, there are spurious local minima through the \(LR\transpose\) lift as well.

References

Baldi, P. 1988. “Linear Learning: Landscapes and Algorithms.” In Advances in Neural Information Processing Systems, edited by D. Touretzky. Vol. 1. Morgan–Kaufmann. https://proceedings.neurips.cc/paper_files/paper/1988/file/202cb962ac59075b964b07152d234b70-Paper.pdf.

Baldi, P., and K. Hornik. 1989. “Neural Networks and Principal Component Analysis: Learning from Examples Without Local Minima.” Neural Networks 2 (1): 53–58. https://doi.org/10.1016/0893-6080(89)90014-2.

Ha, W., H. Liu, and R. Foygel Barber. 2020. “An Equivalence Between Critical Points for Rank Constraints Versus Low-Rank Factorizations.” SIAM Journal on Optimization 30 (4): 2927–55. https://doi.org/10.1137/18m1231675.

Laurent, T., and J. von Brecht. 2018. “Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global.” In Proceedings of the 35th International Conference on Machine Learning, edited by J. Dy and A. Krause, 80:2902–7. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v80/laurent18a.html.

Levin, E., J. Kileel, and N. Boumal. 2024. “The Effect of Smooth Parametrizations on Nonconvex Optimization Landscapes.” Mathematical Programming. https://doi.org/10.1007/s10107-024-02058-3.

Lu, H., and K. Kawaguchi. 2017. “Depth Creates No Bad Local Minima.” arXiv Preprint arXiv:1702.08580. https://arxiv.org/abs/1702.08580.

Nouiehed, M., and M. Razaviyayn. 2018. “Learning Deep Models: Critical Points and Local Openness.” https://openreview.net/forum?id=ByxLBMZCb.

Uschmajew, A., and B. Vandereycken. 2020. “On Critical Points of Quadratic Low-Rank Matrix Optimization Problems.” IMA Journal of Numerical Analysis 40 (4): 2626–51. https://doi.org/10.1093/imanum/drz061.

Zhang, L. 2020. “Depth Creates No More Spurious Local Minima.” arXiv Preprint arXiv:1901.09827. https://arxiv.org/abs/1901.09827.

Zhang, R. Y., C. Josz, S. Sojoudi, and J. Lavaei. 2018. “How Much Restricted Isometry Is Needed in Nonconvex Matrix Recovery?” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/f8da71e562ff44a2bc7edf3578c593da-Paper.pdf.

Zhang, R. Y., S. Sojoudi, and J. Lavaei. 2019. “Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery.” Journal of Machine Learning Research 20 (114): 1–34. http://jmlr.org/papers/v20/19-020.html.

Citation

BibTeX citation:

@online{boumal2023,
  author = {Boumal, Nicolas and Criscitiello, Christopher},
  title = {Benign Landscape of Low-Rank Approximation: {Part} {II}},
  date = {2023-12-13},
  url = {www.racetothebottom.xyz/posts/low-rank-approx-corollaries/},
  langid = {en},
  abstract = {The landscape of \$X \textbackslash mapsto \textbackslash
    sqfrobnorm\{AXB - C\}\$ subject to \$\textbackslash rank(X)
    \textbackslash leq r\$ is benign for full rank \$A, B\$. Same for
    the landscape of \$(W\_1, W\_2) \textbackslash mapsto
    \textbackslash\textbar W\_2W\_1 B -
    C\textbackslash\textbar\_\{\textbackslash mathrm\{F\}\}\^{}2\$
    (2-layer linear neural networks). We show these by combining a few
    basic facts.}
}