The usual smooth lift of the nuclear norm regularizer enjoys 2 \Rightarrow 1 – **Race to the bottom** --- the OPTIM@EPFL blog

Abstract

Minimizing \(f(X)\) with a nuclear-norm regularizer is equivalent to minimizing \(g(L, R) = f(LR^\top)\) with the squared Frobenius norm regularizer. To this classical result, we add that second-order critical points \((L, R)\) map to first-order stationary points \(LR^\top\) under a bounded-rank constraint.

Consider the problem of minimizing a \(C^2\) function \(f \colon \Rmn \to \reals\) with a nuclear norm regularizer: \[ f_\lambda(X) = f(X) + \lambda \nucnorm{X}. \qquad(1)\] The nuclear norm promotes low-rank solutions in a soft way (the larger \(\lambda > 0\), the more we push for low-rank solutions). Favoring “simple” solutions may be useful for statistical reasons, and also to ease computation. However, for the latter, we have the issue that the search space \(\Rmn\) is still high dimensional. To fix this, we may additionally impose a hard rank constraint: \[ \min_{X \in \Rmn} f_\lambda(X) \qquad \textrm{ subject to } \qquad \rank(X) \leq r. \qquad(2)\] This reduces the dimension of the search space from \(mn\) down to \(r(m+n-r)\).

Problem (Eq. 2) is nonsmooth for two reasons:

The feasible set (bounded-rank matrices) is a nonsmooth algebraic variety.
The regularizer is a nonsmooth function.

Let us smooth both with standard tricks.

The smooth lift of our problem

To smooth out the first part, we can just parameterize the set of matrices of rank at most \(r\) via the the lift \((L, R) \mapsto LR^\top\) where \(L\) and \(R\) have \(r\) columns. For the second part, it is known (e.g., (Srebro, Rennie, and Jaakkola 2004)) that \[ \nucnorm{X} = \min_{L, R : LR^\top = X} \frac{1}{2}\Big(\sqfrobnorm{L} + \sqfrobnorm{R}\Big). \qquad(3)\] This leads to the following equivalent but smooth formulation of our problem: \[ \min_{L \in \Rmr, R \in \Rnr} g_\lambda(L, R) \qquad \textrm{ where } \qquad g_\lambda(L, R) = f(LR^\top) + \frac{\lambda}{2}\Big(\sqfrobnorm{L} + \sqfrobnorm{R}\Big). \qquad(4)\] This sequence of steps is common, see also (Mishra et al. 2013), (Li, Zhu, and Tang 2019) and (Lee et al. 2023) for just three of many examples.

When \(\lambda = 0\) (no regularization), it was shown in (Ha, Liu, and Foygel Barber 2020) that if \((L, R)\) satisfies first- and second-order necessary optimality conditions for \(g_\lambda\), then \(X = LR^\top\) satisfies first-order necessary optimality conditions for \(f_\lambda\): see also the discussion in this other post.

One may naturally wonder if that still holds with the added nonsmooth regularizer (\(\lambda > 0\)). In this post, we show that this is indeed the case.

Update May 12, 2025: A somewhat stronger version of the claim now also appears in (McRae 2024, Lem. 4). The first version of that paper did not include a proof, which in part led to the writing of this blog post. The revised version includes a proof that uses somewhat different methods to this post (particularly in the maximal-rank case). Related calculations (again more relevant to the maximal-rank case which we handle in a different manner) appear in a recent paper by Wenqing Ouyang, Ting Kei Pong and Man-Chung Yue as Theorem 4.1.

Theorem 1 (\(2 \Rightarrow 1\) for nuclear norm regularization.) Fix \(\lambda > 0\). Let \((L, R) \in \Rmr \times \Rnr\) be a critical point of \(g_\lambda\) (that is, \(\nabla g_\lambda(L, R) = 0\)), and let \(X = LR^\top\).

If \(\rank(X) = r\) (maximal rank), then \(X\) is stationary for (Eq. 2.)
If \(\rank(X) < r\) (lesser rank) and \(\nabla^2 g_\lambda(L, R) \succeq 0\), then \(X\) is stationary for (Eq. 2).

Overall, if \((L, R)\) is second-order critical for (Eq. 4), then \(LR^\top\) is first-order stationary for (Eq. 2).

(In fact, if \((L, R)\) is second-order critical for (Eq. 4) and \(\rank(X) = r\) then \(X\) is second-order critical for (Eq. 2), because in this scenario \(f_\lambda\) is \(C^2\) on the manifold of rank-\(r\) matrices and \((L, R) \mapsto LR^\top\) is a submersion to that manifold; see (Levin, Kileel, and Boumal 2024, Prop. 2.12) or (Boumal 2023, Prop. 9.6 and Ex. 9.20).)

The sections below, combined, give a proof of Theorem 1.

Let us also spell out an immediate consequence that has appeared in several variations:

Corollary 1 (Convexity) Assume \(f\) is convex and \(C^2\), and fix \(\lambda > 0\). If \((L, R)\) is second-order critical for (Eq. 4) and \(X = LR^\top\) has rank strictly less than \(r\), then \(X\) is a global minimizer of \(f_\lambda\) and \((L, R)\) is a global minimizer of \(g_\lambda\).

Proof. Owing to Theorem 1, \(X\) is stationary for the convex function \(f_\lambda\) (both with and wihout the constraint \(\rank(X) \leq r\), since the latter is not active).

In particular, if \(r \geq \min(m, n)+1\) in that corollary, then \(g_\lambda\) has a benign landscape: its second-order critical points are globally optimal. (The “+1” can be removed.) Picking such a high value for \(r\) is often not necessary, but it makes for a simple statement: it allows to replace the nonsmooth \(f_\lambda\) with the smooth \(g_\lambda\) without creating spurious local minima and with only double the dimension.

Necessary optimality conditions in \(L, R\)

Fix \(\lambda > 0\). Pick any \((L, R) \in \Rmr \times \Rnr\) and let \(X = LR^\top\). The function \(g_\lambda \colon \Rmr \times \Rnr \to \reals\) in Eq. 4 has the following gradient at \((L, R)\): \[ \nabla g_\lambda(L, R) = \Big( \nabla f(X) R + \lambda L, \nabla f(X)^\top L + \lambda R \Big). \] Also, its Hessian at \((L, R)\) along \((\dot L, \dot R) \in \Rmr \times \Rnr\) is \[\begin{align*} \nabla^2 g_\lambda(L, R)[\dot L, \dot R] = \Big( & \nabla^2 f(X)[\dot X] R + \nabla f(X) \dot R + \lambda \dot L, \\ & \nabla^2 f(X)[\dot X]^\top L + \nabla f(X)^\top \dot L + \lambda \dot R \Big), \end{align*}\] where \(\dot X = \dot L R^\top + L \dot R^\top\).

Thus, a pair \((L, R)\) is first-order critical for \(g_\lambda\) if and only if \[\begin{align*} \nabla f(X) R + \lambda L & = 0, \textrm{ and } \\ \nabla f(X)^\top L + \lambda R & = 0. \end{align*} \qquad(5)\] Such a pair is also a second-order critical point if and only if \[\begin{align*} 0 & \leq \inner{(\dot L, \dot R)}{\nabla^2 g_\lambda(L, R)[\dot L, \dot R]} \\ & = \innersmall{\dot X}{\nabla^2 f(X)[\dot X]} + 2 \innersmall{\dot L \dot R^\top}{\nabla f(X)} + \lambda \big( \sqfrobnormsmall{\dot L} + \sqfrobnormsmall{\dot R} \big) \end{align*} \qquad(6)\] for all \(\dot L, \dot R\), where \(\inner{A}{B} = \trace(A^\top B)\) is the usual inner product on matrices (extended to pairs of matrices as a product metric: \(\inner{(A_1, A_2)}{(B_1, B_2)} = \inner{A_1}{B_1} + \inner{A_2}{B_2}\)).

Properties of first-order critical points

In the next two lemmas, we show the following: If \((L, R)\) is first-order critical for \(g_\lambda\), then its factors are balanced (\(L^\top L = R^\top R\)) and the gradient of \(f\) at \(X = LR^\top\) has a specific form that will play well with the subgradient of the nuclear norm.

Lemma 1 (Balanced factors) Let \((L, R)\) be first-order critical for \(g_\lambda\) with \(\lambda > 0\). A compact SVD of \(X = LR^\top\) takes the form \[ X = U \Sigma V^\top, \] where \(U, V\) each have \(k = \rank(X)\) orthonormal columns, and \(\Sigma\) is \(k \times k\), diagonal and positive definite (in particular, invertible). We can select such an SVD and a matrix \(P \in \reals^{r \times k}\) (with \(k \leq r\)) whose columns are orthonormal and such that \[\begin{align*} L = U\Sigma^{1/2}P^\top && \textrm{ and } && R = V\Sigma^{1/2}P^\top. \end{align*}\] In particular, we find that \(L^\top L = R^\top R\) and \(\ker L = \ker R\).

Proof. For convenience, let \(H = -\frac{1}{\lambda} \nabla f(X)\). First-order criticality (Eq. 5) implies \(H^\top L = R\) and \(HR = L\). Thus, \(L^\top L = R^\top H^\top L = R^\top R\). It follows that \(L\) and \(R\) have the same singular values and right singular vectors. Thus, they admit compact SVDs of the form \(L = U S P^\top\) and \(R = V S P^\top\) with orthonormal \(U, V, P\) and diagonal, positive definite \(S\) of size \(k \times k\). Clearly, \(X = LR^\top = U S^2 V^\top\) provides a valid compact SVD of \(X\) upon setting \(\Sigma = S^2\).

Lemma 2 (Gradient downstairs) If \((L, R)\) is first-order critical for \(g_\lambda\) with \(\lambda > 0\), then \[ \nabla f(X) = -\lambda UV^\top + W \] with \(X = LR^\top = U\Sigma V^\top\) and some \(W \in \Rmn\) which satisfies \(WV = 0\) and \(W^\top U = 0\).

Proof. From Lemma 1, we have \(L = U \Sigma^{1/2} P^\top\) and \(R = V \Sigma^{1/2} P^\top\). Continuing from the proof of that lemma, multiply the equations \(H^\top L = R\) and \(HR = L\) on the right by \(P\Sigma^{-1/2}\). Then, \[\begin{align*} H^\top U = V && \textrm{ and } && HV = U. \end{align*}\] This implies that the columns of \(U\) and \(V\) are left and right singular vectors of \(H\) for the singular value 1. All other singular vectors of \(H\) are orthogonal to those. Thus, \(H = UV^\top + \tilde W\) where \(\tilde W V = 0\) and \(\tilde W^\top U = 0\). Substitute \(H = -\frac{1}{\lambda} \nabla f(X)\) to conclude.

Non-maximal rank case

Let \(X = LR^\top = U\Sigma V^\top\) as before, with \((L, R)\) second-order critical for \(g_\lambda\). If \(\rank(X) < r\), we know from Lemma 1 that \(L\) and \(R\) both have the same non-trivial kernel. Thus, we can select a vector \(z \in \reals^r\) such that \[\begin{align*} Lz = 0, && Rz = 0 && \textrm{ and } && \|z\| = 1. \end{align*}\] Now, with arbitrary unit-norm vectors \(x \in \Rm\) and \(y \in \Rn\), form the following direction along which we shall perturb \((L, R)\): \[\begin{align*} \dot L = xz^\top && \textrm{ and } && \dot R = yz^\top. \end{align*}\] This is convenient because it then follows that \(\sqfrobnormsmall{\dot L} = \sqfrobnormsmall{\dot R} = 1\), and \(\dot L \dot R^\top = xy^\top\), and, importantly, \[ \dot X = L\dot R^\top + \dot L R^\top = 0. \] Therefore, the second-order conditions in (Eq. 6) imply \[ 0 \leq x^\top \nabla f(X) y + \lambda \] for all \(x, y\). Recall from Lemma 2 that \[ \nabla f(X) = -\lambda UV^\top + W \] with some \(W \in \Rmn\). If \(W\) is nonzero, pick \(x, y\) as a dominant singular vector pair of \(W\) with signs such that \(x^\top W y = -\opnorm{W}\). Notice that \(V^\top y = 0\) and \(x^\top U = 0\) since \(WV = 0\) and \(W^\top U = 0\). Thus, the inequality above becomes \[ \opnorm{W} \leq \lambda. \] (This also holds if \(W = 0\).) Conveniently, the subdifferential of the nuclear norm at \(X\) is as follows (see (Watson 1992)): \[ \partial \nucnorm{X} = \{ UV^\top\! + \tilde W \, | \, \tilde W V = 0, \tilde W^\top U = 0 \textrm{ and } \opnorm{\tilde W} \leq 1 \}. \qquad(7)\] Thus, \(-\nabla f(X)\) belongs to the subdifferential of \(\lambda \nucnorm{\cdot}\) at \(X\). In other words, \(X\) is stationary for \(f_\lambda\) as defined in (Eq. 1), even without the rank constraint. Thus, \(X\) is a fortiori stationary for \(f_\lambda\) constrained to the matrices of rank at most \(r\). (Indeed, adding constraints only makes it easier for a point to be stationary.)

Maximal rank case

Fix \((L, R)\) first-order critical for \(g_\lambda\). Suppose \(X = LR^\top\) has maximal rank \(r\). Then the rank constraint in problem (Eq. 2) is active, and therefore we must take it into account when discussing stationarity of \(X\).

Fortunately, the set of matrices of size \(m \times n\) and rank exactly \(r\)—which we shall denote by \(\calM_r\)—is a smooth manifold embedded in \(\Rmn\), and it is open within the set of matrices of rank at most \(r\) (indeed, small perturbations of a matrix of rank \(r\) cannot have rank smaller than \(r\)). Moreover, \(f_\lambda\) is differentiable on \(\calM_r\) (as we shall confirm below).

This makes it easy to define stationarity: \(X\) is stationary for (Eq. 2) if \(\grad f_\lambda\)—the Riemannian gradient of \(f_\lambda\) restricted to the Riemannian submanifold \(\calM_r\) embedded in \(\Rmn\)—is zero. (See for example (Boumal 2023, sec. 7.5).)

Since \(f_\lambda = f + \lambda \nucnorm{\cdot}\) is a sum of two components, the same is true of its Riemannian gradient. Assuming \(f\) is continuously differentiable, the Riemannian gradient of \(f\) restricted to \(\calM_r\) is \[ \Proj_X(\nabla f(X)), \] where \(\nabla f(X)\) is the Euclidean gradient of \(f\) on \(\Rmn\) and \(\Proj_X\) is the orthogonal projector from \(\Rmn\) to the tangent space of \(\calM_r\) at \(X\).

As \((L, R)\) is first-order critical, Lemma 2 implies \(\nabla f(X) = -\lambda UV^\top + W\). The component \(UV^\top\) is already tangent at \(X\), and \(W\) is orthogonal to the tangent space (Boumal 2023, sec. 7.5). Therefore, \[ \grad f(X) = \Proj_X(\nabla f(X)) = -\lambda UV^\top. \] It remains to obtain the gradient of the nuclear norm part of \(f_\lambda\). Conveniently, even though the nuclear norm is not differentiable over all of \(\Rmn\), it is differentiable on each manifold of constant rank.

Lemma 3 The Riemannian gradient of the nuclear norm restricted to the rank-\(r\) matrices is \[ \grad(\nucnorm{\cdot})(X) = UV^\top, \] where \(X = U \Sigma V^\top\) is any compact SVD of \(X\).

Proof. Use for example (Yang, Zhang, and Song 2014, sec. 5). There, it is stated that the (Riemannian) subdifferential of \(\nucnorm{\cdot}\) restricted to \(\calM_r\) is simply the orthogonal projection of the subdifferential of \(\nucnorm{\cdot}\) in \(\Rmn\) to the tangent space of \(\calM_r\) at \(X\). Yet, owing to Eq. 7, the subgradients of the nuclear norm are exactly of the form “something tangent at \(X\)” (namely, \(UV^\top\)) + “something orthogonal to the tangent space at \(X\)” (namely, the \(W\) matrices). Thus, the orthogonal projection of that entire subdifferential to the tangent space at \(X\) is the singleton \(\{UV^\top\}\).

Therefore, when \((L, R)\) is stationary for \(g_\lambda\), the Riemannian gradient of \(f_\lambda\) at \(X = LR^\top\) is \[ \grad f_\lambda(X) = \grad f(X) + \lambda \grad(\nucnorm{\cdot})(X) = -\lambda UV^\top + \lambda UV^\top = 0, \] and it follows that \(X\) is stationary for (Eq. 2).

Other nonsmooth regularizers?

As discussed in another post, an analogous result holds for the 1-norm regularizer through a difference-of-Hadamard lift. (Additional such lifts are discussed, for example, in (Kolb et al. 2023).)

With these examples at hand, one wonders if there is a broader story, e.g., within the formalism discussed in (Levin, Kileel, and Boumal 2024). So let us close with a question:

Say we are minimizing a cost function \(f\) with a (possibly nonsmooth) regularizer \(r(x)\) over a (possibly nonsmooth) set \(\calX\), and that we have a smooth, surjective lift \(\varphi \colon \calM \to \calX\) and a smooth function \(\rho(y)\) on \(\calM\) such that \(r(x) = \inf_{y : \varphi(y) = x} \rho(y)\). Is it always the case that if \(y\) is second-order critical for \(f \circ \varphi + \rho\) then \(x = \varphi(y)\) is stationary for \(f + r\) on \(\calX\)? What extra conditions might we need? What are other relevant examples?

References

Boumal, N. 2023. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press. https://doi.org/10.1017/9781009166164.

Ha, W., H. Liu, and R. Foygel Barber. 2020. “An Equivalence Between Critical Points for Rank Constraints Versus Low-Rank Factorizations.” SIAM Journal on Optimization 30 (4): 2927–55. https://doi.org/10.1137/18m1231675.

Kolb, C., C. L. Müller, B. Bischl, and D. Rügamer. 2023. “Smoothing the Edges: Smooth Optimization for Sparse Regularization Using Hadamard Overparametrization.” arXiv 2307.03571. https://doi.org/10.48550/arXiv.2307.03571.

Lee, C.-P., L. Liang, T. Tang, and K.-C. Toh. 2023. “Accelerating Nuclear-Norm Regularized Low-Rank Matrix Optimization Through Burer–Monteiro Decomposition.” arXiv 2204.14067. https://doi.org/10.48550/arXiv.2204.14067.

Levin, E., J. Kileel, and N. Boumal. 2024. “The Effect of Smooth Parametrizations on Nonconvex Optimization Landscapes.” Mathematical Programming. https://doi.org/10.1007/s10107-024-02058-3.

Li, Q., Z. Zhu, and G. Tang. 2019. “The Non-Convex Geometry of Low-Rank Matrix Optimization.” Information and Inference: A Journal of the IMA 8 (1): 51–96. https://doi.org/10.1093/imaiai/iay003.

McRae, A. D. 2024. “Low Solution Rank of the Matrix LASSO Under RIP with Consequences for Rank-Constrained Algorithms.” arXiv Preprint arXiv:2404.12828.

Mishra, B., G. Meyer, F. Bach, and R. Sepulchre. 2013. “Low-Rank Optimization with Trace Norm Penalty.” SIAM Journal on Optimization 23 (4): 2124–49. https://doi.org/10.1137/110859646.

Srebro, N., J. Rennie, and T. Jaakkola. 2004. “Maximum-Margin Matrix Factorization.” In Advances in Neural Information Processing Systems, edited by L. Saul, Y. Weiss, and L. Bottou. Vol. 17. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2004/file/e0688d13958a19e087e123148555e4b4-Paper.pdf.

Watson, G. A. 1992. “Characterization of the Subdifferential of Some Matrix Norms.” Linear Algebra and Its Applications 170 (June): 33–45. https://doi.org/10.1016/0024-3795(92)90407-2.

Yang, W. H., L.-H. Zhang, and R. Song. 2014. “Optimality Conditions for the Nonlinear Programming Problems on Riemannian Manifolds.” Pacific Journal of Optimization 10 (2): 415–34.

Citation

BibTeX citation:

@online{boumal2024,
  author = {Boumal, Nicolas and McRae, Andrew D.},
  title = {The Usual Smooth Lift of the Nuclear Norm Regularizer Enjoys
    \$2 {\textbackslash Rightarrow} 1\$},
  date = {2024-10-03},
  url = {www.racetothebottom.xyz/posts/lift-regularizer-nuclear/},
  langid = {en},
  abstract = {Minimizing \$f(X)\$ with a nuclear-norm regularizer is
    equivalent to minimizing \$g(L, R) = f(LR\^{}\textbackslash top)\$
    with the squared Frobenius norm regularizer. To this classical
    result, we add that second-order critical points \$(L, R)\$ map to
    first-order stationary points \$LR\^{}\textbackslash top\$ under a
    bounded-rank constraint.}
}