Benign landscape of the Brockett cost function: asymmetric case

nonconvex
Author

Nicolas Boumal

Published

April 28, 2024

Abstract
The second-order critical points of \(f(X, Y) = \trace(AX\transpose BY)\) for \(X, Y\) orthogonal are globally optimal. This implies a number of equally well-known corollaries, including benign non-convexity of \(f(X) = \trace(AX)\) and \(f(X) = \sqfrobnorm{A-X}\) for \(X\) orthogonal.

The fact

Let \(A, B\) be two arbitrary matrices of size \(n \times n\) (not necessarily symmetric). Some notation:

  • \(\sigma_1 \geq \cdots \geq \sigma_n\) are the singular values of \(A\).
  • \(\mu_1 \geq \cdots \geq \mu_n\) are the singular values of \(B\).
  • \(\inner{A}{B} = \trace(A\transpose B)\) is the usual inner product for matrices.
  • \(\On = \{ X \in \Rnn : X\transpose X = I_n\}\) is the orthogonal group.
  • \(\SOn = \{ X \in \On : \det(X) = +1\}\) is the special orthogonal group (rotation group).

The following is well known. It is an asymmetric version of the analogous result detailed in a previous post about the Brockett cost function \(X \mapsto \innersmall{A}{X\transpose B X}\).

Theorem 1 (Benign landscape) Consider the problem of minimizing the asymmetric Brockett cost function, namely, \[\begin{align} f \colon \SOn \times \SOn \to \reals, \qquad f(X, Y) = \innersmall{A}{X\transpose BY}. \end{align} \tag{1}\] Its second-order critical points are its global minimizers. The minimal value is \[\begin{align} -\left( (-1)^{n} \sign(\det(AB)) \sigma_{n}\mu_{n} + \sum_{i = 1}^{n-1} \sigma_{i} \mu_{i}\right). \end{align}\]

In particular, \(f\) has the so-called strict-saddle property: if \((X, Y)\) is a critical point, then either it is a global minimum or the Riemannian Hessian of \(f\) at \((X, Y)\) has a negative eigenvalue (and hence reasonable optimization algorithms are overwhelmingly likely to find a global optimum).

Much has been known for a long time about the optimal solutions of that problem. They become quite clear once you diagonalize \(A\) and \(B\) (which we will too). The goal here is to focus on the landscape of \(f\), not just on the solution of the problem. This is spelled out less often.

von Neumann (1937) established the optimal value of \(f\) to prove his celebrated trace inequality. As far as I could tell, his proof does not imply that the second-order critical points are optimal. Helmke and Moore (1992) provide a proof of Theorem 1 with some assumptions on singular values of \(A\) or \(B\) being distinct. Please write to me if you know a reference that implies the theorem as is, without assumptions on \(A, B\) (surely, an old one exists).

It is useful to prove the theorem for all \(A, B\) to enable a collection of corollaries listed later.

Proof structure

We follow the same proof structure as in the post about the Brockett cost function: see that one for additional high level explanations. It goes as follows:

  1. Reduce the problem to simpler data, that is, diagonalize \(A, B\) via a change of variable.
  2. Write down the first- and second-order necessary optimality conditions.
  3. The first-order conditions provide equalities: extract structure from them.
  4. The second-order conditions provide a family of inequalities: identify some that are useful.
  5. Show that any second-order critical point \((X, Y)\) can be transformed into another one without changing the value of \(f\), and argue that this value is independent of \((X, Y)\).
  6. Conclude as follows:
    1. Note that all second-order critical points have the same cost function value, and
    2. Observe that at least one of them is optimal since \(\SOn \times \SOn\) is compact,
    so that all second-order critical points are optimal.

Extension from \(\SOn\) to \(\On\)

Above, \(f\) is defined on \(\SOn \times \SOn\). Sometimes we care about \(f\) on the larger space \(\On \times \On\). The latter has four connected components, one of which is \(\SOn \times \SOn\).

It is convenient to split \(\On \times \On\) in two components as \(\calM_+\) and \(\calM_-\), with: \[\begin{align} \calM_\pm & = \{ (X, Y) \in \On \times \On : \det(X) = \pm \det(Y) \}. \end{align}\]

Corollary 1 (Extension to \(\On\)) Extend the domain of \(f\) to \(\On \times \On\) in Theorem 1.

Second-order critical points for \(f|_{\calM_+}\) are optimal, and likewise second-order critical points for \(f|_{\calM_-}\) are optimal. The minimal value on \(\calM_\pm\) is the same as above only with \(\sign(\det(AB))\) replaced by \(\pm\sign(\det(AB))\).

Thus, the minimal value of \(f\) on \(\On \times \On\) is \(-\sum_{i = 1}^{n} \sigma_{i} \mu_{i}\), attained either on \(\calM_+\) or \(\calM_-\) (or both if \(\sigma_{n}\mu_{n} = 0\), that is, if \(AB\) is singular).

Proof. The manifold \(\On \times \On\) has four connected components. The theorem elucidates the landscape of \(f\) on one of them, namely, \(\{(X, Y) : \det(X) = \det(Y) = +1\}\). To control the landscape on each other component, it is enough to setup a diffeomorphism to \(\SOn \times \SOn\). For example, \(\{(X, Y) : \det(X) = -1, \det(Y) = +1\}\) is mapped diffeomorphically to \(\SOn \times \SOn\) by \((X, Y) \mapsto (XJ_n, Y)\), where \(J_n\) denotes the diagonal matrix of size \(n\) with diagonal entries \(1, \ldots, n-1\) set to 1 and the last diagonal entry set to \(-1\). Thus, the landscape of \(f\) on the former is identical to the landscape of \(f\) on the latter with \((A, B)\) replaced by \((J_n A, B)\). This change of data leaves the singular values unchanged and flips the sign as \(\det(J_nAB) = -\det(AB)\).

Corollaries

Similarly to the post about the Brockett cost function, it is easy to derive a number of corollaries from Theorem 1.

For starters, linear function over \(\SOn\) have a benign landscape. This could have been proved more directly of course. See (Brockett 1989, Thm. 3) for the special case where \(A\) has distinct, positive singular values; and (Bozma, Gillam, and Öztürk 2019, sec. 3) for a full study of the Hessian at all critical points, with an older reference to (Frankel 1965).

Corollary 2 For all \(A \in \Rnn\), the second-order critical points of \[ g \colon \SOn \to \reals, \qquad g(Q) = \innersmall{A}{Q} \] are its global minimizers.

Proof. Let \(B = I_n\). Then, \(f(X, Y) = \innersmall{A}{X\transpose BY} = \innersmall{A}{X\transpose Y} = g(X\transpose Y)\). The map \(\varphi \colon \SOn \times \SOn \to \SOn \colon (X, Y) \mapsto X\transpose Y\) is surjective and smooth. We have that \(f = g \circ \varphi\). We can now reason as follows (see for example (Levin, Kileel, and Boumal 2024)). If \(Q\) is second-order critical for \(g\), then:

  • Any \((X, Y) \in \varphi^{-1}(Q)\) is second-order critical for \(f\).

    (To see this, consider any curve \(c \colon \reals \to \SOn \times \SOn\) with \(c(0) = (X, Y)\) and check that \((f \circ c)'(0) = 0\) and \((f \circ c)''(0) \geq 0\) owing to \(f \circ c = g \circ \gamma\) with \(\gamma = \varphi \circ c \colon \reals \to \SOn\).)

  • By Theorem 1, that implies \((X, Y)\) is a global minimizer for \(f\).

  • This implies \(Q = \varphi(X, Y)\) is a global minimizer for \(g\) since \(\varphi\) is a surjection.

Metric projection to \(\SOn\) can be computed using the SVD: see the Kabsch algorithm. From the above, we deduce now that the corresponding optimization problem has a benign landscape. Metric projection to \(\On\) corresponds to the polar factor: we get a similar result for this one too upon considering the two components of \(\SOn\) separately.

Corollary 3 For all \(A \in \Rnn\), the second-order critical points of \[ h \colon \SOn \to \reals, \qquad h(Q) = \sqfrobnorm{A - Q} \] are its global minimizers.

Proof. Apply Corollary 2 after noting that \(h(Q) = \sqfrobnormsmall{A} + \sqfrobnormsmall{Q} - 2\innersmall{A}{Q}\) and the first two terms are constant.

The next corollary reveals notes that \(f\) is an effective cost function to compute a singular decomposition (or a truncated version of it) for a given matrix \(B\). This is part of the original motivation for its study.

Corollary 4 If \(A\) is diagonal with sorted entries (e.g., \(A = \diag(1, \ldots, n)\)), then the columns of optimal \(X\) and \(Y\) reveal singular vectors of \(B\), sorted.

Proof. In the proof of Theorem 1 below, track the characterization of second-order critical points. Alternatively, use the known value of \(f\) at global optima to justify that \(X, Y\) composed of singular vectors of \(B\) (properly sorted according to singular values) is optimal.

Corollary 5 It is clear that all of the above, including Theorem 1, extends to maximization instead of minimization (simply replace \(A\) by \(-A\) and track consequences).

Moreover, many of the results above extend to optimization on Stiefel manifolds \(\Stnp\) (matrices of size \(n \times p\) with orthonormal columns) by zero-padding \(A, B\) and using the fact that the map \(\varphi \colon \SOn \to \Stnp\) which extracts the \(p < n\) first columns of a rotation matrix is a surjection (together with the same reasoning as in the proof of Corollary 2).

It is an exercise to deduce von Neumann’s matrix trace inequality from Theorem 1.

Reduce the problem to diagonal \(A, B\)

The main theorem only references how the minima and second-order critical points of \(f\) relate to the singular values of \(A\) and \(B\). Thus, in the process of proving the theorem, we are allowed to replace \(A\) and \(B\) as long as we do not change that relation.

Lemma 1 It is sufficient to prove Theorem 1 for the case where \(A\) and \(B\) are diagonal with sorted entries.

Proof. Let \(A = U \Sigma V\transpose\) be an SVD of \(A\), so that \(U, V\) are in \(\On\) and \(\Sigma = \diag(\sigma_{1}, \ldots, \sigma_{n})\). Given a matrix \(Q \in \On\), let \(J_Q = \diag(1, \ldots, 1, \det(Q))\) (so that \(\det(J_Q) = \det(Q)\)). Note that \(A = (U J_U) J_U \Sigma J_V (V J_V)\transpose\). Let \(\tilde U = U J_U\) and \(\tilde V = V J_V\): these are in \(\SOn\). Since \(\Sigma\) is diagonal, \(A = \tilde U (\Sigma J_{UV}) \tilde V\transpose\). It follows that, for all \(X, Y \in \SOn\), \[\begin{align} f(X, Y) = \innersmall{A}{X\transpose B Y} & = \innersmall{\tilde U (\Sigma J_{UV}) \tilde V\transpose}{X\transpose B Y} \nonumber \\ & = \innersmall{\Sigma}{(X\tilde U)\transpose (BJ_{UV}) J_{UV} (Y\tilde V) J_{UV}} \nonumber \\ & = \innersmall{\Sigma}{\tilde X\transpose (B J_{UV}) \tilde Y}, \end{align}\] where we let \(\tilde X = X\tilde U\) and \(\tilde Y = J_{UV} Y\tilde V J_{UV}\). The change of variable from \((X, Y)\) to \((\tilde X, \tilde Y)\) is a diffeomorphism on \(\SOn \times \SOn\): it does not affect the landscape of \(f\), in the sense that first-order critical points are mapped to first-order critical points, and likewise for second-order critical points, local minima and global minima. Also, the optimal value is unchanged, and the data \((A, B)\) become \((\Sigma, BJ_{UV})\): their singular values are unchanged, as is the determinant of their product since \[\begin{align} \det(AB) = \det(\Sigma) \det(UV\transpose) \det(B) = \det(\Sigma) \det(BJ_{UV}) = \det(\Sigma B J_{UV}). \end{align}\] This confirms that we can assume \(A\) is diagonal. Now do it again to argue we can assume \(B\) is diagonal.

Accordingly, for the remainder of the proof we assume \[ A = \diag(\sigma_{1}, \ldots, \sigma_{n}) \qquad \textrm{ and } \qquad B = \diag(\mu_{1}, \ldots, \mu_{n}). \tag{2}\] We won’t use the fact that \(B\) is diagonal, but it’s free.

Write down necessary optimality conditions

The first-order optimality conditions for \(f\) can be obtained by deriving an expression for the Riemannian gradient of \(f\) and requiring it to be zero. Likewise, second-order optimality conditions follow by expressing that the Riemannian Hessian is positive semidefinite. See this post about optimality conditions for functions on \(\On\): it extends without friction to optimization on \(\SOn \times \SOn\), leading to the two lemmas below. See also (Boumal 2023, Table 7.2) for additional background on working with product manifolds.

Lemma 2 (First-order conditions) \((X, Y) \in \SOn \times \SOn\) is a first-order critical point for \(f\) if and only if \[\begin{align} (X\transpose B Y) \cdot A\transpose & = A \cdot (X\transpose B Y)\transpose, \textrm{ and } \\ (X\transpose B Y)\transpose \cdot A & = A\transpose \cdot (X\transpose B Y). \end{align} \tag{3}\]

Proof. We have \(f(X) = \innersmall{A}{X\transpose BY}\). Its Euclidean gradient is \(\nabla f(X, Y) = (BYA\transpose, B\transpose XA)\). Plug this into the first theorem of that post, adapted to \(\SOn \times \SOn\).

Lemma 3 (Second-order conditions) A first-order critical point \((X, Y) \in \SOn \times \SOn\) is second-order critical for \(f\) if and only if \[ \innerbig{\Omega^2 A + A\Theta^2 - 2\Omega A \Theta}{X\transpose B Y} \geq 0 \tag{4}\] for all skew-symmetric \(\Omega, \Theta\) of size \(n \times n\).

Proof. For \(f(X, Y) = \innersmall{A}{X\transpose BY}\), the Euclidean Hessian is \(\nabla^2 f(X, Y)[\dot X, \dot Y] = (B\dot YA\transpose, B\transpose \dot XA)\). Plug this with \(\dot X = X\Omega\) and \(\dot Y = Y\Theta\) into the second theorem of the same post adapted to \(\SOn \times \SOn\), and reorganize the expression to isolate \(X\transpose B Y\).

Extract structure from first-order criticality

From now one, we use additional notation as follows:

  • \(\sigma_{(1)} > \cdots > \sigma_{(\ell)}\) are the distinct singular values of \(A\).
  • \(n_1, \ldots, n_\ell\) are their multiplicities, so that \(n_1 + \cdots + n_\ell = n\).

In particular, from Eq. 2 we can write \(A\) in block-diagonal form: \[ A = \begin{bmatrix} \sigma_{1} & & \\ & \ddots & \\ & & \sigma_{n} \end{bmatrix} = \begin{bmatrix} \sigma_{(1)} I_{n_1} & & \\ & \ddots & \\ & & \sigma_{(\ell)} I_{n_\ell} \end{bmatrix}. \tag{5}\]

When \((X, Y)\) is first-order critical, Lemma 2 provide exploitable structure.

Lemma 4 If \((X, Y)\) is first-order critical, then \(X\transpose B Y\) is block diagonal with blocks \(M_1, \ldots, M_\ell\) whose sizes match the multiplicities of the singular values of \(A\): \[\begin{align} X\transpose B Y = \begin{bmatrix} M_1 & & \\ & \ddots & \\ & & M_\ell \end{bmatrix}, && M_k \in \reals^{n_k \times n_k}, \quad k = 1, \ldots \ell. \end{align} \tag{6}\] Blocks \(M_1, \ldots, M_{\ell-1}\) are symmetric. If \(A\) is invertible (\(\sigma_\ell > 0\)), then \(M_\ell\) is symmetric as well.

Proof. Since \((X, Y)\) is first-order critical, conditions Eq. 3 provide \[\begin{align} \left( (X\transpose B Y) \cdot A\transpose \right)_{ij} & = \left( A \cdot (X\transpose B Y)\transpose \right)_{ij}, \textrm{ and } \\ \left( (X\transpose B Y)\transpose \cdot A \right)_{ij} & = \left( A\transpose \cdot (X\transpose B Y) \right)_{ij} \end{align}\] for all \(i, j\). Since \(A = \diag(\sigma_{1}, \ldots, \sigma_{n})\) as in Eq. 5, the above reads: \[\begin{align} (X\transpose B Y)_{ij} \sigma_{j} & = (X\transpose B Y)_{ji} \sigma_{i}, \textrm{ and } \\ (X\transpose B Y)_{ji} \sigma_{j} & = (X\transpose B Y)_{ij} \sigma_{i}. \end{align}\] We can rewrite this linear system in matrix form: \[\begin{align} \begin{bmatrix} \sigma_{j} & -\sigma_{i} \\ \sigma_{i} & -\sigma_{j} \end{bmatrix} \begin{bmatrix} (X\transpose B Y)_{ij} \\ (X\transpose B Y)_{ji} \end{bmatrix} = \begin{bmatrix} 0 \\ 0\end{bmatrix}. \end{align}\] The determinant of the \(2 \times 2\) matrix is \(\sigma_{i}^2 - \sigma_{j}^2\). There are several cases to consider:

  • If \(i, j\) index distinct singular values of \(A\), then \(\sigma_{i} \neq \sigma_{j}\). The \(2 \times 2\) matrix is invertible and hence \((X\transpose B Y)_{ij} = (X\transpose B Y)_{ji} = 0\). This confirms that \(X\transpose B Y\) has the block-diagonal structure as claimed.

  • If \(i, j\) both index the \(k\)th singular value of \(A\) with \(k < \ell\), then \(\sigma_{i} = \sigma_{j} = \sigma_{(k)} > 0\). The null space of the \(2 \times 2\) matrix consists of all constant vectors, so \((X\transpose B Y)_{ij} = (X\transpose B Y)_{ji}\). This confirms that the blocks \(M_1, \ldots, M_{\ell-1}\) are symmetric.

  • If \(i, j\) both index the \(\ell\)th singular value of \(A\), then there are two further possibilities.

    • If \(\sigma_{(\ell)} = 0\), then the \(2\times 2\) matrix is null so that \(M_\ell\) is unconstrained.
    • If \(\sigma_{(\ell)} > 0\), then \(\sigma_{i} = \sigma_{j} = \sigma_{(\ell)} > 0\) and (as before) it follows that \(M_\ell\) is symmetric.

This concludes all cases.

Extract meaningful inequalities from second-order criticality

The next part is a bit of an art. Lemma 3 provides us with a large collection of inequalities: one for each choice of \(\Omega\) and \(\Theta\). Which ones shall we choose? The fact that we diagonalized \(A\) and \(B\) makes it so that the canonical basis vectors \(e_1, \ldots, e_n\) (the columns of the identity matrix) are no longer arbitrary. Thus, it is reasonable to try canonical skew-symmetric matrices such as \(e_i^{}e_j\transpose - e_j^{}e_i\transpose\). That is indeed fruitful.

The lemma below provides us with a good understanding of the diagonal entries of the blocks \(M_k\) provided by Lemma 4. Later, we shall diagonalize \(X\transpose BY\) so that understanding the diagonal means we understand it as a whole.

Lemma 5 (Special inequalities) If \(X\) is second-order critical, then:

  • All diagonal entries of \(M_k\) are less than or equal to all diagonal entries of \(M_{k+1}, \ldots, M_{\ell}\).
  • All diagonal entries of the blocks \(M_1, \ldots, M_{\ell-1}\) are nonpositive.
  • In absolute value, the diagonal entries of \(M_\ell\) are smaller than those of \(M_1, \ldots, M_{\ell-1}\).
  • If \(A\) is invertible, then \(M_\ell\) can have at most one positive entry on its diagonal; if so, that entry is the smallest (in absolute value) on the whole diagonal of \(X\transpose B Y\).

Proof. Pick \(i \neq j\) and let \(\Omega = e_i^{} e_j\transpose - e_j^{} e_i\transpose\), where \(e_1, \ldots, e_n\) are the columns of \(I_n\). Setting \(\Theta = \Omega\), the second-order conditions Eq. 4 provide \[\begin{align} \inner{\Omega^2 A + A\Omega^2 - 2\Omega A \Omega}{X\transpose B Y} \geq 0. \end{align}\] Notice that \(\Omega A = \sigma_{i} e_i^{} e_j\transpose - \sigma_{i} e_j^{} e_i\transpose\) and \(-A\Omega\) is the transpose of that. Thus, \(\Omega^2 A = -( \sigma_{i} e_i^{} e_i\transpose + \sigma_{j} e_j^{} e_j\transpose )\) and \(A\Omega^2 = (\sigma_{i} e_i^{} e_j\transpose - \sigma_{j} e_j^{} e_i\transpose) \Omega = \Omega^2 A\). Therefore, \[\begin{align} \inner{\Omega^2 A - \Omega A \Omega}{X\transpose B Y} \geq 0. \end{align}\] After some computations, we see that \[\begin{align} \Omega^2 A - \Omega A \Omega = \Omega(\Omega A - A\Omega) = (\sigma_{j} - \sigma_{i}) (e_i^{} e_i\transpose - e_j^{} e_j\transpose). \end{align}\] Hence, we have shown that \[\begin{align} (\sigma_{j} - \sigma_{i}) \cdot ((X\transpose B Y)_{ii} - (X\transpose B Y)_{jj}) \geq 0 \end{align}\] for all \(i, j\). Say \(i\) indexes into the \(k\)th diagonal block with \(k < \ell\) (so, not the last one). Say \(j\) indexes into the \(q\)th diagonal block with \(q > k\). Then, \(\sigma_{j} = \sigma_{(q)} < \sigma_{(k)} = \sigma_{i}\). It follows that \((X\transpose B Y)_{ii} \leq (X\transpose B Y)_{jj}\). Since \((X\transpose B Y)_{ii}\) is a diagonal entry of \(M_k\) while \((X\transpose B Y)_{jj}\) is a diagonal entry of \(M_q\), this confirms that the diagonal entries of those blocks compare as claimed.

Now consider Eq. 4 again, with the same \(\Omega\) but this time with \(\Theta = 0\). Then, \[\begin{align} \inner{\Omega^2 A}{X\transpose B Y} \geq 0. \end{align}\] Since \(\Omega^2 A = -( \sigma_{i} e_i^{} e_i\transpose + \sigma_{j} e_j^{} e_j\transpose )\) as before, we have found that \[\begin{align} \sigma_{i} (X\transpose B Y)_{ii} + \sigma_{j} (X\transpose B Y)_{jj} \leq 0 \end{align}\] for all \(i \neq j\). There are two cases to consider:

  • If \(\sigma_{(\ell)} = 0\), let \(i\) index into the \(k\)th block with \(k < \ell\) and let \(j\) index into the \(\ell\)th block. Then, \(\sigma_{i} = \sigma_{(k)} > 0\) and \(\sigma_{j} = 0\) so we conclude that all diagonal entries of the blocks \(M_1, \ldots, M_{\ell-1}\) are nonpositive. (See below for \(M_\ell\).)

  • If \(\sigma_{(\ell)} > 0\), then \(\sigma_{1}, \ldots, \sigma_{n}\) are all positive. Thus, the diagonal of \(X\transpose B Y\) can contain at most one positive entry. If there is such a positive entry, then

    1. by the reasoning from earlier, it must belong to the last block (\(M_\ell\)), and
    2. by taking \(i\) to index that positive entry and \(j\) to index another entry in the last block (if any), we see that \(\sigma_{i} = \sigma_{j} = \sigma_{(\ell)} > 0\) and therefore \((X\transpose B Y)_{ii} \leq -(X\transpose B Y)_{jj} = |(X\transpose B Y)_{jj}|\).

Thus, the positive entry (if any) is less than the absolute value of any other diagonal entry of \(M_\ell\), and therefore also less than the absolute value of any other diagonal entry of \(X\transpose B Y\).

We need more control over \(M_\ell\) in the case where \(A\) is singular (\(\sigma_{(\ell)} = 0\)). To this end, let us consider Eq. 4 once again, with the same \(\Omega\) but this time with \(\Theta = -\Omega\). Then, \[\begin{align} \inner{\Omega^2 A + \Omega A \Omega}{X\transpose B Y} \geq 0. \end{align}\] With similar computations as before, we find \[\begin{align} \Omega^2 A + \Omega A \Omega = \Omega(\Omega A + A\Omega) = -(\sigma_{i} + \sigma_{j}) (e_i^{} e_i\transpose + e_j^{} e_j\transpose), \end{align}\] so that \[\begin{align} (\sigma_{i} + \sigma_{j}) \cdot ((X\transpose B Y)_{ii} + (X\transpose B Y)_{jj}) \leq 0 \end{align}\] for all \(i \neq j\). Take \(j\) to index into the \(\ell\)th block and \(i\) to index into the \(k\)th block with \(k < \ell\). Then, assuming \(\sigma_{(k)} > \sigma_{(\ell)} = 0\), it follows that \((X\transpose B Y)_{jj} \leq -(X\transpose B Y)_{ii}\). But we already knew from above that \(-(X\transpose B Y)_{jj} \leq -(X\transpose B Y)_{ii}\), and also that \((X\transpose B Y)_{ii} \leq 0\). It follows that the diagonal entries of \(M_\ell\) are smaller than all diagonal entries of \(M_1, \ldots, M_{\ell-1}\) in absolute value when \(\sigma_{(\ell)} = 0\). We had already shown this to be the case for \(\sigma_{(\ell)} > 0\), hence the proof is complete.

Jump to a particularly nice second-order point

Given a second-order critical point \((X, Y)\), we want to jump to another one that has as much structure as we can, without losing in function value. To do so, we may exploit the natural symmetries of the problem and try to rotate \(X, Y\). Lemma 4 suggests that we pick \(U\) so as to diagonalize \(X\transpose B Y\). Then, Lemma 5 tells us exactly how those diagonal entries are ordered.

Lemma 6 (Diagonalize \(X\transpose BY\)) For each \(1 \leq k \leq \ell-1\), let \(U_k \in \mathrm{SO}(n_k)\) be such that \(M_k^{} = U_k^{} \Lambda_k^{} U_k\transpose\) with \(\Lambda_k\) diagonal. If \(\sigma_{(\ell)} > 0\), do the same for \(k = \ell\). If \(\sigma_{(\ell)} = 0\), let \(U_\ell = I_{n_\ell}\) and \(\Lambda_\ell = M_\ell\). Define \[\begin{align} U = \begin{bmatrix} U_1 & & \\ & \ddots & \\ & & U_\ell \end{bmatrix} && \textrm{ and } && \Lambda = \begin{bmatrix} \Lambda_1 & & \\ & \ddots & \\ & & \Lambda_\ell \end{bmatrix}, \end{align}\] so that \(U\) itself is in \(\SOn\) and \(\Lambda\) is diagonal of size \(n\). Let \(\tilde X = XU\) and \(\tilde Y = YU\).

Then, \(f(\tilde X, \tilde Y) = f(X, Y)\) and \((\tilde X, \tilde Y)\) is first-order critical. If moreover \((X, Y)\) is second-order critical, then so is \((\tilde X, \tilde Y)\).

Proof. By the definition of \(f\), the special form of \(A\) Eq. 5 and the special form of \(X\transpose B Y\) Eq. 6, \[\begin{align} f(X, Y) & = \innersmall{A}{X\transpose B Y} = \sum_{k = 1}^\ell \sigma_{(k)} \trace(M_k) = \sum_{k = 1}^\ell \sigma_{(k)} \trace(\Lambda_k). \end{align} \tag{7}\] Likewise, by construction of \((\tilde X, \tilde Y)\) we have \(\tilde X\transpose B \tilde Y = U\transpose X\transpose B Y U = \Lambda\), and so \[\begin{align} f(\tilde X, \tilde Y) & = \innersmall{A}{\tilde X\transpose B \tilde Y} = \sum_{k = 1}^\ell \sigma_{(k)} \trace(\Lambda_k). \end{align}\] This confirms that \(f(\tilde X, \tilde Y) = f(X, Y)\).

To see that \((\tilde X, \tilde Y)\) is first-order critical, note that both \(A\) and \(\tilde X\transpose B \tilde Y = \Lambda\) are diagonal then plug into Eq. 3.

Now assume \((X, Y)\) is second-order critical Eq. 4, that is, \[\begin{align} \inner{\Omega^2 A + A\Theta^2 - 2\Omega A \Theta}{X\transpose B Y} \geq 0 \end{align}\] for all skew-symmetric \(\Omega, \Theta\). We must show that the same holds with \(X, Y\) replaced by \(\tilde X, \tilde Y\). To this end, use \(U\transpose U = I_n\) and both \(AU\transpose = U\transpose A\) and \(UA = AU\) to compute \[\begin{align*} \inner{\Omega^2 A + A\Theta^2 - 2\Omega A \Theta}{\tilde X\transpose B \tilde Y} & = \inner{U\Omega^2 AU\transpose + UA\Theta^2U\transpose - 2U\Omega A \Theta U\transpose}{X\transpose B Y} \\ & = \inner{U\Omega^2U\transpose A + A U\Theta^2U\transpose - 2U\Omega U\transpose (UAU\transpose) U \Theta U\transpose}{X\transpose B Y} \\ & = \inner{\tilde \Omega^2 A + A \tilde\Theta^2 - 2\tilde\Omega A \tilde\Theta}{X\transpose B Y}, \end{align*}\] where \(\tilde \Omega = U\Omega U\transpose\) and \(\tilde \Theta = U \Theta U\transpose\). Since these are skew-symmetric, it follows from second-order criticality of \((X, Y)\) that the right-hand side is nonnegative, and hence that \((\tilde X, \tilde Y)\) is second-order critical.

Deduce second-order critical points are optimal

Say \((X, Y)\) is second-order critical. Lemma 6 tells us we can modify it (without changing the value of \(f\) nor changing the fact that this point is second-order critical) in such a way that \(X\transpose BY\) is diagonal. Lemma 5 further tells us precisely what this diagonal looks like. From there, we can deduce the value \(f(X, Y)\), and notice that it is the same for all second-order critical points. Since there is a global optimum, and since it is second-order critical, it follows that the value of \(f\) at all second-order critical points is optimal. Lemma 7 below provides the details and as such concludes the proof of Theorem 1.

Recall that \(\mu_{1} \geq \cdots \geq \mu_{n}\) denote the singular values of \(B\).

Lemma 7 If \((X, Y)\) is second-order critical, then \[\begin{align*} f(X, Y) = -\left( (-1)^{n} \sign(\det(AB)) \sigma_{n}\mu_{n} + \sum_{i = 1}^{n-1} \sigma_{i} \mu_{i}\right) \end{align*}\] and that is optimal.

Proof. We only need to show that all second-order critical points attain the claimed cost function value, which is independent on \((X, Y)\). Then, we will have that all second-order critical points have the same value. Moreover, \(f\) attains its global minimum at a second-order critical point (since \(\SOn \times \SOn\) is compact and \(f\) is smooth). That is indeed sufficient to claim that all second-order critical points are optimal. Thus, we proceed to compute \(f(X, Y)\).

Since \((X, Y)\) is second-order critical, Lemma 6 provides \(U \in \SOn\) such that \((\tilde X, \tilde Y) = (XU, YU)\) is also second-order critical, with same function value. By definition of \(U\), the matrix \(\tilde X\transpose B \tilde Y = \Lambda\) is diagonal, with blocks \(\Lambda_1, \ldots, \Lambda_\ell\). Since \(\tilde X, \tilde Y\) are orthogonal, the diagonal entries of \(\Lambda\) are the singular values of \(B\), in some order and up to signs.

Lemma 5 applied to \((\tilde X, \tilde Y)\) states that

  • The diagonal entries of \(\Lambda_1\) are all \(\leq\) those of \(\Lambda_2\) which are \(\leq\) those of \(\Lambda_3\) etc.
  • The diagonal entries of \(\Lambda_1, \ldots, \Lambda_{\ell-1}\) are all nonpositive.
  • In absolute value, the diagonal entries of \(\Lambda_\ell\) are \(\leq\) those of \(\Lambda_1, \ldots, \Lambda_{\ell-1}\).

Therefore, \(\Lambda_1\) must have the \(n_1\) largest singular values of \(B\) on its diagonal with negative signs, so that \(\Trace(\Lambda_1) = -(\mu_{1} + \cdots + \mu_{n_1})\). Then, \(\Lambda_2\) must have \(n_2\) remaining largest singular values of \(B\) on its diagonal with negative signs, so that \(\Trace(\Lambda_2) = -(\mu_{n_1+1} + \cdots + \mu_{n_1+n_2})\), etc. Eventually, \(\Lambda_\ell\) has the \(n_\ell\) smallest singular values of \(B\) on its diagonal, with signs to be determined.

By Eq. 7, it follows that \[\begin{align} f(X, Y) & = \sum_{k = 1}^{\ell} \sigma_{(k)} \trace(\Lambda_k) \\ & = -\sigma_{(1)} (\mu_{1} + \cdots + \mu_{n_1}) - \sigma_{(2)} (\mu_{n_1 + 1} + \cdots + \mu_{n_1 + n_2}) - \cdots + \sigma_{(\ell)} \Trace(\Lambda_\ell) \\ & = \sigma_{(\ell)} \Trace(\Lambda_\ell) - \sum_{i=1}^{n-n_\ell} \sigma_{i} \mu_{i}, \end{align}\] because \(\sigma_{(1)} = \sigma_{1} = \cdots = \sigma_{n_1}\), \(\sigma_{(2)} = \sigma_{n_1+1} = \cdots = \sigma_{n_1 + n_2}\) and so on.

If \(\sigma_{(\ell)} = 0\), we are done. Now consider \(\sigma_{(\ell)} > 0\). Then, Lemma 5 further provides that:

  • \(\Lambda_\ell\) has at most one positive entry on its diagonal; if so, that entry is the smallest one (in absolute value).

Thus, the diagonal entries of \(\Lambda_\ell\) are the \(n_\ell\) smallest singular values of \(B\), all with a negative sign except possibly for \(\mu_{n}\) (the smallest one), so that \[\begin{align} \sigma_{(\ell)} \trace(\Lambda_\ell) = s\sigma_{n}\mu_{n} - (\sigma_{n-n_\ell+1}\mu_{n-n_\ell+1} + \cdots + \sigma_{n-1}\mu_{n-1}) \end{align}\] where \(s = \pm1\) is to be determined. If \(\mu_{n} = 0\), we are also done. Now consider \(\mu_{n} > 0\). Both \(A, B\) are invertible and we have \[\begin{align} \sign(\det(AB)) = \sign(\det(B)) = \sign(\det(\tilde X\transpose B \tilde Y)) = \sign(\det(\Lambda)) = (-1)^{n-1} s. \end{align}\] Therefore, \(s = (-1)^{n-1} \sign(\det(AB))\) and the proof is complete.

References

Boumal, N. 2023. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press. https://doi.org/10.1017/9781009166164.
Bozma, H. I., W. D. Gillam, and F. Öztürk. 2019. Morse-Bott Functions on Orthogonal Groups.” Topology and Its Applications 265. https://doi.org/10.1016/j.topol.2019.07.001.
Brockett, R. W. 1989. “Least Squares Matching Problems.” Linear Algebra and Its Applications 122–124 (September): 761–77. https://doi.org/10.1016/0024-3795(89)90675-7.
Frankel, T. 1965. “Critical Submanifolds of the Classical Groups and Stiefel Manifolds.” In Differential and Combinatorial Topology (a Symposium in Honor of Marston Morse), 37–53. Princeton, NJ: Princeton University Press.
Helmke, U., and J. B. Moore. 1992. “Singular-Value Decomposition via Gradient and Self-Equivalent Flows.” Linear Algebra and Its Applications 169 (May): 223–48. https://doi.org/10.1016/0024-3795(92)90180-i.
Levin, E., J. Kileel, and N. Boumal. 2024. “The Effect of Smooth Parametrizations on Nonconvex Optimization Landscapes.” Mathematical Programming. https://doi.org/10.1007/s10107-024-02058-3.
von Neumann, J. 1937. “Some Matrix-Inequalities and Metrization of Matric-Space.” Tomsk University Review, In: A.H. Taub (Ed.), John von Neumann Collected Works, Vol. IV, (Pp. 205–218). Oxford: Pergamon 1: 286–300.

Citation

BibTeX citation:
@online{boumal2024,
  author = {Boumal, Nicolas},
  title = {Benign Landscape of the {Brockett} Cost Function: Asymmetric
    Case},
  date = {2024-04-28},
  url = {racetothebottom.xyz/posts/brockett-asymmetric},
  langid = {en},
  abstract = {The second-order critical points of \$f(X, Y) =
    \textbackslash trace(AX\textbackslash transpose BY)\$ for \$X, Y\$
    orthogonal are globally optimal. This implies a number of equally
    well-known corollaries, including benign non-convexity of \$f(X) =
    \textbackslash trace(AX)\$ and \$f(X) = \textbackslash
    sqfrobnorm\{A-X\}\$ for \$X\$ orthogonal.}
}