The fact
Let \(A, B\) be two symmetric matrices of size \(n\). Some notation:
- \(\lambda_1 \geq \cdots \geq \lambda_n\) are the eigenvalues of \(A\) in descending order.
- \(\mu_1 \leq \cdots \leq \mu_n\) are the eigenvalues of \(B\) in ascending order.
- \(\inner{A}{B} = \trace(A\transpose B)\) is the usual inner product for matrices.
- \(\On = \{ X \in \Rnn : X\transpose X = I_n\}\) is the orthogonal group.
The following is well known. A proof follows notably from a paper by Bozma, Gillam, and Öztürk (2019), where the authors show that the function \(f\) below is Morse–Bott. They use linear algebra arguments much like the construction below, and cite Duistermaat, Kolk, and Varadarajan (1983) for an older proof based on group theoretic arguments. These facts are also discussed eloquently in the context of dynamical systems (gradient flows) in a fantastic book by Helmke and Moore (1996). That book offers insights into the already rich literature at the time, and includes a foreword by Brockett. The cost function bears his name following the papers (Brockett 1989, 1991), in which he notably works out the situation where \(A\) and/or \(B\) have distinct eigenvalues.
Theorem 1 (Benign landscape) Consider the problem of minimizing the Brockett cost function, namely, \[\begin{align} f \colon \On \to \reals, \qquad f(X) = \innersmall{A}{X\transpose BX}. \end{align} \tag{1}\] Its second-order critical points are its global minimizers. The optimal value is \(\sum_{i=1}^n \lambda_i \mu_i\).
In particular, \(f\) has the so-called strict-saddle property: if \(X\) is a critical point, then either it is a global minimum or the Riemannian Hessian of \(f\) at \(X\) has a negative eigenvalue (and hence reasonable optimization algorithms are overwhelmingly likely to find a global optimum).
Much has been known for a long time about the optimal solutions of that problem. They become quite clear once you diagonalize \(A\) and \(B\) (which we will too). The goal here is to focus on the landscape of \(f\), not just on the solution of the problem. This is spelled out less often, with Bozma et al.’s paper being a great positive example outside the optimization literature. Please write to me if you know other explicit references.
Versions of Theorem 1 are sometimes proved with additional assumptions on \(A\) and \(B\) such as invertibilty or that their eigenvalues are distinct or positive. These are unnecessary, and it is useful to prove the theorem for all \(A, B\) to enable a collection of corollaries listed later.
Added April 28, 2024: See this follow-up post about the Brockett cost function in the asymmetric case.
Proof structure
Part of the point of this post is to underline a proof structure that might be useful when analyzing other nonconvex landscapes. It goes as follows:
- Reduce the problem to simpler data. Here, we reduce the task to diagonal \(A\) and \(B\) via a smooth change of variable (a diffeomorphism \(\varphi\) from \(\On\) to itself). This works because diffeomorphisms do not qualitatively change the landscape: if \(X\) is a local/global minimum for \(f \circ \varphi\), then \(\varphi(X)\) is a local/global minimum for \(f\); and likewise for first-/second-order critical points. (This is standard; see for example (Boumal 2023, Prop. 9.6) or (Levin, Kileel, and Boumal 2024, sec. 2.6) for additional optimization context.)
- Write down the first- and second-order necessary optimality conditions. This is straightforward (though cumbersome at times). Here, we can use this post.
- The first-order conditions provide equalities: extract structure from them.
- The second-order conditions provide a family of inequalities. Explore what you can get out of them. This step typically requires inspiration and insight to select the right subfamily of inequalities. Here, a fairly obvious selection provides all we need.
- Show that any second-order critical point \(X\) can be transformed into another one without changing the value of \(f\), and argue that this value is independent of \(X\).
- Conclude as follows:
- Note that all second-order critical points have the same cost function value, and
- Observe that at least one of them is optimal since \(\On\) is compact,
Contrast this to a more direct approach, which would exploit the optimality conditions to deduce that any second-order critical point \(X\) is of a certain form. This would be tricky here, because the set of global minimizers of \(f\) is not a singleton. Its structure is (somewhat) intricate, due to the fact that \(A\) and \(B\) may have eigenvalues with multiplicities larger than one, and these multiplicities may not be the same for \(A\) and \(B\). The approach we follow here says:
- Don’t try to deduce that \(X\) is so-and-so (this would force us to branch over many cases to cover the possible optimal \(X\)s).
- Rather, transform \(X\) into a more friendly \(\tilde X\) (one of the minimizers), and argue \(X\) is just as good at \(\tilde X\).
Corollaries
Consider two more manifolds:
- \(\SOn = \{ X \in \On : \det(X) = 1 \}\) is the group of rotations.
- \(\Stnp = \{ X \in \Rnp : X\transpose X = I_p \}\) is the Stiefel manifold.
First, it is clear that the result still holds if we optimize only over rotation matrices.
Corollary 1 (Rotations) Theorem 1 and corollaries below still hold with \(\On\) replaced by \(\SOn\).
Proof. This holds because second-order critical points on \(\SOn\) are also second-order critical on \(\On\), and \(\SOn\) is a compact connected component of \(\On\) hence there is at least one second-order critical point in \(\SOn\). To adapt the proof, simply check in Lemma 1 (where we diagonalize \(A\) and \(B\)) and Lemma 7 (where we diagonalize \(M_1, \ldots, M_\ell\)) that a symmetric matrix of size \(d\) can be diagonalized with an orthogonal matrix in \(\SOd\).
Second, restricting optimization to Stiefel also preserves benign landscapes.
Corollary 2 (Stiefel) For all \(H \in \Rpp\) and \(B \in \Rnn\) symmetric, the second-order critical points of \[\begin{align} g \colon \Stnp \to \reals, \qquad g(Y) = \innersmall{H}{Y\transpose B Y} \end{align}\] are global minima. The minimal value is \(\sum_{i = 1}^p \nu_i \mu_i\), where \(\nu_1 \geq \cdots \geq \nu_p\) are the eigenvalues of \(H\).
Proof. Create \(A \in \Rnn\), symmetric, by padding \(H\) with zeros. Let \(\varphi \colon \On \to \Stnp\) extract the first \(p\) columns of an orthogonal matrix. Notice that \(f = g \circ \varphi\). If \(Y \in \Stnp\) is second-order critical for \(g\), then any \(X \in \On\) such that \(\varphi(X) = Y\) is second-order critical for \(f\) (Levin, Kileel, and Boumal 2024, sec. 3.2). (Indeed, for any curve \(c\) on \(\On\) satisfying \(c(0) = X\), let \(\gamma = \varphi \circ c\) to see that \(\gamma(0) = Y\) and \(f \circ c = g \circ \gamma\); since \(Y\) is second-order critical we know \((g \circ \gamma)'(0) = 0\) and \((g \circ \gamma)''(0) \geq 0\); it follows that \((f \circ c)'(0) = 0\) and \((f \circ c)''(0) \geq 0\).) Theorem 1 then provides that \(X\) attains the minimal value of \(f\) on \(\On\). That is the same as the minimal value of \(g\) on \(\Stnp\) since \(\varphi\) surjects to \(\Stnp\): \(f(\On) = g(\varphi(\On)) = g(\Stnp)\). Yet, \(f(X) = g(\varphi(X)) = g(Y)\). Thus, \(Y\) is a global minimum. All we used is that \(\varphi\) is a \(C^2\) surjection from \(\On\) to \(\Stnp\).
Corollary 3 (Rayleigh quotient) The second-order critical points of the function \[\begin{align} g \colon \Stnp \to \reals, \qquad g(Y) = \trace(Y\transpose B Y) \end{align}\] are global minima. The minimal value is \(\sum_{i = 1}^p \mu_i\).
Proof. Use Corollary 2 with \(H = I_p\). This recovers classical results à la Courant–Fischer and (Fan 1949).
Corollary 4 It is clear that all of the above extends to maximization instead of minimization (simply replace \(A\) by \(-A\) or \(H\) by \(-H\) and track consequences). Also, if \(A\) or \(H\) are diagonal with strictly decreasing entries, then the columns of optimal \(X\) or \(Y\) reveal eigenvectors of \(B\), sorted by eigenvalue.
As an exercise, one can recover the Hoffman-Wielandt inequality from Theorem 1 (also here).
Reduce the problem to diagonal \(A, B\)
The main theorem only references how the minima and second-order critical points of \(f\) relate to the eigenvalues of \(A\) and \(B\). Thus, in the process of proving the theorem, we are allowed to replace \(A\) and \(B\) as long as we do not change that relation.
Lemma 1 It is sufficient to prove Theorem 1 for the case where \(A\) and \(B\) are diagonal with sorted entries.
Proof. Let \(D = \diag(\lambda_{1}, \ldots, \lambda_{n})\) and \(E = \diag(\mu_{1}, \ldots, \mu_{n})\). We can choose \(X_A \in \On\) such that \(A = X_A^\top D X_A^{}\), and likewise \(X_B \in \On\) such that \(B = X_B^\top E X_B^{}\). Then, \[ f(X) = \innersmall{A}{X\transpose BX} = \innersmall{D}{\tilde X\transpose E \tilde X}, \] where \(\tilde X = X_B X X_A\transpose\). The change of variable from \(X\) to \(\tilde X\) is a diffeomorphism on \(\On\): it does not affect the landscape of \(f\). Explicitly: first-order critical points are mapped to first-order critical points, and likewise for second-order critical points, local minima and global minima (Boumal 2023, Prop. 9.6). The optimal value is unchanged, and the data \((A, B)\) become \((D, E)\): their spectra are unchanged.
Accordingly, for the remainder of the proof we assume \[ A = \diag(\lambda_{1}, \ldots, \lambda_{n}) \qquad \textrm{ and } \qquad B = \diag(\mu_{1}, \ldots, \mu_{n}). \tag{2}\] We won’t use the fact that \(B\) is diagonal, but at this point it would be strange not to diagonalize both.
Write down necessary optimality conditions
First- and second-order necessary optimality conditions for the minimizers of a function \(f\) on the orthogonal group \(\On\) are well known. Specialized to our setting, here is what they say.
Lemma 2 (First-order conditions) \(X \in \On\) is a first-order critical point for \(f\) if and only if \(A \cdot X\transpose BX = X\transpose BX \cdot A\).
Proof. We have \(f(X) = \innersmall{A}{X\transpose BX}\). Its Euclidean gradient is \(\nabla f(X) = 2BXA\). Plug this into the first theorem of this post, which expresses the fact that the Riemannian gradient at \(X\) is zero.
Lemma 3 (Second-order conditions) A first-order critical point \(X \in \On\) is second-order critical for \(f\) if and only if \[ \innerbig{\Omega^2 A - \Omega A \Omega}{X\transpose B X} \geq 0 \] for all \(\Omega \in \Rnn\) skew-symmetric.
Proof. We have \(f(X) = \innersmall{A}{X\transpose BX}\). Its Euclidean gradient and Hessian are \(\nabla f(X) = 2BXA\) and \(\nabla^2 f(X)[\dot X] = 2B\dot X A\). Plug these with \(\dot X = X\Omega\) into the second theorem of the same post, which expresses the fact that the Riemannian Hessian at \(X\) is positive semidefinite.
Extract structure from first-order criticality
From now one, we use additional notation as follows:
- \(\lambda_{(1)} > \cdots > \lambda_{(\ell)}\) are the distinct eigenvalues of \(A\).
- \(n_1, \ldots, n_\ell\) are their multiplicities, so that \(n_1 + \cdots + n_\ell = n\).
In particular, from Eq. 2 we can write \(A\) in block-diagonal form: \[ A = \begin{bmatrix} \lambda_{1} & & \\ & \ddots & \\ & & \lambda_{n} \end{bmatrix} = \begin{bmatrix} \lambda_{(1)} I_{n_1} & & \\ & \ddots & \\ & & \lambda_{(\ell)} I_{n_\ell} \end{bmatrix}. \tag{3}\]
When \(X\) is first-order critical, it satisfies certain equality conditions that provide exploitable structure.
Lemma 4 If \(X\) is first-order critical, then \(X\transpose B X\) is block diagonal with symmetric blocks \(M_1, \ldots, M_\ell\) whose sizes match the multiplicities of the eigenvalues of \(A\): \[\begin{align} X\transpose B X = \begin{bmatrix} M_1 & & \\ & \ddots & \\ & & M_\ell \end{bmatrix}, && M_k \in \reals^{n_k \times n_k}, \quad k = 1, \ldots \ell. \end{align} \tag{4}\]
Proof. Since \(X\) is first-order critical, Lemma 2 provides \[ (A \cdot X\transpose BX)_{ij} = (X\transpose BX \cdot A)_{ij} \] for all \(i, j\). Since \(A = \diag(\lambda_{1}, \ldots, \lambda_{n})\) as in Eq. 3, the above reads: \[ (\lambda_{i} - \lambda_{j}) (X\transpose BX)_{ij} = 0. \] Therefore, whenever \(i, j\) index distinct eigenvalues of \(A\), the corresponding entry of \(X\transpose B X\) is zero.
This block structure allows us to recognize that each critical point is part of a submanifold of critical points of equal value. Moreover, if any point in that submanifold is second-order critical, then they all are.
Lemma 5 (Submanifolds of critical points) Consider \(X \in \On\). Then, for all \(U \in \On\) of the form \(U = \diag(U_1, \ldots, U_\ell)\) with each \(U_k \in \mathrm{O}(n_k)\), the following hold:
- \(f(XU) = f(X)\).
- If \(X\) is first-order critical, then so is \(XU\).
- If \(X\) is second-order critical, then so is \(XU\).
Proof. By Eq. 3, we have \(A = \diag(\lambda_{(1)} I_{n_1}, \ldots, \lambda_{(\ell)} I_{n_\ell})\). Since \(U = \diag(U_1, \ldots, U_\ell)\) has orthogonal diagonal block of the same sizes as those of \(A\), we see that \(U AU\transpose = A\) and hence \[ f(XU) = \innersmall{A}{U\transpose \cdot X\transpose B X \cdot U} = \innersmall{UAU\transpose}{X\transpose BX} = \innersmall{A}{X\transpose B X} = f(X). \]
Now assume \(X\) is first-order critical. From Lemma 4, we know \(X\transpose B X = \diag(M_1, \ldots, M_\ell)\) with symmetric blocks of sizes \(n_1, \ldots, n_\ell\). From Lemma 2, \(XU\) is first-order critical if and only if \[ A \cdot U\transpose (X\transpose BX) U = \diag\!\left(\lambda_{(1)} U_1\transpose M_1^{} U_1^{}, \ldots, \lambda_{(\ell)} U_{\ell}\transpose M_{\ell}^{} U_{\ell}^{} \right) \] is symmetric, which is indeed the case.
Finally, assume \(X\) is second-order critical. To check that \(XU\) is second-order critical too, apply Lemma 3: For any \(\Omega\) skew-symmetric, let \(\tilde \Omega = U \Omega U\transpose\) and use \(U AU\transpose = A\) twice to find \[ \innerbig{\Omega^2 A - \Omega A \Omega}{U\transpose X\transpose B X U} = \innerbig{\tilde \Omega^2 A - \tilde \Omega A \tilde \Omega}{X\transpose B X}. \] The right-hand side is indeed nonnegative since \(\tilde \Omega\) is skew-symmetric and \(X\) is second-order critical.
Extract meaningful inequalities from second-order criticality
The next part is a bit of an art. Lemma 3 provides us with a large collection of inequalities: one for each choice of \(\Omega\). Which ones shall we choose? This typically takes a fair amount of trial-and-error. In this case, the fact that we diagonalized \(A\) and \(B\) makes it so that the canonical basis vectors \(e_1, \ldots, e_n\) (the columns of the identity matrix) are no longer arbitrary. Thus, it is reasonable to try canonical skew-symmetric matrices such as \(e_i^{}e_j\transpose - e_j^{}e_i\transpose\). That is indeed fruitful.
Lemma 6 (Special inequalities) If \(X\) is second-order critical, then for all \(1 \leq i, j \leq n\) we have \[ \left( \lambda_j - \lambda_i \right) \left( (X\transpose B X)_{ii} - (X\transpose B X)_{jj} \right) \geq 0. \] In particular, if \(\lambda_i > \lambda_j\) then \((X\transpose B X)_{ii} \leq (X\transpose B X)_{jj}\).
Proof. Let \(\Omega = e_i^{}e_j\transpose - e_j^{}e_i\transpose\). For \(i \neq j\), this allows us to range over a natural basis of the skew-symmetric matrices. Plug this \(\Omega\) in Lemma 3, and work through the expressions using the fact that \(Ae_j = \lambda_j e_j\) and \(Be_i = \mu_i e_i\). (For \(i = j\), the claim is clear but vacuous.)
Jump to a particularly nice second-order point
Given a second-order critical point \(X\), we want to jump to another one that has as much structure as we can, without losing in function value. Lemma 5 provides us with a lot of leeway: we can rotate \(X\) to \(XU\) with \(U = \diag(U_1, \ldots, U_k)\). Lemma 4 suggests how to pick \(U\): try and diagonalize \(X\transpose B X\). Finally, Lemma 6 tells us exactly how those diagonal entries are ordered.
Lemma 7 If \(X\) is second-order critical, then there exists a second-order critical point \(\tilde X\) such that \(f(X) = f(\tilde X)\) and \(\tilde X\transpose B \tilde X = \diag(\mu_1, \ldots, \mu_n)\). In particular, \(f(X) = \sum_{i=1}^n \lambda_i \mu_i\).
Proof. Since \(X\) is critical, Lemma 4 implies \(X\transpose B X = \diag(M_1, \ldots, M_\ell)\) with symmetric blocks \(M_k\) of size \(n_k\). For each \(k\), choose \(U_k\) (orthogonal of size \(n_k\)) such that \(U_k\transpose M_k U_k = \Lambda_k\), where \(\Lambda_k\) is diagonal with increasing diagonal entries. By Lemma 5, we know that \(\tilde X = X U\) with \(U = \diag(U_1, \ldots, U_k)\) is second-order critical (because \(X\) is so) and \(f(X) = f(\tilde X)\). Then, \(\tilde X\transpose B \tilde X = \diag(\Lambda_1, \ldots, \Lambda_\ell)\) is a diagonal matrix. Of course, its diagonal entries must be the eigenvalues of \(B\), namely, \(\mu_1, \ldots, \mu_n\). In what order? Within each block \(\Lambda_k\), the diagonal entries appear in increasing order. Moreover, from Lemma 6 applied to \(\tilde X\), we see that if \(i < j\) index distinct eigenvalues of \(A\) (corresponding to distinct diagonal blocks) then \(\lambda_i > \lambda_j\) and so \[ (\tilde X\transpose B \tilde X)_{ii} \leq (\tilde X\transpose B \tilde X)_{jj}. \] This means that all the diagonal entries of \(\Lambda_1\) are less than or equal to all the diagonal entries of \(\Lambda_2\), which are less than or equal to all the diagonal entries of \(\Lambda_3\), etc. Overall, we deduce that \[ \tilde X\transpose B \tilde X = \diag(\mu_1, \ldots, \mu_n), \] as announced. To conclude, compute as follows: \[ f(X) = f(\tilde X) = \innersmall{A}{\tilde X\transpose B \tilde X} = \innersmall{\diag(\lambda_1, \ldots, \lambda_n)}{\diag(\mu_1, \ldots, \mu_n)} = \sum_{i=1}^n \lambda_i \mu_i, \] as claimed.
Deduce second-order critical points are optimal
By Lemma 7, all second-order critical points \(X\) have the same value, namely, \(f(X) = \sum_{i=1}^n \lambda_{i} \mu_{i}\). Also, \(\On\) is compact, so that there exists a global minimizer. That point is second-order critical. Thus, \(\sum_{i=1}^{n} \lambda_{i} \mu_{i}\) is the optimal value, and therefore all second-order critical points are optimal. The proof of Theorem 1 is complete.
Red herrings
In the process of extracting inequalities from the second-order conditions (Lemma 3), we may try many things. It is common for some of these attempts to look promising. For example, I obtained the following and tried for a while, but I didn’t manage to show the theorem from there.
Lemma 8 For all \(1 \leq i, j \leq n\), we have \[ \left( \mu_i - (X\transpose B X)_{jj} \right) \left( \lambda_j - (XAX\transpose)_{ii} \right) \geq 0. \]
Proof. Consider the third theorem in this post about general optimality conditions on \(\On\). Plug in \(Z = e_i^{}e_j\transpose\) and work through the expressions, using the fact that \(Ae_j = \lambda_j e_j\) and \(Be_i = \mu_i e_i\).
This type of inequality is sufficient when \(A\) and \(B\) have distinct eigenvalues (consider \(i = j = 1\) and deduce the first column of \(X\) and of \(X\transpose\) is \(\pm e_1\); then proceed to \(i = j = 2\) etc.) However, once we allow general \(A, B\), the global minimum is no longer unique, and this type of reasoning becomes intricate. The “jump” to a nicer second-order critical point takes care of all of that.
Beyond homogeneous quadratics?
This section was added April 28, 2024
The Brockett cost funtion is a (special) homogeneous quadratic function of \(X \in \On\). One could hope that the landscape is still benign if we add a linear term, that is, if we optimize \[ h(X) = \innersmall{A}{X\transpose B X} + \innersmall{C}{X} \] for some symmetric matrices \(A, B, C\). Unfortunately, that is not the case in general. (See also (Brockett 1989) for more discussion of that mixed cost function.)
Run the following numerical experiment in Matlab using the Manopt toolbox to confirm that \(h \colon \SOn \to \reals\) can have strict local minimizers:
% This Matlab code requires the Manopt toolbox: manopt.org
n = 5;
A = randsym(n); % Three random symmetric matrices of size n
B = randsym(n);
C = randsym(n);
problem.M = rotationsfactory(n); % Optimize on SO(n)
inner = @(M, N) M(:).'*N(:); % Trace inner product
problem.cost = @(X) inner(A, X'*B*X) + inner(C, X); % Cost function h(X)
problem = manoptAD(problem); % Automatic differentation
X1 = trustregions(problem); % Minimize with a random initialization
disp(hessianspectrum(problem, X1)'); % Check the eigenvalues of the Hessian
X2 = trustregions(problem); % again, with another one
disp(hessianspectrum(problem, X2)');
X3 = trustregions(problem); % again, with another one
disp(hessianspectrum(problem, X3)');
In a typical run, the value of \(h\) is not the same at all three of X1, X2, X3
, yet the gradient is numerically zero and the Hessian is positive definite at all three, confirming that they are strict local minimizers.
A different generalization which does turn out to preserve the benign quality of the optimization landscape is the trace ratio or trace quotient cost function, for appropriate matrices \(A, B, G\): \[ h(X) = \frac{\innersmall{G}{X\transpose A X}}{\innersmall{G}{X\transpose B X}}. \] See for example (Yang, Zheng, and Xia 2022), where the conditions on \(A, B, G\) are spelled out.
References
Citation
@online{boumal2023,
author = {Boumal, Nicolas},
title = {Benign Landscape of the {Brockett} Cost Function},
date = {2023-12-04},
url = {racetothebottom.xyz/posts/brockett-symmetric},
langid = {en},
abstract = {The second-order critical points of \$f(X) =
\textbackslash trace(AX\textbackslash transpose BX)\$ for \$X\$
orthogonal are globally optimal. This implies a number of equally
well-known corollaries.}
}