In (Zhai et al. 2020, sec. 2.2), the authors consider the following optimization problem: \[ \max_{X \in \On} \|X\|_4^4, \qquad(1)\] where \(\On = \{ X \in \Rnn : X\transpose X = I_n \}\) is the orthogonal group, and \(\|X\|_4^4 = \sum_{i,j} X_{ij}^4\). This occurs as a simplification in the context of orthogonal dictionary learning (see at the end).
Part of the interest in this problem is that it has discrete symmetry: it is invariant under the change of variable \(X \mapsto PX\) for all signed permutations \(P\). (That is, each row and column of \(P\) contains exactly one nonzero entry, each of which is either \(+1\) or \(-1\).) This creates a nonconvex landscape with exponentially many equivalent extrema—and also exponentially many saddle points.
As Zhai et al. note, the maximizers are exactly the signed permutation matrices.1 They further conjecture that the landscape of this nonconvex optimization problem is benign, in the sense that the local maximizers are global maximizers (Zhai et al. 2020, sec. 4.1).
This post confirms the conjecture, showing a sligthly stronger version: the optimization problem above has the strict saddle property, that is, the first- and second-order necessary optimality conditions are sufficient for global optimality.
Theorem 1 (Benign landscape) For each \(X \in \On\), the following are equivalent:
The proof is in three steps:
- First, spell out the necessary optimality conditions (this is mechanical).
- Second, argue that if \(X\) satisfies those conditions then at least one of its entries is equal to \(+1\) or \(-1\). This involves a sequence of fruitful inequalities.
- Finally, extend that argument to peel off one row and column from \(X\) at a time, deducing each time that there must be another entry equal to \(1\) or \(-1\) in the remaining entries.
We will use notation as follows:
- \(X \odot Y\) denotes entrywise product.
- \(X^{(k)}\) denotes entrywise power \(k\), that is, \(X \odot \cdots \odot X\) (\(k\) times).
- \(\inner{A}{B} = \sum_{i,j} A_{ij} B_{ij}\) is the usual inner product for matrices.
- \(\frobnorm{A} = \sqrt{\inner{A}{A}}\) is the Frobenius norm.
Necessary optimality conditions
Our problem amounts to minimizing \(f(X) = -\frac{1}{4}\|X\|_4^4\) for \(X\) restricted to the orthogonal group \(\On\). The first- and second-order necessary optimality conditions can be expressed in terms of the gradient and Hessian of \(f \colon \Rnn \to \reals\), as described in a previous post about optimality conditions on \(\On\). This is slightly tedious, but entirely mechanical.
Explicitly, let \[ f(X) = -\frac{1}{4}\|X\|_4^4 = -\frac{1}{4}\innersmall{X}{X^{(3)}}. \] The Euclidean gradient at \(X \in \Rnn\) and the Euclidean Hessian at \(X \in \Rnn\) along \(Z \in \Rnn\) are \[\begin{align*} \nabla f(X) & = -X^{(3)}, & \nabla^2 f(X)[Z] & = -3 X^{(2)} \odot Z. \end{align*}\] Plugging these formulas in Theorems 1 and 3 of the aforementioned post yields the following lemma.
First-order criticality states that the (Riemannian) gradient of \(f|_{\On}\) at \(X\) is zero. Second-order criticality states that the (Riemannian) Hessian of \(f|_{\On}\) at \(X\) is positive semidefinite; here, it is reworked in a way that incorporates first-order conditions too. Local maximizers satisfy both.
Lemma 1 The necessary optimality conditions are as follows: \(X \in \On\) is first-order critical if and only if \[\begin{align*} X^{(3)} = X(X^{(3)})\transpose X. \end{align*}\] It is second-order critical if and only if, in addition to the above, we also have \[\begin{align*} 0 & \leq \innersmall{ZZ\transpose X + XZ\transpose Z - 2ZX\transpose Z}{X^{(3)}} \\ & \qquad + 6\innersmall{X^{(2)}}{Z\odot(XZ\transpose X)} \\ & \qquad - 3\sqfrobnormsmall{X\odot Z} - 3\sqfrobnormsmall{X \odot (XZ\transpose X)} \end{align*}\] for all \(Z \in \Rnn\).
Proof core: \(X\) has an entry equal to \(\pm 1\)
Since \(X\) is orthogonal, its entries lie in the interval \([-1, 1]\). We aim to show that if \(X\) is a second-order critical point of Eq. 1 then it is a signed permutation matrix. In particular, this entails showing that at least one entry of \(X\) is exactly \(+1\) or \(-1\). First, a short lemma.
Lemma 2 For all \(x \in \Rn\), we have \(\|x\|_4^4 \leq \|x\|_2^2 \|x\|_\infty^2\).
Proof. Let \(z = x \odot x\). Then, \(\|x\|_4^4 = \|z\|_2^2 = \inner{z}{z} \leq \|z\|_1 \|z\|_\infty = \|x\|_2^2 \|x\|_\infty^2\).
The main part of the proof comes now. As usual in landscape analyses, a key step is to identify fruitful inequalities provided by the second-order criticality conditions. In this case, it so happens that we can simply use \(Z = e_i^{}e_j\transpose\) in Lemma 1 (that is, matrices \(Z\) with \(Z_{ij} = 1\) and all other entries equal to zero). The crux then is to exploit these inequalities appropriately.
Lemma 3 If \(X\) is second-order critical, then its largest entry in absolute value is \(\pm 1\).
Proof. Let \(i, j\) be such that \(|X_{ij}| = \max_{k,\ell} |X_{k\ell}|\). Note that \(|X_{ij}|\) is in \((0, 1]\) as \(X\) is orthogonal. Since \(X\) is second-order critical, we can let \(Z = e_i^{}e_j\transpose\) in Lemma 1 to deduce \[\begin{align} 0 & \leq \innersmall{e_i^{}e_i\transpose X + Xe_j^{}e_j\transpose - 2X_{ij}e_i^{}e_j\transpose}{X^{(3)}} \\ & \qquad + 6X_{ij}^4 - 3X_{ij}^2 - 3\sqfrobnormsmall{X \odot (X e_j^{}e_i\transpose X)} \\ & = \|X_{i:}\|_4^4 + \|X_{:j}\|_4^4 - 2X_{ij}^4 \\ & \qquad + 6X_{ij}^4 - 3X_{ij}^2 - 3\sum_{k\ell} (X_{k\ell} X_{kj} X_{i\ell})^2. \end{align} \qquad(2)\] We can now claim \(\|X_{i:}\|_4^4 \leq |X_{ij}|^2\). To see this, apply Lemma 2 with the facts that \(\|X_{i:}\|_2 = 1\) (since \(X\) is orthogonal) and \(\|X_{i:}\|_\infty = |X_{ij}|\) (owing to our choice of \((i, j)\)). Likewise, \(\|X_{:j}\|_4^4 \leq |X_{ij}|^2\). It follows that \[\begin{align} 0 & \leq 4X_{ij}^4 - X_{ij}^2 - 3\sum_{k\ell} (X_{k\ell} X_{kj} X_{i\ell})^2 \\ & \leq 4X_{ij}^4 - X_{ij}^2 - 3X_{ij}^6, \end{align}\] where in the last inequality we used that the sum is at least as large as the term \((k, \ell) = (i, j)\), that is, \(X_{ij}^6\). (This is rather crude, but we will bootstrap from here momentarily.)
Studying that last polynomial inequality reveals that \(|X_{ij}|\) is in the interval \([1/\sqrt{3}, 1]\) (see a plot of the polynomial and its roots).
Let us now return to Eq. 2 and bound the sum more carefully. Notice that it is at least as large as the sum with \(k = i\) plus the sum with \(\ell = j\) minus the term \((k, \ell) = (i, j)\) (to make up for double-counting). Thus, \[\begin{align*} \sum_{k\ell} (X_{k\ell} X_{kj} X_{i\ell})^2 & \geq X_{ij}^2 \sum_\ell X_{i\ell}^4 + X_{ij}^2 \sum_k X_{kj}^4 - X_{ij}^6 \\ & = X_{ij}^2 (\|X_{i:}\|_4^4 + \|X_{:j}\|_4^4) - X_{ij}^6. \end{align*}\] Plugging this back in Eq. 2 reveals that \[\begin{align*} 0 \leq (1 - 3X_{ij}^2)(\|X_{i:}\|_4^4 + \|X_{:j}\|_4^4) + 4X_{ij}^4 - 3X_{ij}^2 + 3X_{ij}^6. \end{align*}\] From the earlier (crude) bound, we know that \(1-3X_{ij}^2 \leq 0\). Thus, we can use the bounds \(\|X_{i:}\|_4^4 \geq X_{ij}^4\) and \(\|X_{j:}\|_4^4 \geq X_{ij}^4\) to deduce that \[\begin{align*} 0 & \leq 2X_{ij}^4(1 - 3X_{ij}^2) + 4X_{ij}^4 - 3X_{ij}^2 + 3X_{ij}^6 \\ & = -3 X_{ij}^2 (X_{ij} - 1)^2 (X_{ij} + 1)^2 \\ & \leq 0. \end{align*}\] (Factorization done here.) This implies that \(X_{ij}\) is in \(\{-1, 0, 1\}\), and we know it is not zero. Therefore, \(|X_{ij}| = 1\), as announced.
Proof finish: peeling rows one at a time
Consider the proof of Lemma 3. The assumption that \(|X_{ij}|\) is maximal over \((i, j)\) is used in exactly one way: to claim \[ \|X_{i:}\|_\infty = \|X_{:j}\|_\infty = |X_{ij}|. \] That is enough to conclude \(X_{ij} = \pm 1\).
Thus, first select \((i, j)\) so that \(|X_{ij}|\) is maximal, and deduce \(X_{ij} = \pm 1\). It also follows that all other entries on row \(i\) and column \(j\) are zero (since \(X\) is orthogonal).
Now, let \((k, \ell) \neq (i, j)\) be such that \(|X_{k\ell}|\) is maximal. Necessarily, \(k \neq i\) and \(\ell \neq j\). Therefore, we still have \(\|X_{k:}\|_\infty = \|X_{:\ell}\|_\infty = |X_{k\ell}|\). It follows that \(X_{k\ell} = \pm1\) and all other entries on row \(k\) and column \(\ell\) are zero.
Repeat this process to deduce that \(X\) has \(n\) entries equal to \(\pm 1\), with each row and each column containing exactly one of them.
This shows that all second-order critical points are signed permutation matrices. At least one of them is a global optimum since \(\On\) is compact, and they all have the same cost function value. Thus, they coincide with the set of global optimizers.
This concludes the proof of Theorem 1.
Open question
Going back to the dictionary learning context in (Zhai et al. 2020), it would be interesting to understand the landscape of \[ \max_{X \in \On} \|XB\|_4^4, \qquad(3)\] where \(B \in \reals^{n \times p}\) is some data matrix.
Based on numerical experiments, it seems credible that this is benign too (but I’ve been hurt before). For example, using Matlab with the Manopt toolbox to run the following code, I was not able to contradict the hypothesis.
clear; clc;
n = 5;
m = 12;
B = randn(n, m); % maybe try 'adversarial' data?
% Get Manopt from manopt.org to optimize
% over the special orthogonal group SO(n).
problem.M = rotationsfactory(n);
% Specify the cost function to *minimize*.
nrm44 = @(M) sum(M(:).^4);
problem.cost = @(X) -.25*nrm44(X*B);
% Also provide the (Euclidean) gradient.
problem.egrad = @(X) -(X*B).^3 * B.';
% Minimize f starting from some random initial point.
X, fX] = trustregions(problem);
[
% Do it again, from another random point.
Y, fY] = trustregions(problem);
[
% The two attained function values
% appear to be the same. Always?
fprintf('\n');
fprintf('Attained function values:\n\t%g\n\t%g\n', fX, fY);
fprintf('Difference: %g\n', fX-fY);
% The relative rotation between X and Y seems to
% be a signed permutation. Always?
fprintf('\n');
fprintf('Relative rotation X*Y.'':\n');
disp(X*Y.')
Here is an example output:
Attained function values:
-62.419
-62.419
Difference: 4.9738e-14
Relative rotation X*Y.':
-0.0000 -1.0000 0.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000 0.0000
-1.0000 0.0000 0.0000 0.0000 -0.0000
-0.0000 0.0000 -0.0000 0.0000 1.0000
-0.0000 -0.0000 0.0000 -1.0000 0.0000
As a side note, for the particular case where \(B = (1, 0, 0, \ldots, 0)^\top\), the optimization problem reduces essentially to \(\max_{x \in \Rn : \|x\|_2 = 1} \|x\|_4^4\) (maximizing the 4-norm over a sphere). This case is handled, for example, in (Ge et al. 2015, sec. C.1).
References
Footnotes
Citation
@online{boumal2024,
author = {Boumal, Nicolas},
title = {Maximizing the 4-Norm over Orthogonal Matrices Has a Benign
Landscape},
date = {2024-12-30},
url = {racetothebottom.xyz/posts/four-norm-orthogonal/},
langid = {en},
abstract = {Consider maximizing \$\textbackslash sum\_\{i,j\}
X\_\{ij\}\^{}4\$ for \$X\$ orthogonal. This problem has the strict
saddle property, that is, its second-order critical points are
globally optimal.}
}