Norm of linear combination of unit norm vectors: benign up and down – Race to the bottom

Fix a vector of coefficients \(c \in \Rn\). Consider the following function, defined for \(n\) points \(x_1, \ldots, x_n\) on the unit sphere \(\Sd = \{ x \in \Rd : \|x\|^2 = 1 \}\), where \(\|x\| = \sqrt{x\transpose x}\) is the 2-norm: \[ f(x_1, \ldots, x_n) = \| c_1 x_1 + \cdots + c_n x_n \|^2. \qquad(1)\] Maximizing or minimizing this function is a nonconvex problem because the domain (a product of spheres) is nonconvex. We are interested in the landscape of \(f\). As we shall see, the nonconvexity is benign both for minimization and for maximization.

Particular case

Let \(c_1 = \cdots = c_n = 1\) as a special case, so \[ f(x_1, \ldots, x_n) = \|x_1 + \cdots + x_n\|^2 = \sum_{ij} x_i\transpose x_j^{}. \] Then,

The minimizers of \(f\) are exactly the tuples of \(n\) unit vectors that sum to zero. This comes up in some applications; see for example (Tang and Toh 2024).
The maximizers of \(f\) are the tuples of \(n\) identical unit vectors.

The latter is relevant in the study of Kuramoto oscillator networks. Indeed, with \(d = 2\), \(\Sd\) is a circle and so each point \(x_i\) corresponds to an angle \(\theta_i\). Explicitly, \(x_i = (\cos \theta_i, \sin \theta_i)\) and we have \[ f(x_1, \ldots, x_n) = \sum_{ij} \cos(\theta_i - \theta_j). \] Then, gradient flow (upwards) on \(f\) coincides with the (simplified) dynamics of the Kuramoto oscillator network with a complete graph. This is well known to be benign.

More generally, we might consider variations of \(f\) where the inner products \(x_i\transpose x_j^{}\) go through a nonlinearity. Then, depending on the scenario, it may be that \(f\) is benign to maximize but hard to minimize (think “packing on a sphere”). You can read more in (Criscitiello et al. 2024).

The fact

We are about to prove the following fact:

Theorem 1 (Benign landscape up and down) Fix \(d \geq 2\) and \(c \in \Rn\). The landscape of \(f\) (Eq. 1) is benign in the sense that second-order necessary optimality conditions are also sufficient for global optimality. This is true both for minimizing and for maximizing \(f\).

Note that if some coefficients \(c_i\) are zero, then the corresponding \(x_i\) play no role in \(f\). Thus:

Without loss of generality, we assume \(c_i \neq 0\) for all \(i\).

Another way to state Theorem 1 is that if \((x_1, \ldots, x_n)\) is a critical point of \(f\) which is neither a global minimizer nor a global maximizer then the (Riemannian) Hessian of \(f\) at \((x_1, \ldots, x_n)\) has at least one positive and one negative eigenvalue.

Necessary optimality conditions

Concisely, the cost function can be written as \[ f(X) = \|X c\|^2 = \inner{X\transpose X}{cc\transpose}, \] where \(X = \Big[ x_1, \ldots, x_n \Big] \in \Rdn\) has unit-norm columns, and \(\inner{A}{B} = \trace(A\transpose B)\) is the usual inner product over matrix spaces.

It is sligthly tedious but straightforward to derive the usual first- and second-order necessary optimality conditions for minimizing or maximizing \(f\). The first-order conditions state that the (Riemannian) gradient of \(f\) is zero. The second-order conditions state that the (Riemannian) Hessian of \(f\) is positive semidefinite (if we are minimizing \(f\)) or negative semidefinite (if we are maximizing \(f\)).

These conditions are best expressed after introducing a symmetric matrix \(S\) of size \(n\). Let \[ A = cc\transpose, \] and define \[ S = A - \ddiag(AX\transpose X), \qquad(2)\] where \(\ddiag \colon \Rnn \to \Rnn\) zeroes-out the non-diagonal entries of a matrix.

Lemma 1 A matrix \(X \in \Rdn\) with unit-norm columns is first-order critical for \(f\) if and only if \[ XS = 0. \] Such a matrix is second-order critical for minimizing \(f\) if and only if \[ \innersmall{S}{\dot X\transpose \dot X} \geq 0 \] for all \(\dot X \in \Rdn\) such that \(\diag(X\transpose \dot X) = 0\). Likewise, when maximizing \(f\), flip the inequality to read \(\innersmall{S}{\dot X\transpose \dot X} \leq 0\) instead.

Proof. Work out the Riemannian gradient and Hessian of \(f \colon (\Sd)^n \to \reals\). See for example (Boumal, Voroninski, and Bandeira 2020, sec. 2.1) for details in a more general context.

Second order and rank deficient is optimal

It is well known (from a more general setting) that if \(X\) is second-order critical and it is rank deficient then it is globally optimal. (That sentence applies both for minimizing and maximizing \(f\).) The proof below, split in two lemmas, follows a general argument by Journée et al. (2010).

Lemma 2 If \(X\) is first-order critical, then \(f(\hat X) - f(X) = \innersmall{\hat X\transpose \hat X}{S}\) for all feasible \(\hat X\), with \(S\) as in Eq. 2.

Proof. Start with the definitions of \(f\) and \(S\), then use the fact that \(\hat X\transpose \hat X\) and \(X\transpose X\) have the same diagonal (all ones) to get started as follows: \[\begin{align*} f(\hat X) - f(X) & = \innersmall{\hat X\transpose \hat X - X\transpose X}{A} \\ & = \innersmall{\hat X\transpose \hat X - X\transpose X}{S + \ddiag(AX\transpose X)} \\ & = \innersmall{\hat X\transpose \hat X - X\transpose X}{S} + 0 \\ & = \innersmall{\hat X\transpose \hat X}{S}. \end{align*}\] To reach the last line, use \(XS = 0\) since \(X\) is critical (Lemma 1).

Lemma 3 If \(X\) is second-order critical for \(f\) and \(\rank(X) < d\), then \(X\) is a global optimizer for \(f\).

Proof. Since \(\rank(X) < d\), there exists a unit-norm vector \(z \in \Rd\) such that \(X\transpose z = 0\). Thus, for all \(v \in \Rn\), the matrix \(\dot X = z v\transpose\) satisfies \(\diag(X\transpose \dot X) = 0\). (In words: \(z\) is tangent to the sphere at all points \(x_1, \ldots, x_n\).)

If we are minimizing \(f\), it follows from Lemma 1 that \[ 0 \leq \innersmall{S}{\dot X\transpose \dot X} = \innersmall{S}{vv\transpose} = v\transpose S v. \] Therefore, \(S\) is positive semidefinite. Lemma 2 then confirms that \(X\) is a global minimizer for \(f\).

If we are maximizing \(f\), Lemma 1 reveals \(S\) is negative semidefinite, and therefore Lemma 2 confirms that \(X\) is a global maximizer of \(f\).

Exploiting the rank-1 cost matrix

From Lemma 3, we know that rank-deficient second-order critical points are fine. The next lemma shows that the full-rank critical points are special in their own way. This exploits the fact that the cost matrix \(A = cc\transpose\) is rank one.

Lemma 4 If \(X\) is first-order critical for \(f\), then \(\rank(X) = 1\) or \(Xc = 0\) (or both).

Proof. Lemma 1’s first-order condition \(SX\transpose = 0\) implies that \(S\) (Eq. 2) has a kernel of dimension at least as large as the rank of \(X\), that is, \[\begin{align*} \rank(X) \leq \dim \ker(S) & = n - \rank(S) \\ & = n - \rank(A - \ddiag(AX\transpose X)) \\ & \leq \rank(A) + n - \rank(\ddiag(AX\transpose X)), \end{align*}\] where we used \(\rank(A+B) \geq \rank(B) - \rank(A)\). (That first step is similar to (Wen and Yin 2013, Thm. 3) and (Boumal, Voroninski, and Bandeira 2020, Lem. 3.3).)

There are two scenarios to entertain:

Either \(\rank(\ddiag(AX\transpose X)) = n\), in which case \(\rank(X) \leq \rank(A) = 1\) (because \(A = cc^\top\)) and we are done.
Or \(\rank(\ddiag(AX\transpose X)) < n\), in which case there exists an index \(i\) such that \((AX\transpose X)_{ii} = 0\). Returning to the first-order condition \(XS = 0\) and looking at the \(i\)th column specifically, it follows (with \(A = cc\transpose\) again) that \[\begin{align*} 0 = (XS)_{:i} & = (XA)_{:i} - (X\ddiag(AX\transpose X))_{:i} \\ & = c_i(Xc) - (AX\transpose X)_{ii} x_i\\ & = c_i(Xc). \end{align*}\] We assumed in the very beginning that \(c_i \neq 0\). Thus, \(Xc = 0\).

Finishing the proof

We can now prove Theorem 1.

Proof. First, say we are minimizing \(f\). Let \(X\) be second-order critical. Then, Lemma 4 offers two (not mutually exclusive) possibilities: \(\rank(X) = 1 < d\) (in which case Lemma 3 implies that \(X\) is a global minimizer for \(f\)), or \(Xc = 0\) (in which case \(f(X) = \|Xc\|^2 = 0\) and so \(X\) is a global minimizer for \(f\)).

Now say we are maximizing \(f\). Let \(X\) be second-order critical, and apply again Lemma 4. If \(\rank(X) = 1 < d\), we conclude with Lemma 3 as before. If not, then \(Xc = 0\) and so \(S = cc^\top\) (Eq. 2). Let us show that this cannot happen. Pick any index \(i\) with \(c_i \neq 0\) and a unit-norm vector \(u \in \Rd\) orthogonal to \(x_i\) (this exists since \(d \geq 2\)). Then, \(\dot X = u e_i\transpose\) indeed satisfies \(\diag(X\transpose \dot X) = 0\). Therefore, the second-order conditions from Lemma 1 imply \[ 0 \geq \innersmall{S}{\dot X \transpose \dot X} = \innersmall{cc\transpose}{e_i^{}e_i\transpose} = c_i^2 \geq 0. \] This shows \(c_i = 0\): a contradiction.

References

Boumal, N., V. Voroninski, and A. S. Bandeira. 2020. “Deterministic Guarantees for Burer-Monteiro Factorizations of Smooth Semidefinite Programs.” Communications on Pure and Applied Mathematics 73 (3): 581–608. https://doi.org/10.1002/cpa.21830.

Criscitiello, C., Q. Rebjock, A. D. McRae, and N. Boumal. 2024. “Synchronization on Circles and Spheres with Nonlinear Interactions.” arXiv 2405.18273. https://arxiv.org/abs/2405.18273.

Journée, M., F. Bach, P.-A. Absil, and R. Sepulchre. 2010. “Low-Rank Optimization on the Cone of Positive Semidefinite Matrices.” SIAM Journal on Optimization 20 (5): 2327–51. https://doi.org/10.1137/080731359.

Tang, Tianyun, and Kim-Chuan Toh. 2024. “Solving Graph Equipartition SDPs on an Algebraic Variety.” Mathematical Programming 204 (1-2): 299–347. https://doi.org/10.1007/s10107-023-01952-6.

Wen, Z., and W. Yin. 2013. “A Feasible Method for Optimization with Orthogonality Constraints.” Mathematical Programming 142 (1–2): 397–434. https://doi.org/10.1007/s10107-012-0584-1.

Citation

BibTeX citation:

@online{boumal2025,
  author = {Boumal, Nicolas},
  title = {Norm of Linear Combination of Unit Norm Vectors: Benign up
    and Down},
  date = {2025-01-19},
  url = {www.racetothebottom.xyz/posts/sum-sphere-norm/},
  langid = {en},
  abstract = {Some nonconvex functions are easy to minimize but not to
    maximize, or the other way around. Here is one more example where
    both are easy.}
}