Optimality conditions on the orthogonal group – Race to the bottom

Abstract

Consider a cost function restricted to the set of orthogonal matrices or rotation matrices. Its local minima satisfy first- and second-order necessary optimality conditions, spelled out here.

Several interesting non-convex optimization problems are of the form \[ \min_{X \in \Rnn} f(X) \qquad \textrm{ subject to } \qquad X \in \On, \qquad(1)\] where \(\On = \{ X \in \Rnn : X\transpose X = I_n \}\) is the set of orthogonal matrices and \(f \colon \Rnn \to \reals\) is twice continuously differentiable.

Everything here applies verbatim if we replace \(\On\) with the group of rotations, that is, \(\SOn = \{ X \in \Rnn : X\transpose X = I_n \textrm{ and } \det(X) = +1\}\).

First- and second-order necessary optimality conditions for such problems are well known (Edelman, Arias, and Smith 1998). We spell them out as they are likely to be useful in future posts.

As we proceed, it is useful to define the following standard inner product over \(\Rnn\): \[\inner{A}{B} = \trace(A\transpose B).\] The tangent space to \(\On\) at \(X\) is the subspace of \(\Rnn\) defined by \[\T_X\On = \{ X\Omega : \Omega + \Omega\transpose = 0 \}.\] The orthogonal projector from \(\Rnn\) to \(\T_X\On\) is \[\Proj_X(Z) = X\skeww(X\transpose Z),\] where \(\skeww(M) = (M-M\transpose)/2\) extracts the skew-symmetric part of a matrix.

Below, \(\nabla f\) and \(\nabla^2 f\) denote the Euclidean gradient and Hessian of \(f\), as a function from \(\Rnn\) to \(\reals\). Inside proofs, we also write \(\grad f\) and \(\Hess f\) to denote the Riemannian gradient and Hessian of \(f|_{\On}\) (restricted to \(\On\)), where \(\On\) is equipped with the Riemannian submanifold structure inherited from the above inner product on \(\Rnn\). Those considerations only appear in the proofs, because the notion of first- and second-order critical point does not depend on any choice of metric (Boumal 2023, see §4.2, §6.1). Stated differently, the theorem statements are independent of any Riemannian considerations, but it’s helpful to use Riemannian concepts to prove the theorems. See (Boumal 2023, sec. 7.4) for some background on the Riemannian tools for \(\On\).

Theorem 1 (First-order) If \(X \in \On\) is a local minimum for Eq. 1, then \(\nabla f(X) = X \nabla f(X)\transpose X\).

Proof. The claim holds because at a local minimum the Riemannian gradient is zero. The Riemannian gradient of \(f\) is \(\grad f(X) = \Proj_X(\nabla f(X)) = X\skeww(X\transpose \nabla f(X))\). This is zero if and only if \(X\transpose \nabla f(X)\) is a symmetric matrix.

Theorem 2 (Second-order) If \(X \in \On\) is a local minimum for Eq. 1, then \[ \innerbig{\dot X}{\nabla^2 f(X)[\dot X]} + \innerbig{\dot X X\transpose \dot X}{\nabla f(X)} \geq 0 \] for all \(\dot X \in \T_X\On\).

Proof. The claim holds because at a local minimum the Riemannian Hessian is positive semidefinite. Let’s make this explicit. Standard results in Riemannian optimization provide for all tangent \(\dot X\) that \[ \Hess f(X)[\dot X] = \Proj_X(\D\grad f(X)[\dot X]), \] where \(\D\) denotes a standard directional derivative. Since \(\dot X\) is tangent and \(\Proj_X\) is self-adjoint, the projector has no effect on the quadratic form: \[ 0 \leq \innersmall{\dot X}{\Hess f(X)[\dot X]} = \innersmall{\dot X}{\D\grad f(X)[\dot X]}. \] From the previous proof, we had \(\grad f(X) = X\skeww(X\transpose \nabla f(X))\). Thus, \[ \D\grad f(X)[\dot X] = \dot X \skeww\big(X\transpose \nabla f(X)\big) + X \skeww\big(\dot X\transpose \nabla f(X) + X\transpose \nabla^2 f(X)[\dot X]\big). \] When we plug this back into the quadratic form, two things happen:

The term \(\dot X \skeww(X\transpose \nabla f(X))\) vanishes because it becomes \(\innersmall{\dot X\transpose \dot X}{\skeww(X\transpose \nabla f(X))}\), yet \(\dot X\transpose \dot X\) is symmetric.
The \(\skeww\) is not necessary in the other term, because \(X\transpose \dot X\) is already skew-symmetric.

Thus, it follows that \[ 0 \leq \innerbig{X\transpose \dot X}{\dot X\transpose \nabla f(X) + X\transpose \nabla^2 f(X)[\dot X]}. \] Reorganize to finish the claim.

Sometimes, it is convenient to extract from the two conditions above a set of inequalities that hold for all \(Z \in \Rnn\), and not just for all \(\dot X \in \T_X\On\). Here is one way to do just that.

Theorem 3 (Second-order, bis) If \(X \in \On\) is a local minimum for Eq. 1, then \[ \innerbig{Z - XZ\transpose X}{\nabla^2 f(X)[Z - XZ\transpose X]} + \innerbig{2ZX\transpose Z - XZ\transpose Z - ZZ\transpose X}{\nabla f(X)} \geq 0 \] for all \(Z \in \Rnn\).

Proof. This holds because for every \(Z \in \Rnn\) the matrix \[ \dot X = \Proj_X(2Z) = 2X\skeww(X\transpose Z) = Z - XZ\transpose X \] is a valid tangent vector. (Note that, also, every tangent vector can be written in that form, so that we do not lose anything.) Plug this \(\dot X\) into the second-order optimality conditions from Theorem 2 to obtain: \[ \innerbig{Z - XZ\transpose X}{\nabla^2 f(X)[Z - XZ\transpose X]} + \innerbig{(Z - XZ\transpose X) (X\transpose Z - Z\transpose X)}{\nabla f(X)} \geq 0 \] for all \(Z \in \Rnn\). Expand: \[ (Z - XZ\transpose X) (X\transpose Z - Z\transpose X) = ZX\transpose Z + XZ\transpose X Z\transpose X - XZ\transpose Z - ZZ\transpose X. \] We can simplify the second term owing to the fact that first-order criticality also holds. Indeed, we have \[ \innerbig{XZ\transpose X Z\transpose X}{\nabla f(X)} = \innerbig{Z\transpose X Z\transpose}{X\transpose \nabla f(X) X\transpose} = \innerbig{Z X\transpose Z}{X \nabla f(X)\transpose X}, \] and from Theorem 1 we know \(X \nabla f(X)\transpose X = \nabla f(X)\). Simplify as above and reorganize to finish the claim.

References

Boumal, N. 2023. An Introduction to Optimization on Smooth Manifolds. Cambridge University Press. https://doi.org/10.1017/9781009166164.

Edelman, A., T. A. Arias, and S. T. Smith. 1998. “The Geometry of Algorithms with Orthogonality Constraints.” SIAM Journal on Matrix Analysis and Applications 20 (2): 303–53.