Saddle avoidance and center-stable manifold: a proof from first principles for a general setup

nonconvex
saddles
CSMT
Authors

Andreea Musat

Nicolas Boumal

Published

September 16, 2024

Modified

September 23, 2024

Abstract
The center-stable manifold theorem is old. It has become important in non-convex optimization, but the proof is rarely spelled out. We reconstruct one here, at a level of generality that is useful to optimizers.

Given a continuous map \(g \colon \Rd \to \Rd\), we consider the dynamical system which generates sequences \(x_0, x_1, x_2, \ldots\) starting from a given \(x_0\) by applying \(g\) iteratively: \[ x_{t+1} = g(x_t) \qquad \textrm{ for } \qquad t = 0, 1, 2, \ldots \] If the sequence converges, it must do so to a fixed point of \(g\), that is, a point \(x^\ast\) such that \(g(x^\ast) = x^\ast\).

The center-stable manifold theorem (CSMT) is an old result. Among other things, it can help us understand why, if \(x_0\) is selected at random, then it is almost certain that the sequence won’t converge to some of the fixed points of \(g\). In other words, the system almost surely avoids those fixed points.

This has become important in non-convex optimization, where it was used notably by Lee et al. (2016) and Panageas and Piliouras (2016) (then together in (Lee et al. 2019)) to argue that gradient descent with an appropriately chosen fixed-step size almost surely avoids saddle points of the cost function where the Hessian has at least one negative eigenvalue.

The CSMT comes in various forms. Typically, they involve linearizing \(g\) around a fixed point \(x^\ast\) of interest. The statement provides that if \(\D g(x^\ast)\) has at least one expanding direction (that is, an eigenvalue with magnitude larger than 1), then locally around \(x^\ast\) a sequence can only converge to \(x^\ast\) if it does so by moving along a zero-measure set (which one then argues away for random initialization).

The usual version of the CSMT invoked in the saddle avoidance literature (Shub 1987, Thm. III.7) requires that \(g\) be a local diffeomorphism (the eigenvalues of \(\D g(x^\ast)\) are nonzero). However, as stated in Exercise III.2 of that same book, this restriction is unnecessary. Other versions of the CSMT (e.g., in a previous blog post) require the eigenvalues of \(\D g(x^\ast)\) to have magnitude not equal to 1 (“hyperbolic”). This too is unnecessary. And it matters for optimization: the former restricts the step-size, and the latter can only handle isolated saddle points. Yet, gradient descent with step size \(1/3\) does avoid the saddles of \(f(x, y, z) = (3x^2 - z^2)/2\), even though \(\D g(x^\ast) = \diag(0, 1, 4/3)\).

Fortunately, a more general CSMT appears as Theorem 5.1 in (Hirsch, Pugh, and Shub 1977). They propose a broad statement that holds in Banach spaces. Theorem 1 below is a corollary of their statement.

Theorem 1 (CSMT without local diffeomorphism) Let \(x^\ast\) be a fixed point for the \(C^1\) iteration map \(g \colon \Rd \to \Rd\). Assume \(\D g(x^\ast)\) has at least one eigenvalue whose magnitude is strictly larger than 1. Then, there exist

  1. a compact neighborhood \(B\) of \(x^\ast\), and
  2. a zero-measure set \(\Wcs\) in \(\Rd\)

such that the following holds:

Let the sequence \((x_t)_{t \geq 0}\) be generated by \(x_{t+1} = g(x_t)\) starting from any \(x_0 \in \Rd\). If there exists an index \(\bar{t}\) such that \(x_t\) is in \(B\) for all \(t \geq \bar{t}\), then \(x_t\) is also in \(\Wcs\) for all \(t \geq \bar{t}\).

The set \(\Wcs\) is the graph of a \(1\)-Lipschitz function \(\varphi\) whose construction is the bulk of the proof. With additional work, one could also verify that this set is a manifold that inherits regularity from \(g\). Its dimension is the number of eigenvalues of \(\D g(x^\star)\) with magnitude at most 1.

The proof presented in this post is reconstructed from (Hirsch, Pugh, and Shub 1977), with (limited) technical changes and a fairly different exposition. The most relevant parts in that book start at page 56 with “But \(T\) may well have a kernel”. Hirsch et al. attribute the essence of the proof to Hadamard (1901), and retrace a wealth of related literature.

To reiterate, the strengths of Hirsch et al.’s version of the CSMT include:

Calibration before setup: gradient descent

We will need to split the domain \(\Rd\) into a direct sum of two subspaces, and to endow them with appropriate norms. It’s a bit abstract, so let us first anchor things for the particular case of gradient descent.

Given a \(C^2\) cost function \(f \colon \Rd \to \reals\), gradient descent with step-size \(\alpha > 0\) iterates the map \[ g(x) = x - \alpha \nabla f(x). \] Let \(x^\ast\) be a fixed point of \(g\) (that is, a critical point of \(f\)). Then, the linearization of \(g\) around \(x^\ast\) is the map \(T \colon \Rd \to \Rd\) obtained by differentiating \(g\): \[ T := \D g(x^\ast) = I_d - \alpha \nabla^2 f(x^\ast). \] This map is symmetric. Thus, it admits an orthonormal basis of eigenvectors \(v_1, \ldots, v_d\) with associated real eigenvalues \(\lambda_1, \ldots, \lambda_d\).

We can set \(\Ecs\) (“center-stable”) to be the subspace spanned by all \(v_i\) such that \(|\lambda_i| \leq 1\). By construction, \(T\) is invariant on \(\Ecs\), that is, if \(x\) is in \(\Ecs\), then \(Tx\) is also in \(\Ecs\). Moreover, the singular values of \(T\) restricted to \(\Ecs\) are exactly the magnitudes of the eigenvalues we selected (because \(T\) is symmetric). These are all at most 1, so we get to claim this (with the usual 2-norm): \[ \|Tx\| \leq \|x\| \qquad \textrm{ for all } \qquad x \in \Ecs. \] In words: components along \(\Ecs\) are non-expanding under application of \(T\).

Likewise, let \(\Eu\) (“unstable”) be spanned by the remaining eigenvectors. Again, \(T\) is invariant on \(\Eu\). The eigenvalues we selected for this one are all strictly larger than 1 in magnitude, so let’s say these are all at least \(\mu > 1\). Thus, the singular values of \(T\) restricted to \(\Eu\) are all at least \(\mu\), so that we get to claim this (still in 2-norm): \[ \|Tx\| \geq \mu\|x\| \qquad \textrm{ for all } \qquad x \in \Eu. \] In words: components along \(\Eu\) are strictly expanding under application of \(T\).

Of course, \(\Rd = \Ecs \oplus \Eu\). In the construction above, it so happens that \(\Ecs\) and \(\Eu\) are orthogonal complements of one another. However, we won’t need this going forward, and it is useful not to force it. Indeed, in general, \(\D g(x^\ast)\) might not be symmetric—it may not even be diagonalizable.

This is why, in the next section, we allow a more general split of \(\Rd\) into a direct sum of two subspaces “compatible” with \(T\). We also endow the subspaces and \(\Rd\) itself with norms that are attuned to that split. Since these norms are no longer necessarily Euclidean, it is best not to reference singular values and to express all assumptions in terms of norms directly.

Setup and assumptions

Let \(T \colon \Rd \to \Rd\) be a linear map—think of \(T\) as being the differential of \(g\) at the fixed point we are investigating. What we are about to do (split \(\Rd\) and choose appropriate norms) can always be done if \(T\) has at least one eigenvalue of magnitude strictly larger than 1. (See Lemma 7 in the appendix about renorming.)

Split \(\Rd\) into a direct sum of two subspaces as \[ \Rd = \Ecs \oplus \Eu, \] with subscripts standing for “center-stable” and “unstable”. The split should meet some requirements with respect to \(T\): we will spell them out momentarily.

Each point \(x \in \Rd\) splits uniquely into \(y \in \Ecs\) and \(z \in \Eu\) such that \(x = y + z\). It is also convenient to think of \(x\) as the pair \(x = (y, z)\). We let \(\pcs \colon \Rd \to \Ecs\) and \(\pu \colon \Rd \to \Eu\) denote the projectors to these subspaces so that \(y = \pcs(x)\) and \(z = \pu(x)\).

Each subspace should be endowed with its own norm, and \(\Rd\) itself inherits a norm from them: \[ \|x\|_{\Rd} = \max\!\left( \|\pcs(x)\|_{\Ecs}, \|\pu(x)\|_{\Eu} \right). \tag{1}\] With these norms, notice that the projectors \(\pcs\) and \(\pu\) are 1-Lipschitz continuous. (We drop subscripts on norms going forward, without ambiguity.)

The split of \(\Rd\) and the chosen norms must satisfy the following conditions with respect to \(T\), with some constants \(1 \leq \lambda < \mu\):

  1. The subspaces are invariant under \(T\), that is, if \(x\) is in \(\Ecs\) then \(Tx\) is also in \(\Ecs\), and likewise for \(\Eu\).
  2. The restriction of \(T\) to those subspaces satisfies some growth conditions, as follows: \[\begin{align*} \|Ty\| \leq \lambda \|y\| && \textrm{ and } && \|Tz\| \geq \mu \|z\| && \textrm{ for all } && y \in \Ecs, z \in \Eu. \end{align*}\]

Invariance implies that \(T\) commutes with the projectors \(\pcs\) and \(\pu\), that is, \[\begin{align*} \pcs \circ T = T \circ \pcs && \textrm{ and} && \pu \circ T = T \circ \pu. \end{align*} \tag{2}\] (To verify this, use that \(T\) is invariant on \(\Ecs\) and \(\Eu\), and \(\pcs + \pu\) is identity while \(\pcs \circ \pu\) is zero.)

We still need to relate \(g\) and \(T\). Without loss of generality, say the fixed point of interest is \(x^\ast = 0\):

  1. The origin is a fixed point: \(g(0) = 0\).

If \(g\) is \(C^1\), we can set \(T := \D g(0)\) (as we should). Then \[ g(x) = g(0) + \D g(0)[x] + e(x) = Tx + e(x), \] where the error term \(e\) is \(C^1\) and satisfies \(\D e(0) = 0\) (and \(e(0) = 0\)). Thus, for any \(\varepsilon > 0\), we know that \(e = g - T\) is \(\varepsilon\)-Lipschitz continuous in a sufficiently small ball around \(0\). Accordingly, let us exploit the gap \(\mu - \lambda > 0\) to fix \(\varepsilon\) such that \[\begin{align*} \varepsilon \geq 0, && \lambda \geq 1 && \textrm{ and } && \mu > \lambda + 4\varepsilon, \end{align*} \tag{3}\] and require the following global Lipschitz condition (with respect to the norm in Eq. 1):

  1. The iteration map \(g\) is well approximated by \(T\), namely, \(\Lip(g - T) \leq \varepsilon\).

We would naturally prefer to assume this only in a small enough ball around the fixed point under consideration. This is easily handled after the fact: see the section “From global to local”.

(To be clear, we do not require \(g\) to be \(C^1\); but if it is, then we can secure the assumptions above more easily.)

Reduction to invariance

We now argue the following:

All we need is to identify a “nice” set which is invariant under the iteration map \(g\).

By “nice”, we mean that this set is the graph of a Lipschitz continuous function. In particular, it has measure zero. The following theorem describes that special function. In later sections, we will prove its existence using a nested contraction mapping argument.

Theorem 2 (Invariant graph) There exists a function \(\varphi \colon \Ecs \to \Eu\) whose graph is invariant under the iteration map \(g\), that is, \[\begin{align*} x \in \graph \varphi && \implies && g(x) \in \graph \varphi, \end{align*}\] where \(\graph \varphi = \{ (y, z) : z = \varphi(y) \textrm{ with } y \in \Ecs \} \subset \Rd\). Moreover, \(\varphi\) is 1-Lipschitz continuous.

Using this special function \(\varphi\), we can now build a real-valued “potential” function \(V \colon \Rd \to \reals\) which is positive away from the graph of \(\varphi\), as follows: \[ V(x) = \| \pu(x) - \varphi(\pcs(x)) \|. \tag{4}\] Thinking of \(x\) as the pair \((y, z)\), we can also write \(V(y, z) = \| z - \varphi(y) \|\). Notice that \(V\) is continuous and nonnegative, and it vanishes exactly on the graph of \(\varphi\).

The key feature of \(V\) is that its value along a sequence generated by \(g\) is either unbounded or always zero.

Theorem 3 (Growth of \(V\)) For all \(x \in \Rd\), we have \(V(g(x)) \geq (\mu - 2\varepsilon) V(x)\), with \(\mu - 2\varepsilon > 1\).

Corollary 1 (Zero or unbounded) Consider a sequence generated by \(x_{t+1} = g(x_t)\). If \(V(x_{\bar t}) > 0\) for some \(\bar t\), then \(V(x_t) \to \infty\) with \(t \to \infty\). By contraposition, if \(V(x_t)\) remains bounded, then in fact \(V(x_t) = 0\) for all \(t\).

Proof. Let us prove Theorem 3. Fix any \(x = (y, z)\) in \(\Rd\). Consider also the modified point \(x' = (y, \varphi(y))\). That point is on the graph of \(\varphi\), which is invariant under \(g\) (Theorem 2). Thus, \(g(x')\) is also on the graph of \(\varphi\), that is, \[ \pu(g(x')) = \varphi(\pcs(g(x'))). \] We are interested in the quantity \(V(g(x))\), so let’s first write it out and then squeeze in \(0 = -\pu(g(x')) + \varphi(\pcs(g(x')))\) inside the norm, followed by a triangle inequality: \[\begin{align*} V(g(x)) & = \| \pu(g(x)) - \varphi(\pcs(g(x))) \| \\ & = \| \pu(g(x)) - \pu(g(x')) + \varphi(\pcs(g(x'))) - \varphi(\pcs(g(x))) \| \\ & \geq \| \pu(g(x)) - \pu(g(x')) \| - \| \varphi(\pcs(g(x'))) - \varphi(\pcs(g(x))) \|. \end{align*}\] In that last term, we can further remove \(\varphi\) since \(\Lip(\varphi) \leq 1\) (by Theorem 2). We know that \(g\) is close to the linear map \(T\), so let’s replace each occurrence of \(g\) by \(g = T + (g-T)\). Then, we extract all terms involving \(g-T\) from the norms with a couple of triangle inequalities: \[\begin{align*} V(g(x)) & \geq \| \pu(T(x)) - \pu(T(x')) \| \\ & \qquad - \| \pu((g-T)(x)) - \pu((g-T)(x')) \| \\ & \qquad - \| \pcs(T(x')) - \pcs(T(x)) \| \\ & \qquad - \| \pcs((g-T)(x')) - \pcs((g-T)(x)) \|. \end{align*}\] Consider the two terms with \(g-T\). The Lipschitz constants of \(\pu \circ (g-T)\) and \(\pcs \circ (g-T)\) are both at most \(\varepsilon\) because \(\Lip(\pu), \Lip(\pcs) \leq 1\) (by design with our choices of norms) and \(\Lip(g-T) \leq \varepsilon\) (by assumption). Moreover, \[ \|x - x'\| = \|(y, z) - (y, \varphi(y))\| = \|z - \varphi(y)\| = V(x), \] so we are off to a good start. Let us now focus on the terms with \(T\). First, note that \[ \pcs(T(x')) - \pcs(T(x)) = 0. \] This is because \(\pcs \circ T = T \circ \pcs\) (Eq. 2) and \(\pcs(x') = y = \pcs(x)\). It remains to show that the very first term is large. For this, we use the fact that \(\pu \circ T = T \circ \pu\) (same reasoning as before) together with our assumption that \(T\) expands vectors in \(\Eu\) by a factor of at least \(\mu > 1\), so that \[ \| \pu(T(x)) - \pu(T(x')) \| = \| T(\pu(x - x')) \| \geq \mu \|z - \varphi(y)\| = \mu V(x). \] All combined, we have shown this: \[ V(g(x)) \geq (\mu - 2\varepsilon) V(x), \] as announced.

We are now good to go. Indeed, assume the sequence \((x_t)_{t \geq 0}\) remains bounded. Then, the value of the continuous function \(V\) along that sequence, namely, \((V(x_t))_{t \geq 0}\) also remains bounded. Corollary 1 then implies that actually \(V(x_t) = 0\) for all \(t\). In other words: the entire sequence is confined to the graph of \(\varphi\), which has measure zero.

Remark: it seems we have not yet used the assumptions \(g(0) = 0\) and \(\|Ty\| \leq \lambda \|y\|\) for all \(y \in \Ecs\). We will use them to prove Theorem 2.

Strategy to build \(\varphi\)

It remains to prove Theorem 2. That is, we must show existence of a 1-Lipschitz continuous function \(\varphi \colon \Ecs \to \Eu\) whose graph is invariant under the iteration map \(g\). As an intermediate step, given a 1-Lipschitz function \(\varphi\), let us try to build a 1-Lipschitz function \(\tilde \varphi\) such that \[\begin{align*} x \in \graph \tilde\varphi && \implies && g(x) \in \graph \varphi. \end{align*}\] If we can do this in such a way that \(\tilde \varphi = \varphi\), then we are done.

The requirement on \(\tilde \varphi\) is as follows: for all \(y \in \Ecs\), the point \(x = (y, \tilde\varphi(y))\) is on the graph on \(\tilde\varphi\) and we want its image \(g(x)\) to be on the graph of \(\varphi\), that is, we require \[ \pu(g(y, \tilde\varphi(y))) = \varphi(\pcs(g(y, \tilde\varphi(y)))). \] Since we aim to build \(\tilde\varphi\), one way to look at this is as an equation whose unknown is \(\tilde\varphi(y)\).

Thus, replace \(\tilde\varphi(y)\) with \(z\) (that is the unknown). Then, replace \(\pu \circ g\) with \(\pu \circ(g-T) + \pu \circ T\) and notice that \(\pu(T(y, z)) = T(\pu(y, z)) = T|_{\Eu} z\) since \(T\) and \(\pu\) commute by the invariance assumption (Eq. 2). Finally, re-organize to find: \[ T|_{\Eu} z = \varphi(\pcs(g(y, z))) - \pu((g-T)(y, z)). \] By assumption, \(T|_{\Eu}\) is invertible. Thus, \(z\) is a solution of the above equation if and only if it is a fixed point of the function \[ h^\varphi_y(z) = \big( T|_{\Eu} \big)^{-1} \! \Big( \varphi(\pcs(g(y, z))) - \pu((g-T)(y, z)) \Big), \tag{5}\] where \(h^\varphi_y\) is a function from \(\Eu\) to \(\Eu\). It depends on the parameter \(y\), and on a prior choice of \(\varphi\).

The plan is as follows:

  1. Fix any \(\varphi\) that is 1-Lipschitz continuous; and satisfies \(\varphi(0) = 0\).
  2. We will show that \(h^\varphi_y\) is a contraction, so that it has a unique fixed point: call it \(z(y)\).
  3. We shall show that the contraction is uniform, so that \(y \mapsto z(y)\) is also 1-Lipschitz; and \(z(0) = 0\).
  4. Echoing the discussion above, this tells us that we can set \(\tilde\varphi\) to be the function \(y \mapsto z(y)\).
  5. Call this new function \(\Gamma \varphi\). We would like for \(\tilde \varphi = \Gamma \varphi\) to be the same as \(\varphi\)
  6. To tie it all together, we show that \(\Gamma\) is a contraction, and hence that there exists a suitable \(\varphi\).

The map \(h^\varphi_y\) is a uniform contraction

For the next two lemmas, fix a 1-Lipschitz function \(\varphi \colon \Ecs \to \Eu\) that satisfies \(\varphi(0) = 0\). For each \(y \in \Ecs,\) define \(h^\varphi_y \colon \Eu \to \Eu\) as in Eq. 5. Recall that \(\mu > \lambda + 4\varepsilon\) with \(\varepsilon \geq 0\).

Lemma 1 (\(h^\varphi_y\) is a uniform contraction) For all \(y \in \Ecs\) and \(z, z' \in \Eu\), we have \(\|h^\varphi_y(z) - h^\varphi_y(z')\| \leq \frac{2\varepsilon}{\mu} \|z-z'\|\).

Proof. Recall the definition of \(h^\varphi_y(z)\) (Eq. 5). Apply \(T|_{\Eu}\) on both sides and use our assumption on that operator to find that \[\begin{align*} \mu \|h^\varphi_y(z) - h^\varphi_y(z')\| & \leq \left\| T|_{\Eu}\!\left( h^\varphi_y(z) - h^\varphi_y(z') \right) \right\| \\ & = \big\| \varphi(\pcs(g(y, z))) - \pu((g-T)(y, z)) \\ & \qquad - \varphi(\pcs(g(y, z'))) + \pu((g-T)(y, z')) \big\| \\ & \leq \Lip(\varphi) \| \pcs(g(y, z)) - \pcs(g(y, z')) \| \\ & \qquad + \Lip(\pu \circ (g-T)) \| (y, z) - (y, z') \|. \end{align*}\] Recall that \(\Lip(\pu) \leq 1\) and that we assume \(\Lip(g-T) \leq \varepsilon\). Of course, \(\| (y, z) - (y, z') \| = \|z-z'\|\), so the last term is at most \(\varepsilon \|z-z'\|\). For the first term, use \(\Lip(\varphi) \leq 1\) and split \(g\) into \((g-T) + T\): \[\begin{align*} \| \pcs(g(y, z)) - \pcs(g(y, z')) \| & \leq \| \pcs((g-T)(y, z)) - \pcs((g-T)(y, z')) \| \\ & \qquad + \| \pcs(T(y, z)) - \pcs(T(y, z')) \|. \end{align*}\] The first term is at most \(\varepsilon \|z - z'\|\). For the second term, use invariance (Eq. 2) to claim \(\pcs \circ T = T \circ \pcs\), then observe that \(\pcs(y, z) = y = \pcs(y, z')\) to see that the term is zero. Overall, we showed \[ \mu \|h^\varphi_y(z) - h^\varphi_y(z')\| \leq 2\varepsilon \|z - z'\|, \] so the proof is complete.

Lemma 2 (\(h^\varphi_y\) is Lipschitz in its parameter) For all \(z \in \Eu\) and \(y, y' \in \Ecs\), we have \(\|h^\varphi_y(z) - h^\varphi_{y'}(z)\| \leq \frac{\lambda+2\varepsilon}{\mu}\|y-y'\|\).

Proof. With the same beginning as in the proof of Lemma 1, we readily find: \[\begin{align*} \mu \|h^\varphi_y(z) - h^\varphi_{y'}(z)\| % & \leq \big\| \varphi(\pcs(g(y, z))) - \pu((g-T)(y, z)) \\ % & \qquad - \varphi(\pcs(g(y', z))) + \pu((g-T)(y', z)) \big\| \\ & \leq \Lip(\varphi) \|\pcs(g(y, z)) - \pcs(g(y', z))\| \\ & \qquad + \Lip(\pu \circ (g-T)) \| (y, z) - (y', z) \|. \end{align*}\] The second term is at most \(\varepsilon \|y - y'\|\). For the first term, with the usual split \(g = (g-T) + T\), \[\begin{align*} \|\pcs(g(y, z)) - \pcs(g(y', z))\| & \leq \| \pcs((g-T)(y, z)) - \pcs((g-T)(y', z)) \| \\ & \qquad + \| \pcs(T(y, z)) - \pcs(T(y', z)) \|. \end{align*}\] The first term is at most \(\varepsilon \|y - y'\|\). For the second term, use \(\pcs \circ T = T \circ \pcs\) and our assumption on \(T|_{\Ecs}\) to obtain \[ \| \pcs(T(y, z)) - \pcs(T(y', z)) \| = \| T|_{\Ecs}(y - y') \| \leq \lambda\|y - y'\|. \] Overall, we have shown \[ \mu \|h^\varphi_y(z) - h^\varphi_{y'}(z)\| \leq (\lambda+2\varepsilon) \|y - y'\|, \] as announced.

The map \(\Gamma\) is a contraction

The two lemmas above regarding \(h^\varphi_y\) combine neatly with the uniform contraction principle (see Lemma 5 below) to allow us to define the map \(\Gamma\) announced earlier. First off, given any function \(\varphi \colon \Ecs \to \Eu\), define the following quantity: \[ \|\varphi\| = \sup_{y \neq 0} \frac{\|\varphi(y)\|}{\|y\|}. \tag{6}\] Now define a linear space of functions as follows: \[ \calF = \{ \varphi \colon \Ecs \to \Eu \, | \, \varphi(0) = 0 \textrm{ and } \|\varphi\| < \infty \}. \] On this linear space, Eq. 6 defines a norm. With the distance induced by that norm, \(\calF\) is a complete metric space (i.e., \(\calF\) is a Banach space).

It is with respect to this norm that we show \(\Gamma\) is a contraction on a closed (thus complete) subset of \(\calF\).

Lemma 3 Consider the following closed subset of \(\calF\): \[ \calF_1 = \{ \varphi \in \calF \, | \, \Lip(\varphi) \leq 1 \}. \] Define the map \(\Gamma \colon \calF_1 \to \calF_1\) as follows: for a given \(\varphi \in \calF_1\) and each \(y \in \Ecs\), \[ (\Gamma \varphi)(y) = \textrm{the unique fixed point of } h^\varphi_y. \] The map \(\Gamma\) is well defined.

Proof. By Lemma 1, \(h^\varphi_y\) is a contraction, so \((\Gamma \varphi)(y)\) is well defined for each \(y\). We still need to show \(\Gamma \varphi\) is in \(\calF_1\). To see that \(\Gamma \varphi\) is 1-Lipschitz, plug both Lemma 1 and Lemma 2 into the uniform contraction principle (Lemma 5): this reveals that \(\Gamma \varphi\) is in fact \(\frac{\lambda+2\varepsilon}{\mu-2\varepsilon}\)-Lipschitz (that constant is strictly less than 1 since \(\mu > \lambda+4\varepsilon\)). To confirm that \((\Gamma \varphi)(0) = 0\), check that \(z = 0\) is the fixed point of \(h_y^\varphi\) when \(y = 0\), because \(h_0^\varphi(0) = 0\) (see Eq. 5; this is the only explicit use of the assumption \(g(0) = 0\)).

Lemma 4 (\(\Gamma\) is a contraction) The map \(\Gamma\) defined in Lemma 3 is a contraction for the norm in Eq. 6.

Proof. Let \(\varphi_1, \varphi_2\) in \(\calF_1\) be arbitrary. Fix any \(y \in \Ecs\), and let \(z_1 := (\Gamma \varphi_1)(y)\) (likewise for \(z_2\)). By definition, \(z_1\) is the fixed point of \(h^{\varphi_1}_y\), so that \(z_1 = h^{\varphi_1}_y(z_1)\) (likewise for \(z_2\)). It follows that \[\begin{align*} \|(\Gamma \varphi_1)(y) - (\Gamma \varphi_2)(y)\| & = \|z_1 - z_2\| \\ & = \|h^{\varphi_1}_y(z_1) - h^{\varphi_2}_y(z_2)\|. \end{align*}\] Using the definition of \(h^\varphi_y(z)\) (Eq. 5) and our assumption on \(T|_{\Eu}\) in the usual manner, it follows that \[\begin{align*} \mu \|z_1 - z_2\| & \leq \big\| \varphi_1(\pcs(g(y, z_1))) - \pu((g-T)(y, z_1)) \\ & \qquad - \varphi_2(\pcs(g(y, z_2))) + \pu((g-T)(y, z_2)) \big\| \end{align*}\] The terms involving \(\pu \circ (g-T)\) are easily handled after a triangle inequality as \[\begin{align*} \|\pu((g-T)(y, z_1)) - \pu((g-T)(y, z_2))\| & \leq \Lip(\pu \circ (g-T)) \|(y, z_1) - (y, z_2)\| \\ & \leq \varepsilon \|z_1 - z_2\|. \end{align*}\] This can be moved to the left-hand side of the previous inequality, so that \[\begin{align*} (\mu - \varepsilon) \|z_1 - z_2\| & \leq \| \varphi_1(\pcs(g(y, z_1))) - \varphi_2(\pcs(g(y, z_2))) \| \\ & \leq \| \varphi_1(\pcs(g(y, z_1))) - \varphi_1(\pcs(g(y, z_2))) \| \\ & \qquad + \| \varphi_1(\pcs(g(y, z_2))) - \varphi_2(\pcs(g(y, z_2))) \|. \end{align*}\] The first term is bounded as \[\begin{align*} \| \varphi_1(\pcs(g(y, z_1))) - \varphi_1(\pcs(g(y, z_2))) \| & \leq \Lip(\varphi_1) \|\pcs(g(y, z_1)) - \pcs(g(y, z_2))\| \\ & \leq \|\pcs((g-T)(y, z_1)) - \pcs((g-T)(y, z_2))\| \\ & \qquad + \|\pcs(T(y, z_1)) - \pcs(T(y, z_2))\| \\ & \leq \varepsilon \|z_1 - z_2\| + 0. \end{align*}\] (As often, we split \(g = (g-T) + T\), used \(\Lip(g-T) \leq \varepsilon\), and commuted \(\pcs \circ T = T \circ \pcs\) followed by \(\pcs(y, z_1) = y = \pcs(y, z_2)\).) This term too can be moved to the left-hand side, yielding: \[\begin{align*} (\mu - 2\varepsilon) \|z_1 - z_2\| & \leq \| \varphi_1(\pcs(g(y, z_2))) - \varphi_2(\pcs(g(y, z_2))) \|. \end{align*}\] By definition of the norm (Eq. 6), we have \(\|\varphi_1(u) - \varphi_2(u)\| \leq \|\varphi_1 - \varphi_2\| \|u\|\) for all \(u \in \Ecs\), so: \[\begin{align*} (\mu - 2\varepsilon) \|z_1 - z_2\| & \leq \|\varphi_1 - \varphi_2\| \| \pcs(g(y, z_2)) \|. \end{align*} \tag{7}\] Focusing on that last factor, by the usual arguments, \[\begin{align*} \| \pcs(g(y, z_2)) \| & \leq \| \pcs((g-T)(y, z_2)) \| + \| \pcs(T(y, z_2)) \| \\ & = \| \pcs((g-T)(y, z_2)) - \pcs((g-T)(0, 0)) \| + \| T|_{\Ecs} y \| \\ & \leq \Lip(\pcs \circ (g-T)) \max(\|y\|, \|z_2\|) + \lambda \|y\|. \end{align*}\] To reach the last inequality, we used the norm on \(\Rd\) (recall Eq. 1) and our assumption on \(T|_{\Ecs}\). Now recall that \(z_2 = (\Gamma \varphi_2)(y)\). Since \(\Gamma \varphi_2\) belongs to \(\calF_1\), we know that \((\Gamma \varphi_2)(0) = 0\) and \(\Lip(\Gamma \varphi_2) \leq 1\), so that \(\|z_2\| = \|(\Gamma \varphi_2)(y) - (\Gamma \varphi_2)(0)\| \leq \|y\|\). Therefore, \[ \| \pcs(g(y, z_2)) \| \leq (\lambda+\varepsilon) \|y\|. \] Back in the beginning, we had \(\|z_1 - z_2\| = \|(\Gamma \varphi_1)(y) - (\Gamma \varphi_2)(y)\|\). Overall, returning to Eq. 7, we have shown for all \(y \neq 0\) that the following inequality holds: \[ \frac{\|(\Gamma \varphi_1)(y) - (\Gamma \varphi_2)(y)\|}{\|y\|} \leq \frac{\lambda + \varepsilon}{\mu - 2\varepsilon} \|\varphi_1 - \varphi_2\|. \] Take the supremum over \(y \neq 0\) on the left-hand side and it follows from our choice of norm (Eq. 6) that \[ \|\Gamma \varphi_1 - \Gamma \varphi_2\| \leq \frac{\lambda + \varepsilon}{\mu - 2\varepsilon} \|\varphi_1 - \varphi_2\|. \] This concludes the proof because the coefficient is strictly less than 1 (due to \(\mu > \lambda + 4\varepsilon\)).

Finishing the proof of Theorem 2

From Lemma 4 and the Banach fixed-point theorem, we deduce that there exists a (unique) fixed point \(\varphi \in \calF_1\) for \(\Gamma\), which proves Theorem 2. For good measure, let’s be explicit. By definition, for all \(y\) we have \[ \varphi(y) = (\Gamma \varphi)(y) = \textrm{the unique fixed point of } h^\varphi_y. \] Since \(\varphi(y)\) is the fixed point of \(h^\varphi_y\), it follows that \[ T|_{\Eu} \varphi(y) = T|_{\Eu} h^\varphi_y(\varphi(y)) = \varphi(\pcs(g(y, \varphi(y)))) - \pu((g-T)(y, \varphi(y))). \] Cancel \(T|_{\Eu} \varphi(y)\) on both sides (using \(\pu \circ T = T \circ \pu\) again) and it follows that \[ \varphi(\pcs(g(y, \varphi(y)))) = \pu(g(y, \varphi(y))). \] This indeed states that \(g(y, \varphi(y))\) is on the graph of \(\varphi\), as desired.

From global to local

Above, we have assumed \(\Lip(g - T) \leq \varepsilon\) with some \(\varepsilon > 0\) that may be very small in order to satisfy the condition \(\mu > \lambda + 4\varepsilon\). This assumption effectively forces \(g\) to be globally almost linear, which is bad. On the plus side, the conclusions are also global: they apply to sequences throughout \(\Rd\). As a fruitful compromise, we now reduce the claims to hold only for sequences which eventually stay in a small ball, with the benefit that \(g\) needs to be nearly linear only in that ball.

As argued in Setup and assumptions, if the iteration map \(g\) is \(C^1\) and \(T = \D g(0)\), then for all \(\varepsilon > 0\) there exists a radius \(r > 0\) such that \(g-T\) is \(\frac{\varepsilon}{5}\)-Lipschitz continuous in the closed ball of radius \(r\) around the fixed point \(0\). (This is all with respect to the norm we have chosen for \(\Rd\) in Eq. 1.)

Now it remains to build a new iteration map \(\tilde g\), so as to fulfill these requirements:

  1. For all \(x \in \Rd\) with \(\|x\| \leq r/2\), we set \(\tilde g(x) = g(x)\).
  2. For all \(x \in \Rd\) with \(\|x\| \geq r\), we set \(\tilde g(x) = Tx\).
  3. For all \(x\) “in between”, smoothly connect the two parts with a transition function: pick a \(C^\infty\) sigmoid function \(s \colon \reals \to \reals\) which is identically 0 up to \(r/2\) and identically 1 starting at \(r\), and let \(\tilde g(x) = (1 - s(\|x\|)) g(x) + s(\|x\|) Tx\). This can be done with \(|s'|\) not exceeding \(4/r\).

This yields an iteration map \(\tilde g\) on \(\Rd\) which does indeed satisfy \(\Lip(\tilde g - T) \leq \varepsilon\) globally. Thus, we may apply the results above to \(\tilde g\). Since it coincides with \(g\) on a small ball \(B\) around the fixed point, whenever we consider a sequence generated by \(g\), if that sequence (or the trailing part of it) is entirely contained in \(B\), then it may as well have been generated by \(\tilde g\). We can then apply the global conclusions about \(\tilde g\) and transfer them locally to \(g\).

Remark. The norm on \(\Rd\) may not be smooth (even away from 0), so that \(\tilde g\) as built above may not retain the differentiability of \(g\) (because of \(s(\|x\|)\)). This is fine for us. If need be, one could replace the norm here with an equivalent smooth norm (e.g., Euclidean).

How this provides saddle avoidance for gradient descent

CSMT is one of the main ingredients for showing that gradient descent on a \(C^2\) cost function \(f \colon \Rd \to \reals\) does not converge to a strict saddle from almost any initial point. By definition, \(x^\ast\) is a strict saddle if \(\nabla^2 f(x^\ast)\) has at least one negative eigenvalue, so the linearization \(\D g(x^\ast) = I_d - \alpha \nabla^2 f(x^\ast)\) has at least one expanding direction, satisfying the assumption of Theorem 1.

Let’s now study the set of all initial points from which we could converge to some strict saddle \(x^\ast\). Applying Theorem 1 at \(x^\ast\), we obtain a compact neighbourhood \(B\) of \(x^\ast\) and a zero measure set \(\Wcs\) in \(\Rd\) such that if \(x_t\) is in \(B\) for all \(t \geq \bar{t}\), then \(x_t\) is also in \(\Wcs\) for all \(t \geq \bar{t}\). Consider some point \(x_0\) such that \(\lim_{t \to \infty} g^t(x_0) = x^\ast\). In particular, there exists some \(\bar{t}\) such that all subsequent iterates \((x_t)_{t \geq \bar{t}}\) remain in \(B\). Therefore, the iterate \(x_{\bar{t}}\) belongs to \(\Wcs\) (a set of measure zero), and so \(x_0\) certainly belongs to \(\bigcup_{t \geq 0} g^{-t} (\Wcs)\).

To obtain a global saddle avoidance result, we need to argue that this last set also has zero measure. This can be done (under some conditions) by showing that the iteration map satisfies the Luzin \(N^{-1}\) property—for some details, see the section “Reduction to local concerns” in our previous blog post.

Of course, this sketch only shows that we avoid one saddle, \(x^\ast\). A final saddle avoidance result is obtained by applying Theorem 1 at each strict saddle of \(f\). Since there could be uncountably many, simply taking the union of all corresponding sets \(\Wcs(x^\ast)\) does not work, as we would potentially obtain an uncountable union of sets of zero measure, which is not of zero measure in general. This can be resolved entirely using Lindelöf’s lemma—to the best of our knowledge, that argument first appeared in (Panageas and Piliouras 2016) (see Thm. 10 and comments thereafter for details).

Overall, one then obtains the following clean statement:

Theorem 4 (Saddle avoidance) Let \(f \colon \Rd \to \reals\) be \(C^2\) and have \(L\)-Lipschitz continuous gradient. Consider gradient descent \(g(x) = x - \alpha\nabla f(x)\) with step-size \(0 < \alpha < 1/L\). There exists a zero-measure set \(Z \subset \Rd\) such that the following is true: If GD initialized at \(x_0\) generates a sequence that converges to a strict saddle point of \(f\), then \(x_0\) belongs to \(Z\).

A few comments:

  1. The Lipschitz gradient condition and the restriction on the step-size are there only to ensure the Luzin \(N^{-1}\) property: this can also be achieved in other ways.
  2. The theorem allows to exclude convergence to a strict saddle point, but it does not rule out accumulation at such points (e.g., if GD’s trajectory approaches a limit cycle of sorts).
  3. The theorem also does not guarantee convergence anywhere else.
  4. To resolve the two previous points, one would typically invoke another result to ensure convergence first; e.g., based on a Łojasiewicz inequality.

Appendix: contractions

We use the following standard lemma about Lipschitz contractions (see also proof wiki).

Lemma 5 (Uniform contraction principle) Let \(Y, Z\) be two non-empty metric spaces. Assume \(Z\) is complete. Let \(h \colon Y \times Z \to Z\) satisfy the following for some \(0 \leq \ell < 1\) and \(L \geq 0\), uniformly over all \(y, y' \in Y\) and \(z, z' \in Z\): \[\begin{align*} \dist(h(y, z), h(y, z')) & \leq \ell \, \dist(z, z'), \textrm{ and } \\ \dist(h(y, z), h(y', z)) & \leq L \, \dist(y, y'). \end{align*}\] Define \(\zeta \colon Y \to Z\) such that \(\zeta(y)\) is the unique fixed point of the contraction \(h(y, \cdot) \colon Z \to Z\). Then, \(\zeta\) is well defined and it is \(\frac{L}{1-\ell}\)-Lipschitz continuous.

Proof. The fact that \(\zeta\) is well defined follows from the Banach fixed-point theorem. By definition, \(h(y, \zeta(y)) = \zeta(y)\) for all \(y\). To see that \(\zeta\) is Lipschitz continuous, let’s compute: \[\begin{align*} \dist(\zeta(y), \zeta(y')) & = \dist(h(y, \zeta(y)), h(y', \zeta(y'))) \\ & \leq \dist(h(y, \zeta(y)), h(y, \zeta(y'))) + \dist(h(y, \zeta(y')), h(y', \zeta(y'))) \\ & \leq \ell \dist(\zeta(y), \zeta(y')) + L \dist(y, y'). \end{align*}\] Reorganizing, we find that \((1-\ell) \dist(\zeta(y), \zeta(y')) \leq L \dist(y, y')\) for all \(y, y'\), as announced.

Appendix: renorming \(\Rd\)

Holmes (1968) proposed the statements and proofs below (and articulated them in the more general setting of Banach spaces). This is useful to verify the assumptions of the center stable manifod theorem when \(T = \D g(x^*)\) is not symmetric.

Lemma 6 (Norm and spectral radius) Let \(T \colon \Rd \to \Rd\) be a linear map, and let \(r(T) = \max_{i=1,\ldots,d} |\lambda_i(T)|\) denote its spectral radius. For all \(s > r(T)\), there exists a norm \(\|\cdot\|_\square\) on \(\Rd\) such that \(\|T\|_\square \leq s\), where \(\|T\|_\square = \sup_{\|x\|_\square = 1} \|Tx\|_\square\) is the associated operator norm.

Proof. Let \(U = \frac{1}{s} T\), so that \(r(U) < 1\). Choose any matrix norm \(\|\cdot\|\) (for example, the usual operator norm with respect to the Euclidean norm on \(\Rd\)). By Gelfand’s formula, \(r(U) = \lim_{k \to \infty} \|U_k\|^{1/k}\). Since \(r(U) < 1\), we can choose \(\varepsilon = (1-r(U))/2 > 0\) and \(K\) such that \(\|U_k\|^{1/k} \leq 1-\varepsilon < 1\) for all \(k \geq K\). Thus, the norms \(\|U^k\| \leq (1-\varepsilon)^k\) decay geometrically to zero for \(k \geq K\). In particular, \(\sum_{k=0}^\infty \|U^k\|\) is finite. It then follows easily that \[ \|x\|_\square \triangleq \sum_{k = 0}^\infty \|U^k x\| \] defines a norm on \(\Rd\). By definition, for all \(x \in \Rd\) we find that \[ \|Ux\|_\square = \sum_{k = 0}^\infty \|U^{k+1} x\| = \|x\|_\square - \|x\| \leq \|x\|_\square, \] that is, \(\|U\|_\square \leq 1\), and so \(\|T\|_\square \leq s\).

Lemma 7 (Split and renorm) Let \(T \colon \Rd \to \Rd\) be a linear map. If \(\rho > 0\) is such that no eigenvalue of \(T\) has modulus \(\rho\), then \(\Rd\) can be split into a direct sum \(\Rd = E_a \oplus E_b\) such that \(T(E_a) \subseteq E_a\) and \(T(E_b) \subseteq E_b\). Moreover, there exist \(\lambda < \rho < \mu\) and norms \(\|\cdot\|_a\) and \(\|\cdot\|_b\) for \(E_a\) and \(E_b\) such that

  1. \(\|Ty\|_a \leq \lambda \|y\|_a\) for all \(y \in E_a\), and
  2. \(\|Tz\|_b \geq \mu \|z\|_b\) for all \(z \in E_b\).

Proof. Write \(T\) as a matrix in (real) Jordan normal form: \(T = PJP^{-1}\). Collect the columns of \(P\) corresponding to the diagonal blocks of \(J\) with eigenvalues of magnitude strictly less than \(\rho\): these form the basis for \(E_a\). The remaining columns of \(P\) are associated to eigenvalues of magnitude strictly more than \(\rho\): they form the basis for \(E_b\). It is clear that \(\Rd = E_a \oplus E_b\); these subspaces are indeed invariant under \(T\) since \(J\) is block diagonal.

The restriction \(T_a \triangleq T|_{E_a} \colon E_a \to E_a\) is a linear map whose eigenvalues all have magnitude strictly less than \(\rho\). Thus, the spectral radius satisfies \(r(T_a) < \rho\). Pick a \(\lambda\) in the interval \(( r(T_a), \rho )\). Then it follows from Lemma 6 that we can choose a norm \(\|\cdot\|_a\) on \(E_a\) such that \(\|T_a\|_a \leq \lambda\).

For \(T_b \triangleq T|_{E_b} \colon E_b \to E_b\), notice that its eigenvalues all have magnitude strictly larger than \(\rho\). In particular, \(T_b\) is invertible, and the spectral radius of its inverse satisfies \(0 < r(T_b^{-1}) < 1/\rho\). Select any \(\mu\) such that \(1/\mu\) is in the interval \((r(T_b^{-1}), 1/\rho)\), so that in particular \(\mu > \rho\). Invoke Lemma 6 again to choose a norm on \(E_b\) such that \(\|T_b^{-1}\|_b \leq 1/\mu\). It follows that for all \(z \in E_b\) we have \(\mu \|z\|_b = \mu \|T_b^{-1} T_b z\|_b \leq \|T_b z\|_b\), as announced.

(Hirsch, Pugh, and Shub (1977) call an operator \(T\) as in Lemma 7\(\rho\)-pseudohyperbolic”.)

References

Hirsch, M. W., C. C. Pugh, and M. Shub. 1977. Invariant Manifolds. Springer Berlin, Heidelberg. https://doi.org/10.1007/bfb0092042.
Holmes, R. B. 1968. “A Formula for the Spectral Radius of an Operator.” The American Mathematical Monthly 75 (2): 163–66. https://doi.org/10.2307/2315890.
Lee, J. D., I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. 2019. “First-Order Methods Almost Always Avoid Strict Saddle Points.” Mathematical Programming 176 (1–2): 311–37. https://doi.org/10.1007/s10107-019-01374-3.
Lee, J. D., M. Simchowitz, M. I. Jordan, and B. Recht. 2016. “Gradient Descent Only Converges to Minimizers.” In Conference on Learning Theory, 1246–57. PMLR.
Panageas, I., and G. Piliouras. 2016. “Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions.” arXiv Preprint arXiv:1605.00405.
Shub, M. 1987. Global Stability of Dynamical Systems. Springer New York. https://doi.org/10.1007/978-1-4757-1947-5.

Citation

BibTeX citation:
@online{musat2024,
  author = {Musat, Andreea and Boumal, Nicolas},
  title = {Saddle Avoidance and Center-Stable Manifold: A Proof from
    First Principles for a General Setup},
  date = {2024-09-16},
  url = {racetothebottom.xyz/posts/saddle-avoidance-general/},
  langid = {en},
  abstract = {The center-stable manifold theorem is old. It has become
    important in non-convex optimization, but the proof is rarely
    spelled out. We reconstruct one here, at a level of generality that
    is useful to optimizers.}
}