Saddle avoidance and stable manifold: a proof from first principles for a simple setup – Race to the bottom

Abstract

Gradient descent with fixed step-size avoids strict saddle points almost surely. The now standard proof of this fact relies on the Center Stable Manifold theorem, which may feel like a black-box. This post provides an explicit proof, back to first principles, for a simplified setting.

Let \(g \colon \Rd \to \Rd\) be a \(C^1\) iteration map, used to generate sequences as follows: given \(x_0 \in \Rd\), let \[ x_{t+1} = g(x_t) \qquad \textrm{ for } \qquad t = 0, 1, 2, \ldots \] If the sequence converges to a point \(x^\ast\), then the latter is a fixed point of \(g\), that is, \(g(x^\ast) = x^\ast\). We might converge to any fixed point of \(g\) (for trivial reasons: just initialize \(x_0\) there). It is natural to ask:

If we initialize \(x_0\) randomly, are there fixed points we are unlikely to converge to?

For example, if we are aiming to minimize a \(C^2\) cost function \(f \colon \Rd \to \reals\) with gradient descent and a fixed step-size \(\alpha > 0\), the iteration map is \[ g(x) = x - \alpha \nabla f(x). \] The fixed points of \(g\) are the critical points of \(f\). Among those, some are desirable (local minima of \(f\)), and some are not (saddle points of \(f\)).

Fortunately, it has been long observed and later proved that gradient descent avoids strict saddles from almost any initial point (Lee et al. 2016, 2019; Panageas and Piliouras 2016). (Saddles may still slow it down though.) The argument relies on a rather sophisticated theorem from dynamical systems, namely, the Center Stable Manifold theorem (Shub 1987, Thm. III.7). As a result, it may seem opaque.

In this post, we give an explicit proof in a simplified setting, namely, assuming the Hessian of \(f\) at the saddle points is invertible. In particular, this forces the saddle points to be isolated, and the iteration map \(g\) to be locally invertible around them.

(These restrictions can be relaxed: see our detailed follow-up post.)

The core of the story is a transparent proof of the Center Stable Manifold theorem, or rather, of the parts we need from the Stable Manifold theorem. This proof uses a contraction argument in a space of sequences, adapted from a paper by Antonakopoulos et al. (2022). (The latter aims for a more general setting where the iteration map \(g\) may not be the same at each iteration; however, we were not able to track their argument for that level of generality.)

Edit on Nov. 19, 2024: The resulting proof parallels the one by Irwin (1970), with more words (we were not aware of that one at the time of writing). Also, Irwin uses the implicit function theorem rather than a contraction argument (the two are closely related of course): more on that later. Extensions of that proof to handle center-stable manifolds are notably discussed in de la Llave and Wayne (1995).

Reduction to local concerns

In our context, a strict saddle is a point where the gradient of \(f\) is zero and the Hessian has at least one (strictly) negative eigenvalue. We do not want to converge to such a point, so let us entertain how this might happen anyway.

Assume the sequence \((x_t)_{t \geq 0}\) converges to some strict saddle \(x^\ast\). Then, it must eventually enter (and never exit again) an arbitrarily small ball around \(x^\ast\).

The strategy is to first focus on what happens in that neighborhood of the saddle point. We shall show that if the sequence (eventually) stays in that small ball forever, then it must actually be confined to a zero-measure subset \(Z\) of that ball. Stated differently: locally, there is only a set of measure zero from which GD could converge to \(x^\ast\).

So if we were to initialize randomly in that small ball, we would almost surely avoid that saddle point. That’s the part which uses (a version of) the Center Stable Manifold theorem, as made explicit in the next several section.

Once this local result is secured, the key to showing the global result is to assume the Luzin \(N^{-1}\) property:

For all subsets \(S \subset \Rd\), if \(S\) has measure zero then its pre-image \(g^{-1}(S)\) also has measure zero.

This ensures that repeated pre-images of the local, zero-measure set \(Z\) of “bad points” also remains of zero measure. With this, we easily globalize the local saddle avoidance result. Indeed, the only way our sequence could enter \(Z\) at (say) time \(\bar{t}\) is if \(x_0\) belongs to \(g^{-\bar{t}}(Z)\), which has measure zero. This remains of measure zero after taking the union over all possible values of \(\bar t = 0, 1, 2, \ldots\).

Importantly, for \(g \in C^1\), the Luzin \(N^{-1}\) property holds if and only if \(\D g(x)\) has full rank for almost all \(x\) in \(\Rd\) (Ponomarev 1987, Thm. 1).

For GD, under the standard assumptions that \(\nabla f\) is \(L\)-Lipschitz and that \(\alpha < 1/L\), it is easy to check that \(\D g(x)\) is non-singular for all \(x\), so we are fine. Indeed, in this case, \(\D g(x) = I_d - \alpha \nabla^2 f(x)\). Since \(\nabla f\) is Lipschitz with constant \(L\), the eigenvalues of \(\nabla^2 f(x)\) are bounded between \(-L\) and \(L\). With the condition on the step size, we obtain that the eigenvalues of \(\D g(x)\) are always strictly positive.

Linearization of the iteration, and a first assumption

Without loss of generality, let us assume that \(x^\ast = 0\) (the origin in \(\Rd\)) is a fixed point of \(g\): \[ g(0) = 0. \] (For GD, this means the origin is a critical point of \(f\).)

Let \(G = \D g(0)\) be the Jacobian of \(g\) at that fixed point. (For GD, if \(f\) is \(C^2\) we have \(G = I_d - \alpha \nabla^2 f(0)\).) The eigenvalues of \(G\) control the local behavior of our sequences. Indeed, consider the following Taylor expansion of \(g\) around \(0\): \[ g(x) = g(0) + \D g(0)[x] + e(x) = Gx + e(x), \] where \(e \colon \Rd \to \Rd\) is the error term in the first-order approximation. Of course, \(e(0) = 0\). If \(g\) is sufficiently regular, then the error term is small for \(x\) near \(0\) and we see that \(x_{t+1}\) is roughly equal to \(Gx_t\): \[ x_{t+1} = g(x_t) = Gx_t + e(x_t). \] Therefore, eigenvalues of \(G\) larger than 1 in magnitude will amplify certain components of \(x_t\) (“unstable”), whereas the eigenvalues smaller than 1 will drive those other components to zero (“stable”). Eigenvalues with magnitude equal to 1 are important… but we assume them away in this post.

We make the following assumption:

The Jacobian \(G \in \Rdd\) has \(u\) eigenvalues of magnitude strictly larger than 1, and \(s\) eigenvalues of magnitude strictly smaller than 1, so that \(u + s = d\). We let \(\lambda_u\) and \(\lambda_s\) denote the eigenvalues of \(G\) such that \(|\lambda_u| > 1\) and \(|\lambda_s| < 1\) are closest to 1.

For GD, the eigenvalues of \(G\) are equal to \(1-\alpha\mu_i\) where \(\mu_1, \ldots, \mu_d\) are the eigenvalues of \(\nabla^2 f(0)\). Thus, our assumption is that \(\nabla^2 f(0)\) has \(s\) eigenvalues strictly in \((0, \frac{2}{\alpha})\), and \(u\) eigenvalues strictly outside of \([0, \frac{2}{\alpha}]\). For sufficiently small \(\alpha\), this separates the positive and the negative eigenvalues of \(\nabla^2 f(0)\).

By assumption, the iteration map \(g\) is \(C^1\) and therefore the error function \(e(x) = g(x) - Gx\) is also \(C^1\). It follows that \(\D e(x) = \D g(x) - G\) is continuous, and \(\D e(0) = 0\). This means that the norm of \(\D e(x)\) is as small as we want provided \(x\) stays in a small ball around the origin. Stated differently: the error term \(e\) is Lipschitz continuous with an arbitrarily small constant \(\rho\), as long as we restrict our attention to a ball of sufficiently small radius \(\delta\) around the fixed point. Explicitly:

Since \(g\) is \(C^1\), there exist \(\delta > 0\) and \(\rho > 0\) such that:

For all \(x, y\) satisfying \(\|x\|, \|y\| \leq \delta\), we have \[\|e(x) - e(y)\| \leq \rho\|x - y\|\] (in particular, \(\|e(x)\| \leq \rho\|x\|\) since \(e(0) = 0\)), and

The constant \(\rho\) is sufficiently small so that \(\rho \frac{1}{|\lambda_u|-1} \leq \frac{1}{3}\) and also \(\rho \frac{1}{1-|\lambda_s|} \leq \frac{1}{3}\).

For illustration with GD, say \(f\) is \(C^2\) around the fixed point. Then the error term is \[ e(x) = g(x) - Gx = x - \alpha \nabla f(x) - (I_d - \alpha \nabla^2 f(0))x = \alpha( \nabla^2 f(0) x - \nabla f(x) ). \] Its Jacobian is \(\D e(x) = \alpha ( \nabla^2 f(0) - \nabla^2 f(x) )\). Since \(\nabla^2 f\) is continuous, for any \(\rho\) we want (in particular, a \(\rho\) which satisfies the second condition above), we can choose \(\delta\) small enough so that \(\|\D e(x)\| \leq \rho\) for all \(x\) in the ball \(\|x\| \leq \delta\). As a result, \(e\) is \(\rho\)-Lipschitz continuous in that ball, which secures the first condtion.

Unrolling the recurrence, and a second assumption

We can now express the \((t+1)\)st iterate by unrolling the recurrence with its linearization \(g(x) = Gx + e(x)\): \[\begin{align*} x_{t+1} & = g(x_t) \\ & = Gx_t + e(x_t) \\ & = Gg(x_{t-1}) + e(x_t) \\ & = G(Gx_{t-1} + e(x_{t-1})) + e(x_t) \\ & = G^2 x_{t-1} + Ge(x_{t-1}) + e(x_t) \\ & \ \ \vdots \\ & = G^{t+1} x_0 + \sum_{i = 0}^t G^i e(x_{t-i}). \end{align*}\] For later convenience, replace the counter \(i\) in the sum with \(j = t-i\) (also ranging from 0 to \(t\)) so as to get: \[ x_{t+1} = G^{t+1} x_0 + \sum_{j = 0}^t G^{t-j} e(x_j). \tag{1}\]

Let us make another simplifying assumption. For GD, that assumption holds because \(\nabla^2 f(0)\) is symmetric.

Assume \(G\) is diagonalizable, and therefore go ahead and assume \(G\) is diagonal.

Indeed, as long as we can diagonalize \(G\) as \(G = VDV^{-1}\), let \(y_t = V^{-1}x_t\). Then, \(x_{t+1} = g(x_t)\) becomes \[ y_{t+1} = V^{-1}g(Vy_t) = V^{-1}g(0) + V^{-1}\D g(0)[Vy_t] + \tilde e(y_t) = Dy_t + \tilde e(y_t). \] We can then study sequences \((y_t)\) instead of \((x_t)\), with \(G\) and \(e\) above replaced by \(D\) and \(\tilde e\).

Now that \(G\) is diagonal, we split it in a 2-by-2 block structure. This is to separate the stable and unstable parts into the first \(u\) coordinates and the remaining \(s\) coordinates: \[ G = \begin{bmatrix} G_u & 0 \\ 0 & G_s \end{bmatrix}. \] The eigenvalues of \(G_u \in \reals^{u \times u}\) are all larger than \(|\lambda_u| > 1\) in magnitude, and the eigenvalues of \(G_s \in \reals^{s \times s}\) are all smaller than \(|\lambda_s| < 1\) in magnitude. Notice that \(G_u\) is invertible.

The diagonality assumption is useful later on as it allows us to bound \(\|Gx\|\) in terms of eigenvalues of \(G\).

From here, we split any vector \(z\) in \(\Rd\) in two parts \(z^u\) and \(z^s\), so that \(z = [z^u, z^s]^\top\). Applied to Eq. 1, we get two expressions: \[ x_{t+1}^s = G_s^{t+1} x_0^s + \sum_{j = 0}^t G_s^{t-j} e(x_j)^s \tag{2}\] and \[ x_{t+1}^u = G_u^{t+1} x_0^u + \sum_{j = 0}^t G_u^{t-j} e(x_j)^u. \tag{3}\] Since \(G_u\) is invertible, we can multiply the latter by the inverse of \(G_u^{t+1}\) on both sides to reveal an expression for \(x_0^u\): \[ x_0^u = G_u^{-(t+1)} x_{t+1}^u - \sum_{j = 0}^t G_u^{-(j+1)} e(x_j)^u. \] The latter holds for all \(t\), so we can take the limit \(t \to \infty\) (noting that \(G_u^{-(t+1)} \to 0\)) and deduce \[ x_0^u = - \sum_{j = 0}^\infty G_u^{-(j+1)} e(x_j)^u. \tag{4}\]

The three identities above hold true for all bounded sequences \((x_t)\) generated by the iteration map \(g\) starting at some given \(x_0\). What we are about to do now is this:

Given some bounded sequence \((x_t)\) (not necessarily generated by iterating \(g\)), we construct a new sequence \((y_t)\). We do this in such a way that if \((x_t)\) was generated by \(g\), then there is no change (\(y_t = x_t\) for all \(t\)).

Time to be precise.

A complete metric space of sequences, \(S\)

With \(\delta\) as selected in our assumptions above, define the following set of sequences: \[ S = \{ (x_t) : \|x_t\| \leq \delta \textrm{ for all } t \}. \tag{5}\] That is, a sequence \((x_t)\) (indexed by \(t = 0, 1, 2, \ldots\)) is in \(S\) if it stays in the ball of radius \(\delta\) around the fixed origin forever.

We define a distance over the set \(S\) as follows: \[ \dist((x_t), (y_t)) = \sup_{t \geq 0} \|x_t - y_t\|. \] Equipped with this distance, \(S\) is a complete metric space. This puts us in a position to use the contraction mapping theorem.

Transforming a sequence into another: the map \(T_b\)

Fix a vector \(b \in \reals^s\) (it will be the stable part of a point). Momentarily, we will restrict \(b\) to a ball, as: \(\|b\| \leq \frac{1}{3} \delta\).

Now we build a map \(T_b\) which takes as input a sequence in \(S\) and produces as output a new sequence (which we shall show is also in \(S\)). Given the sequence \((x_t)\) in \(S\), the sequence \((y_t) := T_b(x_t)\) obtained by applying \(T_b\) to \((x_t)\) is defined as follows. The essence is this:

For \(y_0\), we fix the stable part to \(b\), and we set the unstable part with Eq. 4.
For the subsequent \(y_t\) with \(t = 1, 2, \ldots\), we use Eq. 2 and Eq. 3, but we replace the \(x_0\) in the leading term with the \(y_0\) we just constructed.

Now precisely:

For \(y_0\), we let \(y_0^s = b\) and \(y_0^u = - \sum_{j = 0}^\infty G_u^{-(j+1)} e(x_j)^u\).
For \(y_{t+1}\) with \(t \geq 0\), we let
1. \(y_{t+1}^s = G_s^{t+1} y_0^s + \sum_{j = 0}^t G_s^{t-j} e(x_j)^s\), and
2. \(y_{t+1}^u = G_u^{t+1} y_0^u + \sum_{j = 0}^t G_u^{t-j} e(x_j)^u\).

By design, if \((x_t)\) was actually generated by iterating \(g\) and if \(b = x_0^s\), then \(y_0 = x_0\), and therefore also \(y_t = x_t\) for all \(t\). Stated differently:

If \((x_t)\) was generated by \(x_{t+1} = g(x_t)\) and \(b = x_0^s\), then \(T_b(x_t) = (x_t)\), that is, \((x_t)\) is a fixed point of \(T_b\).

Our next goal is to show that \(T_b\) is a contraction on \(S\). For starters, we will have to show that \(T_b(x_t)\) is indeed in \(S\) when \((x_t)\) is so.

As a technical side note (for now), take a closer look at \(y_{t+1}^u\). We can plug the expression for \(y_0^u\) back in, and find: \[\begin{align*} y_{t+1}^u & = - G_u^{t+1} \sum_{j = 0}^\infty G_u^{-(j+1)} e(x_j)^u + \sum_{j = 0}^t G_u^{t-j} e(x_j)^u \\ & = - \sum_{j = 0}^\infty G_u^{t-j} e(x_j)^u + \sum_{j = 0}^t G_u^{t-j} e(x_j)^u \\ & = - \sum_{j = t+1}^\infty G_u^{t-j} e(x_j)^u. \end{align*} \tag{6}\]

\(T_b\) maps from \(S\) into \(S\)

Let \((x_t)\) be a sequence in \(S\), as defined in Eq. 5. We must make sure that \((y_t) := T_b(x_t)\) is also in \(S\). That is, we must check that each \(y_t\) is in the ball of radius \(\delta\) around the origin. Brute force will do.

First, let us look at \(y_0\). Of course, \(\|y_0\| \leq \|y_0^s\| + \|y_0^u\|\). Also, \(\|y_0^s\| = \|b\|\). For the unstable part, recall our assumptions about \(e\): this error function is \(\rho\)-Lipschitz continuous in the ball of radius \(\delta\), with \(\delta\) and \(\rho\) selected so that the last inequality below holds: \[\begin{align*} \|y_0^u\| & = \left\| \sum_{j = 0}^\infty G_u^{-(j+1)} e(x_j)^u \right\| \\ & \leq \sum_{j = 0}^\infty \| G_u^{-(j+1)} e(x_j)^u \| \\ & \leq \sum_{j = 0}^\infty \| G_u^{-1} \|^{j+1} \| e(x_j)^u \| \\ & \leq \sum_{j = 0}^\infty (1/|\lambda_u|)^{j+1} \rho \|x_j\| \\ & \leq \rho \delta \sum_{j = 0}^\infty (1/|\lambda_u|)^{j+1} \\ & = \rho \delta \frac{1}{|\lambda_u| - 1} \\ & \leq \frac{1}{3} \delta. \end{align*}\] In particular:

Restrict \(b\) to \(\|b\| \leq \frac{1}{3} \delta\). Then, \(\|y_0\| \leq \|y_0^s\| + \|y_0^u\| \leq \frac{2}{3} \delta \leq \delta\), as desired.

Now, let us look at all the other elements in the sequence. Decompose again as \(\|y_{t+1}\| \leq \|y_{t+1}^s\| + \|y_{t+1}^u\|\). Let us start with the unstable part. For \(t \geq 0\), starting from Eq. 6, the same reasoning as for \(\|y_0^u\|\) yields: \[\begin{align*} \|y_{t+1}^u\| & = \left\| \sum_{j = t+1}^\infty G_u^{t-j} e(x_j)^u \right\| \\ & \leq \sum_{j = t+1}^\infty (1/|\lambda_u|)^{j-t} \rho \delta \\ & = \rho \delta \frac{1}{|\lambda_u| - 1} \\ & \leq \frac{1}{3} \delta. \end{align*}\] Now for the stable part, we start from the definition of \(y_{t+1}^s\) and recall that \(\|G_s\| = |\lambda_s| < 1\) (using that \(G\) is symmetric) to find: \[\begin{align*} \|y_{t+1}^s\| & = \left\| G_s^{t+1} y_0^s + \sum_{j = 0}^t G_s^{t-j} e(x_j)^s \right\| \\ & \leq |\lambda_s|^{t+1} \|b\| + \sum_{j = 0}^t |\lambda_s|^{t-j} \rho \delta \\ & \leq \|b\| + \rho \delta \frac{1}{1 - |\lambda_s|} \\ & \leq \frac{2}{3} \delta. \end{align*}\] Overall, it follows that \[ \|y_{t+1}\| \leq \|y_{t+1}^s\| + \|y_{t+1}^u\| \leq \delta, \] as desired.

\(T_b\) is a contraction on \(S\)

Given two sequences \((x_t)\) and \((\tilde x_t)\) (both of them in \(S\)), let \((y_t) := T_b(x_t)\) and \((\tilde y_t) := T_b(\tilde x_t)\). We want to show that \(T\) contracts distances in \(S\), that is, we want to show \(\dist((y_t), (\tilde y_t)) < \dist((x_t), (\tilde x_t))\). Again, brute force will do.

As usual, we can split the stable and unstable parts: \[\begin{align*} \dist((y_t), (\tilde y_t)) & = \sup_{t \geq 0} \| y_t - \tilde y_t \| \\ & \leq \sup_{t \geq 0} \| y_t^s - \tilde y_t^s \| + \sup_{t \geq 0} \| y_t^u - \tilde y_t^u \|. \end{align*}\] Recall the definitions of \((y_t)\) (and similarly for \((\tilde y_t)\)): \[\begin{align*} y_{t+1}^s & = G_s^{t+1} y_0^s + \sum_{j = 0}^t G_s^{t-j} e(x_j)^s, \\ y_{t+1}^u & = - \sum_{j = t+1}^\infty G_u^{t-j} e(x_j)^u. \end{align*}\]

Starting with the stable part, recall \(y_0^s = \tilde y_0^s = b\). Then, we find for all \(t \geq 0\): \[\begin{align*} \|y_{t+1}^s - \tilde y_{t+1}^s\| & = \left\| \sum_{j = 0}^t G_s^{t-j} (e(x_j)^s - e(\tilde x_j)^s) \right\| \\ & \leq \sum_{j = 0}^t |\lambda_s|^{t-j} \rho \|x_j - \tilde x_j\| \\ & \leq \frac{1}{1 - |\lambda_s|} \rho \sup_{j \geq 0} \|x_j - \tilde x_j\| \\ & \leq \frac{1}{3} \dist((x_t), (\tilde x_t)). \end{align*}\] Of course, \(\|y_0^s - \tilde y_0^s\| = \|b - b\| = 0\), so we have appropriate control for all elements of the sequences.

Now looking at the unstable part, for all \(t \geq 0\) we have: \[\begin{align*} \|y_{t+1}^u - \tilde y_{t+1}^u\| & = \left\| \sum_{j = t+1}^\infty G_u^{-(j-t)} (e(x_j)^u - e(\tilde x_j)^u) \right\| \\ & \leq \sum_{j = t+1}^\infty (1/|\lambda_u|)^{j-t} \rho \|x_j - \tilde x_j\| \\ & \leq \frac{1}{|\lambda_u| - 1} \rho \sup_{j \geq 0} \|x_j - \tilde x_j\| \\ & \leq \frac{1}{3} \dist((x_t), (\tilde x_t)). \end{align*}\] And again, this also holds for \(y_0^u - \tilde y_0^u\) as well because the expression for \(y_{t+1}^u\) reduces to the definition of \(y_0^u\) upon setting \(t + 1 = 0\).

Thus, \[\begin{align*} \dist((y_t), (\tilde y_t)) & \leq \sup_{t \geq 0} \| y_t^s - \tilde y_t^s \| + \sup_{t \geq 0} \| y_t^u - \tilde y_t^u \| \\ & \leq \frac{2}{3} \dist((x_t), (\tilde x_t)), \end{align*}\] confirming that \(T_b\) is a contraction.

In fact, we have shows a bit more than that:

The contraction factor \(\frac{2}{3}\) is independent of \(b\), so that \(T\) is a uniform contraction with respect to the parameter \(b\). (This is still assuming \(\|b\| \leq \frac{1}{3} \delta\).)

What does the contraction argument buy us?

We have shown that, for all \(b \in \reals^s\) satisfying \(\|b\| \leq \frac{1}{3} \delta\), the map \(T_b\) is a contraction on \(S\). Fix one such \(b\). By the contraction mapping theorem, \(T_b\) has a unique fixed point in \(S\). In other words, there exists a unique sequence \(x_0, x_1, x_2, \ldots\) in \(S\) for which \(T_b(x_t) = (x_t)\). Since the whole sequence exists and is unique, in particular, \(x_0\) is well defined. And it is merely a function of \(b\) (the parameter that determines \(T_b\)). By design, the stable part is \(x_0^s = b\). What we just discovered is that the unstable part \(x_0^u\) is a well-defined function of \(b\). Thus, we have shown the following (where \(B_r(\Rd)\) denotes the closed ball of radius \(r\) around the origin in \(\Rd\)):

There exists a function \(\varphi \colon B_{\frac{1}{3} \delta}(\reals^s) \to B_{\delta}(\reals^u)\) with the following property. If \((x_t)\) is a sequence in \(S\) that is fixed for \(T_b\) with \(b := x_0^s\) satisfying \(\|b\| \leq \frac{1}{3} \delta\), then \(x_0^u = \varphi(b)\).

In fact, we have shown a bit more than that. Indeed, \(T_b\) is a uniform contraction, so that we can apply the uniform contraction mapping theorem. This further guarantees the following:

The function \(\varphi\) is continuous.

We use that next.

How does it help us understand fixed-point avoidance?

Let’s go back to our goal: we want to understand what it takes for a sequence generated by the iteration map \(g\) to converge to the fixed point at the origin.

Accordingly, let \(x_0 \in \Rd\) be arbitrary, and generate the sequence as \(x_{t+1} = g(x_t)\) for all \(t \geq 0\). Assume this sequence converges to the origin. Then, it must eventually enter (and never again leave) the ball of radius \(\frac{1}{3} \delta\) around the origin. Let \(\bar t\) be the index when that happens, so that: \[ x_{\bar t}, x_{\bar t+1}, x_{\bar t+2}, \ldots \] each have norm at most \(\frac{1}{3} \delta\). In particular, the sequence \((x_t)_{t\geq \bar t}\) is in \(S\) (Eq. 5).

Moreover, the norm of \(b := x_{\bar t}^s\) satisfies \(\|b\| \leq \|x_{\bar t}\| \leq \frac{1}{3} \delta\).

Thus, by design, \((x_t)_{t\geq \bar t}\) is a fixed point of \(T_b\) because it is generated by \(g\) and \(x_{\bar t}^s = b\).

But \(T_b\) only has one fixed point in \(S\). That must be the one!

It follows that \(x_{\bar t}\) has its unstable part entirely determined by its stable part: \[ x_{\bar t}^u = \varphi(x_{\bar t}^s). \] And it is clear that, if we drop \(x_{\bar t}\) from the sequence above, then the rest of the sequence, namely, \((x_t)_{t\geq \bar t+1}\), still satisfies all of the above (only with a different \(b\)). Iterating this thinking, we find that \[ x_{t}^u = \varphi(x_{t}^s) \quad \textrm{ for all } \quad t \geq \bar t. \] In other words, these iterates in \(\Rd\) live on the graph of \(\varphi\), namely, \(Z = \{ (z, y) \in \Rd : z = \varphi(y) \}\).

This set has measure zero, owing to the fact that \(\varphi\) is continuous. (If it wasn’t continuous, we would have to be more careful.)

Thus, if a sequence generated by \(g\) converges to the origin, then there exists some \(\bar{t}\) such that \(x_{\bar{t}}\) is in \(Z\). Accordingly, \(x_0\) must belong to \(g^{-\bar t}(Z)\): the set of all points in \(\Rd\) which map into \(Z\) after \(\bar{t}\) iterations of \(g\) (this is well defined even if \(g\) is not globally invertible). Recall the Luzin \(N^{-1}\) property from the beginning of this post: we know \(g^{-1}(Z)\) has measure zero because \(Z\) has measure zero. Therefore, \(\bigcup_{\bar{t} \geq 0} g^{-\bar{t}}(Z)\) also has measure zero. In other words:

There exists a zero-measure set such that, if the sequence generated by \(x_{t+1} = g(x_t)\) converges to the fixed point at the origin, then \(x_0\) belongs to that set.

For continuously random \(x_0\), that almost surely does not happen. We are done.

Where did the manifold go?

The Center Stable Manifold theorem tells us the zero-measure set of “bad” points is a manifold with a certain level of regularity. In the argument above, we only found that this set is the graph of the continuous function \(\varphi\) in some ball of \(\Rd\). That graph has zero measure (which is all we care about for the purpose of securing saddle avoidance), but so far we have said nothing about smoothness.

As one would imagine, since the map \(T_b\) depends nicely on the parameter \(b\), the fixed point also depends rather nicely on \(b\). This can be used to argue that \(\varphi\) inherits some of the regularity of \(g\), e.g., \(C^k\) differentiability or analyticity. See for example Theorem 3 in these lecture notes.

Edit on Nov. 19, 2024: Following the argument of Irwin (1970), one can now use the implicit function theorem rather directly in order to deduce the regularity of the stable manifold.

Relaxing the assumptions

Our assumptions make it so that the undesirable fixed points of \(g\) are isolated and \(g\) is locally invertible around them. This is not necessary: see Appendix III (including the exercises) in the book by Shub (1987) for refined theorem statements, and also (Hirsch et al. 1977, Thm. 5.1).

We write about this in a detailed follow-up post.

References

Antonakopoulos, K., P. Mertikopoulos, G. Piliouras, and X. Wang. 2022. “AdaGrad Avoids Saddle Points.” In Proceedings of the 39th International Conference on Machine Learning, edited by K. Chaudhuri, S. Jegelka, Le Song, C. Szepesvari, G. Niu, and S. Sabato, vol. 162. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v162/antonakopoulos22a.html.

de la Llave, R., and C. E. Wayne. 1995. “On Irwin’s Proof of the Pseudostable Manifold Theorem.” Mathematische Zeitschrift 219 (1): 301–21. https://doi.org/10.1007/bf02572367.

Hirsch, Morris W., Charles C. Pugh, and Michael Shub. 1977. Invariant Manifolds. In Lecture Notes in Mathematics. Springer Berlin Heidelberg. https://doi.org/10.1007/bfb0092042.

Irwin, M. C. 1970. “On the Stable Manifold Theorem.” Bulletin of the London Mathematical Society 2 (2): 196–98. https://doi.org/10.1112/blms/2.2.196.

Lee, J. D., I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. 2019. “First-Order Methods Almost Always Avoid Strict Saddle Points.” Mathematical Programming 176 (1–2): 311–37. https://doi.org/10.1007/s10107-019-01374-3.

Lee, J. D., M. Simchowitz, M. I. Jordan, and B. Recht. 2016. “Gradient Descent Only Converges to Minimizers.” Conference on Learning Theory, 1246–57. https://proceedings.mlr.press/v49/lee16.html.

Panageas, I., and G. Piliouras. 2016. “Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions.” arXiv Preprint arXiv:1605.00405.

Ponomarev, S. P. 1987. “Submersions and Preimages of Sets of Measure Zero.” Siberian Mathematical Journal 28 (1): 153–63.

Shub, M. 1987. Global Stability of Dynamical Systems. Springer New York. https://doi.org/10.1007/978-1-4757-1947-5.

Citation

BibTeX citation:

@online{boumal2024,
  author = {Boumal, Nicolas and Musat, Andreea},
  title = {Saddle Avoidance and Stable Manifold: A Proof from First
    Principles for a Simple Setup},
  date = {2024-08-21},
  url = {www.racetothebottom.xyz/posts/saddle-avoidance-simple/},
  langid = {en},
  abstract = {Gradient descent with fixed step-size avoids strict saddle
    points almost surely. The now standard proof of this fact relies on
    the Center Stable Manifold theorem, which may feel like a black-box.
    This post provides an explicit proof, back to first principles, for
    a simplified setting.}
}