Randomization is necessary to escape saddle points in high dimension – Race to the bottom

Abstract

It is folklore that deterministic optimization algorithms can have a hard time escaping saddle points in high-dimensional settings. We propose one way of making this precise, with a resisting oracle argument.

Say we want to minimize \(f \colon \reals^d \to \reals\) in a high-dimensional space (large \(d\)). A priori, it may have saddle points, that is, critical points that are not even local minimizers. Though optimization algorithms are unlikely to converge to saddles, the mere presence of these saddles can dramatically slow down algorithms (Du et al. 2017).

It is well known that having some randomness in the algorithm can help in this regard. For example, with a healthy dose of assumptions, Jin et al. (2017) showed that perturbed gradient descent (which essentially consists in adding well-calibrated jitter to standard gradient descent) is able to escape saddle points using a number of function and gradient evaluations whose dependence on the dimension \(d\) is only polylogarithmic. This is to be understood as: for all cost functions in some nice class, the algorithm has a good probability of succeeding this fast.

For contrast, deterministic algorithms need at least \(\Omega(d)\) oracle calls for some cost functions. With more words:

In order to escape the saddle points of \(f\) efficiently, it is necessary to use a randomized algorithm, that is, one that includes some form of jitter. Without randomization, some cost functions may force the algorithm to query as many as (roughly) \(d\) pieces of information about \(f\).

This is folklore. One can get the essence of the formalization from (Simchowitz, El Alaoui, and Recht 2017, Ex. 2.1) in a different (though related) context.

The high-level idea there is that (strict) saddle points can be distinguished from local minimizers by detecting a negative eigenvalue in the Hessian; yet, using only Hessian-vector products, an adversary could arrange for the Hessian to have a single negative eigenvalue, and then for our first \(d-1\) query vectors to be orthogonal to the corresponding eigenspace.

However, this reduction is not entirely satisfactory because, in optimization, we are not forced to query the Hessian repeatedly at the same point. One may wonder whether clever tricks and perhaps some nice assumptions on \(f\) would make the issue go away.

In this post, we formalize the idea that: no, the issues won’t go away. The proof is a simple construction based on a resisting oracle argument, à la Nemirovski and Yudin.

Figure 1: The function \(f_q\) (Eq. 1) looks like a pure quadratic in all but one of the dimensions. It is only once we discover the special direction (spanned by \(q\)) that we find out the apparent minimizer of that (restricted) quadratic is nothing but a saddle point. The resisting oracle chooses \(q\) adversarially in order to delay the algorithm’s epiphany for as long as possible.

Context

The proof below is extracted from our paper about a randomized trust-region method with Xiaowen Jiang and Bonan Sun (Dragomir et al. 2026). We built that algorithm to combine three features that are seldom found together:

The algorithm is practical (just one extra parameter).
It escapes saddle points efficiently (which requires jitter).
Yet it is still able to converge to local minimizers (which requires steadiness).

Our starting point was to add the randomness in the initialization of the trust-region subproblem solver. We show that this randomness is automatically amplified if we are near a saddle point, or absorbed if we are near a minimizer.

The lower bound

For our purposes, a deterministic algorithm is any procedure which starts without any knowledge of \(f\) and iterates the following:

Select \(x \in \Rd\) and \(u \in \Rd\) (based on all that’s known so far, deterministically).
Query \(f(x)\), \(\nabla f(x)\) and \(\nabla^2 f(x)[u]\) (this is one unit of work).

This generates a sequence of pairs \((x_1, u_1), (x_2, u_2), (x_3, u_3), \ldots\)

Two comments: (1) the algorithm is in charge of choosing \(x_1\) (for example, we think of gradient descent with two different initial guesses as two different algorithms); and (2) if we do not need a Hessian-vector product at \(x_k\), we can let \(u_k = 0\) (the point is that the lower bound applies even if we are allowed to use the Hessian).

Theorem 1 For every \(K\) and every deterministic algorithm, there exists a nice function \(f \colon \reals^{2K+1} \to \reals\) (more on this later) with the following properties:

The origin is a (strict) saddle point.
\(f\) takes the same value at all of its saddle points, namely, \(\bar{f} := f(0)\).
\(f\) takes the same value at all of its local minimizers, namely, \(\min f < \bar f\).
Running the algorithm on \(f\) generates points \(x_1, x_2, \ldots\) such that \(f(x_1), \ldots, f(x_K) \geq \bar f\).

Taking \(d = 2K+1\), this gives a lower bound of \((d+1)/2\) oracle calls (that is, \(K+1\)) for the algorithm to descend below the level of any saddle point of \(f\), despite the fact that all local minimizers are below that level.

Proof. The resisting oracle entertains a collection of cost functions. Explicitly, for each unit-norm \(q \in \reals^{2K+1}\), let \[ f_q(x) = \frac{1}{2} \|(I - qq^\top)x\|^2 + \cos(q^\top x). \tag{1}\] It attains its minimal value when \(x\) is parallel to \(q\) (so that the first term vanishes) and also \(q^\top x\) is an odd multiple of \(\pi\) (so that the cosine term is \(-1\)). For example, \(f_q(\pi q) = -1\) is minimal.

The gradient and Hessian are: \[\begin{align*} \nabla f_q(x) & = (I - qq^\top)x - \sin(q^\top x) q, \\ \nabla^2 f_q(x)[u] & = (I - qq^\top)u - \cos(q^\top x) (q^\top u) q. \end{align*}\] Notice \(f_q(0) = 1\), \(\nabla f_q(0) = 0\) and \(\nabla^2 f_q(0)[q] = -q\). Thus, the origin is a saddle point, with one negative eigenvalue equal to -1.

Moreover, all critical points are of the form \(x = n\pi q\) for some integer \(n\). The saddle points correspond to \(n\) even, with \(f_q\) taking the value \(1\) at all of them. The minimizers correspond to \(n\) odd, with \(f_q\) taking the value \(-1\) at all of them.

Further notice that for all \(x, u\) orthogonal to \(q\), we have: \[\begin{align*} f_q(x) = \frac{1}{2} \|x\|^2 + 1, && \nabla f_q(x) = x, && \textrm{ and } && \nabla^2 f_q(x)[u] = u. \end{align*}\] The right-hand sides are independent of \(q\)—this is key!

Indeed, this allows the resisting oracle to do the following:

In the beginning, do not choose a specific cost function \(f\).
For the first \(K\) queries \((x_1, u_1), \ldots, (x_K, u_K)\), reply \(f(x_k) = \frac{1}{2} \|x_k\|^2 + 1\), \(\nabla f(x_k) = x_k\) and \(\nabla^2 f(x_k)[u_k] = u_k\).
Only then, choose a unit \(q \in \reals^{2K+1}\) orthogonal to all query vectors \(x_1, \ldots, x_K\) and \(u_1, \ldots, u_K\). Such a \(q\) exists because at most \(2K\) vectors have been queried in a space of dimension \(2K+1\).
Pretend \(f\) was \(f_q\) all along.

This “after the fact” choice of \(f\) is compatible with all our previous replies. And indeed, on this function \(f = f_q\), the given algorithm’s first \(K\) iterates are unable to descend below the value of the strict saddles (\(f(0) = 1\)), well above the minimal value (\(\min f = -1\)).

(This type of reasoning is typical in optimization, pioneered by Nemirovski and Yudin.)

These are indeed nice functions

The usefulness of Theorem 1 stems from the fact that (a) it applies to a large class of algorithms, and (b) the functions \(f_q\) used by the resisting oracle satisfy all sorts of assumptions found in the optimization literature.

Explicitly, here are some of the qualities of \(f_q\) (Eq. 1):

It is real analytic (in particular, \(C^\infty\)).
Its gradient is 1-Lipschitz continuous.
Its Hessian is 1-Lipschitz continuous.
At each critical point, the eigenvalues of the Hessian lie outside of the interval \((-1, 1)\) (in particular, it is a Morse function).
All of its local minimizers are global minimizers.
The distance from any \(x\) to the closest critical point is at most \(\frac{\pi}{2} \|\nabla f(x)\|\) (that is, if the gradient is small, then there is a critical point nearby; the other way around: if we are far from all critical points, then the gradient is somewhat large).

Therefore, the obstruction is fairly general. To make it go away, we would have to make unusually strong assumptions on \(f\) that somehow exclude these nice functions.

References

Dragomir, R.-A., X. Jiang, B. Sun, and N. Boumal. 2026. “A Practical Randomized Trust-Region Method to Escape Saddle Points in High Dimension.” arXiv 2603.15494.

Du, S., C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos. 2017. “Gradient Descent Can Take Exponential Time to Escape Saddle Points.” Advances in Neural Information Processing Systems 30.

Jin, Chi, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. 2017. “How to Escape Saddle Points Efficiently.” In Proceedings of the 34th International Conference on Machine Learning - Volume 70, 1724–32. ICML’17. Sydney, NSW, Australia: JMLR.org.

Simchowitz, M., A. El Alaoui, and B. Recht. 2017. “On the Gap Between Strict-Saddles and True Convexity: An \(\Omega(\log d)\) Lower Bound for Eigenvector Approximation.” arXiv Preprint arXiv:1704.04548. https://arxiv.org/abs/1704.04548.

Citation

BibTeX citation:

@online{dragomir2026,
  author = {Dragomir, Radu and Boumal, Nicolas},
  title = {Randomization Is Necessary to Escape Saddle Points in High
    Dimension},
  date = {2026-07-02},
  url = {www.racetothebottom.xyz/posts/randomized-escape/},
  langid = {en},
  abstract = {It is folklore that deterministic optimization algorithms
    can have a hard time escaping saddle points in high-dimensional
    settings. We propose one way of making this precise, with a
    resisting oracle argument.}
}