When the Hessian is a proxy for its Gauss–Newton approximation (and not the other way around) – **Race to the bottom** --- the OPTIM@EPFL blog

Abstract

Classically, we think of the Gauss–Newton map as a convenient (though imperfect) approximation of the Hessian. Here, I argue that (in the overparameterized regime), some kind of Gauss–Newton “approximation” is actually what we want—and the Hessian may be a way to access it.

The classical story around Gauss–Newton’s method goes something like this:

Consider a nonlinear least-squares problem of the form \[\begin{align*} \min_{x \in \Rd} f(x) && \textrm{ where } && f(x) = \frac{1}{2} \|F(x)\|^2 \end{align*}\] for some smooth function \(F \colon \Rd \to \Rm\). This encodes \(m\) equations on \(d\) variables which we would like to solve, as we aim to find \(x\) such that \(F(x) \approx 0\). In order to run an optimization algorithm on \(f\), we figure out its gradient and Hessian: \[\begin{align*} \nabla f(x) & = \D F(x)^\top F(x), \\ \nabla^2 f(x) & = \D F(x)^\top \D F(x) + \sum_{i=1}^m F_i(x) \nabla^2 F_i(x), \end{align*}\] where \(\D F(x) \in \reals^{m \times d}\) is the Jacobian of \(F\) at \(x\) and \(F_1, \ldots, F_m\) are the components of \(F\).

We take a look at the Hessian; we notice there is an annoying term that involves second-order derivatives of \(F\); but we also notice that if \(F(x) \approx 0\) (which hopefully holds near minimizers), then this term is expected to be small. Accordingly, we decide to ignore that term and use \[ H(x) := \D F(x)^\top \D F(x) \] as a proxy for the Hessian.

This is the Gauss–Newton approximation. It is motivated by convenience, and by the fact that it is close to the true Hessian when \(F(x) \approx 0\). But…

Should we want to use something close to the true Hessian to begin with?

I will argue that, in some settings, the Gauss–Newton “approximation” is actually what we should be using, though not necessarily with \(F\) itself. Then, I will argue that filtering the Hessian may be seen as a way to get a better approximation of that Gauss–Newton map.

This is especially relevant in the overparameterized regime, where we expect the set of minimizers \(S\) to form a manifold of positive dimension \(d - m > 0\), leading to degenerate Hessians.

Quick literature note

Such takes have (of course) already appeared, though with different technical ingredients.

The angle here is hopefully sufficiently different to be interesting, specifically because I rely on recent insights that Chris, Quentin and I obtained about Polyak–Łojasiewicz functions and how they relate to nonlinear least-squares.

A few pointers:

A 2026 paper by McKay et al. in the context of machine learning, beyond pure least-squares.
A 2025 paper by Arbel et al., also in ML, with a focus on implicit bias of Gauss–Newton-type methods.
A 2011 paper by Pei Chen which zooms in on the importance of using models that preserve the dimension of the set of minimizers.
A 2010 paper by Buhmiler et al. which focuses on solving \(F(x) = 0\) and also advocates for filtering the Jacobian eigenvalues. Such ideas also appear in a 2023 paper by Zhonggang Zeng.

Gauss–Newton flow down submersions

Consider flowing along the negative gradient of \(f\), modified by some square matrix \(G\) that depends on \(x\): \[ \dot{x} = -G(x) \nabla f(x). \] Three options for \(G\) come to mind in our context:

Gradient flow corresponds to \(G(x) = I\). This is notoriously slow in case of ill-conditioning.
Newton flow corresponds to \(G(x) = \nabla^2 f(x)^{-1}\). This is ill-advised in the overparameterized regime because \(\nabla^2 f(x)\) is degenerate near minimizers.
Gauss–Newton flow corresponds to \(G(x) = H(x)^{\dagger}\), where the dagger denotes the Moore–Penrose pseudoinverse. It is necessary to use the pseudoinverse here because \(H(x)\) is not invertible in the overparameterized regime: \(\rank H(x) \leq m < d\).

What happens to \(F(x)\) along these flows? The chain rule gives \[\begin{align*} \ddt F(x) & = \D F(x) \dot{x} \\ & = -\D F(x) G(x) \nabla f(x) \\ & = -\D F(x) G(x) \D F(x)^\top F(x). \end{align*}\] The Gauss–Newton flow has a visible advantage here, as it leads to \[ \dot{x} = - H(x)^{\dagger} \D F(x)^\top F(x) = - \D F(x)^\dagger F(x) \] and \[ \ddt F(x) = \D F(x) \dot{x} = - \Proj_{\im \D F(x)} F(x), \] where \(\Proj_{\im \D F(x)}\) is the orthogonal projection to \(\im \D F(x)\).

If \(F\) is a submersion, that is, if \(\D F(x)\) has rank \(m\) for all \(x\), then \(\im \D F(x) = \Rm\). (That’s a big if—hold on to that.)

In that case, \(\ddt F(x) = -F(x)\). This converges to \(0\) along a straight line: \[ F(x(t)) = e^{-t} F(x(0)). \] Accordingly, \(f(x(t)) = \frac{1}{2} \|F(x(t))\|^2 = e^{-2t} f(x(0))\) also converges to \(0\) (the optimal value).

These are strong reasons to like the Gauss–Newton flow:

If \(F\) is a submersion, then \(F(x(t))\) goes to zero without detours, and \(x(t)\) approaches the set of minimizers \(S = \{x : F(x) = 0\}\) along a direction \(\dot x(t)\) that becomes normal to \(S\).

Importantly,

This narrative in favor of Gauss–Newton has nothing to do with the fact that \(H\) is close to \(\nabla^2 f\). They are, but \(H\) is relevant in and of itself, regardless of how close it is to \(\nabla^2 f\).

By extension, for discrete-time algorithms, it makes sense to entertain the update direction \(-H(x)^\dagger \nabla f(x)\), or equivalently, \(- \D F(x)^\dagger F(x)\) (computable by linear least-squares). We’ll have a numerical experiment below.

Notice how the assumption that \(F\) is a submersion implies \(\min f = 0\). And indeed, the traditional motivation for Gauss–Newton has us assume \(F(x) = 0\) at minimizers. Still, this is restrictive. It’s time to discuss these assumptions some more.

When might we expect a submersion structure?

In short: if \(f\) has the Morse–Bott property.

We say \(f\) has the Morse–Bott property (MB) if the set of minimizers \(S\) is a smooth manifold and the Hessian \(\nabla^2 f(x)\) is positive definite on the normal space to \(S\) at every \(x \in S\).

This is equivalent to the assumption that \(f\) satisfies the Polyak–Łojasiewicz condition (PŁ) locally around \(S\) (Rebjock and Boumal 2024b). In turn, we know from this post (based on (Boumal, Rebjock, and Criscitiello 2026)) that this implies \[ f(x) = f^* + \frac{1}{2} \|\varphi(x)\|^2 \] locally around \(S\), with \(f^* = \min f\) and some smooth submersion \(\varphi \colon \Rd \to \Rk\) such that \(k = d - \dim S\).

It is not entirely uncommon for a cost function \(f\) (even beyond least-squares) to have the MB/PŁ property around \(S\) (Rebjock and Boumal 2024b). And this can hold even for a least-squares cost \(f(x) = \frac{1}{2} \|F(x)\|^2\) that has nonzero residuals (\(F(x) \neq 0\) at minimizers), because \(\varphi\) can be different from \(F\).

How this flips the script

Then, based on the discussion above, we might want to apply a Gauss–Newton-type method to \(f\), specifically in the form \(f(x) = f^* + \frac{1}{2} \|\varphi(x)\|^2\). But \(\varphi\) may not be available. How can we access what we need from it?

Well, if \(f(x) = f^* + \frac{1}{2} \|\varphi(x)\|^2\), then (exactly as in the computations at the very beginning of this post) \(\nabla f(x) = \D \varphi(x)^\top \varphi(x)\) and \[ \nabla^2 f(x) = \D \varphi(x)^\top \D \varphi(x) + \sum_{i=1}^k \varphi_i(x) \nabla^2 \varphi_i(x). \] Here is the twist:

In this context, \(f\) is available to us (so we know how to compute \(\nabla^2 f\)), but \(\varphi\) is not available to us. We want the Gauss–Newton direction for \(f\), and that involves \(\D \varphi(x)^\top \D \varphi(x)\). What to do?

Then, looking at the expression for the Hessian, we can observe:

Close to minimizers, \(\varphi(x) \approx 0\) and hence \(\nabla^2 f(x) \approx \D \varphi(x)^\top \D \varphi(x)\).

So there we have it: in this narrative, the Hessian \(\nabla^2 f(x)\) is a proxy for the Gauss–Newton map \(H(x) = \D \varphi(x)^\top \D \varphi(x)\), and not the other way around.

Filtering the Hessian

One step further: we want access to \(H(x)^\dagger\), the pseudoinverse of \(H(x) = \D \varphi(x)^\top \D \varphi(x) \approx \nabla^2 f(x)\).

This is not well approximated by \(\nabla^2 f(x)^{\dagger}\), because generally \(\nabla^2 f(x)\) is invertible near \(S\), with a number of very small eigenvalues. These stem from the fact that \(\ker \nabla^2 f(x) = \T_x S\) for \(x \in S\), that is, the Hessian at a minimizer has a zero eigenvalue with multiplicity \(\dim S\). In turn, these yield as many small (though usually nonzero) eigenvalues near \(S\). Then, \(\nabla^2 f(x)^{\dagger} = \nabla^2 f(x)^{-1}\) explodes near \(S\). This is what causes Newton’s method to misbehave in the overparameterized regime.

Since we are using the Hessian only as a surrogate for \(H(x)\), it is clear what we should do: the rank of \(H(x)\) is \(k = d - \dim S\), so we should keep the \(k\) largest eigenvalues of \(\nabla^2 f(x)\) and zero out the \(\dim S\) smallest ones. Then, we can work with the pseudoinverse of this filtered Hessian, which is a much better proxy for \(H(x)^\dagger\) than \(\nabla^2 f(x)^{\dagger}\).

How exactly to implement this filtering takes some thinking of course, especially since \(k\) is typically unknown. Let me just note that the truncated CG algorithm (a standard method used to solve the trust-region subproblem) actually does something like this automatically: see (Rebjock and Boumal 2024a).

We can get to these conclusions through simpler heuristics of course. The particular angle in this post has the advantage that it could also serve as the premise for a proof.

Numerical experiment

Consider the following least-squares cost function in 2-D, as studied in (Rebjock and Boumal 2024b, sec. 4): \[\begin{align*} f(x) = \frac{1}{2} \|F(x)\|^2, && \textrm{ where } && F(x) = x_2 \sqrt{1 + x_1^2}. \end{align*}\] Notice \(F\) is smooth. Also, \[ \D F(x) = \begin{bmatrix} \frac{x_1 x_2}{\sqrt{1 + x_1^2}} & \sqrt{1 + x_1^2} \end{bmatrix} \in \reals^{1 \times 2} \] is nonzero for all \(x\), so \(F\) is a submersion. (In fact, \(f\) is globally PŁ because the smallest (and only) singular value of \(\D F(x)\) is \(\|\D F(x)\| \geq 1\) for all \(x\).)

Newton’s method on \(f\) would have us run this iteration: \[ x_{t+1} = x_t - \nabla^2 f(x_t)^{-1} \nabla f(x_t). \] In contrast, Gauss–Newton on \(f\) corresponds to \[ x_{t+1} = x_t - \D F(x_t)^\dagger F(x_t). \] The set of minimizers of \(f\) is the line \(S = \{x : x_2 = 0\}\). For a simple experiment, we initialize close to \(S\) and run both methods for a few iterations.

This is easily implemented in Matlab, as follows:

clear; clc;

F  = @(x) sqrt(1+x(1)^2) * x(2);
DF = @(x) [x(1)*x(2)/sqrt(1+x(1)^2), sqrt(1+x(1)^2)];

f = @(x) .5*F(x)^2;
gradf = @(x) DF(x)'*F(x);
Hessf = @(x) [x(2)^2,       2*x(1)*x(2);
              2*x(1)*x(2),  1+x(1)^2];

% Fixed initial guess
x0 = [1/2, 1/2]';

% You can also try a random guess, it's not very different
% x0 = randn(2, 1);

niter = 12;
x_N  = zeros(2, niter);  % Newton iterates
x_GN = zeros(2, niter);  % Gauss--Newton iterates

x_N(:, 1) = x0;
for iter = 2 : niter
    x = x_N(:, iter-1);
    x_N(:, iter) = x - Hessf(x)\gradf(x);
end
disp(x_N);

x_GN(:, 1) = x0;
for iter = 2 : niter
    x = x_GN(:, iter-1);
    x_GN(:, iter) = x - DF(x)\F(x);
end
disp(x_GN);

Starting from \(x_0 = [0.5, 0.5]^\top\), the Newton iterates first make a big jump away from \(S\), then slowly converge to it, though they do not seem to converge to any particular point on \(S\):

    0.5000    3.0000    1.8462    0.9639   -0.0764   -0.1546   -0.3252   -0.8517    0.3978    1.2747    0.4112    1.3866
    0.5000   -1.0000   -0.6923   -0.5116   -0.5318    0.0063   -0.0003    0.0001    0.0001   -0.0001   -0.0001    0.0000

In contrast, the Gauss–Newton iterates converge in one step to the closest point on \(S\):

    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000    0.5000
    0.5000         0         0         0         0         0         0         0         0         0         0         0

Here is also a run with a random initial guess (Newton shown first, then Gauss–Newton):

    1.4090    0.5603   13.2025    8.7680    5.7944    3.7855    2.4034    1.4060    0.5572   11.1834    7.4157    4.8835
    1.4172    1.1354  -12.2421   -8.1770   -5.4751   -3.6867   -2.5163   -1.7803   -1.4275   12.8993    8.6225    5.7834

    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090    1.4090
    1.4172         0         0         0         0         0         0         0         0         0         0         0

Of course, one would typically also use a line-search method, or safeguard more comprehensively with a trust-region method for example. This may further change the picture.

Aside: the strongly convex case

Consider the special case where \(f \colon \Rd \to \reals\) is smooth and \(\mu\)-strongly convex. Then, \(f\) has a unique minimizer \(x^*\) and \(\nabla^2 f(x) \succeq \mu I\) for all \(x\). This has led to a number of papers considering the Hessian metric on \(\Rd\), namely, \[ \inner{u}{v}_x := u^\top \nabla^2 f(x) v, \] because:

Riemannian gradient flow in the Hessian metric coincides with Newton flow in the Euclidean metric.

I would argue there is another interesting metric to consider. Specifically, it is a standard fact that \(\mu\)-strong convexity implies \(\mu\)-PŁ globally (Karimi, Nutini, and Schmidt 2016). Then, still from our work on PŁ functions, it follows that there exists a diffeomorphism \(\varphi \colon \Rd \to \Rd\) such that \[ f(x) = f^* + \frac{1}{2} \|\varphi(x)\|^2. \] (This map \(\varphi\) is not unique: just pick one.)

Stated differently: \[ (f \circ \varphi^{-1})(y) = f^* + \frac{1}{2} \|y\|^2. \] The right-hand side is a trivial cost function to minimize on \(\Rd\). The two are related by the change of coordinates \(\varphi\).

Now, pullback the Euclidean metric on \(\Rd\) by \(\varphi\). This yields a new Riemannian metric on \(\Rd\) (the pullback metric), defined by \[ \inner{u}{v}_x := \inner{\D \varphi(x) u}{\D \varphi(x) v}_{\Rd} = u^\top \D \varphi(x)^\top \D \varphi(x) v. \] This metric is defined by the Gauss–Newton map \(H(x) = \D \varphi(x)^\top \D \varphi(x)\) rather than by the Hessian map \(\nabla^2 f(x)\).

Call that one the Gauss–Newton metric. By design,

Riemannian gradient flow in the Gauss–Newton metric coincides with Gauss–Newton flow in the Euclidean metric.

This flow is particularly easy to describe because, up to the change of coordinates, it corresponds to gradient flow on \(y \mapsto f^* + \frac{1}{2} \|y\|^2\), that is, \(\dot y = -y\). This traces out a straight line in \(\Rd\), namely, \(y(t) = e^{-t} y(0)\). From here, we recover the flow in the original coordinates, as a geodesic (up to reparametrization) in the Gauss–Newton metric: \[\begin{align*} x(t) & = \varphi^{-1}(y(t)) \\ & = \varphi^{-1}(e^{-t} y(0)) \\ & = \varphi^{-1}(e^{-t} \varphi(x(0))). \end{align*}\] Thus, \[ \varphi(x(t)) = e^{-t} \varphi(x(0)) \] and \(f(x(t)) = f^* + e^{-2t}(f(x(0)) - f^*)\).

(For contrast, with Newton flow it is the gradient that behaves nicely along the trajectory: \(\ddt \nabla f(x(t)) = -\nabla f(x(t))\) so that \(\nabla f(x(t)) = e^{-t} \nabla f(x(0))\). This has funny implications.)

Of course, \(\varphi\) is not available to us, so this is more of a theoretical gimmick.

References

Boumal, N., Q. Rebjock, and C. Criscitiello. 2026. “Smooth, Globally Polyak-Łojasiewicz Functions Are Nonlinear Least-Squares.” arXiv 2604.07972.

Karimi, H., J. Nutini, and M. Schmidt. 2016. “Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition.” In Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 795–811. Springer International Publishing. https://doi.org/10.1007/978-3-319-46128-1_50.

Rebjock, Q., and N. Boumal. 2024a. “Fast Convergence of Trust-Regions for Non-Isolated Minima via Analysis of CG on Indefinite Matrices.” Mathematical Programming. https://doi.org/10.1007/s10107-024-02140-w.

———. 2024b. “Fast Convergence to Non-Isolated Minima: Four Equivalent Conditions for \(C^2\) Functions.” Mathematical Programming. https://doi.org/10.1007/s10107-024-02136-6.

Citation

BibTeX citation:

@online{boumal2026,
  author = {Boumal, Nicolas},
  title = {When the {Hessian} Is a Proxy for Its {Gauss-\/-Newton}
    Approximation (and Not the Other Way Around)},
  date = {2026-05-29},
  url = {www.racetothebottom.xyz/posts/Gauss-Newton/},
  langid = {en},
  abstract = {Classically, we think of the Gauss-\/-Newton map as a
    convenient (though imperfect) approximation of the Hessian. Here, I
    argue that (in the overparameterized regime), some kind of
    Gauss-\/-Newton “approximation” is actually what we want-\/-\/-and
    the Hessian may be a way to access it.}
}