Gradient Descent Converges to Minimizers

Jason D. Lee, Max Simchowitz, Michael I. Jordan, Benjamin Recht

Introduction

Saddle points have long been regarded as a tremendous obstacle for continuous optimization. There are many well known examples when worst case initialization of gradient descent provably converge to saddle points [20, Section 1.2.3], and hardness results which show that finding even a local minimizer of non-convex functions is NP-Hard in the worst case . However, such worst-case analyses have not daunted practitioners, and high quality solutions of continuous optimization problems are readily found by a variety of simple algorithms. Building on tools from the theory of dynamical systems, this paper demonstrates that, under very mild regularity conditions, saddle points are indeed of little concern for the gradient method.

We call xx a critical point of ff if f(x)=0\nabla f(x)=0, and say that ff satisfies the strict saddle property if each critical point xx of ff is either a local minimizer, or a “strict saddle”, i.e, 2f(x)\nabla^{2}f(x) has at least one strictly negative eigenvalue. We prove:

Here, by sufficiently small, we simply mean less than the inverse of the Lipschitz constant of the gradient. As we discuss below, such step sizes are standard for the gradient method. We remark that the strict saddle assumption is necessary in the worst case, due to hardness results regarding testing the local optimality of functions whose Hessians are highly degenerate at critical points (e.g, quartic polynomials) .

Prior work has show that first-order descent methods can circumvent strict saddle points, provided that they are augmented with unbiased noise whose variance is sufficiently large along each direction. For example, establishes convergence of the Robbins-Monro stochastic approximation to local minimizers for strict saddle functions. More recently, give quantitative rates on the convergence of noise-added stochastic gradient descent to local minimizers, for strict saddle functions. The condition that the noise have large variance along all directions is often not satisfied by the randomness which arises in sample-wise or coordinate-wise stochastic updates. In fact, it generally requires that additional, near-isotropic noise be added at each iteration, which yields convergence rates that depend heavily on problem parameters like dimension. In contrast, our results hold for the simplest implementation of gradient descent and thus do not suffer from the slow convergence associated with adding high-variance noise to each iterate.

But is this strict saddle property reasonable? Many works have answered in the affirmative by demonstrating that many objectives of interest do in fact satisfy the “strict saddle” property: PCA, a fourth-order tensor factorization , formulations of dictionary learning and phase retrieval .

To obtain provable guarantees, the authors of and adopt trust-region methods which leverage Hessian information in order to circumvent saddle points. This approach joins a long line of related strategies, including: a modified Newton’s method with curvilinear line search , the modified Cholesky method , trust-region methods , and the related cubic regularized Newton’s method , to name a few. Specialized to deep learning applications, have introduced a saddle-free Newton method.

Unfortunately, such curvature-based optimization algorithms have a per-iteration computational complexity which scales quadratically or even cubically in the dimension dd, rendering them unsuitable for optimization of high dimensional functions. In contrast, the complexity of an iteration of gradient descent is linear in dimension. We also remark that the authors of empirically observe gradient descent with 100100 random initializations on the phase retrieval problem reliably converges to a local minimizer, and one whose quality matches that of the solution found using more costly trust-region techniques.

More broadly, many recent works have shown that gradient descent plus smart initialization provably converges to the global minimum for a variety of non-convex problems: such settings include matrix factorization , phase retrieval , dictionary learning , and latent-variable models . While our results only guarantee convergence to local minimizers, they eschew the need for complex and often computationally prohibitive initialization procedures.

Finally, some preliminary results have shown that there are settings in which if an algorithm converges to a saddle point it necessarily has a small objective value. For example, studies the loss surface of a particular Gaussian random field as a proxy for understanding the objective landscape of deep neural nets. The results leverage the Kac-Rice Theorem , and establish that that critical points with more positive eigenvalues have lower expected function value, often close to that of the global minimizer. We remark that functions drawn from this Gaussian random field model share the strict saddle property defined above, and so our results apply in this setting. On the other hand, our results are considerably more general, as they do not place stringent generative assumptions on the objective function ff.

2 Organization

The rest of the paper is organized as follows. Section 2 introduces the notation and definitions used throughout the paper. Section 3 provides an intuitive explanation for why it is unlikely that gradient descent converges to a saddle point, by studying a non-convex quadratic and emphasizing the analogy with power iteration. Section 4 states our main results which guarantee gradient descent converges to only local minimizers, and also establish rates of convergence depending on the local geometry of the minimizer. The primary tool we use is the local stable manifold theorem, accompanied by inversion of gradient descent via the proximal point algorithm. Finally, we conclude in Section 5 by suggesting several directions of future work.

Preliminaries

Throughout the paper, we will use ff to denote a real-valued function in C2C^{2}, the space of twice-continuously differentiable functions, and gg to denote the corresponding gradient map with step size α\alpha,

The Jacobian of gg is given by Dg(x)ij=gixj(x)Dg(x)_{ij}=\frac{\partial g_{i}}{\partial x_{j}}(x), or Dg(x)=Iα2f(x)Dg(x)=I-\alpha\nabla^{2}f(x). In addition to being C2C^{2}, our main regularity assumption on ff is that it has a Lipschitz gradient:

The kk-fold composition of the gradient map gk(x)g^{k}(x) corresponds to performing kk steps of gradient descent initialized at xx. The iterates of gradient descent will be denoted xk:=gk(x0)x_{k}:=g^{k}(x_{0}). All the probability statements are with respect to ν\nu, the distribution of x0x_{0}, which we assume is absolutely continuous with respect to Lebesgue measure.

A fixed point of the gradient map gg is a critical point of the function ff. Critical points can be saddle points, local minima, or local maxima. In this paper, we will study the critical points of ff via the fixed points of gg, and then apply dynamical systems theory to gg.

A point xx^{*} is a critical point of ff if it is a fixed point of the gradient map g(x)=xg(x^{*})=x^{*}, or equivalently f(x)=0\nabla f(x^{*})=0.

A critical point xx^{*} is isolated if there is a neighborhood UU around xx^{*}, and xx^{*} is the only critical point in UU.

A critical point is a local minimum if there is a neighborhood UU around xx^{*} such that f(x)f(x)f(x^{*})\leq f(x) for all xUx\in U, and a local maximum if f(x)f(x)f(x^{*})\geq f(x).

A critical point is a saddle point if for all neighborhoods UU around xx^{*}, there are x,yUx,y\in U such that f(x)f(x)f(y)f(x)\leq f(x^{*})\leq f(y).

As mentioned in the introduction, we will be focused on saddle points that have directions of strictly negative curvature. This notion is made precise by the following definition.

A critical point xx^{*} of ff is a strict saddle if λmin(2f(x))<0\lambda_{\min}(\nabla^{2}f(x^{*}))<0.

Since we are interested in the attraction region of a critical point, we define the stable set.

The global stable set Ws(x)W^{s}(x^{*}) of a critical point xx^{*} is the set of initial conditions of gradient descent that converge to xx^{*}:

Intuition

To illustrate why gradient descent does not converge to saddle points, consider the case of a non-convex quadratic, f(x)=12xTHxf(x)=\frac{1}{2}x^{T}Hx. Without loss of generality, assume H=diag(λ1,...,λn)H=\mathop{\mathbf{diag}}(\lambda_{1},...,\lambda_{n}) with λ1,...,λk>0\lambda_{1},...,\lambda_{k}>0 and λk+1,,λn<0\lambda_{k+1},\dots,\lambda_{n}<0. x=0x^{*}=0 is the unique critical point of this function and the Hessian at xx^{*} is HH. Note that gradient descent initialized from x0x_{0} has iterates

where eie_{i} denote the standard basis vectors. This iteration resembles power iteration with the matrix IαHI-\alpha H.

The gradient method is guaranteed to converge with a constant step size provided 0<α<2L0<\alpha<\frac{2}{L} . For this quadratic ff, LL is equal to maxλi\max|\lambda_{i}|. Suppose α<1/L\alpha<1/L, a slightly stronger condition. Then we will have (1αλi)<1(1-\alpha\lambda_{i})<1 for iki\leq k and (1αλi)>1(1-\alpha\lambda_{i})>1 for i>ki>k. If x0Es:=span(e1,,ek)x_{0}\in E_{s}:=\operatorname{span}(e_{1},\ldots,e_{k}), then xkx_{k} converges to the saddle point at since (1αλi)k+10(1-\alpha\lambda_{i})^{k+1}\to 0. However, if x0x_{0} has a component outside EsE_{s} then gradient descent diverges to \infty. For this simple quadratic function, we see that the global stable set (attractive set) of is the subspace EsE_{s}. Now, if we choose our initial point at random, the probability of that point landing in EsE_{s} is zero.

As an example of this phenomena for a non-quadratic function, consider the following example from [20, Section 1.2.3]. Letting f(x,y)=12x2+14y412y2f(x,y)=\frac{1}{2}x^{2}+\frac{1}{4}y^{4}-\frac{1}{2}y^{2}, the corresponding gradient mapping is

The points z2z_{2} and z3z_{3} are isolated local minima, and z1z_{1} is a saddle point.

Gradient descent initialized from any point of the form [x0]\begin{bmatrix}x\\ 0\end{bmatrix} converges to the saddle point z1z_{1}. Any other initial point either diverges, or converges to a local minimum, so the stable set of z1z_{1} is the xx-axis, which is a zero measure set in R2\mathbf{R}^{2}. By computing the Hessian,

we find that 2f(z1)\nabla^{2}f(z_{1}) has one positive eigenvalue with eigenvector that spans the xx-axis, thus agreeing with our above characterization of the stable set. If the initial point is chosen randomly, there is zero probability of initializing on the xx-axis and thus zero probability of converging to the saddle point z1z_{1}.

In the general case, the local stable set Wlocs(x)W^{s}_{loc}(x^{*}) of a critical point xx^{*} is well-approximated by the span of the eigenvectors corresponding to positive eigenvalues. By an application of Taylor’s theorem, one can see that if the initial point x0x_{0} is uniformly random in a small neighborhood around xx^{*}, then the probability of initializing in the span of these eigenvectors is zero whenever there is a negative eigenvalue. Thus, gradient descent initialized at x0x_{0} will leave the neighborhood. The primary difficulty is that x0x_{0} is randomly distributed over the entire domain, not a small neighborhood around xx^{*}, and Taylor’s theorem does not provide any global guarantees.

However, the global stable set can be found by inverting the gradient map via g1g^{-1}. Indeed, the global stable set is precisely k=0gk(Wlocs(x))\cup_{k=0}^{\infty}g^{-k}(W^{s}_{loc}(x^{*})). This follows because if a point xx converges to xx^{*}, then for some sufficiently large kk it must enter the local stable set. That is, xx converges to xx^{*} if and only if gk(x)Wlocsg^{k}(x)\in W^{s}_{loc} for sufficiently large kk. If Wlocs(x)W^{s}_{loc}(x^{*}) is of measure zero, then gk(Wlocs(x))g^{-k}(W^{s}_{loc}(x^{*})) is also of measure zero, and hence the global stable set is of measure zero. Thus, gradient descent will never converge to xx^{*} from a random initialization.

In Section 4, we formalize the above arguments by showing the existence of an inverse gradient map. The case of degenerate critical points, critical points with zero eigenvalues, is more delicate; the geometry of the global stable set is no longer characterized by only the number of positive eigenvectors. However in Section 4, we show that if a critical point has at least one negative eigenvalue, then the global stable set is of measure zero.

Main Results

We now state and prove our main theorem, making our intuition rigorous.

Let ff be a C2C^{2} function and xx^{*} be a strict saddle. Assume that 0<α<1L0<\alpha<\frac{1}{L}, then

That is, the gradient method never converges to saddle points, provided the step size is not chosen aggressively. Greedy methods that use precise line search may still get stuck at stationary points. However, a short-step gradient method will only converge to minimizers.

Note that even for the convex functions method, a constant step size slightly less than 1/L1/L is a nearly optimal choice. Indeed, for θ<1\theta<1, if one runs the gradient method with step size of θ/L\theta/L on a convex function a convergence rate of O(1θT)O(\frac{1}{\theta T}) is attained.

When limkxk\lim_{k}x_{k} does not exist, the above theorem is trivially true.

To prove Theorem 4.1, our primary tool will be the theory of Invariant Manifolds. Specifically, we will use Stable-Center Manifold theorem developed in , which allows for a local characterization of the stable set. Recall that a map g:XYg:X\to Y is a diffeomorphism if gg is a bijection, and gg and g1g^{-1} are continuously differentiable.

Let be a fixed point for the CrC^{r} local diffeomorphism ϕ:UE\phi:U\to E, where UU is a neighborhood of in the Banach space EE. Suppose that E=EsEuE=E_{s}\oplus E_{u}, where EsE_{s} is the span of the eigenvectors corresponding to eigenvalues less than or equal to 11 of Dϕ(0)D\phi(0), and EuE_{u} is the span of the eigenvectors corresponding to eigenvalues greater than 11 of Dϕ(0)D\phi(0). Then there exists a CrC^{r} embedded disk WloccsW^{cs}_{loc} that is tangent to EsE_{s} at called the local stable center manifold. Moreover, there exists a neighborhood BB of , such that ϕ(Wloccs)BWloccs\phi(W^{cs}_{loc})\cap B\subset W^{cs}_{loc}, and k=0ϕk(B)Wloccs\cap_{k=0}^{\infty}\phi^{-k}(B)\subset W^{cs}_{loc}.

To unpack all of this terminology, what the stable manifold theorem says is that if there is a map that diffeomorphically deforms a neighborhood of a critical point, then this implies the existence of a local stable center manifold WloccsW^{cs}_{loc} containing the critical point. This manifold has dimension equal to the number of eigenvalues of the Jacobian of the critical point that are less than 11. WlocscW^{sc}_{loc} contains all points that are locally forward non-escaping meaning, in a smaller neighborhood BB, a point converges to xx^{*} after iterating ϕ\phi only if it is in WloccsBW^{cs}_{loc}\cap B.

Relating this back to the gradient method, replace ϕ\phi with our gradient map gg and let xx^{*} be a strict saddle point. We first record a very useful fact:

The gradient mapping gg with step size α<1L\alpha<\frac{1}{L} is a diffeomorphism.

We will prove this proposition below. But let us first continue to apply the stable manifold theorem. Note that Dg(x)=Iα2f(x)Dg(x)=I-\alpha\nabla^{2}f(x). Thus, the set WloccsW^{cs}_{loc} is a manifold of dimension equal to the number of non-negative eigenvalues of the 2f(x)\nabla^{2}f(x). Note that by the strict saddle assumption, this manifold has strictly positive codimension and hence has measure zero.

Let BB be the neighborhood of xx^{*} promised by the Stable Manifold Theorem. If xx converges to xx^{*} under the gradient map, then there exists a TT such that gt(x)Bg^{t}(x)\in B for all tTt\geq T. This means that gt(x)k=0gk(B)g^{t}(x)\in\cap_{k=0}^{\infty}g^{-k}(B), and hence, gt(x)Wloccsg^{t}(x)\in W^{cs}_{loc}. That is, we have shown that

Since diffeomorphisms map sets of measure zero to sets of measure zero, and countable unions of measure zero sets have measure zero, we conclude that WsW^{s} has measure zero. That is, we have proven Theorem 4.1.

We first check that gg is injective from RnRn\mathbf{R}^{n}\to\mathbf{R}^{n} for α<1L\alpha<\frac{1}{L}. Suppose that there exist xx and yy such that g(x)=g(y)g(x)=g(y). Then we would have xy=α(f(x)f(y))x-y=\alpha(\nabla f(x)-\nabla f(y)) and hence

To show the gradient map is surjective, we will construct an explicit inverse function. The inverse of the gradient mapping is given by performing the proximal point algorithm on the function f-f. The proximal point mapping of f-f centered at yy is given by

For α<1L\alpha<\frac{1}{L}, the function above is strongly convex with respect to xx, so there is a unique minimizer. Let xyx_{y} be the unique minimizer, then by KKT conditions,

Hence, xyx_{y} is mapped to yy by the gradient map.

We have already shown that gg is a bijection, and continuously differentiable. Since Dg(x)=Iα2f(x)Dg(x)=I-\alpha\nabla^{2}f(x) is invertible for α<1L\alpha<\frac{1}{L}, the inverse function theorem guarantees g1g^{-1} is continuously differentiable, completing the proof that gg is a diffeomorphism.

2 Further consequences of Theorem 4.1

Let CC be the set of saddle points and assume they are all strict. If CC has at most countably infinite cardinality, then

By applying Corollary 4.1 to each point xCx^{*}\in C, we have that Pr(limkxk=x)=0\Pr(\lim_{k}x_{k}=x^{*})=0. Since the critical points are countable, the conclusion follows since countable union of null sets is a null set. ∎

If the saddle points are isolated points, then the set of saddle points is at most countably infinite.

Assume the same conditions as Theorem 4.6 and limkxk\lim_{k}x_{k} exists, thien Pr(limkxk=x)=1\Pr(\lim_{k}x_{k}=x^{\star})=1, where xx^{\star} is a local minimizer.

Using the previous theorem, Pr(limkxkC)=0\Pr(\lim_{k}x_{k}\in C)=0. Since limkxk\lim_{k}x_{k} exists and there is zero probability of converging to a saddle, then Pr(limkxk=x)=1\Pr(\lim_{k}x_{k}=x^{*})=1, where xx^{*} is a local minimizer. ∎

We now discuss two sufficient conditions for limkxk\lim_{k}x_{k} to exist. The following proposition prevents xkx_{k} from escaping to \infty, by enforcing that ff has compact sublevel sets, {x:f(x)c}\{x:f(x)\leq c\}. This is true for any coercive function, limxf(x)=\lim_{\left\|x\right\|\to\infty}f(x)=\infty, which holds in most machine learning applications since ff is usually a loss function.

Assume that ff is continuously differentiable, has isolated critical points, and compact sublevel sets, then limkxk\lim_{k}x_{k} exists and that limit is a critical point of ff.

The second sufficient condition for limkxk\lim_{k}x_{k} to exist is based on the Lojasiewicz gradient inequality, which characterizes the steepness of the gradient near a critical point. The Lojasiewicz inequality ensures that the length traveled by the iterates of gradient descent is finite. This will also allow us to derive rates of convergence to a local minimum.

A critical point xx^{*} is satisfies the Lojasiewicz gradient inequality if there exists a neighborhood UU, m,ϵ>0m,\epsilon>0, and 0a<10\leq a<1 such that

for all x in {xU:f(x)<f(x)<f(x)+ϵ}\{x\in U:f(x^{*})<f(x)<f(x^{*})+\epsilon\}.

The Lojasiewicz inequality is very general as discussed in . In fact every analytic function satisfies the Lojasiewicz inequality. Also if the solution is μ\mu-strongly convex in a neighborhood, then the Lojasiewicz inequality is satisfied with parameters a=12a=\frac{1}{2}, and m=2μm=\sqrt{2\mu}.

Assume the same conditions as Theorem 4.6, and the iterates do not escape to \infty, meaning {xk}\{x_{k}\} is a bounded sequence. Then limkxk\lim_{k}x_{k} exists and limkxk=x\lim_{k}x_{k}=x^{*} for a local minimum xx^{*}.

Furthermore if xx^{*} satisfies the Lojasiewicz gradient inequality for 0<a120<a\leq\frac{1}{2}, then for some CC and b<1b<1 independent of kk,

The first part of the theorem follows from , which shows that limkxk\lim_{k}x_{k} exists. By Theorem 4.8, limkxk\lim_{k}x_{k} is a local minimizer xx^{*}. Without loss of generality, we may assume that f(x)=0f(x^{*})=0 by shifting the function.

Define ek=j=kxj+1xje_{k}=\sum_{j=k}^{\infty}\left\|x_{j+1}-x_{j}\right\|, and since ekxkxe_{k}\geq\left\|x_{k}-x^{*}\right\| it suffices to upper bound eke_{k}.

Since we have established that xkx_{k} converges, for kk large enough we can use the gradient inequality and f(xk)=xkxk+1α\nabla f(x_{k})=\frac{x_{k}-x_{k+1}}{\alpha}:

Define β=2(mα)1/a(1a)\beta=\frac{2}{(m\alpha)^{1/a}(1-a)} and d=a1ad=\frac{a}{1-a}. First consider the case 0a120\leq a\leq\frac{1}{2}, then d1d\leq 1. Thus,

where the last inequality uses ek<1e_{k}<1 and d1d\leq 1.

For 12<a<1\frac{1}{2}<a<1, we have established ek+1ek1βdekde_{k+1}\leq e_{k}-\frac{1}{\beta^{d}}e_{k}^{d}. We show by induction that ek+1C(k+1)(1a)/(2a1)e_{k+1}\leq\frac{C}{(k+1)^{(1-a)/(2a-1)}}. The inductive hypothesis guarantees us ekCk(1a)/(2a1)e_{k}\leq\frac{C}{k^{(1-a)/(2a-1)}}, so

For Cd1>βdC^{d-1}>\beta^{d},we have shown ek+1C(k+1)(1a)/(2a1)e_{k+1}\leq\frac{C}{(k+1)^{(1-a)/(2a-1)}}. ∎

Conclusion

We have shown that gradient descent with random initialization and appropriate constant step size does not converge to a saddle point. Our analysis relies on a characterization of the local stable set from the theory of invariant manifolds. The geometric characterization is not specific to the gradient descent algorithm. To use Theorem 4.1, we simply need the update step of the algorithm to be a diffeomorphism. For example if gg is the mapping induced by the proximal point algorithm, then gg is a diffeomorphism with inverse given by gradient ascent on f-f. Thus the results in Section 4 also apply to the proximal point algorithm. That is, the proximal point algorithm does not converge to saddles. We expect that similar arguments can be used to show ADMM, mirror descent and coordinate descent do not converge to saddle points under appropriate choices of step size. Indeed, convergence to minimizers has been empirically observed for the ADMM algorithm .

It is not clear if the step size restriction (α<1/L\alpha<1/L) is necessary to avoid saddle points. Most of the constructions where the gradient method converges to saddle points require fragile initial conditions as discussed in Section 3. It remains a possibility that methods that choose step sizes greedily, by Wolfe Line Search or backtracking, may still avoid saddle points provided the initial point is chosen at random. We leave such investigations for future work.

Another important piece of future work would be relaxing the conditions on isolated saddle points. It is possible that for the structured problems that arise in machine learning, whether in matrix factorization or convolutional neural networks, that saddle points are isolated after taking a quotient with respect to the associated symmetry group of the problem. Techniques from dynamical systems on manifolds may be applicable to understand the behavior of optimization algorithms on problems with a high degree of symmetry.

It is also important to understand how stringent the strict saddle assumption is. Will a perturbation of a function always satisfy the strict saddle property? provide very general sufficient conditions for a random function to be Morse, meaning the eigenvalues at critical points are non-zero, which implies the strict saddle condition. These conditions rely on checking the density of 2f(x)\nabla^{2}f(x) has full support conditioned on the event that f(x)=0\nabla f(x)=0. This can be explicitly verified for functions ff that arise from learning problems.

However, we note that there are very difficult unconstrained optimization problems where the strict saddle condition fails. Perhaps the simplest is optimization of quartic polynomials. Indeed, checking if is a local minimizer of the quartic

is equivalent to checking whether the matrix Q=[qij]Q=[q_{ij}] is co-positive, a co-NP complete problem. For this ff, the Hessian at x=0x=0 is zero. Interestingly, the strict saddle property failing is analogous in dynamical systems to the existence of a slow manifold where complex dynamics may emerge. Slow manifolds give rise to metastability, bifurcation, and other chaotic dynamics, and it would be intriguing to see how the analysis of chaotic systems could be applied to understand the behavior of optimization algorithms around these difficult critical points.

Acknowledgements

The authors would like to thank Chi Jin, Tengyu Ma, Robert Nishihara, Mahdi Soltanolkotabi, Yuekai Sun, Jonathan Taylor, and Yuchen Zhang for their insightful feedback. MS is generously supported by an NSF Graduate Research Fellowship. BR is generously supported by ONR awards N00014-14-1-0024, N00014-15-1-2620, and N00014-13-1-0129, and NSF awards CCF-1148243 and CCF-1217058. MIJ is generously supported by ONR award N00014-11-1-0688 and by the ARL and the ARO under grant number W911NF-11-1-0391. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple Inc., Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger, Splunk, State Farm, Virdata and VMware.

References