Gradient Descent Converges to Minimizers
Jason D. Lee, Max Simchowitz, Michael I. Jordan, Benjamin Recht
Introduction
Saddle points have long been regarded as a tremendous obstacle for continuous optimization. There are many well known examples when worst case initialization of gradient descent provably converge to saddle points [20, Section 1.2.3], and hardness results which show that finding even a local minimizer of non-convex functions is NP-Hard in the worst case . However, such worst-case analyses have not daunted practitioners, and high quality solutions of continuous optimization problems are readily found by a variety of simple algorithms. Building on tools from the theory of dynamical systems, this paper demonstrates that, under very mild regularity conditions, saddle points are indeed of little concern for the gradient method.
We call a critical point of if , and say that satisfies the strict saddle property if each critical point of is either a local minimizer, or a “strict saddle”, i.e, has at least one strictly negative eigenvalue. We prove:
Here, by sufficiently small, we simply mean less than the inverse of the Lipschitz constant of the gradient. As we discuss below, such step sizes are standard for the gradient method. We remark that the strict saddle assumption is necessary in the worst case, due to hardness results regarding testing the local optimality of functions whose Hessians are highly degenerate at critical points (e.g, quartic polynomials) .
Prior work has show that first-order descent methods can circumvent strict saddle points, provided that they are augmented with unbiased noise whose variance is sufficiently large along each direction. For example, establishes convergence of the Robbins-Monro stochastic approximation to local minimizers for strict saddle functions. More recently, give quantitative rates on the convergence of noise-added stochastic gradient descent to local minimizers, for strict saddle functions. The condition that the noise have large variance along all directions is often not satisfied by the randomness which arises in sample-wise or coordinate-wise stochastic updates. In fact, it generally requires that additional, near-isotropic noise be added at each iteration, which yields convergence rates that depend heavily on problem parameters like dimension. In contrast, our results hold for the simplest implementation of gradient descent and thus do not suffer from the slow convergence associated with adding high-variance noise to each iterate.
But is this strict saddle property reasonable? Many works have answered in the affirmative by demonstrating that many objectives of interest do in fact satisfy the “strict saddle” property: PCA, a fourth-order tensor factorization , formulations of dictionary learning and phase retrieval .
To obtain provable guarantees, the authors of and adopt trust-region methods which leverage Hessian information in order to circumvent saddle points. This approach joins a long line of related strategies, including: a modified Newton’s method with curvilinear line search , the modified Cholesky method , trust-region methods , and the related cubic regularized Newton’s method , to name a few. Specialized to deep learning applications, have introduced a saddle-free Newton method.
Unfortunately, such curvature-based optimization algorithms have a per-iteration computational complexity which scales quadratically or even cubically in the dimension , rendering them unsuitable for optimization of high dimensional functions. In contrast, the complexity of an iteration of gradient descent is linear in dimension. We also remark that the authors of empirically observe gradient descent with random initializations on the phase retrieval problem reliably converges to a local minimizer, and one whose quality matches that of the solution found using more costly trust-region techniques.
More broadly, many recent works have shown that gradient descent plus smart initialization provably converges to the global minimum for a variety of non-convex problems: such settings include matrix factorization , phase retrieval , dictionary learning , and latent-variable models . While our results only guarantee convergence to local minimizers, they eschew the need for complex and often computationally prohibitive initialization procedures.
Finally, some preliminary results have shown that there are settings in which if an algorithm converges to a saddle point it necessarily has a small objective value. For example, studies the loss surface of a particular Gaussian random field as a proxy for understanding the objective landscape of deep neural nets. The results leverage the Kac-Rice Theorem , and establish that that critical points with more positive eigenvalues have lower expected function value, often close to that of the global minimizer. We remark that functions drawn from this Gaussian random field model share the strict saddle property defined above, and so our results apply in this setting. On the other hand, our results are considerably more general, as they do not place stringent generative assumptions on the objective function .
2 Organization
The rest of the paper is organized as follows. Section 2 introduces the notation and definitions used throughout the paper. Section 3 provides an intuitive explanation for why it is unlikely that gradient descent converges to a saddle point, by studying a non-convex quadratic and emphasizing the analogy with power iteration. Section 4 states our main results which guarantee gradient descent converges to only local minimizers, and also establish rates of convergence depending on the local geometry of the minimizer. The primary tool we use is the local stable manifold theorem, accompanied by inversion of gradient descent via the proximal point algorithm. Finally, we conclude in Section 5 by suggesting several directions of future work.
Preliminaries
Throughout the paper, we will use to denote a real-valued function in , the space of twice-continuously differentiable functions, and to denote the corresponding gradient map with step size ,
The Jacobian of is given by , or . In addition to being , our main regularity assumption on is that it has a Lipschitz gradient:
The -fold composition of the gradient map corresponds to performing steps of gradient descent initialized at . The iterates of gradient descent will be denoted . All the probability statements are with respect to , the distribution of , which we assume is absolutely continuous with respect to Lebesgue measure.
A fixed point of the gradient map is a critical point of the function . Critical points can be saddle points, local minima, or local maxima. In this paper, we will study the critical points of via the fixed points of , and then apply dynamical systems theory to .
A point is a critical point of if it is a fixed point of the gradient map , or equivalently .
A critical point is isolated if there is a neighborhood around , and is the only critical point in .
A critical point is a local minimum if there is a neighborhood around such that for all , and a local maximum if .
A critical point is a saddle point if for all neighborhoods around , there are such that .
As mentioned in the introduction, we will be focused on saddle points that have directions of strictly negative curvature. This notion is made precise by the following definition.
A critical point of is a strict saddle if .
Since we are interested in the attraction region of a critical point, we define the stable set.
The global stable set of a critical point is the set of initial conditions of gradient descent that converge to :
Intuition
To illustrate why gradient descent does not converge to saddle points, consider the case of a non-convex quadratic, . Without loss of generality, assume with and . is the unique critical point of this function and the Hessian at is . Note that gradient descent initialized from has iterates
where denote the standard basis vectors. This iteration resembles power iteration with the matrix .
The gradient method is guaranteed to converge with a constant step size provided . For this quadratic , is equal to . Suppose , a slightly stronger condition. Then we will have for and for . If , then converges to the saddle point at since . However, if has a component outside then gradient descent diverges to . For this simple quadratic function, we see that the global stable set (attractive set) of is the subspace . Now, if we choose our initial point at random, the probability of that point landing in is zero.
As an example of this phenomena for a non-quadratic function, consider the following example from [20, Section 1.2.3]. Letting , the corresponding gradient mapping is
The points and are isolated local minima, and is a saddle point.
Gradient descent initialized from any point of the form converges to the saddle point . Any other initial point either diverges, or converges to a local minimum, so the stable set of is the -axis, which is a zero measure set in . By computing the Hessian,
we find that has one positive eigenvalue with eigenvector that spans the -axis, thus agreeing with our above characterization of the stable set. If the initial point is chosen randomly, there is zero probability of initializing on the -axis and thus zero probability of converging to the saddle point .
In the general case, the local stable set of a critical point is well-approximated by the span of the eigenvectors corresponding to positive eigenvalues. By an application of Taylor’s theorem, one can see that if the initial point is uniformly random in a small neighborhood around , then the probability of initializing in the span of these eigenvectors is zero whenever there is a negative eigenvalue. Thus, gradient descent initialized at will leave the neighborhood. The primary difficulty is that is randomly distributed over the entire domain, not a small neighborhood around , and Taylor’s theorem does not provide any global guarantees.
However, the global stable set can be found by inverting the gradient map via . Indeed, the global stable set is precisely . This follows because if a point converges to , then for some sufficiently large it must enter the local stable set. That is, converges to if and only if for sufficiently large . If is of measure zero, then is also of measure zero, and hence the global stable set is of measure zero. Thus, gradient descent will never converge to from a random initialization.
In Section 4, we formalize the above arguments by showing the existence of an inverse gradient map. The case of degenerate critical points, critical points with zero eigenvalues, is more delicate; the geometry of the global stable set is no longer characterized by only the number of positive eigenvectors. However in Section 4, we show that if a critical point has at least one negative eigenvalue, then the global stable set is of measure zero.
Main Results
We now state and prove our main theorem, making our intuition rigorous.
Let be a function and be a strict saddle. Assume that , then
That is, the gradient method never converges to saddle points, provided the step size is not chosen aggressively. Greedy methods that use precise line search may still get stuck at stationary points. However, a short-step gradient method will only converge to minimizers.
Note that even for the convex functions method, a constant step size slightly less than is a nearly optimal choice. Indeed, for , if one runs the gradient method with step size of on a convex function a convergence rate of is attained.
When does not exist, the above theorem is trivially true.
To prove Theorem 4.1, our primary tool will be the theory of Invariant Manifolds. Specifically, we will use Stable-Center Manifold theorem developed in , which allows for a local characterization of the stable set. Recall that a map is a diffeomorphism if is a bijection, and and are continuously differentiable.
Let be a fixed point for the local diffeomorphism , where is a neighborhood of in the Banach space . Suppose that , where is the span of the eigenvectors corresponding to eigenvalues less than or equal to of , and is the span of the eigenvectors corresponding to eigenvalues greater than of . Then there exists a embedded disk that is tangent to at called the local stable center manifold. Moreover, there exists a neighborhood of , such that , and .
To unpack all of this terminology, what the stable manifold theorem says is that if there is a map that diffeomorphically deforms a neighborhood of a critical point, then this implies the existence of a local stable center manifold containing the critical point. This manifold has dimension equal to the number of eigenvalues of the Jacobian of the critical point that are less than . contains all points that are locally forward non-escaping meaning, in a smaller neighborhood , a point converges to after iterating only if it is in .
Relating this back to the gradient method, replace with our gradient map and let be a strict saddle point. We first record a very useful fact:
The gradient mapping with step size is a diffeomorphism.
We will prove this proposition below. But let us first continue to apply the stable manifold theorem. Note that . Thus, the set is a manifold of dimension equal to the number of non-negative eigenvalues of the . Note that by the strict saddle assumption, this manifold has strictly positive codimension and hence has measure zero.
Let be the neighborhood of promised by the Stable Manifold Theorem. If converges to under the gradient map, then there exists a such that for all . This means that , and hence, . That is, we have shown that
Since diffeomorphisms map sets of measure zero to sets of measure zero, and countable unions of measure zero sets have measure zero, we conclude that has measure zero. That is, we have proven Theorem 4.1.
We first check that is injective from for . Suppose that there exist and such that . Then we would have and hence
To show the gradient map is surjective, we will construct an explicit inverse function. The inverse of the gradient mapping is given by performing the proximal point algorithm on the function . The proximal point mapping of centered at is given by
For , the function above is strongly convex with respect to , so there is a unique minimizer. Let be the unique minimizer, then by KKT conditions,
Hence, is mapped to by the gradient map.
We have already shown that is a bijection, and continuously differentiable. Since is invertible for , the inverse function theorem guarantees is continuously differentiable, completing the proof that is a diffeomorphism.
2 Further consequences of Theorem 4.1
Let be the set of saddle points and assume they are all strict. If has at most countably infinite cardinality, then
By applying Corollary 4.1 to each point , we have that . Since the critical points are countable, the conclusion follows since countable union of null sets is a null set. ∎
If the saddle points are isolated points, then the set of saddle points is at most countably infinite.
Assume the same conditions as Theorem 4.6 and exists, thien , where is a local minimizer.
Using the previous theorem, . Since exists and there is zero probability of converging to a saddle, then , where is a local minimizer. ∎
We now discuss two sufficient conditions for to exist. The following proposition prevents from escaping to , by enforcing that has compact sublevel sets, . This is true for any coercive function, , which holds in most machine learning applications since is usually a loss function.
Assume that is continuously differentiable, has isolated critical points, and compact sublevel sets, then exists and that limit is a critical point of .
The second sufficient condition for to exist is based on the Lojasiewicz gradient inequality, which characterizes the steepness of the gradient near a critical point. The Lojasiewicz inequality ensures that the length traveled by the iterates of gradient descent is finite. This will also allow us to derive rates of convergence to a local minimum.
A critical point is satisfies the Lojasiewicz gradient inequality if there exists a neighborhood , , and such that
for all x in .
The Lojasiewicz inequality is very general as discussed in . In fact every analytic function satisfies the Lojasiewicz inequality. Also if the solution is -strongly convex in a neighborhood, then the Lojasiewicz inequality is satisfied with parameters , and .
Assume the same conditions as Theorem 4.6, and the iterates do not escape to , meaning is a bounded sequence. Then exists and for a local minimum .
Furthermore if satisfies the Lojasiewicz gradient inequality for , then for some and independent of ,
The first part of the theorem follows from , which shows that exists. By Theorem 4.8, is a local minimizer . Without loss of generality, we may assume that by shifting the function.
Define , and since it suffices to upper bound .
Since we have established that converges, for large enough we can use the gradient inequality and :
Define and . First consider the case , then . Thus,
where the last inequality uses and .
For , we have established . We show by induction that . The inductive hypothesis guarantees us , so
For ,we have shown . ∎
Conclusion
We have shown that gradient descent with random initialization and appropriate constant step size does not converge to a saddle point. Our analysis relies on a characterization of the local stable set from the theory of invariant manifolds. The geometric characterization is not specific to the gradient descent algorithm. To use Theorem 4.1, we simply need the update step of the algorithm to be a diffeomorphism. For example if is the mapping induced by the proximal point algorithm, then is a diffeomorphism with inverse given by gradient ascent on . Thus the results in Section 4 also apply to the proximal point algorithm. That is, the proximal point algorithm does not converge to saddles. We expect that similar arguments can be used to show ADMM, mirror descent and coordinate descent do not converge to saddle points under appropriate choices of step size. Indeed, convergence to minimizers has been empirically observed for the ADMM algorithm .
It is not clear if the step size restriction () is necessary to avoid saddle points. Most of the constructions where the gradient method converges to saddle points require fragile initial conditions as discussed in Section 3. It remains a possibility that methods that choose step sizes greedily, by Wolfe Line Search or backtracking, may still avoid saddle points provided the initial point is chosen at random. We leave such investigations for future work.
Another important piece of future work would be relaxing the conditions on isolated saddle points. It is possible that for the structured problems that arise in machine learning, whether in matrix factorization or convolutional neural networks, that saddle points are isolated after taking a quotient with respect to the associated symmetry group of the problem. Techniques from dynamical systems on manifolds may be applicable to understand the behavior of optimization algorithms on problems with a high degree of symmetry.
It is also important to understand how stringent the strict saddle assumption is. Will a perturbation of a function always satisfy the strict saddle property? provide very general sufficient conditions for a random function to be Morse, meaning the eigenvalues at critical points are non-zero, which implies the strict saddle condition. These conditions rely on checking the density of has full support conditioned on the event that . This can be explicitly verified for functions that arise from learning problems.
However, we note that there are very difficult unconstrained optimization problems where the strict saddle condition fails. Perhaps the simplest is optimization of quartic polynomials. Indeed, checking if is a local minimizer of the quartic
is equivalent to checking whether the matrix is co-positive, a co-NP complete problem. For this , the Hessian at is zero. Interestingly, the strict saddle property failing is analogous in dynamical systems to the existence of a slow manifold where complex dynamics may emerge. Slow manifolds give rise to metastability, bifurcation, and other chaotic dynamics, and it would be intriguing to see how the analysis of chaotic systems could be applied to understand the behavior of optimization algorithms around these difficult critical points.
Acknowledgements
The authors would like to thank Chi Jin, Tengyu Ma, Robert Nishihara, Mahdi Soltanolkotabi, Yuekai Sun, Jonathan Taylor, and Yuchen Zhang for their insightful feedback. MS is generously supported by an NSF Graduate Research Fellowship. BR is generously supported by ONR awards N00014-14-1-0024, N00014-15-1-2620, and N00014-13-1-0129, and NSF awards CCF-1148243 and CCF-1217058. MIJ is generously supported by ONR award N00014-11-1-0688 and by the ARL and the ARO under grant number W911NF-11-1-0391. This research is supported in part by NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Adatao, Adobe, Apple Inc., Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, Guavus, HP, Huawei, Intel, Microsoft, Pivotal, Samsung, Schlumberger, Splunk, State Farm, Virdata and VMware.