How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD
Zeyuan Allen-Zhu
Introduction
In convex optimization and machine learning, the classical goal is to design algorithms to decrease objective values, that is, to find points with . In contrast, the rate of convergence for the gradients, that is,
the number of iterations needed to find a point with ,
is a harder problem and sometimes needs new algorithmic ideas [Nesterov2012make]. For instance, in the full-gradient setting, accelerated gradient descent alone is suboptimal for this new goal, and one needs additional tricks to get the fastest rate [Nesterov2012make]. We review these tricks in Section 1.1.
In the convex (online) stochastic optimization, to the best of our knowledge, tight bounds are not yet known for finding points with small gradients. The best recorded rate was [GhadimiLan2015], and it was raised as an open question [OpenProblem2017Simons] regarding how to improve it.
In this paper, we design two new algorithms, SGD2 which gives rate using Nesterov’s tricks, and SGD3 which gives an even better rate which is optimal up to log factors.
We also apply our techniques to design SGD4 and SGD5 for non-convex optimization tasks.
Motivation. Studying the rate of convergence for the minimizing gradients can be important at least for the following two reasons.
In many situations, points with small gradients fit better our final goals.
Designing algorithms to find points with small gradients can help us understand non-convex optimization better and design faster non-convex machine learning algorithms.
Without strong assumptions, non-convex optimization theory is always in terms of finding points with small gradients (i.e., approximate stationary points or local minima). Therefore, to understand non-convex stochastic optimization better, perhaps we should first figure out the best rate for convex stochastic optimization. In addition, if new algorithmic ideas are needed, can we also apply them to the non-convex world? We find positive answers to this question, and also obtain better rates for standard non-convex optimization tasks.
For convex optimization, Nesterov2012make discussed the difference between convergence for objective values vs. for gradients, and introduced two algorithms. We review his results as follows.
Suppose is a Lipschitz smooth convex function with smoothness parameter . Then, it is well-known that accelerated gradient descent (AGD) [Nesterov2004, Nesterov2005] finds a point satisfying using gradient computations of . To turn this into a gradient guarantee, we can apply the smoothness property of which gives . This means
to get a point with , AGD converges in rate .
Nesterov2012make proposed two different tricks to improve upon such rate.
Nesterov’s First Trick: GD After AGD. Recall that starting from a point , if we perform steps of gradient descent (GD) , then it satisfies (see for instance [Nemirovski2004, AO-survey-nesterov]). In addition, if this is already the output of AGD for another iterations, then it satisfies f(x_{0})-f(x^{*})\leq O\big{(}\frac{L}{T^{2}}\big{)}. Putting the two inequalities together, we have \min_{t=0}^{T-1}\big{\{}\|\nabla f(x_{t})\|^{2}\big{\}}\leq O\big{(}\frac{L^{2}}{T^{3}}\big{)}. We call this method “GD after AGD,” and it satisfies
to get a point with , “GD after AGD” converges in rate .
Nesterov’s Second Trick: AGD After Regularization. Alternatively, we can also regularize by defining . This new function is -strongly convex, so AGD converges linearly, meaning that using gradients we can find a point satisfying . If we choose , then this implies . We call this method “AGD after regularization,” and it satisfies
to get a point with , “AGD after regularization” converges in rate .
Nesterov’s Lower Bound. Recall that Nesterov constructed hard-instance functions so that, when dimension is sufficiently high, first-order methods require at least computations of to produce a point satisfying (see his textbook [Nesterov2004]). Since , this also implies a lower bound to find a point with . In other words,
to get a point with , “AGD after regularization” is optimal (up to a log factor).
2 Our Results: Stochastic Convex Optimization
Both rates are asymptotically optimal in terms of decreasing objective, and is an absolute bound on the variance of the stochastic gradients. Using the same argument as before, SGD finds a point with in
These rates are not optimal. We investigate three approaches to improve such rates.
New Approach 1: SGD after SGD. Recall in Nesterov’s first trick, he replaced the use of the inequality by steps of gradient descent. In the stochastic setting, can we replace this inequality with steps of SGD? We call this algorithm SGD1 and prove that
For convex stochastic optimization, SGD1 finds with in
We prove Theorem LABEL:thm:sgd1 in the general language of composite function minimization. This allows us to support an additional “proximal” term and minimize . For instance, if if and if for some convex , then Theorem LABEL:thm:sgd1 is to minimize over .
The rate , in the special case of , was first recorded by GhadimiLan2015. Their algorithm is more involved because they also attempted to tighten the lower order terms using acceleration. To the best of our knowledge, our rate in Theorem LABEL:thm:sgd1 is new.
New Approach 2: SGD after regularization. Recall that in Nesterov’s second trick, he defined as a regularized version of , and applied the strongly-convex version of AGD to minimize . Can we apply this trick to the stochastic setting?
Note the parameter has to be on the magnitude of because and we wish to make sure . Therefore, if we apply SGD1 to minimize to find a point , the convergence rate is . We call this algorithm SGD2.
For convex stochastic optimization, SGD2 finds with in
We prove Theorem LABEL:thm:sgd2 also in the general proximal language. This rate improves the best known result of , but is still far from the lower bound .
New Approach 3: SGD and recursive regularization. In the second approach above, the sub-optimality gap is due to the choice of which ensures .
Intuitively, if were sufficiently close to (and thus were also close to the approximate minimizer ), then we could choose so that still holds. In other words, an appropriate warm start could help us break the barrier and get a better convergence rate. However, how to find such ? We find it by constructing a “less warm” starting point and so on. This process is summarized by the following algorithm which recursively finds the warm starts.
For convex stochastic optimization, SGD3 finds with in
3 Our Applications: Stochastic Non-Convex Optimization
One natural question to ask is whether our techniques for convex stochastic optimization translate to non-convex performance guarantees? We design two SGD variants to tackle this question.
Such recursive regularization techniques for non-convex optimization have appeared in prior works [CarmonDHS2016, Allenzhu2017-natasha2]. However, different from them, we only use simple SGD variants to minimize each and then use SGD3 to get small gradient. We call this algorithm SGD4 and prove that
For non-convex stochastic optimization with -bounded nonconvexity, SGD4 finds with in
Perhaps surprisingly, this simple SGD variant already outperforms previous results in the regime of . We closely compare SGD4 to them in Figure 1(a) and Table 1.
New Approach 5: SGD for local minima. In the second application, we tackle the more ambitious goal of finding a point with both and , known as an -approximate local minimum. For this harder task, one needs the following two standard assumptions: each is -smooth and is -second-order smooth. (The later means for every .)
Motivated by the “swing by saddle point” framework of [Allenzhu2017-natasha2], we combine SGD variants with Oja’s algorithm of [AL2017-MMWU] to design a new algorithm SGD5.Oja’s algorithm [oja1982simplified] is itself an SGD variant of power method to find approximate eigenvectors. We rely on the recent work [AL2017-MMWU] which gives the optimal rate for Oja’s algorithm. We prove that
For non-convex stochastic optimization, SGD5 finds with and in (ignoring the dependency on for simplicity)
We compare SGD5 to known results in Figure 1(b). Perhaps surprisingly, our SGD5, being a simple SGD variant, performs no worse than cubic regularized Newton’s method with T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{3.5}}+\frac{1}{\delta^{6}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)} [TripuraneniSJRJ2017] or the best known SGD variant with T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{4}}+\frac{1}{\delta^{5}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)} [AllenLi2017-neon2]. Only when , SGD5 is outperformed by variance-reduction based methods + [AllenLi2017-neon2] and Natasha2 [Allenzhu2017-natasha2].
Existing SGD variants to find approximate local minima are all based on the “escape saddle points” approach. In contrast, SGD5 is based on the alternative “swing by saddle point” approach. For the difference between the two, we refer interested readers to [AllenLi2017-neon2, Allenzhu2017-natasha2].
4 Roadmap
We introduce notions in Section 2 and formalize the convex problem in Section 3. We review classical (convex) SGD theorems with objective decrease in Section 4. We give an auxiliary lemma in Section LABEL:sec:aux show our SGD3 results in Section LABEL:sec:sgd-grad3. We apply our techniques to non-convex optimization and give algorithms SGD4 and SGD5 in Section LABEL:sec:sgd4+5. We discuss more related work in Appendix LABEL:sec:related, and show our results on SGD1 and SGD2 respectively in Appendix LABEL:sec:sgd-grad1 and Appendix LABEL:sec:sgd-grad2.
Preliminaries
We denote by the spectral norm of matrix . For symmetric matrices and , we write to indicate that is positive semidefinite (PSD). Therefore, if and only if all eigenvalues of are no less than . We denote by and the minimum and maximum eigenvalue of a symmetric matrix .
Recall some definitions on strong convexity and smoothness (and they have other equivalent definitions, see textbook [Nesterov2004]).
For composite function where is proper convex, given a parameter , the gradient mapping of at point is
In particular, if , then .
Recall the following property about gradient mapping—see for instance [XiaoZhang2014-ProximalSVRG, Lemma 3.7])
The following definition and properties of Fenchel dual for convex functions is classical, and can be found for instance in the textbook [Shalev-Shwartz2011].
.
If is -strongly convex, then is -smooth.
Problem Formalization
Throughout this paper (except our nonconvex application Section LABEL:sec:sgd4+5), we minimize the following convex stochastic composite objective:
is proper convex (a.k.a. the proximal term),
is differentiable for every ,
is -smooth and -strongly convex for some that could be zero,
the stochastic gradients have a bounded variance (over the domain of ), that is
We emphasize that the above assumptions are all classical.
In the rest of the paper, we define , the gradient complexity, as the number of computations of . We search for points so that the gradient mapping for any . Recall from Definition 2.2 that if there is no proximal term (i.e., ), then for any . We want to study the best tradeoff between the gradient complexity and the error .
Review: SGD with Objective Value Convergence
Recall that stochastic gradient descent (SGD) repeatedly performs proximal updates of the form
where is some learning rate, and is chosen in uniformly at random per iteration. Note that if then . For completeness’ sake, we summarize it in Algorithm 1. If is also known to be strongly convex, to get the tightest convergence rate, one can repeatedly apply SGD with decreasing learning rate [HazanKale2014]. We summarize this algorithm as SGDsc in Algorithm 2.
The following theorem describes the rates of convergence in objective values for SGD and