How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Zeyuan Allen-Zhu

Introduction

In convex optimization and machine learning, the classical goal is to design algorithms to decrease objective values, that is, to find points xx with f(x)f(x)εf(x)-f(x^{*})\leq\varepsilon. In contrast, the rate of convergence for the gradients, that is,

the number of iterations TT needed to find a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon,

is a harder problem and sometimes needs new algorithmic ideas [Nesterov2012make]. For instance, in the full-gradient setting, accelerated gradient descent alone is suboptimal for this new goal, and one needs additional tricks to get the fastest rate [Nesterov2012make]. We review these tricks in Section 1.1.

In the convex (online) stochastic optimization, to the best of our knowledge, tight bounds are not yet known for finding points with small gradients. The best recorded rate was Tε8/3T\propto\varepsilon^{-8/3} [GhadimiLan2015], and it was raised as an open question [OpenProblem2017Simons] regarding how to improve it.

In this paper, we design two new algorithms, SGD2 which gives rate Tε5/2T\propto\varepsilon^{-5/2} using Nesterov’s tricks, and SGD3 which gives an even better rate Tε2log31εT\propto\varepsilon^{-2}\log^{3}\frac{1}{\varepsilon} which is optimal up to log factors.

We also apply our techniques to design SGD4 and SGD5 for non-convex optimization tasks.

Motivation. Studying the rate of convergence for the minimizing gradients can be important at least for the following two reasons.

In many situations, points with small gradients fit better our final goals.

Designing algorithms to find points with small gradients can help us understand non-convex optimization better and design faster non-convex machine learning algorithms.

Without strong assumptions, non-convex optimization theory is always in terms of finding points with small gradients (i.e., approximate stationary points or local minima). Therefore, to understand non-convex stochastic optimization better, perhaps we should first figure out the best rate for convex stochastic optimization. In addition, if new algorithmic ideas are needed, can we also apply them to the non-convex world? We find positive answers to this question, and also obtain better rates for standard non-convex optimization tasks.

For convex optimization, Nesterov2012make discussed the difference between convergence for objective values vs. for gradients, and introduced two algorithms. We review his results as follows.

Suppose f(x)f(x) is a Lipschitz smooth convex function with smoothness parameter LL. Then, it is well-known that accelerated gradient descent (AGD) [Nesterov2004, Nesterov2005] finds a point xx satisfying f(x)f(x)δf(x)-f(x^{*})\leq\delta using T=O(Lδ)T=O(\frac{\sqrt{L}}{\sqrt{\delta}}) gradient computations of f(x)\nabla f(x). To turn this into a gradient guarantee, we can apply the smoothness property of f(x)f(x) which gives f(x)2L(f(x)f(x))\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*})). This means

to get a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon, AGD converges in rate TLεT\propto\frac{L}{\varepsilon}.

Nesterov2012make proposed two different tricks to improve upon such rate.

Nesterov’s First Trick: GD After AGD. Recall that starting from a point x0x_{0}, if we perform TT steps of gradient descent (GD) xt+1=xt1Lf(xt)x_{t+1}=x_{t}-\frac{1}{L}\nabla f(x_{t}), then it satisfies t=0T1f(xt)2L(f(x0)f(x))\sum_{t=0}^{T-1}\|\nabla f(x_{t})\|^{2}\leq L(f(x_{0})-f(x^{*})) (see for instance [Nemirovski2004, AO-survey-nesterov]). In addition, if this x0x_{0} is already the output of AGD for another TT iterations, then it satisfies f(x_{0})-f(x^{*})\leq O\big{(}\frac{L}{T^{2}}\big{)}. Putting the two inequalities together, we have \min_{t=0}^{T-1}\big{\{}\|\nabla f(x_{t})\|^{2}\big{\}}\leq O\big{(}\frac{L^{2}}{T^{3}}\big{)}. We call this method “GD after AGD,” and it satisfies

to get a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon, “GD after AGD” converges in rate TL2/3ε2/3T\propto\frac{L^{2/3}}{\varepsilon^{2/3}}.

Nesterov’s Second Trick: AGD After Regularization. Alternatively, we can also regularize f(x)f(x) by defining g(x)=f(x)+σ2xx02g(x)=f(x)+\frac{\sigma}{2}\|x-x_{0}\|^{2}. This new function g(x)g(x) is σ\sigma-strongly convex, so AGD converges linearly, meaning that using TLσlogLεT\propto\frac{\sqrt{L}}{\sqrt{\sigma}}\log\frac{L}{\varepsilon} gradients we can find a point xx satisfying g(x)2L(g(x)g(x))ε2\|\nabla g(x)\|^{2}\leq L(g(x)-g(x^{*}))\leq\varepsilon^{2}. If we choose σε\sigma\propto\varepsilon, then this implies f(x)g(x)+ε2ε\|\nabla f(x)\|\leq\|\nabla g(x)\|+\varepsilon\leq 2\varepsilon. We call this method “AGD after regularization,” and it satisfies

to get a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon, “AGD after regularization” converges in rate TL1/2ε1/2logLεT\propto\frac{L^{1/2}}{\varepsilon^{1/2}}\log\frac{L}{\varepsilon}.

Nesterov’s Lower Bound. Recall that Nesterov constructed hard-instance functions f(x)f(x) so that, when dimension is sufficiently high, first-order methods require at least T=Ω(L/δ)T=\Omega(\sqrt{L/\delta}) computations of f(x)\nabla f(x) to produce a point xx satisfying f(x)f(x)δf(x)-f(x^{*})\leq\delta (see his textbook [Nesterov2004]). Since f(x)f(x)f(x),xxf(x)xxf(x)-f(x^{*})\leq\langle\nabla f(x),x-x^{*}\rangle\leq\|\nabla f(x)\|\cdot\|x-x^{*}\|, this also implies a lower bound T=Ω(L/ε)T=\Omega(\sqrt{L/\varepsilon}) to find a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon. In other words,

to get a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon, “AGD after regularization” is optimal (up to a log factor).

2 Our Results: Stochastic Convex Optimization

Both rates are asymptotically optimal in terms of decreasing objective, and V\mathcal{V} is an absolute bound on the variance of the stochastic gradients. Using the same argument f(x)2L(f(x)f(x))\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*})) as before, SGD finds a point xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon in

These rates are not optimal. We investigate three approaches to improve such rates.

New Approach 1: SGD after SGD. Recall in Nesterov’s first trick, he replaced the use of the inequality f(x)2L(f(x)f(x))\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*})) by TT steps of gradient descent. In the stochastic setting, can we replace this inequality with TT steps of SGD? We call this algorithm SGD1 and prove that

For convex stochastic optimization, SGD1 finds xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon in

We prove Theorem LABEL:thm:sgd1 in the general language of composite function minimization. This allows us to support an additional “proximal” term ψ(x)\psi(x) and minimize ψ(x)+f(x)\psi(x)+f(x). For instance, if ψ(x)=0\psi(x)=0 if xQx\in Q and ψ(x)=+\psi(x)=+\infty if x∉Qx\not\in Q for some convex QQ, then Theorem LABEL:thm:sgd1 is to minimize f(x)f(x) over QQ.

The rate Tε8/3T\propto\varepsilon^{-8/3}, in the special case of ψ(x)0\psi(x)\equiv 0, was first recorded by GhadimiLan2015. Their algorithm is more involved because they also attempted to tighten the lower order terms using acceleration. To the best of our knowledge, our rate T1σ1/2ε2T\propto\frac{1}{\sigma^{1/2}\varepsilon^{2}} in Theorem LABEL:thm:sgd1 is new.

New Approach 2: SGD after regularization. Recall that in Nesterov’s second trick, he defined g(x)=f(x)+σ2xx02g(x)=f(x)+\frac{\sigma}{2}\|x-x_{0}\|^{2} as a regularized version of f(x)f(x), and applied the strongly-convex version of AGD to minimize g(x)g(x). Can we apply this trick to the stochastic setting?

Note the parameter σ\sigma has to be on the magnitude of ε\varepsilon because g(x)=f(x)+σ(xx0)\nabla g(x)=\nabla f(x)+\sigma(x-x_{0}) and we wish to make sure f(x)=g(x)±ε\|\nabla f(x)\|=\|\nabla g(x)\|\pm\varepsilon. Therefore, if we apply SGD1 to minimize g(x)g(x) to find a point g(x)ε\|\nabla g(x)\|\leq\varepsilon, the convergence rate is T1σ1/2ε2=1ε2.5T\propto\frac{1}{\sigma^{1/2}\varepsilon^{2}}=\frac{1}{\varepsilon^{2.5}}. We call this algorithm SGD2.

For convex stochastic optimization, SGD2 finds xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon in

We prove Theorem LABEL:thm:sgd2 also in the general proximal language. This Tε5/2T\propto\varepsilon^{-5/2} rate improves the best known result of Tε8/3T\propto\varepsilon^{-8/3}, but is still far from the lower bound Ω(V/ε2)\Omega(\mathcal{V}/\varepsilon^{2}).

New Approach 3: SGD and recursive regularization. In the second approach above, the ε0.5\varepsilon^{0.5} sub-optimality gap is due to the choice of σε\sigma\propto\varepsilon which ensures σ(xx0)ε\|\sigma(x-x_{0})\|\leq\varepsilon.

Intuitively, if x0x_{0} were sufficiently close to xx^{*} (and thus were also close to the approximate minimizer xx), then we could choose σε\sigma\gg\varepsilon so that σ(xx0)ε\|\sigma(x-x_{0})\|\leq\varepsilon still holds. In other words, an appropriate warm start x0x_{0} could help us break the ε2.5\varepsilon^{-2.5} barrier and get a better convergence rate. However, how to find such x0x_{0}? We find it by constructing a “less warm” starting point and so on. This process is summarized by the following algorithm which recursively finds the warm starts.

For convex stochastic optimization, SGD3 finds xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon in

3 Our Applications: Stochastic Non-Convex Optimization

One natural question to ask is whether our techniques for convex stochastic optimization translate to non-convex performance guarantees? We design two SGD variants to tackle this question.

Such recursive regularization techniques for non-convex optimization have appeared in prior works [CarmonDHS2016, Allenzhu2017-natasha2]. However, different from them, we only use simple SGD variants to minimize each g(x)g(x) and then use SGD3 to get small gradient. We call this algorithm SGD4 and prove that

For non-convex stochastic optimization with σ\sigma-bounded nonconvexity, SGD4 finds xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon in

Perhaps surprisingly, this simple SGD variant already outperforms previous results in the regime of σεL\sigma\leq\varepsilon L. We closely compare SGD4 to them in Figure 1(a) and Table 1.

New Approach 5: SGD for local minima. In the second application, we tackle the more ambitious goal of finding a point xx with both f(x)ε\|\nabla f(x)\|\leq\varepsilon and 2f(x)δI\nabla^{2}f(x)\succeq-\delta\mathbf{I}, known as an (ε,δ)(\varepsilon,\delta)-approximate local minimum. For this harder task, one needs the following two standard assumptions: each fi(x)f_{i}(x) is LL-smooth and f(x)f(x) is L2L_{2}-second-order smooth. (The later means 2f(x)2f(y)2L2xy\|\nabla^{2}f(x)-\nabla^{2}f(y)\|_{2}\leq L_{2}\|x-y\| for every x,yx,y.)

Motivated by the “swing by saddle point” framework of [Allenzhu2017-natasha2], we combine SGD variants with Oja’s algorithm of [AL2017-MMWU] to design a new algorithm SGD5.Oja’s algorithm [oja1982simplified] is itself an SGD variant of power method to find approximate eigenvectors. We rely on the recent work [AL2017-MMWU] which gives the optimal rate for Oja’s algorithm. We prove that

For non-convex stochastic optimization, SGD5 finds xx with f(x)ε\|\nabla f(x)\|\leq\varepsilon and 2f(x)δI\nabla^{2}f(x)\succeq-\delta\mathbf{I} in (ignoring the dependency on L,L2,V,f(x0)f(x)L,L_{2},\mathcal{V},f(x_{0})-f(x^{*}) for simplicity)

We compare SGD5 to known results in Figure 1(b). Perhaps surprisingly, our SGD5, being a simple SGD variant, performs no worse than cubic regularized Newton’s method with T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{3.5}}+\frac{1}{\delta^{6}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)} [TripuraneniSJRJ2017] or the best known SGD variant with T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{4}}+\frac{1}{\delta^{5}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)} [AllenLi2017-neon2]. Only when σ>ε\sigma>\sqrt{\varepsilon}, SGD5 is outperformed by variance-reduction based methods Neon2\mathtt{Neon2} +SCSG\mathtt{SCSG} [AllenLi2017-neon2] and Natasha2 [Allenzhu2017-natasha2].

Existing SGD variants to find approximate local minima are all based on the “escape saddle points” approach. In contrast, SGD5 is based on the alternative “swing by saddle point” approach. For the difference between the two, we refer interested readers to [AllenLi2017-neon2, Allenzhu2017-natasha2].

4 Roadmap

We introduce notions in Section 2 and formalize the convex problem in Section 3. We review classical (convex) SGD theorems with objective decrease in Section 4. We give an auxiliary lemma in Section LABEL:sec:aux show our SGD3 results in Section LABEL:sec:sgd-grad3. We apply our techniques to non-convex optimization and give algorithms SGD4 and SGD5 in Section LABEL:sec:sgd4+5. We discuss more related work in Appendix LABEL:sec:related, and show our results on SGD1 and SGD2 respectively in Appendix LABEL:sec:sgd-grad1 and Appendix LABEL:sec:sgd-grad2.

Preliminaries

We denote by A2\|\mathbf{A}\|_{2} the spectral norm of matrix A\mathbf{A}. For symmetric matrices A\mathbf{A} and B\mathbf{B}, we write AB\mathbf{A}\succeq\mathbf{B} to indicate that AB\mathbf{A}-\mathbf{B} is positive semidefinite (PSD). Therefore, AσI\mathbf{A}\succeq-\sigma\mathbf{I} if and only if all eigenvalues of A\mathbf{A} are no less than σ-\sigma. We denote by λmin(A)\lambda_{\min}(\mathbf{A}) and λmax(A)\lambda_{\max}(\mathbf{A}) the minimum and maximum eigenvalue of a symmetric matrix A\mathbf{A}.

Recall some definitions on strong convexity and smoothness (and they have other equivalent definitions, see textbook [Nesterov2004]).

For composite function F(x)=ψ(x)+f(x)F(x)=\psi(x)+f(x) where ψ(x)\psi(x) is proper convex, given a parameter η>0\eta>0, the gradient mapping of F()F(\cdot) at point xx is

In particular, if ψ()0\psi(\cdot)\equiv 0, then GF,η(x)f(x)\mathcal{G}_{F,\eta}(x)\equiv\nabla f(x).

Recall the following property about gradient mapping—see for instance [XiaoZhang2014-ProximalSVRG, Lemma 3.7])

The following definition and properties of Fenchel dual for convex functions is classical, and can be found for instance in the textbook [Shalev-Shwartz2011].

h(β)=arg maxy{yβh(y)}\nabla h^{*}(\beta)=\operatornamewithlimits{arg\,max}_{y}\{y^{\top}\beta-h(y)\}.

If h()h(\cdot) is σ\sigma-strongly convex, then h()h^{*}(\cdot) is 1σ\frac{1}{\sigma}-smooth.

Problem Formalization

Throughout this paper (except our nonconvex application Section LABEL:sec:sgd4+5), we minimize the following convex stochastic composite objective:

ψ(x)\psi(x) is proper convex (a.k.a. the proximal term),

fi(x)f_{i}(x) is differentiable for every i[n]i\in[n],

f(x)f(x) is LL-smooth and σ\sigma-strongly convex for some σ[0,L]\sigma\in[0,L] that could be zero,

the stochastic gradients fi(x)\nabla f_{i}(x) have a bounded variance (over the domain of ψ()\psi(\cdot)), that is

We emphasize that the above assumptions are all classical.

In the rest of the paper, we define TT, the gradient complexity, as the number of computations of fi(x)\nabla f_{i}(x). We search for points xx so that the gradient mapping GF,η(x)ε\|\mathcal{G}_{F,\eta}(x)\|\leq\varepsilon for any η1L\eta\approx\frac{1}{L}. Recall from Definition 2.2 that if there is no proximal term (i.e., ψ(x)0\psi(x)\equiv 0), then GF,η(x)=f(x)\mathcal{G}_{F,\eta}(x)=\nabla f(x) for any η>0\eta>0. We want to study the best tradeoff between the gradient complexity TT and the error ε\varepsilon.

Review: SGD with Objective Value Convergence

Recall that stochastic gradient descent (SGD) repeatedly performs proximal updates of the form

where α>0\alpha>0 is some learning rate, and ii is chosen in 1,2,,n1,2,\dots,n uniformly at random per iteration. Note that if ψ(y)0\psi(y)\equiv 0 then xt+1=xtαfi(xt)x_{t+1}=x_{t}-\alpha\nabla f_{i}(x_{t}). For completeness’ sake, we summarize it in Algorithm 1. If f(x)f(x) is also known to be strongly convex, to get the tightest convergence rate, one can repeatedly apply SGD with decreasing learning rate α\alpha [HazanKale2014]. We summarize this algorithm as SGDsc in Algorithm 2.

The following theorem describes the rates of convergence in objective values for SGD and