How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Zeyuan Allen-Zhu

Introduction

In convex optimization and machine learning, the classical goal is to design algorithms to decrease objective values, that is, to find points $x$ with $f(x)-f(x^{*})\leq\varepsilon$ . In contrast, the rate of convergence for the gradients, that is,

the number of iterations $T$ needed to find a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ ,

is a harder problem and sometimes needs new algorithmic ideas [Nesterov2012make]. For instance, in the full-gradient setting, accelerated gradient descent alone is suboptimal for this new goal, and one needs additional tricks to get the fastest rate [Nesterov2012make]. We review these tricks in Section 1.1.

In the convex (online) stochastic optimization, to the best of our knowledge, tight bounds are not yet known for finding points with small gradients. The best recorded rate was $T\propto\varepsilon^{-8/3}$ [GhadimiLan2015], and it was raised as an open question [OpenProblem2017Simons] regarding how to improve it.

In this paper, we design two new algorithms, SGD2 which gives rate $T\propto\varepsilon^{-5/2}$ using Nesterov’s tricks, and SGD3 which gives an even better rate $T\propto\varepsilon^{-2}\log^{3}\frac{1}{\varepsilon}$ which is optimal up to log factors.

We also apply our techniques to design SGD4 and SGD5 for non-convex optimization tasks.

Motivation. Studying the rate of convergence for the minimizing gradients can be important at least for the following two reasons.

In many situations, points with small gradients fit better our final goals.

Designing algorithms to find points with small gradients can help us understand non-convex optimization better and design faster non-convex machine learning algorithms.

Without strong assumptions, non-convex optimization theory is always in terms of finding points with small gradients (i.e., approximate stationary points or local minima). Therefore, to understand non-convex stochastic optimization better, perhaps we should first figure out the best rate for convex stochastic optimization. In addition, if new algorithmic ideas are needed, can we also apply them to the non-convex world? We find positive answers to this question, and also obtain better rates for standard non-convex optimization tasks.

For convex optimization, Nesterov2012make discussed the difference between convergence for objective values vs. for gradients, and introduced two algorithms. We review his results as follows.

Suppose $f(x)$ is a Lipschitz smooth convex function with smoothness parameter $L$ . Then, it is well-known that accelerated gradient descent (AGD) [Nesterov2004, Nesterov2005] finds a point $x$ satisfying $f(x)-f(x^{*})\leq\delta$ using $T=O(\frac{\sqrt{L}}{\sqrt{\delta}})$ gradient computations of $\nabla f(x)$ . To turn this into a gradient guarantee, we can apply the smoothness property of $f(x)$ which gives $\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*}))$ . This means

to get a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ , AGD converges in rate $T\propto\frac{L}{\varepsilon}$ .

Nesterov2012make proposed two different tricks to improve upon such rate.

Nesterov’s First Trick: GD After AGD. Recall that starting from a point $x_{0}$ , if we perform $T$ steps of gradient descent (GD) $x_{t+1}=x_{t}-\frac{1}{L}\nabla f(x_{t})$ , then it satisfies $\sum_{t=0}^{T-1}\|\nabla f(x_{t})\|^{2}\leq L(f(x_{0})-f(x^{*}))$ (see for instance [Nemirovski2004, AO-survey-nesterov]). In addition, if this $x_{0}$ is already the output of AGD for another $T$ iterations, then it satisfies $f(x_{0})-f(x^{*})\leq O\big{(}\frac{L}{T^{2}}\big{)}$ . Putting the two inequalities together, we have $\min_{t=0}^{T-1}\big{\{}\|\nabla f(x_{t})\|^{2}\big{\}}\leq O\big{(}\frac{L^{2}}{T^{3}}\big{)}$ . We call this method “GD after AGD,” and it satisfies

to get a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ , “GD after AGD” converges in rate $T\propto\frac{L^{2/3}}{\varepsilon^{2/3}}$ .

Nesterov’s Second Trick: AGD After Regularization. Alternatively, we can also regularize $f(x)$ by defining $g(x)=f(x)+\frac{\sigma}{2}\|x-x_{0}\|^{2}$ . This new function $g(x)$ is $\sigma$ -strongly convex, so AGD converges linearly, meaning that using $T\propto\frac{\sqrt{L}}{\sqrt{\sigma}}\log\frac{L}{\varepsilon}$ gradients we can find a point $x$ satisfying $\|\nabla g(x)\|^{2}\leq L(g(x)-g(x^{*}))\leq\varepsilon^{2}$ . If we choose $\sigma\propto\varepsilon$ , then this implies $\|\nabla f(x)\|\leq\|\nabla g(x)\|+\varepsilon\leq 2\varepsilon$ . We call this method “AGD after regularization,” and it satisfies

to get a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ , “AGD after regularization” converges in rate $T\propto\frac{L^{1/2}}{\varepsilon^{1/2}}\log\frac{L}{\varepsilon}$ .

Nesterov’s Lower Bound. Recall that Nesterov constructed hard-instance functions $f(x)$ so that, when dimension is sufficiently high, first-order methods require at least $T=\Omega(\sqrt{L/\delta})$ computations of $\nabla f(x)$ to produce a point $x$ satisfying $f(x)-f(x^{*})\leq\delta$ (see his textbook [Nesterov2004]). Since $f(x)-f(x^{*})\leq\langle\nabla f(x),x-x^{*}\rangle\leq\|\nabla f(x)\|\cdot\|x-x^{*}\|$ , this also implies a lower bound $T=\Omega(\sqrt{L/\varepsilon})$ to find a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ . In other words,

to get a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ , “AGD after regularization” is optimal (up to a log factor).

2 Our Results: Stochastic Convex Optimization

Both rates are asymptotically optimal in terms of decreasing objective, and $\mathcal{V}$ is an absolute bound on the variance of the stochastic gradients. Using the same argument $\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*}))$ as before, SGD finds a point $x$ with $\|\nabla f(x)\|\leq\varepsilon$ in

These rates are not optimal. We investigate three approaches to improve such rates.

New Approach 1: SGD after SGD. Recall in Nesterov’s first trick, he replaced the use of the inequality $\|\nabla f(x)\|^{2}\leq L(f(x)-f(x^{*}))$ by $T$ steps of gradient descent. In the stochastic setting, can we replace this inequality with $T$ steps of SGD? We call this algorithm SGD1 and prove that

For convex stochastic optimization, SGD1 finds $x$ with $\|\nabla f(x)\|\leq\varepsilon$ in

We prove Theorem LABEL:thm:sgd1 in the general language of composite function minimization. This allows us to support an additional “proximal” term $\psi(x)$ and minimize $\psi(x)+f(x)$ . For instance, if $\psi(x)=0$ if $x\in Q$ and $\psi(x)=+\infty$ if $x\not\in Q$ for some convex $Q$ , then Theorem LABEL:thm:sgd1 is to minimize $f(x)$ over $Q$ .

The rate $T\propto\varepsilon^{-8/3}$ , in the special case of $\psi(x)\equiv 0$ , was first recorded by GhadimiLan2015. Their algorithm is more involved because they also attempted to tighten the lower order terms using acceleration. To the best of our knowledge, our rate $T\propto\frac{1}{\sigma^{1/2}\varepsilon^{2}}$ in Theorem LABEL:thm:sgd1 is new.

New Approach 2: SGD after regularization. Recall that in Nesterov’s second trick, he defined $g(x)=f(x)+\frac{\sigma}{2}\|x-x_{0}\|^{2}$ as a regularized version of $f(x)$ , and applied the strongly-convex version of AGD to minimize $g(x)$ . Can we apply this trick to the stochastic setting?

Note the parameter $\sigma$ has to be on the magnitude of $\varepsilon$ because $\nabla g(x)=\nabla f(x)+\sigma(x-x_{0})$ and we wish to make sure $\|\nabla f(x)\|=\|\nabla g(x)\|\pm\varepsilon$ . Therefore, if we apply SGD1 to minimize $g(x)$ to find a point $\|\nabla g(x)\|\leq\varepsilon$ , the convergence rate is $T\propto\frac{1}{\sigma^{1/2}\varepsilon^{2}}=\frac{1}{\varepsilon^{2.5}}$ . We call this algorithm SGD2.

For convex stochastic optimization, SGD2 finds $x$ with $\|\nabla f(x)\|\leq\varepsilon$ in

We prove Theorem LABEL:thm:sgd2 also in the general proximal language. This $T\propto\varepsilon^{-5/2}$ rate improves the best known result of $T\propto\varepsilon^{-8/3}$ , but is still far from the lower bound $\Omega(\mathcal{V}/\varepsilon^{2})$ .

New Approach 3: SGD and recursive regularization. In the second approach above, the $\varepsilon^{0.5}$ sub-optimality gap is due to the choice of $\sigma\propto\varepsilon$ which ensures $\|\sigma(x-x_{0})\|\leq\varepsilon$ .

Intuitively, if $x_{0}$ were sufficiently close to $x^{*}$ (and thus were also close to the approximate minimizer $x$ ), then we could choose $\sigma\gg\varepsilon$ so that $\|\sigma(x-x_{0})\|\leq\varepsilon$ still holds. In other words, an appropriate warm start $x_{0}$ could help us break the $\varepsilon^{-2.5}$ barrier and get a better convergence rate. However, how to find such $x_{0}$ ? We find it by constructing a “less warm” starting point and so on. This process is summarized by the following algorithm which recursively finds the warm starts.

For convex stochastic optimization, SGD3 finds $x$ with $\|\nabla f(x)\|\leq\varepsilon$ in

3 Our Applications: Stochastic Non-Convex Optimization

One natural question to ask is whether our techniques for convex stochastic optimization translate to non-convex performance guarantees? We design two SGD variants to tackle this question.

Such recursive regularization techniques for non-convex optimization have appeared in prior works [CarmonDHS2016, Allenzhu2017-natasha2]. However, different from them, we only use simple SGD variants to minimize each $g(x)$ and then use SGD3 to get small gradient. We call this algorithm SGD4 and prove that

For non-convex stochastic optimization with $\sigma$ -bounded nonconvexity, SGD4 finds $x$ with $\|\nabla f(x)\|\leq\varepsilon$ in

Perhaps surprisingly, this simple SGD variant already outperforms previous results in the regime of $\sigma\leq\varepsilon L$ . We closely compare SGD4 to them in Figure 1(a) and Table 1.

New Approach 5: SGD for local minima. In the second application, we tackle the more ambitious goal of finding a point $x$ with both $\|\nabla f(x)\|\leq\varepsilon$ and $\nabla^{2}f(x)\succeq-\delta\mathbf{I}$ , known as an $(\varepsilon,\delta)$ -approximate local minimum. For this harder task, one needs the following two standard assumptions: each $f_{i}(x)$ is $L$ -smooth and $f(x)$ is $L_{2}$ -second-order smooth. (The later means $\|\nabla^{2}f(x)-\nabla^{2}f(y)\|_{2}\leq L_{2}\|x-y\|$ for every $x,y$ .)

Motivated by the “swing by saddle point” framework of [Allenzhu2017-natasha2], we combine SGD variants with Oja’s algorithm of [AL2017-MMWU] to design a new algorithm SGD5.Oja’s algorithm [oja1982simplified] is itself an SGD variant of power method to find approximate eigenvectors. We rely on the recent work [AL2017-MMWU] which gives the optimal rate for Oja’s algorithm. We prove that

For non-convex stochastic optimization, SGD5 finds $x$ with $\|\nabla f(x)\|\leq\varepsilon$ and $\nabla^{2}f(x)\succeq-\delta\mathbf{I}$ in (ignoring the dependency on $L,L_{2},\mathcal{V},f(x_{0})-f(x^{*})$ for simplicity)

We compare SGD5 to known results in Figure 1(b). Perhaps surprisingly, our SGD5, being a simple SGD variant, performs no worse than cubic regularized Newton’s method with $T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{3.5}}+\frac{1}{\delta^{6}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)}$ [TripuraneniSJRJ2017] or the best known SGD variant with $T=\widetilde{O}\big{(}\frac{1}{\varepsilon^{4}}+\frac{1}{\delta^{5}}+\frac{1}{\varepsilon^{2}\delta^{3}}\big{)}$ [AllenLi2017-neon2]. Only when $\sigma>\sqrt{\varepsilon}$ , SGD5 is outperformed by variance-reduction based methods $\mathtt{Neon2}$ + $\mathtt{SCSG}$ [AllenLi2017-neon2] and Natasha2 [Allenzhu2017-natasha2].

Existing SGD variants to find approximate local minima are all based on the “escape saddle points” approach. In contrast, SGD5 is based on the alternative “swing by saddle point” approach. For the difference between the two, we refer interested readers to [AllenLi2017-neon2, Allenzhu2017-natasha2].

4 Roadmap

We introduce notions in Section 2 and formalize the convex problem in Section 3. We review classical (convex) SGD theorems with objective decrease in Section 4. We give an auxiliary lemma in Section LABEL:sec:aux show our SGD3 results in Section LABEL:sec:sgd-grad3. We apply our techniques to non-convex optimization and give algorithms SGD4 and SGD5 in Section LABEL:sec:sgd4+5. We discuss more related work in Appendix LABEL:sec:related, and show our results on SGD1 and SGD2 respectively in Appendix LABEL:sec:sgd-grad1 and Appendix LABEL:sec:sgd-grad2.

Preliminaries

We denote by $\|\mathbf{A}\|_{2}$ the spectral norm of matrix $\mathbf{A}$ . For symmetric matrices $\mathbf{A}$ and $\mathbf{B}$ , we write $\mathbf{A}\succeq\mathbf{B}$ to indicate that $\mathbf{A}-\mathbf{B}$ is positive semidefinite (PSD). Therefore, $\mathbf{A}\succeq-\sigma\mathbf{I}$ if and only if all eigenvalues of $\mathbf{A}$ are no less than $-\sigma$ . We denote by $\lambda_{\min}(\mathbf{A})$ and $\lambda_{\max}(\mathbf{A})$ the minimum and maximum eigenvalue of a symmetric matrix $\mathbf{A}$ .

Recall some definitions on strong convexity and smoothness (and they have other equivalent definitions, see textbook [Nesterov2004]).

For composite function $F(x)=\psi(x)+f(x)$ where $\psi(x)$ is proper convex, given a parameter $\eta>0$ , the gradient mapping of $F(\cdot)$ at point $x$ is

In particular, if $\psi(\cdot)\equiv 0$ , then $\mathcal{G}_{F,\eta}(x)\equiv\nabla f(x)$ .

Recall the following property about gradient mapping—see for instance [XiaoZhang2014-ProximalSVRG, Lemma 3.7])

The following definition and properties of Fenchel dual for convex functions is classical, and can be found for instance in the textbook [Shalev-Shwartz2011].

$\nabla h^{*}(\beta)=\operatornamewithlimits{arg\,max}_{y}\{y^{\top}\beta-h(y)\}$ .

If $h(\cdot)$ is $\sigma$ -strongly convex, then $h^{*}(\cdot)$ is $\frac{1}{\sigma}$ -smooth.

Problem Formalization

Throughout this paper (except our nonconvex application Section LABEL:sec:sgd4+5), we minimize the following convex stochastic composite objective:

$\psi(x)$ is proper convex (a.k.a. the proximal term),

$f_{i}(x)$ is differentiable for every $i\in[n]$ ,

$f(x)$ is $L$ -smooth and $\sigma$ -strongly convex for some $\sigma\in[0,L]$ that could be zero,

the stochastic gradients $\nabla f_{i}(x)$ have a bounded variance (over the domain of $\psi(\cdot)$ ), that is

We emphasize that the above assumptions are all classical.

In the rest of the paper, we define $T$ , the gradient complexity, as the number of computations of $\nabla f_{i}(x)$ . We search for points $x$ so that the gradient mapping $\|\mathcal{G}_{F,\eta}(x)\|\leq\varepsilon$ for any $\eta\approx\frac{1}{L}$ . Recall from Definition 2.2 that if there is no proximal term (i.e., $\psi(x)\equiv 0$ ), then $\mathcal{G}_{F,\eta}(x)=\nabla f(x)$ for any $\eta>0$ . We want to study the best tradeoff between the gradient complexity $T$ and the error $\varepsilon$ .

Review: SGD with Objective Value Convergence

Recall that stochastic gradient descent (SGD) repeatedly performs proximal updates of the form

where $\alpha>0$ is some learning rate, and $i$ is chosen in $1,2,\dots,n$ uniformly at random per iteration. Note that if $\psi(y)\equiv 0$ then $x_{t+1}=x_{t}-\alpha\nabla f_{i}(x_{t})$ . For completeness’ sake, we summarize it in Algorithm 1. If $f(x)$ is also known to be strongly convex, to get the tightest convergence rate, one can repeatedly apply SGD with decreasing learning rate $\alpha$ [HazanKale2014]. We summarize this algorithm as SGDsc in Algorithm 2.

The following theorem describes the rates of convergence in objective values for SGD and