AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, Leon Bottou

Introduction

For non-convex but smooth loss functions FF, (noiseless) gradient descent (GD) with constant stepsize converges to a stationary point of FF at rate O(1/N)\mathcal{O}\left({1}/{N}\right) with the number of iterations NN (Nesterov, 1998). In the same setting, and under the general assumption of bounded gradient noise variance, SGD with constant or decreasing stepsize ηj=O(1/j)\eta_{j}=\mathcal{O}\left({1}/{\sqrt{j}}\right) has been proven to converge to a stationary point of FF at rate O(1/N)\mathcal{O}\left({1}/{\sqrt{N}}\right) (Ghadimi and Lan, 2013; Bottou et al., 2018). The O(1/N)\mathcal{O}\left({1}/{N}\right) rate for GD is the best possible worst-case dimension-free rate of convergence for any algorithm (Carmon et al., 2019); faster convergence rates in the noiseless setting are available under the mild assumption of additional smoothness (Agarwal et al., 2017; Carmon et al., 2017, 2018). In the noisy setting, faster rates than O(1/N)\mathcal{O}\left({1}/{\sqrt{N}}\right) are also possible using accelerated SGD methods (Ghadimi and Lan, 2016; Allen-Zhu and Yang, 2016; Reddi et al., 2016; Allen-Zhu, 2017; Xu et al., 2018; Zhou et al., 2018; Fang et al., 2018). For instance, Zhou et al. (2018) and Fang et al. (2018) obtain the rate O(1/N2/3)\mathcal{O}\left({1}/{{N}^{2/3}}\right) without requiring finite-sum structure but with an additional assumptions about Lipschitz continuity of the stochastic gradients, which they exploit to reduce variance.

Instead of focusing on faster convergence rates for SGD, this paper focuses on adaptive stepsizes (Cutkosky and Boahen, 2017; Levy, 2017) that make the optimization algorithm more robust to (generally unknown) parameters of the optimization problem, such as the noise level of the stochastic gradient and the Lipschitz smoothness constant LL of the loss function defined as the smallest number L>0L>0 such that F(x)F(y)Lxy\|\nabla F(x)-\nabla F(y)\|\leq L\|x-y\| for all x,yx,y. In particular, the O(1/N)\mathcal{O}\left({1}/{N}\right) convergence of GD with fixed stepsize is guaranteed only if the fixed stepsize η>0\eta>0 is carefully chosen such that η1/L\eta\leq 1/L – choosing a larger stepsize η\eta, even just by a factor of 2, can result in oscillation or divergence of the algorithm (Nesterov, 1998). Because of this sensitivity, GD with fixed stepsize is rarely used in practice; instead, one adaptively chooses the stepsize ηj>0\eta_{j}>0 at each iteration to approximately maximize a decrease of the loss function in the current direction of F(xj)-\nabla F(x_{j}) via either line search (Wright and Nocedal, 2006), or according to the Barzilai-Borwein rule (Barzilai and Borwein, 1988) combined with line search.

However, these bounds do not tell us much about how to select a good stepsize schedule in practice, where algorithms are run for finite iterations and the constants in the rate of convergence matter.

The question of how to choose the stepsize η>0\eta>0 or stepsize or learning rate schedule {ηj}\{\eta_{j}\} for SGD is by no means resolved; in practice, a preferred schedule is chosen manually by testing many different schedules in advance and choosing the one leading to smallest training or generalization error. This process can take days or weeks, and can become prohibitively expensive in terms of time and computational resources incurred.

Theoretically rigorous convergence results for AdaGrad-Norm were provided in the convex setting recently (Levy, 2017). Moreover, it is possible to obtain convergence rates in the offline setting by online-batch conversion. However, making such observations rigorous for nonconvex functions is difficult because bjb_{j} is itself a random variable which is correlated with the current and all previous noisy gradients; thus, the standard proofs in SGD do not straightforwardly extend to the proofs of AdaGrad-Norm. This paper provides such a proof for AdaGrad-Norm.

2 Main contributions

Our results make rigorous and precise the observed phenomenon that the convergence behavior of AdaGrad-Norm is highly adaptable to the unknown Lipschitz smoothness constant and level of stochastic noise on the gradient: when there is noise, AdaGrad-Norm converges at the rate of O(log(N)/N)O(\log(N)/\sqrt{N}), and when there is no noise, the same algorithm converges at the optimal O(1/N)O(1/N) rate like well-tuned batch gradient descent. Moreover, our analysis shows that AdaGrad-Norm converges at these rates for any choices of the algorithm hyperparameters b0>0b_{0}>0 and η>0\eta>0, in contrast to GD or SGD with fixed stepsize where if the stepsize is set above a hard upper threshold governed by the (generally unknown) smoothness constant LL, the algorithm might not converge at all. Finally, we note that the constants in the rates of convergence we provide are explicit in terms of their dependence on the hyperparameters b0b_{0} and η\eta. We list our two main theorems (informally) in the following:

For a differentiable non-convex function FF with LL-Lipschitz gradient and F=infxF(x)>F^{*}=\inf_{x}F(x)>-\infty, Theorem 2.1 implies that AdaGrad-Norm converges to an ε\varepsilon-approximate stationary point with high probability It is becoming common to define an ε\varepsilon-approximate stationary point as F(x)ε\|\nabla F(x)\|\leq\varepsilon (Agarwal et al., 2017; Carmon et al., 2018, 2019; Fang et al., 2018; Zhou et al., 2018; Allen-Zhu, 2018), but we use the convention F(x)2ε\|F(x)\|^{2}\leq\varepsilon (Lei et al., 2017; Bottou et al., 2018) to most easily compare our results to those from Ghadimi and Lan (2013); Li and Orabona (2019). at the rate

If the optimal value of the loss function FF^{*} is known and one sets η=F(x0)F\eta=F(x_{0})-F^{*} accordingly, then the constant in our rate is close to the best-known constant σL(F(x0)F)\sigma L(F(x_{0})-F^{*}) achievable for SGD with fixed stepsize η=η1==ηN=min{1L,1σN}\eta=\eta_{1}=\dots=\eta_{N}=\min\{\frac{1}{L},\frac{1}{\sigma\sqrt{N}}\} carefully tuned to knowledge of LL and σ\sigma, as given in Ghadimi and Lan (2013). However, our result requires bounded gradient F(x)2γ2\|\nabla F(x)\|^{2}\leq\gamma^{2} and our rate constant scales with γσ\gamma\sigma instead of linearly in σ\sigma. Nevertheless, our result suggests a good strategy for setting hyperparameters in implementing AdaGrad-Norm practically: given knowledge of FF^{*}, set η=F(x0)F\eta=F(x_{0})-F^{*} and simply initialize b0>0b_{0}>0 to be very small.

When there is no noise σ=0\sigma=0, we can improve this rate to an O(1/N)\mathcal{O}\left(1/N\right) rate of convergence. In Theorem 2.2, we show that minj[N]F(xj)2ε\min_{j\in[N]}\|\nabla F(x_{j})\|^{2}\leq\varepsilon after (1) N=O(1ε(((F(x0)F)/η)2+b0(F(x0)F)/η))N=\mathcal{O}\left(\frac{1}{\varepsilon}\left(\left({(F(x_{0})-F^{*})}/{\eta}\right)^{2}+b_{0}\left(F(x_{0})-F^{*}\right)/\eta\right)\right) if b0ηL,b_{0}\geq\eta L, (2) N=O(1ε(L(F(x0)F)+((F(x0)F)/η)2)+(ηL)2εlog(ηLb0))N=\mathcal{O}\left(\frac{1}{\varepsilon}\left(L\left(F(x_{0})-F^{*}\right)+\left({(F(x_{0})-F^{*})}/{\eta}\right)^{2}\right)+\frac{(\eta L)^{2}}{\varepsilon}\log\left(\frac{\eta L}{b_{0}}\right)\right) if b0<ηL.{b_{0}}<{\eta}L. Note that the constant (ηL)2(\eta L)^{2} in the second case when b0<ηL{b_{0}}<\eta L is not optimal compared to the known best rate constant ηL\eta L obtainable by gradient descent with fixed stepsize η=1/L\eta=1/L (Carmon et al., 2019); on the other hand, given knowledge of LL and F(x0)FF(x_{0})-F^{*}, the rate constant of AdaGrad-norm reproduces the optimal constant ηL\eta L by setting η=F(x0)F\eta=F(x_{0})-F^{*} and b0=ηLb_{0}=\eta L.

Practically, our results imply a good strategy for setting the hyperparameters when implementing AdaGrad-norm in practice: set η=(F(x0)F)\eta=(F(x_{0})-F^{*}) (assuming FF^{*} is known) and set b0>0b_{0}>0 to be a very small value. If FF^{*} is unknown, then setting η=1\eta=1 should work well for a wide range of values of LL, and in the noisy case with σ2\sigma^{2} strictly greater than zero.

3 Previous work

Theoretical guarantees of convergence for AdaGrad were provided in Duchi et al. (2011) in the setting of online convex optimization, where the loss function may change from iteration to iteration and be chosen adversarially. AdaGrad was subsequently observed to be effective for accelerating convergence in the nonconvex setting, and has become a popular algorithm for optimization in deep learning problems. Many modifications of AdaGrad with or without momentum have been proposed, namely, RMSprop (Srivastava and Swersky, 2012), AdaDelta (Zeiler, 2012), Adam (Kingma and Ba, 2015), AdaFTRL(Orabona and Pal, 2015), SGD-BB(Tan et al., 2016), AdaBatch (Defossez and Bach, 2017), SC-Adagrad (Mukkamala and Hein, 2017), AMSGRAD (Reddi et al., 2018), Padam (Chen and Gu, 2018), etc. Extending our convergence analysis to these popular alternative adaptive gradient methods remains an interesting problem for future research.

Regarding the convergence guarantees for the norm version of adaptive gradient methods in the offline setting, the recent work by Levy (2017) introduces a family of adaptive gradient methods inspired by AdaGrad, and proves convergence rates in the setting of (strongly) convex loss functions without knowing the smoothness parameter LL in advance. Yet, that analysis still requires the a priori knowledge of a convex set K{\cal K} with known diameter DD in which the global minimizer resides. More recently, Wu et al. (2018) provids convergence guarantees in the non-convex setting for a different adaptive gradient algorithm, WNGrad, which is closely related to AdaGrad-Norm and inspired by weight normalization (Salimans and Kingma, 2016). In fact, the WNGrad stepsize update is similar to AdaGrad-Norm’s:

4 Future work

5 Notation

AdaGrad-Norm convergence

To be clear about the adaptive algorithm, we first state in Algorithm 1 the norm version of AdaGrad we consider throughout in the analysis.

The random vectors ξk\xi_{k}, k=0,1,2,,k=0,1,2,\dots, are independent of each other and also of xkx_{k};

F(x)2γ2\|\nabla F(x)\|^{2}\leq\gamma^{2} uniformly.

The first two assumptions are standard (see e.g. Nemirovski and Yudin (1983); Nemirovski et al. (2009); Bottou et al. (2018)). The third assumption is somewhat restrictive as it rules out strongly convex objectives, but is not an unreasonable assumption for AdaGrad-Norm, where the adaptive learning rate is a cumulative sum of all previous observed gradient norms.

Because of the variance in gradient, the AdaGrad-Norm stepsize ηbk\frac{\eta}{b_{k}} decreases to zero roughly at a rate between 12(γ2+σ2)k\frac{1}{\sqrt{2(\gamma^{2}+\sigma^{2})k}} and 1σk\frac{1}{\sigma\sqrt{k}}. It is known that AdaGrad-Norm stepsize decreases at this rate (Levy, 2017), and that this rate is optimal in kk in terms of the resulting convergence theorems in the setting of smooth but not necessarily convex FF, or convex but not necessarily strongly convex or smooth FF. Still, standard convergence theorems for SGD do not extend straightforwardly to AdaGrad-Norm because the stepsize 1/bk1/b_{k} is a random variable and dependent on all previous points visited along the way, i.e., {F(xj)}j=0k\{\|\nabla F(x_{j})\|\}_{j=0}^{k} and {G(xj,ξj)}j=0k\{\|\nabla G(x_{j},\xi_{j})\|\}_{j=0}^{k}. From this point on, we use the shorthand Gk=G(xk,ξk)G_{k}=G(x_{k},\xi_{k}) and Fk=F(xk)F_{k}=\nabla F(x_{k}) for simplicity of notation. The following theorem gives the convergence guarantee to Algorithm 1. We give detailed proof in Section 3.

This result implies that AdaGrad-Norm converges for any η>0\eta>0 and starting from any value of b0>0b_{0}>0. To put this result in context, we can compare to Corollary 2.2 of Ghadimi and Lan (2013) giving the best-known convergence rate for SGD with fixed step-size in the same setting (albeit not requiring Assumption (3) of uniformly bounded gradient): if the Lipschitz smoothness constant LL and the variance σ2\sigma^{2} are known a priori, and the fixed stepsize in SGD is set to

We match the O(1/N)O(1/\sqrt{N}) rate of Ghadimi and Lan (2013), but without a priori knowledge of LL and σ\sigma, and with a worse constant in the rate of convergence. In particular, the constant in our bound scales according to σ2\sigma^{2} or γσ\gamma\sigma (up to logarithmic factors in σ\sigma) while the result for SGD with well-tuned fixed step-size scales linearly with σ\sigma. The additional logarithmic factor (by Lemma 3.2) results from the AdaGrad-Norm update using the square norm of the gradient (see inequality (11) for details). The extra constant 1δ\frac{1}{\sqrt{\delta}} results from the correlation between the stepsize bjb_{j} and the gradient F(xj)\|\nabla F(x_{j})\|. We note that the recent work Li and Orabona (2019) derives an O(1/N)O(1/\sqrt{N}) rate for a variation of AdaGrad-Norm without the assumption of uniformly bounded gradient, but at the same time requires a priori knowledge of the smoothness constant L>0L>0 in setting the step-size in order to establish convergence, similar to SGD with fixed stepsize. Finally, we note that recent works (Allen-Zhu, 2017; Lei et al., 2017; Fang et al., 2018; Zhou et al., 2018) provide modified SGD algorithms with convergence rates faster than O(1/N)O(1/\sqrt{N}), albeit again requiring priori knowledge of both LL and σ\sigma to establish convergence.

We reiterate however that the main emphasis in Theorem 2.1 is on the robustness of the AdaGrad-Norm convergence to its hyperparameters η\eta and b0b_{0}, compared to plain SGD’s dependence on its parameters η\eta and σ\sigma. Although the constant in the rate of our theorem is not as good as the best-known constant for stochastic gradient descent with well-tuned fixed stepsize, our result suggests that implementing AdaGrad-Norm allows one to vastly reduce the need to perform laborious experiments to find a stepsize schedule with reasonable convergence when implementing SGD in practice.

We note that for the second bound in 2.1, in the limit as σ0\sigma\rightarrow 0 we recover an O(log(N)/N)O\left(\log(N)/N\right) rate of convergence for noiseless gradient descent. We can establish a stronger result in the noiseless setting using a different method of proof, removing the additional log factor and Assumption 3 of uniformly bounded gradient. We state the theorem below and defer our proof to Section 4.

Then minj[N]F(xj)2ε\min_{j\in[N]}\|\nabla F(x_{j})\|^{2}\leq\varepsilon after

N=1+1ε(4(F(x0)F)2η2+2b0(F(x0)F)η)N=1+\lceil{\frac{1}{\varepsilon}\left(\frac{4\left(F(x_{0})-F^{*}\right)^{2}}{\eta^{2}}+\frac{2b_{0}\left(F(x_{0})-F^{*}\right)}{\eta}\right)\rceil} if b0ηL,{b_{0}}\geq\eta L,

N=1+1ε(2L(F(x0)F)+(2(F(x0)F)η+ηLCb0)2+(ηL)2(1+Cb0)b02)N=1+\lceil{\frac{1}{\varepsilon}\left(2L\left(F(x_{0})-F^{*}\right)+\left(\frac{2\left(F(x_{0})-F^{*}\right)}{\eta}+\eta LC_{b_{0}}\right)^{2}+(\eta L)^{2}(1+C_{b_{0}})-b_{0}^{2}\right)\rceil} if b0<ηL.\quad\quad\text{if }{b_{0}}<\eta L. Here Cb0=1+2log(ηLb0)C_{b_{0}}=1+2\log\left(\frac{\eta L}{b_{0}}\right).

The convergence bound shows that, unlike gradient descent with constant stepsize η\eta which can diverge if the stepsize η2/L\eta\geq 2/L, AdaGrad-Norm convergence holds for any choice of parameters b0b_{0} and η\eta. The critical observation is that if the initial stepsize ηb0>1L\frac{\eta}{b_{0}}>\frac{1}{L} is too large, the algorithm has the freedom to diverge initially, until bjb_{j} grows to a critical point (not too much larger than LηL\eta) at which point ηbj\frac{\eta}{b_{j}} is sufficiently small that the smoothness of FF forces bjb_{j} to converge to a finite number on the order of LL, so that the algorithm converges at an O(1/N)O(1/N) rate. To describe the result in Theorem 2.2, let us first review a classical result (see, for example Nesterov (1998), (1.2.13)(1.2.13)) on the convergence rate for gradient descent with fixed stepsize.

Alternatively, if bL2b\leq\frac{L}{2}, then convergence is not guaranteed at all – gradient descent can oscillate or diverge.

Compared to the convergence rate of gradient descent with fixed stepsize, AdaGrad-Norm in the case b=b0ηLb=b_{0}\geq\eta L gives a larger constant in the rate. But in case b=b0<ηLb=b_{0}<\eta L, gradient descent can fail to converge as soon as bηL/2b\leq\eta L/2, while AdaGrad-Norm converges for any b0>0b_{0}>0, and is extremely robust to the choice of b0<ηLb_{0}<\eta L in the sense that the resulting convergence rate remains close to the optimal rate of gradient descent with fixed stepsize 1/b=1/L1/b=1/L, paying a factor of log(ηLb0)\log(\frac{\eta L}{b_{0}}) and (ηL)2(\eta L)^{2} in the constant. Here, the constant (ηL)2(\eta L)^{2} results from the worst-cast analysis using Lemma 4.1, which assumes that the gradient F(xj)2ε\|\nabla F(x_{j})\|^{2}\approx\varepsilon for all j=0,1,j=0,1,\ldots, when in reality the gradient should be much larger at first. We believe the number of iterations can be improved by a refined analysis, or by considering the setting where x0x_{0} is drawn from an appropriate random distribution.

Proof of Theorem 2.1

We first introduce some important lemmas in subsection 3.1 and give the main proof of Theorem 2.1 in Subsection 3.2.

We first introduce several lemmas that are used in the proof for Theorem 2.1. We repeatedly appeal to the following classical Descent Lemma, which is also the main ingredient in Ghadimi and Lan (2013), and can be proved by considering the Taylor expansion of FF around yy.

We will also use the following lemmas concerning sums of non-negative sequences.

For any non-negative a1,,aTa_{1},\cdots,a_{T}, and a11a_{1}\geq 1, we have

2 Main proof

Proof For simplicity, we write Fj=F(xj)F_{j}=F(x_{j}) and Fj=F(xj)\nabla F_{j}=\nabla F(x_{j}). By Lemma 3.1, for j0,j\geq 0,

At this point, we cannot apply the standard method of proof for SGD, since bj+1b_{j+1} and GjG_{j} are correlated random variables and thus, in particular, for the conditional expectation

By applying the inequality ab12λb2+λ2a2ab\leq\frac{1}{2\lambda}b^{2}+\frac{\lambda}{2}a^{2} with λ=2σ2bj2+Fj2+σ2\lambda=\frac{2\sigma^{2}}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}, a=Gjbj+1a=\frac{\|G_{j}\|}{b_{j+1}}, and b=GjFjFjbj2+Fj2+σ2b=\frac{\left|\|G_{j}\|-\|\nabla F_{j}\|\right|\|\nabla F_{j}\|}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}, the first term in (7) can be bounded as

where the first term in the last inequality is due to the fact that

Similarly, applying the inequality abλ2a2+12λb2ab\leq\frac{\lambda}{2}a^{2}+\frac{1}{2\lambda}b^{2} with λ=2bj2+Fj2+σ2\lambda=\frac{2}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}, a=σGjbj+1a=\frac{\sigma\|G_{j}\|}{b_{j+1}}, and b=Fjbj2+Fj2+σ2b=\frac{\|\nabla F_{j}\|}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}, the second term of the right hand side in equation (7) is bounded by

Thus, putting inequalities (8) and (9) back into (7) gives

Applying the law of total expectation, we take the expectation of each side with respect to ξj1,ξj2,,ξ1\xi_{j-1},\xi_{j-2},\dots,\xi_{1}, and arrive at the recursion

Taking j=Nj=N and summing up from k=0k=0 to k=N1k=N-1,

where the second inequality we apply Lemma (3.2) and then Jensen’s inequality to bound the summation:

For the term on left hand side in equation (10), we apply Hölder’s inequality,

where the last inequality is due to inequality (12). Thus (10) arrives at the inequality

Multiplying by 2b0+22N(γ+σ)N\frac{2b_{0}+2\sqrt{2N}(\gamma+\sigma)}{N}, the above inequality gives

Finally, the bound is obtained by Markov’s Inequality:

where in the second step Jensen’s inequality is applied to the concave function ϕ(x)=minkhk(x)\phi(x)=\min_{k}h_{k}(x).

2.2 Finishing the proof of the second bound in Theorem 2.1

First, observe with probability 1δ1-\delta^{\prime} that

For the denominator on the left hand side of the inequality 10, we let Z=k=0N1Fk2Z=\sum_{k=0}^{N-1}\|\nabla F_{k}\|^{2} and so

Thus, we further simplify inequality (10),

we have with probability 1δ^δ1-\hat{\delta}-\delta^{\prime} that

That is equivalent to solve the following quadratic equation

Let δ^=δ=δ2\hat{\delta}=\delta^{\prime}=\frac{\delta}{2}. Replacing ZZ with k=0N1Fk2\sum_{k=0}^{N-1}\|\nabla F_{k}\|^{2} and dividing both side with NN we have with probability 1δ1-\delta

Proof of Theorem 2.2

We will use the following lemma to argue that after an initial number of steps N=(ηL)2b02ε+1N=\lceil{\frac{(\eta L)^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1, either we have already reached a point xkx_{k} such that F(xk)2ε\|\nabla F(x_{k})\|^{2}\leq\varepsilon, or else bNηLb_{N}\geq\eta L.

Fix ε(0,1]\varepsilon\in(0,1] and C>0C>0. For any non-negative a0,a1,,a_{0},a_{1},\dots, the dynamical system

has the property that after N=C2b02ε+1N=\lceil{\frac{C^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1 iterations, either mink=0:N1akε\min_{k=0:N-1}a_{k}\leq\varepsilon, or bNηLb_{N}\geq\eta L.

Proof If b0ηCb_{0}\geq\eta C, we are done. Else b0<Cb_{0}<C. Let NN be the smallest integer such that NC2b02εN\geq\frac{C^{2}-b_{0}^{2}}{\varepsilon}. Suppose bN<Cb_{N}<C. Then

Hence, for NC2b02εN\geq\frac{C^{2}-b_{0}^{2}}{\varepsilon}, mink[N1]akε.\min_{k\in[N-1]}a_{k}\leq\varepsilon. Suppose mink[N1]ak>ϵ\min_{k\in[N-1]}a_{k}>\epsilon, then from above inequalities we have bN>Cb_{N}>C. The following Lemma shows that {F(xk)}k=0\{F(x_{k})\}_{k=0}^{\infty} is a bounded sequence for any value of b0>0b_{0}>0.

Suppose FCL1F\in C_{L}^{1} and F=infxF(x)>F^{*}=\inf_{x}F(x)>-\infty. Denote by k01k_{0}\geq 1 the first index such that bk0ηLb_{k_{0}}\geq\eta L. Then for all bk<ηL,k=0,1,,k01b_{k}<\eta L,k=0,1,\ldots,k_{0}-1,

Proof Suppose k01k_{0}\geq 1 is the first index such that bk0ηLb_{k_{0}}\geq\eta L. By Lemma 3.1, for jk01,j\leq k_{0}-1,

2 Main proof

Proof By Lemma 4.1, if mink[N1]F(xk)2ε\min_{k\in[N-1]}\|\nabla F(x_{k})\|^{2}\leq\varepsilon is not satisfied after N=(ηL)2b02ε+1N=\lceil{\frac{(\eta L)^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1 steps, then there exits a first index 1k0N1\leq k_{0}\leq N such that bk0η>L\frac{b_{k_{0}}}{\eta}>L. By Lemma 3.1, for j0,j\geq 0,

Let Z=k=k01M1Fk2Z=\sum_{k=k_{0}-1}^{M-1}\|\nabla F_{k}\|^{2}, it follows that

Solving the quadratic inequality for ZZ,

If k0=1k_{0}=1, the stated result holds by multiplying both side by 1M\frac{1}{M}. Otherwise, k0>1.k_{0}>1. From Lemma 4.2, we have

Replacing Fk01FF_{k_{0}-1}-F^{*} in (15) by above bound, we have

where NL2b02εN\leq\frac{L^{2}-b_{0}^{2}}{\varepsilon} and M=CMεM=\frac{C_{M}}{\varepsilon} .

Numerical experiments

With guaranteed convergence of AdaGrad-Norm and its robustness to the parameters η\eta and b0b_{0}, we perform experiments on several data sets ranging from simple linear regression over Gaussian data to neural network architectures on state-of-the-art (SOTA) image data sets including ImageNet. These experiments are not about outperforming the strong baseline of well-tuned SGD, but to further strengthen the theoretical finding that the convergence of AdaGrad-norm requires less hyper-parameter tuning while maintaining a comparable performance as the well-tuned SGD methods.

In this section, we consider linear regression to corroborate our analysis, i.e.,

We can see in Figure 1 how AdaGrad-Norm and AdaGrad-Coordinate auto-tune the learning rate adaptively to a certain level to match the unknown Lipschitz smoothness constant and the stochastic noise so that the gradient norm converges for a significantly wider range of b0b_{0} than for either SGD method. In particular, when b0b_{0} is initialized too small, AdaGrad-Norm and AdaGrad-Coordinate still converge with good speed while SGD-Constant and SGD-DecaySqrt diverge. When b0b_{0} is initialized too large (stepsize too small), surprisingly AdaGrad-Norm and AdaGrad-Coordinate converge at the same speed as SGD-Constant. This possibly can be explained by Theorem 2.2 because this is somewhat like the deterministic setting (the stepsize controls the variance σ\sigma and a smaller learning rate implies smaller variance). Comparing AdaGrad-Coordinate and AdaGrad-Norm, AdaGrad-Norm is more robust to the initialization b0b_{0} but is not better than AdaGrad-Coordinate when the initialization b0b_{0} is close to the optimal value of LL.

Figure 2 explores the batch gradient descent setting, when there is no variance σ=0\sigma=0 (i.e., using the whole data sample for one iteration). The experimental setup in Figure 2 is the same as Figure 1 except for the sample size mm of each iteration. Since the line-search method (GD-LineSearch) is one of the most important algorithms in deterministic gradient descent for adaptively choosing the step-size at each iteration, we also compare to this method – see Algorithm 2 in the appendix for our particular implementation of Line-Search. We see that the behavior of the four methods, AdaGrad-Norm, AdaGrad-Coordinate, GD-Constant, and GD-DecaySqrt, are very similar to the stochastic setting, albeit AdaGrad-Coordinate here is worse than in the stochastic setting. Among the five methods in the plot, GD-LineSearch performs the best but with significantly longer computational time, which is not practical in large-scale machine learning problems.

2 Image data

In this section, we extend our numerical analysis to the setting of deep learning and show that the robustness of AdaGrad-Norm does not come at the price of worse generalization – an important observation that is not explained by our current theory. The experiments are done in PyTorch (Paszke et al., 2017) and parameters are by default if no specification is provided.The code we used is originally from https://github.com/pytorch/examples/tree/master/imagenet We did not find it practical to compute the norm of the gradient for the entire neural network during back-propagation. Instead, we adapt a stepsize for each neuron or each convolutional channel by updating bjb_{j} with the gradient of the neuron or channel. Hence, our experiments depart slightly from a strict AdaGrad-Norm method and include a limited adaptive metric component. Details in implementing AdaGrad-Norm in a neural network are explained in the appendix and the code is also provided.https://github.com/xwuShirley/pytorch/blob/master/torch/optim/adagradnorm.py

We test on three data sets: MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009), see Table 1 in the appendix for detailed descriptions. For MNIST, our models are a logistic regression (LogReg), a multilayer network with two fully connected layers (FulConn2) with 100 hidden units and ReLU activations, and a convolutional neural network (see Table 2 in the appendix for details). For CIFAR10, our model is ResNet-18 (He et al., 2016). For both data sets, we use 256 images per iteration (2 GPUs with 128 images/GPU, 234 iterations per epoch for MNIST and 196 iterations per epoch for CIFAR10). For ImagetNet, we use ResNet-50 and 256 images for one iteration (8 GPUs with 32 images/GPU, 5004 iterations per epoch). Note that we do not use accelerated methods such as adding momentum in the training.

We pick these models for the following reasons: (1) LR with MNIST represents the smooth loss function; (2) FC with MNIST represents the non-smooth loss function; (3) CNN with MNIST belongs to a class of simple shallow network architectures; (4) ResNet-18 in CIFAR10 represents a complicated network architecture involving many other added features achieving SOTA performance; (5) ResNet-50 in ImageNet represents large-scale data and a deep network architecture.

In order to make the setting match our assumptions, we make several changes, which are not practically meaningful scenarios but serve only for corroborating our theorems.

For the experiment in MNIST, we do not use bias, regularization (zero weight decay), dropout, momentum, batch normalization (Ioffe and Szegedy, 2015), or any other added features that help achieving SOTA performance (see Figure 3 and Figure 4). However, the architecture of ResNet by default is built with the celebrated batch normalization (Batch-Norm) method as important layers. Batch-Norm accomplishes the auto-tuning property by normalizing the means and variances of mini-batches in a particular way during the forward-propagation, and in return is back-propagated with projection steps. This projection phenomenon is highlighted in weight normalization (Salimans and Kingma, 2016; Wu et al., 2018). Thus, in the ResNet-18 experiment on CIFAR10, we are particularly interested in how Batch-Norm interacts with the auto-tuning property of AdaGrad-Norm. We disable the learnable scale and shift parameters in the Batch-Norm layers Set nn.BatchNorm2d(planes,affine=False) and compare the default setup in ResNet (Goyal et al., 2017). The resulted plots are located in Figure 4 (bottom left and bottom right). In the ResNet-50 experiment on ImageNet, we also depart from the standard set-up by initializing the weights of the last fully connected layer with i.i.d. Gaussian samples with mean zero and variance 0.031250.03125. Note that the default initialization for the last fully-connected layer of ResNet50 is an i.i.d. Gaussian distribution with mean zero and variance of 0.010.01. Instead, we use variance 0.031250.03125 in that the AdaGrad-Norm algorithm takes the norm of the gradient. The initialization of Gaussian distribution with higher variance results in larger accumulative gradient norms, which is likely to make AdaGrad-Norm robust to small initialization of b0b_{0}. To some extent, AdaGrad-Norm could be sensitive to the model’s initialization. But how much sensitive the AdaGrad-Norm, or more generally the adaptive gradient methods, to the initialization of the model could be a potential future direction.

For all experiments, same initialized vector x0x_{0} is used for the same model so as to eliminate the effect of random initialization in weight vectors. We set η=1\eta=1 in all AdaGrad implementations, noting that in all these problems we know that F=0F^{*}=0 and we measure that F(x0)F(x_{0}) is between 1 and 10. Indeed, we approximate the loss using a sample of 256 images to be 1256i=1256fi(x0)\frac{1}{256}\sum_{i=1}^{256}f_{i}(x_{0}): 2.41292.4129 for logistic regression, 2.3052.305 for two-layer fully connected model, 2.3012.301 for convolution neural network, 2.38482.3848 for ResNet-18 with disable learnable parameter in Batch-Norm, 2.34592.3459 for ResNet-18 with default Batch-Norm, and 7.7047.704 for ResNet-50. We vary the initialization b0b_{0} while fixing all other parameters and plot the training accuracy and testing accuracy after different numbers of epochs. We compare AdaGrad-Norm with initial parameter b0b_{0} to (a) SGD-Constant: fixed stepsize 1b0\frac{1}{b_{0}}, (b) SGD-DecaySqrt: decaying stepsize ηj=1b0j\eta_{j}=\frac{1}{b_{0}\sqrt{j}} (c) AdaGrad-Coordinate: a vector of per-coefficient stepsizes. We use torch.optim.adagrad

In all experiments shown in Figures 3, 4, and 5, we fix b0b_{0} and compare the accuracy for the four algorithms; the convergence of AdaGrad-Norm is much better even for small initial values b0b_{0}, and shows much stronger robustness than the alternatives. In particular, Figures 3 and 4 show that the AdaGrad-Norm’s accuracy is extremely robust (as good as the best performance) to the choice of b0b_{0}. At the same time, the SGD methods and AdaGrad-Coordinate are highly sensitive. For Figure 5, the range of parameters b0b_{0} for which AdaGrad-Norm attains its best performance is also larger than the corresponding range for SGD-Constant and AdaGrad-Coordinate but sub-optimal for small values of b0b_{0}. It is likely to indicate that for ImageNet training, AdaGrad-Norm does not remove the need to tune b0b_{0} but makes the hyper-parameter search for b0b_{0} easier. Note that the best test accuracy in Figure 5 is substantially lower than numbers in the literature, where optimizers for ResNet-50 on ImageNet attain test accuracy around 76%76\% (Goyal et al., 2017), about 10%10\% better than the best result in Figure 5. This is mainly because (a) we do not apply momentum methods, and perhaps more critically (b) both SGD and AdaGrad-Norm do not use the default decaying scheduler for η\eta as in Goyal et al. (2017). Instead, we use a constant rate η=1\eta=1. Our purpose is not to achieve the comparable state-of-the-art results but mainly to verify that AdaGrad-Norm is less sensitive to hyper-parameter and requires less hyper-parameter tuning.

Similar to the Synthetic Data, when b0b_{0} is initialized in the range of well-tuned stepsizes, AdaGrad-Norm gives almost the same accuracy as SGD with constant stepsize; when b0b_{0} is initialized too small, AdaGrad-Norm still converges with good speed (except for CNN in MNIST), while SGDs do not. The divergence of AdaGrad-Norm with small b0b_{0} for CNN in MNIST (Figure 4, top right) can be possibly explained by the unboundedness of gradient norm in the four-layer CNN model. In contrast, the 18-layer or 50-layer ResNet model is very robust to all range of b0b_{0} in experiments (Figure 4, bottom), which is due to Batch-Norm that we further discuss in the next paragraph.

We are interested in the experiments of Batch-Norm by default and Batch-Norm without learnable parameters because we want to understand how AdaGrad-Norm interacts with models that already have the built-in feature of auto-tuning stepsize such as Batch-Norm. First, comparing the outcomes of Batch-Norm with the default setting (Figure 4, bottom right) and without learnable parameters (Figure 4, bottom left), we see the learnable parameters (scales and shifts) in Batch-Norm can be very helpful in accelerating the training. Surprisingly, the best stepsize in Batch-Norm with default for SGD-Constant is at b0=0.1b_{0}=0.1 (i.e., η=10\eta=10). While the learnable parameters are more beneficial to AdaGrad-Coordinate, AdaGrad-Norm seems to be affected less. Overall, combining the two auto-tuning methods (AdaGrad-Norm and Batch-Norm) give good performance.

At last, we add momentum to the stochastic gradient descent methods as empirical evidence to showcase the robustness of adaptive methods with momentum shown in Figure 6. Since SGD with 0.90.9 momentum is commonly used, we also set 0.90.9 momentum for our implementation of AdaGrad-Norm. See Algorithm 3 in the appendix for details. The results (Figure 6) show that AdaGrad-Norm with momentum is highly robust to initialization while SGD with momentum is not. SGD with momentum does better than AdaGrad-Norm when the initialization b0b_{0} is greater than the Lipschitz smoothness constant. When b0b_{0} is smaller than the Lipschitz smoothness constant, AdaGrad-Norm performs as well as SGD with the best stepsize (0.10.1).

Acknowledgments

Special thanks to Kfir Levy for pointing us to his work, to Francesco Orabona for reading a previous version and pointing out a mistake, and to Krishna Pillutla for discussion on the unit mismatch in AdaGrad. We thank Arthur Szlam and Mark Tygert for constructive suggestions. We also thank Francis Bach, Alexandre Defossez, Ben Recht, Stephen Wright, and Adam Oberman. We appreciate the help with the experiments from Priya Goyal, Soumith Chintala, Sam Gross, Shubho Sengupta, Teng Li, Ailing Zhang, Zeming Lin, and Timothee Lacroix. Finally, we owe particular gratitude to the reviewers and the editor for their suggestions and comments that significantly improved the paper.

References

A Tables

B Implementing Algorithm 1 in a neural network

In this section, we give the details for implementing our algorithm in a neural network. In the standard neural network architecture, the computation of each neuron consists of an elementwise nonlinearity of a linear transform of input features or output of previous layer:

where ww is the dd-dimensional weight vector, bb is a scalar bias term, xx,yy are respectively a dd-dimensional vector of input features (or output of previous layer) and the output of current neuron, ϕ()\phi(\cdot) denotes an element-wise nonlinearity.

For fully connected layer, the stochastic gradient GG in Algorithm 1 represents the gradient of the current neuron (see the green curve, Figure 7). Thus, when implementing our algorithm in PyTorch, AdaGrad Norm is one learning rate associated to one neuron for fully connected layer, while SGD has one learning rate for all neurons.

For convolution layer, the stochastic gradient GG in Algorithms 1 represents the gradient of each channel in the neuron. For instance, there are 6 learning rates for the first layer in the LeNet architecture (Table 1). Thus, AdaGrad-Norm is one learning rate associated to one channel.