AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, Leon Bottou

Introduction

For non-convex but smooth loss functions $F$ , (noiseless) gradient descent (GD) with constant stepsize converges to a stationary point of $F$ at rate $\mathcal{O}\left({1}/{N}\right)$ with the number of iterations $N$ (Nesterov, 1998). In the same setting, and under the general assumption of bounded gradient noise variance, SGD with constant or decreasing stepsize $\eta_{j}=\mathcal{O}\left({1}/{\sqrt{j}}\right)$ has been proven to converge to a stationary point of $F$ at rate $\mathcal{O}\left({1}/{\sqrt{N}}\right)$ (Ghadimi and Lan, 2013; Bottou et al., 2018). The $\mathcal{O}\left({1}/{N}\right)$ rate for GD is the best possible worst-case dimension-free rate of convergence for any algorithm (Carmon et al., 2019); faster convergence rates in the noiseless setting are available under the mild assumption of additional smoothness (Agarwal et al., 2017; Carmon et al., 2017, 2018). In the noisy setting, faster rates than $\mathcal{O}\left({1}/{\sqrt{N}}\right)$ are also possible using accelerated SGD methods (Ghadimi and Lan, 2016; Allen-Zhu and Yang, 2016; Reddi et al., 2016; Allen-Zhu, 2017; Xu et al., 2018; Zhou et al., 2018; Fang et al., 2018). For instance, Zhou et al. (2018) and Fang et al. (2018) obtain the rate $\mathcal{O}\left({1}/{{N}^{2/3}}\right)$ without requiring finite-sum structure but with an additional assumptions about Lipschitz continuity of the stochastic gradients, which they exploit to reduce variance.

Instead of focusing on faster convergence rates for SGD, this paper focuses on adaptive stepsizes (Cutkosky and Boahen, 2017; Levy, 2017) that make the optimization algorithm more robust to (generally unknown) parameters of the optimization problem, such as the noise level of the stochastic gradient and the Lipschitz smoothness constant $L$ of the loss function defined as the smallest number $L>0$ such that $\|\nabla F(x)-\nabla F(y)\|\leq L\|x-y\|$ for all $x,y$ . In particular, the $\mathcal{O}\left({1}/{N}\right)$ convergence of GD with fixed stepsize is guaranteed only if the fixed stepsize $\eta>0$ is carefully chosen such that $\eta\leq 1/L$ – choosing a larger stepsize $\eta$ , even just by a factor of 2, can result in oscillation or divergence of the algorithm (Nesterov, 1998). Because of this sensitivity, GD with fixed stepsize is rarely used in practice; instead, one adaptively chooses the stepsize $\eta_{j}>0$ at each iteration to approximately maximize a decrease of the loss function in the current direction of $-\nabla F(x_{j})$ via either line search (Wright and Nocedal, 2006), or according to the Barzilai-Borwein rule (Barzilai and Borwein, 1988) combined with line search.

However, these bounds do not tell us much about how to select a good stepsize schedule in practice, where algorithms are run for finite iterations and the constants in the rate of convergence matter.

The question of how to choose the stepsize $\eta>0$ or stepsize or learning rate schedule $\{\eta_{j}\}$ for SGD is by no means resolved; in practice, a preferred schedule is chosen manually by testing many different schedules in advance and choosing the one leading to smallest training or generalization error. This process can take days or weeks, and can become prohibitively expensive in terms of time and computational resources incurred.

Theoretically rigorous convergence results for AdaGrad-Norm were provided in the convex setting recently (Levy, 2017). Moreover, it is possible to obtain convergence rates in the offline setting by online-batch conversion. However, making such observations rigorous for nonconvex functions is difficult because $b_{j}$ is itself a random variable which is correlated with the current and all previous noisy gradients; thus, the standard proofs in SGD do not straightforwardly extend to the proofs of AdaGrad-Norm. This paper provides such a proof for AdaGrad-Norm.

2 Main contributions

Our results make rigorous and precise the observed phenomenon that the convergence behavior of AdaGrad-Norm is highly adaptable to the unknown Lipschitz smoothness constant and level of stochastic noise on the gradient: when there is noise, AdaGrad-Norm converges at the rate of $O(\log(N)/\sqrt{N})$ , and when there is no noise, the same algorithm converges at the optimal $O(1/N)$ rate like well-tuned batch gradient descent. Moreover, our analysis shows that AdaGrad-Norm converges at these rates for any choices of the algorithm hyperparameters $b_{0}>0$ and $\eta>0$ , in contrast to GD or SGD with fixed stepsize where if the stepsize is set above a hard upper threshold governed by the (generally unknown) smoothness constant $L$ , the algorithm might not converge at all. Finally, we note that the constants in the rates of convergence we provide are explicit in terms of their dependence on the hyperparameters $b_{0}$ and $\eta$ . We list our two main theorems (informally) in the following:

For a differentiable non-convex function $F$ with $L$ -Lipschitz gradient and $F^{*}=\inf_{x}F(x)>-\infty$ , Theorem 2.1 implies that AdaGrad-Norm converges to an $\varepsilon$ -approximate stationary point with high probability It is becoming common to define an $\varepsilon$ -approximate stationary point as $\|\nabla F(x)\|\leq\varepsilon$ (Agarwal et al., 2017; Carmon et al., 2018, 2019; Fang et al., 2018; Zhou et al., 2018; Allen-Zhu, 2018), but we use the convention $\|F(x)\|^{2}\leq\varepsilon$ (Lei et al., 2017; Bottou et al., 2018) to most easily compare our results to those from Ghadimi and Lan (2013); Li and Orabona (2019). at the rate

If the optimal value of the loss function $F^{*}$ is known and one sets $\eta=F(x_{0})-F^{*}$ accordingly, then the constant in our rate is close to the best-known constant $\sigma L(F(x_{0})-F^{*})$ achievable for SGD with fixed stepsize $\eta=\eta_{1}=\dots=\eta_{N}=\min\{\frac{1}{L},\frac{1}{\sigma\sqrt{N}}\}$ carefully tuned to knowledge of $L$ and $\sigma$ , as given in Ghadimi and Lan (2013). However, our result requires bounded gradient $\|\nabla F(x)\|^{2}\leq\gamma^{2}$ and our rate constant scales with $\gamma\sigma$ instead of linearly in $\sigma$ . Nevertheless, our result suggests a good strategy for setting hyperparameters in implementing AdaGrad-Norm practically: given knowledge of $F^{*}$ , set $\eta=F(x_{0})-F^{*}$ and simply initialize $b_{0}>0$ to be very small.

When there is no noise $\sigma=0$ , we can improve this rate to an $\mathcal{O}\left(1/N\right)$ rate of convergence. In Theorem 2.2, we show that $\min_{j\in[N]}\|\nabla F(x_{j})\|^{2}\leq\varepsilon$ after (1) $N=\mathcal{O}\left(\frac{1}{\varepsilon}\left(\left({(F(x_{0})-F^{*})}/{\eta}\right)^{2}+b_{0}\left(F(x_{0})-F^{*}\right)/\eta\right)\right)$ if $b_{0}\geq\eta L,$ (2) $N=\mathcal{O}\left(\frac{1}{\varepsilon}\left(L\left(F(x_{0})-F^{*}\right)+\left({(F(x_{0})-F^{*})}/{\eta}\right)^{2}\right)+\frac{(\eta L)^{2}}{\varepsilon}\log\left(\frac{\eta L}{b_{0}}\right)\right)$ if ${b_{0}}<{\eta}L.$ Note that the constant $(\eta L)^{2}$ in the second case when ${b_{0}}<\eta L$ is not optimal compared to the known best rate constant $\eta L$ obtainable by gradient descent with fixed stepsize $\eta=1/L$ (Carmon et al., 2019); on the other hand, given knowledge of $L$ and $F(x_{0})-F^{*}$ , the rate constant of AdaGrad-norm reproduces the optimal constant $\eta L$ by setting $\eta=F(x_{0})-F^{*}$ and $b_{0}=\eta L$ .

Practically, our results imply a good strategy for setting the hyperparameters when implementing AdaGrad-norm in practice: set $\eta=(F(x_{0})-F^{*})$ (assuming $F^{*}$ is known) and set $b_{0}>0$ to be a very small value. If $F^{*}$ is unknown, then setting $\eta=1$ should work well for a wide range of values of $L$ , and in the noisy case with $\sigma^{2}$ strictly greater than zero.

3 Previous work

Theoretical guarantees of convergence for AdaGrad were provided in Duchi et al. (2011) in the setting of online convex optimization, where the loss function may change from iteration to iteration and be chosen adversarially. AdaGrad was subsequently observed to be effective for accelerating convergence in the nonconvex setting, and has become a popular algorithm for optimization in deep learning problems. Many modifications of AdaGrad with or without momentum have been proposed, namely, RMSprop (Srivastava and Swersky, 2012), AdaDelta (Zeiler, 2012), Adam (Kingma and Ba, 2015), AdaFTRL(Orabona and Pal, 2015), SGD-BB(Tan et al., 2016), AdaBatch (Defossez and Bach, 2017), SC-Adagrad (Mukkamala and Hein, 2017), AMSGRAD (Reddi et al., 2018), Padam (Chen and Gu, 2018), etc. Extending our convergence analysis to these popular alternative adaptive gradient methods remains an interesting problem for future research.

Regarding the convergence guarantees for the norm version of adaptive gradient methods in the offline setting, the recent work by Levy (2017) introduces a family of adaptive gradient methods inspired by AdaGrad, and proves convergence rates in the setting of (strongly) convex loss functions without knowing the smoothness parameter $L$ in advance. Yet, that analysis still requires the a priori knowledge of a convex set ${\cal K}$ with known diameter $D$ in which the global minimizer resides. More recently, Wu et al. (2018) provids convergence guarantees in the non-convex setting for a different adaptive gradient algorithm, WNGrad, which is closely related to AdaGrad-Norm and inspired by weight normalization (Salimans and Kingma, 2016). In fact, the WNGrad stepsize update is similar to AdaGrad-Norm’s:

4 Future work

5 Notation

AdaGrad-Norm convergence

To be clear about the adaptive algorithm, we first state in Algorithm 1 the norm version of AdaGrad we consider throughout in the analysis.

The random vectors $\xi_{k}$ , $k=0,1,2,\dots,$ are independent of each other and also of $x_{k}$ ;

$\|\nabla F(x)\|^{2}\leq\gamma^{2}$ uniformly.

The first two assumptions are standard (see e.g. Nemirovski and Yudin (1983); Nemirovski et al. (2009); Bottou et al. (2018)). The third assumption is somewhat restrictive as it rules out strongly convex objectives, but is not an unreasonable assumption for AdaGrad-Norm, where the adaptive learning rate is a cumulative sum of all previous observed gradient norms.

Because of the variance in gradient, the AdaGrad-Norm stepsize $\frac{\eta}{b_{k}}$ decreases to zero roughly at a rate between $\frac{1}{\sqrt{2(\gamma^{2}+\sigma^{2})k}}$ and $\frac{1}{\sigma\sqrt{k}}$ . It is known that AdaGrad-Norm stepsize decreases at this rate (Levy, 2017), and that this rate is optimal in $k$ in terms of the resulting convergence theorems in the setting of smooth but not necessarily convex $F$ , or convex but not necessarily strongly convex or smooth $F$ . Still, standard convergence theorems for SGD do not extend straightforwardly to AdaGrad-Norm because the stepsize $1/b_{k}$ is a random variable and dependent on all previous points visited along the way, i.e., $\{\|\nabla F(x_{j})\|\}_{j=0}^{k}$ and $\{\|\nabla G(x_{j},\xi_{j})\|\}_{j=0}^{k}$ . From this point on, we use the shorthand $G_{k}=G(x_{k},\xi_{k})$ and $F_{k}=\nabla F(x_{k})$ for simplicity of notation. The following theorem gives the convergence guarantee to Algorithm 1. We give detailed proof in Section 3.

This result implies that AdaGrad-Norm converges for any $\eta>0$ and starting from any value of $b_{0}>0$ . To put this result in context, we can compare to Corollary 2.2 of Ghadimi and Lan (2013) giving the best-known convergence rate for SGD with fixed step-size in the same setting (albeit not requiring Assumption (3) of uniformly bounded gradient): if the Lipschitz smoothness constant $L$ and the variance $\sigma^{2}$ are known a priori, and the fixed stepsize in SGD is set to

We match the $O(1/\sqrt{N})$ rate of Ghadimi and Lan (2013), but without a priori knowledge of $L$ and $\sigma$ , and with a worse constant in the rate of convergence. In particular, the constant in our bound scales according to $\sigma^{2}$ or $\gamma\sigma$ (up to logarithmic factors in $\sigma$ ) while the result for SGD with well-tuned fixed step-size scales linearly with $\sigma$ . The additional logarithmic factor (by Lemma 3.2) results from the AdaGrad-Norm update using the square norm of the gradient (see inequality (11) for details). The extra constant $\frac{1}{\sqrt{\delta}}$ results from the correlation between the stepsize $b_{j}$ and the gradient $\|\nabla F(x_{j})\|$ . We note that the recent work Li and Orabona (2019) derives an $O(1/\sqrt{N})$ rate for a variation of AdaGrad-Norm without the assumption of uniformly bounded gradient, but at the same time requires a priori knowledge of the smoothness constant $L>0$ in setting the step-size in order to establish convergence, similar to SGD with fixed stepsize. Finally, we note that recent works (Allen-Zhu, 2017; Lei et al., 2017; Fang et al., 2018; Zhou et al., 2018) provide modified SGD algorithms with convergence rates faster than $O(1/\sqrt{N})$ , albeit again requiring priori knowledge of both $L$ and $\sigma$ to establish convergence.

We reiterate however that the main emphasis in Theorem 2.1 is on the robustness of the AdaGrad-Norm convergence to its hyperparameters $\eta$ and $b_{0}$ , compared to plain SGD’s dependence on its parameters $\eta$ and $\sigma$ . Although the constant in the rate of our theorem is not as good as the best-known constant for stochastic gradient descent with well-tuned fixed stepsize, our result suggests that implementing AdaGrad-Norm allows one to vastly reduce the need to perform laborious experiments to find a stepsize schedule with reasonable convergence when implementing SGD in practice.

We note that for the second bound in 2.1, in the limit as $\sigma\rightarrow 0$ we recover an $O\left(\log(N)/N\right)$ rate of convergence for noiseless gradient descent. We can establish a stronger result in the noiseless setting using a different method of proof, removing the additional log factor and Assumption 3 of uniformly bounded gradient. We state the theorem below and defer our proof to Section 4.

Then $\min_{j\in[N]}\|\nabla F(x_{j})\|^{2}\leq\varepsilon$ after

$N=1+\lceil{\frac{1}{\varepsilon}\left(\frac{4\left(F(x_{0})-F^{*}\right)^{2}}{\eta^{2}}+\frac{2b_{0}\left(F(x_{0})-F^{*}\right)}{\eta}\right)\rceil}$ if ${b_{0}}\geq\eta L,$

$N=1+\lceil{\frac{1}{\varepsilon}\left(2L\left(F(x_{0})-F^{*}\right)+\left(\frac{2\left(F(x_{0})-F^{*}\right)}{\eta}+\eta LC_{b_{0}}\right)^{2}+(\eta L)^{2}(1+C_{b_{0}})-b_{0}^{2}\right)\rceil}$ $\quad\quad\text{if }{b_{0}}<\eta L.$ Here $C_{b_{0}}=1+2\log\left(\frac{\eta L}{b_{0}}\right)$ .

The convergence bound shows that, unlike gradient descent with constant stepsize $\eta$ which can diverge if the stepsize $\eta\geq 2/L$ , AdaGrad-Norm convergence holds for any choice of parameters $b_{0}$ and $\eta$ . The critical observation is that if the initial stepsize $\frac{\eta}{b_{0}}>\frac{1}{L}$ is too large, the algorithm has the freedom to diverge initially, until $b_{j}$ grows to a critical point (not too much larger than $L\eta$ ) at which point $\frac{\eta}{b_{j}}$ is sufficiently small that the smoothness of $F$ forces $b_{j}$ to converge to a finite number on the order of $L$ , so that the algorithm converges at an $O(1/N)$ rate. To describe the result in Theorem 2.2, let us first review a classical result (see, for example Nesterov (1998), $(1.2.13)$ ) on the convergence rate for gradient descent with fixed stepsize.

Alternatively, if $b\leq\frac{L}{2}$ , then convergence is not guaranteed at all – gradient descent can oscillate or diverge.

Compared to the convergence rate of gradient descent with fixed stepsize, AdaGrad-Norm in the case $b=b_{0}\geq\eta L$ gives a larger constant in the rate. But in case $b=b_{0}<\eta L$ , gradient descent can fail to converge as soon as $b\leq\eta L/2$ , while AdaGrad-Norm converges for any $b_{0}>0$ , and is extremely robust to the choice of $b_{0}<\eta L$ in the sense that the resulting convergence rate remains close to the optimal rate of gradient descent with fixed stepsize $1/b=1/L$ , paying a factor of $\log(\frac{\eta L}{b_{0}})$ and $(\eta L)^{2}$ in the constant. Here, the constant $(\eta L)^{2}$ results from the worst-cast analysis using Lemma 4.1, which assumes that the gradient $\|\nabla F(x_{j})\|^{2}\approx\varepsilon$ for all $j=0,1,\ldots$ , when in reality the gradient should be much larger at first. We believe the number of iterations can be improved by a refined analysis, or by considering the setting where $x_{0}$ is drawn from an appropriate random distribution.

Proof of Theorem 2.1

We first introduce some important lemmas in subsection 3.1 and give the main proof of Theorem 2.1 in Subsection 3.2.

We first introduce several lemmas that are used in the proof for Theorem 2.1. We repeatedly appeal to the following classical Descent Lemma, which is also the main ingredient in Ghadimi and Lan (2013), and can be proved by considering the Taylor expansion of $F$ around $y$ .

We will also use the following lemmas concerning sums of non-negative sequences.

For any non-negative $a_{1},\cdots,a_{T}$ , and $a_{1}\geq 1$ , we have

2 Main proof

Proof For simplicity, we write $F_{j}=F(x_{j})$ and $\nabla F_{j}=\nabla F(x_{j})$ . By Lemma 3.1, for $j\geq 0,$

At this point, we cannot apply the standard method of proof for SGD, since $b_{j+1}$ and $G_{j}$ are correlated random variables and thus, in particular, for the conditional expectation

By applying the inequality $ab\leq\frac{1}{2\lambda}b^{2}+\frac{\lambda}{2}a^{2}$ with $\lambda=\frac{2\sigma^{2}}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}$ , $a=\frac{\|G_{j}\|}{b_{j+1}}$ , and $b=\frac{\left|\|G_{j}\|-\|\nabla F_{j}\|\right|\|\nabla F_{j}\|}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}$ , the first term in (7) can be bounded as

where the first term in the last inequality is due to the fact that

Similarly, applying the inequality $ab\leq\frac{\lambda}{2}a^{2}+\frac{1}{2\lambda}b^{2}$ with $\lambda=\frac{2}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}$ , $a=\frac{\sigma\|G_{j}\|}{b_{j+1}}$ , and $b=\frac{\|\nabla F_{j}\|}{\sqrt{b_{j}^{2}+\|\nabla F_{j}\|^{2}+\sigma^{2}}}$ , the second term of the right hand side in equation (7) is bounded by

Thus, putting inequalities (8) and (9) back into (7) gives

Applying the law of total expectation, we take the expectation of each side with respect to $\xi_{j-1},\xi_{j-2},\dots,\xi_{1}$ , and arrive at the recursion

Taking $j=N$ and summing up from $k=0$ to $k=N-1$ ,

where the second inequality we apply Lemma (3.2) and then Jensen’s inequality to bound the summation:

For the term on left hand side in equation (10), we apply Hölder’s inequality,

where the last inequality is due to inequality (12). Thus (10) arrives at the inequality

Multiplying by $\frac{2b_{0}+2\sqrt{2N}(\gamma+\sigma)}{N}$ , the above inequality gives

Finally, the bound is obtained by Markov’s Inequality:

where in the second step Jensen’s inequality is applied to the concave function $\phi(x)=\min_{k}h_{k}(x)$ .

2.2 Finishing the proof of the second bound in Theorem 2.1

First, observe with probability $1-\delta^{\prime}$ that

For the denominator on the left hand side of the inequality 10, we let $Z=\sum_{k=0}^{N-1}\|\nabla F_{k}\|^{2}$ and so

Thus, we further simplify inequality (10),

we have with probability $1-\hat{\delta}-\delta^{\prime}$ that

That is equivalent to solve the following quadratic equation

Let $\hat{\delta}=\delta^{\prime}=\frac{\delta}{2}$ . Replacing $Z$ with $\sum_{k=0}^{N-1}\|\nabla F_{k}\|^{2}$ and dividing both side with $N$ we have with probability $1-\delta$

Proof of Theorem 2.2

We will use the following lemma to argue that after an initial number of steps $N=\lceil{\frac{(\eta L)^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1$ , either we have already reached a point $x_{k}$ such that $\|\nabla F(x_{k})\|^{2}\leq\varepsilon$ , or else $b_{N}\geq\eta L$ .

Fix $\varepsilon\in(0,1]$ and $C>0$ . For any non-negative $a_{0},a_{1},\dots,$ the dynamical system

has the property that after $N=\lceil{\frac{C^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1$ iterations, either $\min_{k=0:N-1}a_{k}\leq\varepsilon$ , or $b_{N}\geq\eta L$ .

Proof If $b_{0}\geq\eta C$ , we are done. Else $b_{0}<C$ . Let $N$ be the smallest integer such that $N\geq\frac{C^{2}-b_{0}^{2}}{\varepsilon}$ . Suppose $b_{N}<C$ . Then

Hence, for $N\geq\frac{C^{2}-b_{0}^{2}}{\varepsilon}$ , $\min_{k\in[N-1]}a_{k}\leq\varepsilon.$ Suppose $\min_{k\in[N-1]}a_{k}>\epsilon$ , then from above inequalities we have $b_{N}>C$ . The following Lemma shows that $\{F(x_{k})\}_{k=0}^{\infty}$ is a bounded sequence for any value of $b_{0}>0$ .

Suppose $F\in C_{L}^{1}$ and $F^{*}=\inf_{x}F(x)>-\infty$ . Denote by $k_{0}\geq 1$ the first index such that $b_{k_{0}}\geq\eta L$ . Then for all $b_{k}<\eta L,k=0,1,\ldots,k_{0}-1$ ,

Proof Suppose $k_{0}\geq 1$ is the first index such that $b_{k_{0}}\geq\eta L$ . By Lemma 3.1, for $j\leq k_{0}-1,$

2 Main proof

Proof By Lemma 4.1, if $\min_{k\in[N-1]}\|\nabla F(x_{k})\|^{2}\leq\varepsilon$ is not satisfied after $N=\lceil{\frac{(\eta L)^{2}-b_{0}^{2}}{\varepsilon}\rceil}+1$ steps, then there exits a first index $1\leq k_{0}\leq N$ such that $\frac{b_{k_{0}}}{\eta}>L$ . By Lemma 3.1, for $j\geq 0,$

Let $Z=\sum_{k=k_{0}-1}^{M-1}\|\nabla F_{k}\|^{2}$ , it follows that

Solving the quadratic inequality for $Z$ ,

If $k_{0}=1$ , the stated result holds by multiplying both side by $\frac{1}{M}$ . Otherwise, $k_{0}>1.$ From Lemma 4.2, we have

Replacing $F_{k_{0}-1}-F^{*}$ in (15) by above bound, we have

where $N\leq\frac{L^{2}-b_{0}^{2}}{\varepsilon}$ and $M=\frac{C_{M}}{\varepsilon}$ .

Numerical experiments

With guaranteed convergence of AdaGrad-Norm and its robustness to the parameters $\eta$ and $b_{0}$ , we perform experiments on several data sets ranging from simple linear regression over Gaussian data to neural network architectures on state-of-the-art (SOTA) image data sets including ImageNet. These experiments are not about outperforming the strong baseline of well-tuned SGD, but to further strengthen the theoretical finding that the convergence of AdaGrad-norm requires less hyper-parameter tuning while maintaining a comparable performance as the well-tuned SGD methods.

In this section, we consider linear regression to corroborate our analysis, i.e.,

We can see in Figure 1 how AdaGrad-Norm and AdaGrad-Coordinate auto-tune the learning rate adaptively to a certain level to match the unknown Lipschitz smoothness constant and the stochastic noise so that the gradient norm converges for a significantly wider range of $b_{0}$ than for either SGD method. In particular, when $b_{0}$ is initialized too small, AdaGrad-Norm and AdaGrad-Coordinate still converge with good speed while SGD-Constant and SGD-DecaySqrt diverge. When $b_{0}$ is initialized too large (stepsize too small), surprisingly AdaGrad-Norm and AdaGrad-Coordinate converge at the same speed as SGD-Constant. This possibly can be explained by Theorem 2.2 because this is somewhat like the deterministic setting (the stepsize controls the variance $\sigma$ and a smaller learning rate implies smaller variance). Comparing AdaGrad-Coordinate and AdaGrad-Norm, AdaGrad-Norm is more robust to the initialization $b_{0}$ but is not better than AdaGrad-Coordinate when the initialization $b_{0}$ is close to the optimal value of $L$ .

Figure 2 explores the batch gradient descent setting, when there is no variance $\sigma=0$ (i.e., using the whole data sample for one iteration). The experimental setup in Figure 2 is the same as Figure 1 except for the sample size $m$ of each iteration. Since the line-search method (GD-LineSearch) is one of the most important algorithms in deterministic gradient descent for adaptively choosing the step-size at each iteration, we also compare to this method – see Algorithm 2 in the appendix for our particular implementation of Line-Search. We see that the behavior of the four methods, AdaGrad-Norm, AdaGrad-Coordinate, GD-Constant, and GD-DecaySqrt, are very similar to the stochastic setting, albeit AdaGrad-Coordinate here is worse than in the stochastic setting. Among the five methods in the plot, GD-LineSearch performs the best but with significantly longer computational time, which is not practical in large-scale machine learning problems.

2 Image data

In this section, we extend our numerical analysis to the setting of deep learning and show that the robustness of AdaGrad-Norm does not come at the price of worse generalization – an important observation that is not explained by our current theory. The experiments are done in PyTorch (Paszke et al., 2017) and parameters are by default if no specification is provided.The code we used is originally from https://github.com/pytorch/examples/tree/master/imagenet We did not find it practical to compute the norm of the gradient for the entire neural network during back-propagation. Instead, we adapt a stepsize for each neuron or each convolutional channel by updating $b_{j}$ with the gradient of the neuron or channel. Hence, our experiments depart slightly from a strict AdaGrad-Norm method and include a limited adaptive metric component. Details in implementing AdaGrad-Norm in a neural network are explained in the appendix and the code is also provided.https://github.com/xwuShirley/pytorch/blob/master/torch/optim/adagradnorm.py

We test on three data sets: MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009), see Table 1 in the appendix for detailed descriptions. For MNIST, our models are a logistic regression (LogReg), a multilayer network with two fully connected layers (FulConn2) with 100 hidden units and ReLU activations, and a convolutional neural network (see Table 2 in the appendix for details). For CIFAR10, our model is ResNet-18 (He et al., 2016). For both data sets, we use 256 images per iteration (2 GPUs with 128 images/GPU, 234 iterations per epoch for MNIST and 196 iterations per epoch for CIFAR10). For ImagetNet, we use ResNet-50 and 256 images for one iteration (8 GPUs with 32 images/GPU, 5004 iterations per epoch). Note that we do not use accelerated methods such as adding momentum in the training.

We pick these models for the following reasons: (1) LR with MNIST represents the smooth loss function; (2) FC with MNIST represents the non-smooth loss function; (3) CNN with MNIST belongs to a class of simple shallow network architectures; (4) ResNet-18 in CIFAR10 represents a complicated network architecture involving many other added features achieving SOTA performance; (5) ResNet-50 in ImageNet represents large-scale data and a deep network architecture.

In order to make the setting match our assumptions, we make several changes, which are not practically meaningful scenarios but serve only for corroborating our theorems.

For the experiment in MNIST, we do not use bias, regularization (zero weight decay), dropout, momentum, batch normalization (Ioffe and Szegedy, 2015), or any other added features that help achieving SOTA performance (see Figure 3 and Figure 4). However, the architecture of ResNet by default is built with the celebrated batch normalization (Batch-Norm) method as important layers. Batch-Norm accomplishes the auto-tuning property by normalizing the means and variances of mini-batches in a particular way during the forward-propagation, and in return is back-propagated with projection steps. This projection phenomenon is highlighted in weight normalization (Salimans and Kingma, 2016; Wu et al., 2018). Thus, in the ResNet-18 experiment on CIFAR10, we are particularly interested in how Batch-Norm interacts with the auto-tuning property of AdaGrad-Norm. We disable the learnable scale and shift parameters in the Batch-Norm layers Set nn.BatchNorm2d(planes,affine=False) and compare the default setup in ResNet (Goyal et al., 2017). The resulted plots are located in Figure 4 (bottom left and bottom right). In the ResNet-50 experiment on ImageNet, we also depart from the standard set-up by initializing the weights of the last fully connected layer with i.i.d. Gaussian samples with mean zero and variance $0.03125$ . Note that the default initialization for the last fully-connected layer of ResNet50 is an i.i.d. Gaussian distribution with mean zero and variance of $0.01$ . Instead, we use variance $0.03125$ in that the AdaGrad-Norm algorithm takes the norm of the gradient. The initialization of Gaussian distribution with higher variance results in larger accumulative gradient norms, which is likely to make AdaGrad-Norm robust to small initialization of $b_{0}$ . To some extent, AdaGrad-Norm could be sensitive to the model’s initialization. But how much sensitive the AdaGrad-Norm, or more generally the adaptive gradient methods, to the initialization of the model could be a potential future direction.

For all experiments, same initialized vector $x_{0}$ is used for the same model so as to eliminate the effect of random initialization in weight vectors. We set $\eta=1$ in all AdaGrad implementations, noting that in all these problems we know that $F^{*}=0$ and we measure that $F(x_{0})$ is between 1 and 10. Indeed, we approximate the loss using a sample of 256 images to be $\frac{1}{256}\sum_{i=1}^{256}f_{i}(x_{0})$ : $2.4129$ for logistic regression, $2.305$ for two-layer fully connected model, $2.301$ for convolution neural network, $2.3848$ for ResNet-18 with disable learnable parameter in Batch-Norm, $2.3459$ for ResNet-18 with default Batch-Norm, and $7.704$ for ResNet-50. We vary the initialization $b_{0}$ while fixing all other parameters and plot the training accuracy and testing accuracy after different numbers of epochs. We compare AdaGrad-Norm with initial parameter $b_{0}$ to (a) SGD-Constant: fixed stepsize $\frac{1}{b_{0}}$ , (b) SGD-DecaySqrt: decaying stepsize $\eta_{j}=\frac{1}{b_{0}\sqrt{j}}$ (c) AdaGrad-Coordinate: a vector of per-coefficient stepsizes. We use torch.optim.adagrad

In all experiments shown in Figures 3, 4, and 5, we fix $b_{0}$ and compare the accuracy for the four algorithms; the convergence of AdaGrad-Norm is much better even for small initial values $b_{0}$ , and shows much stronger robustness than the alternatives. In particular, Figures 3 and 4 show that the AdaGrad-Norm’s accuracy is extremely robust (as good as the best performance) to the choice of $b_{0}$ . At the same time, the SGD methods and AdaGrad-Coordinate are highly sensitive. For Figure 5, the range of parameters $b_{0}$ for which AdaGrad-Norm attains its best performance is also larger than the corresponding range for SGD-Constant and AdaGrad-Coordinate but sub-optimal for small values of $b_{0}$ . It is likely to indicate that for ImageNet training, AdaGrad-Norm does not remove the need to tune $b_{0}$ but makes the hyper-parameter search for $b_{0}$ easier. Note that the best test accuracy in Figure 5 is substantially lower than numbers in the literature, where optimizers for ResNet-50 on ImageNet attain test accuracy around $76\%$ (Goyal et al., 2017), about $10\%$ better than the best result in Figure 5. This is mainly because (a) we do not apply momentum methods, and perhaps more critically (b) both SGD and AdaGrad-Norm do not use the default decaying scheduler for $\eta$ as in Goyal et al. (2017). Instead, we use a constant rate $\eta=1$ . Our purpose is not to achieve the comparable state-of-the-art results but mainly to verify that AdaGrad-Norm is less sensitive to hyper-parameter and requires less hyper-parameter tuning.

Similar to the Synthetic Data, when $b_{0}$ is initialized in the range of well-tuned stepsizes, AdaGrad-Norm gives almost the same accuracy as SGD with constant stepsize; when $b_{0}$ is initialized too small, AdaGrad-Norm still converges with good speed (except for CNN in MNIST), while SGDs do not. The divergence of AdaGrad-Norm with small $b_{0}$ for CNN in MNIST (Figure 4, top right) can be possibly explained by the unboundedness of gradient norm in the four-layer CNN model. In contrast, the 18-layer or 50-layer ResNet model is very robust to all range of $b_{0}$ in experiments (Figure 4, bottom), which is due to Batch-Norm that we further discuss in the next paragraph.

We are interested in the experiments of Batch-Norm by default and Batch-Norm without learnable parameters because we want to understand how AdaGrad-Norm interacts with models that already have the built-in feature of auto-tuning stepsize such as Batch-Norm. First, comparing the outcomes of Batch-Norm with the default setting (Figure 4, bottom right) and without learnable parameters (Figure 4, bottom left), we see the learnable parameters (scales and shifts) in Batch-Norm can be very helpful in accelerating the training. Surprisingly, the best stepsize in Batch-Norm with default for SGD-Constant is at $b_{0}=0.1$ (i.e., $\eta=10$ ). While the learnable parameters are more beneficial to AdaGrad-Coordinate, AdaGrad-Norm seems to be affected less. Overall, combining the two auto-tuning methods (AdaGrad-Norm and Batch-Norm) give good performance.

At last, we add momentum to the stochastic gradient descent methods as empirical evidence to showcase the robustness of adaptive methods with momentum shown in Figure 6. Since SGD with $0.9$ momentum is commonly used, we also set $0.9$ momentum for our implementation of AdaGrad-Norm. See Algorithm 3 in the appendix for details. The results (Figure 6) show that AdaGrad-Norm with momentum is highly robust to initialization while SGD with momentum is not. SGD with momentum does better than AdaGrad-Norm when the initialization $b_{0}$ is greater than the Lipschitz smoothness constant. When $b_{0}$ is smaller than the Lipschitz smoothness constant, AdaGrad-Norm performs as well as SGD with the best stepsize ( $0.1$ ).

Acknowledgments

Special thanks to Kfir Levy for pointing us to his work, to Francesco Orabona for reading a previous version and pointing out a mistake, and to Krishna Pillutla for discussion on the unit mismatch in AdaGrad. We thank Arthur Szlam and Mark Tygert for constructive suggestions. We also thank Francis Bach, Alexandre Defossez, Ben Recht, Stephen Wright, and Adam Oberman. We appreciate the help with the experiments from Priya Goyal, Soumith Chintala, Sam Gross, Shubho Sengupta, Teng Li, Ailing Zhang, Zeming Lin, and Timothee Lacroix. Finally, we owe particular gratitude to the reviewers and the editor for their suggestions and comments that significantly improved the paper.

References

A Tables

B Implementing Algorithm 1 in a neural network

In this section, we give the details for implementing our algorithm in a neural network. In the standard neural network architecture, the computation of each neuron consists of an elementwise nonlinearity of a linear transform of input features or output of previous layer:

where $w$ is the $d$ -dimensional weight vector, $b$ is a scalar bias term, $x$ , $y$ are respectively a $d$ -dimensional vector of input features (or output of previous layer) and the output of current neuron, $\phi(\cdot)$ denotes an element-wise nonlinearity.

For fully connected layer, the stochastic gradient $G$ in Algorithm 1 represents the gradient of the current neuron (see the green curve, Figure 7). Thus, when implementing our algorithm in PyTorch, AdaGrad Norm is one learning rate associated to one neuron for fully connected layer, while SGD has one learning rate for all neurons.

For convolution layer, the stochastic gradient $G$ in Algorithms 1 represents the gradient of each channel in the neuron. For instance, there are 6 learning rates for the first layer in the LeNet architecture (Table 1). Thus, AdaGrad-Norm is one learning rate associated to one channel.