Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate

Haiwen Huang, Chang Wang, Bin Dong

Introduction

Along with the rise of deep learning, various first-order stochastic optimization methods emerged. Among them, the most fundamental one is the stochastic gradient descent, and the Nesterov’s Accelerated Gradient method NESTEROV (1983) is also a well-known acceleration algorithm. Recently, many adaptive stochastic optimization methods have been proposed, such as AdaGrad Duchi et al. (2010), RMSProp Tieleman and Hinton (2012), AdaDelta Zeiler (2012) and Adam Kingma and Ba (2014). These algorithms can be written in the following general form:

where gig_{i} is the gradient obtained in the ii-th time step, αt/ψ(g1,...,gt)\alpha_{t}/\psi(g_{1},...,g_{t}) the adaptive learning rate, and ϕ(g1,...,gt)\phi(g_{1},...,g_{t}) the gradient estimation. There have been extensive studies on the design of gradient estimations which can be traced back to classical momentum methods Polyak (1964) and NAG NESTEROV (1983). In this paper, however, we focus more on how to understand and improve the adaptive learning rate.

Adam Kingma and Ba (2014) is perhaps the most widely used adaptive stochastic optimization method which uses an exponential moving average (EMA) to estimate the square of the gradient scale, so that the learning rate can be adjusted adaptively. More specifically, Adam takes the form of (1) with

We shall call vtv_{t} the re-scaling term of the Adam and its variants, since it serves as a coordinate-wise re-scaling of the gradients. Despite its fast convergence and easiness in implementation, Adam is also known for its non-convergence and poor generalization in some cases Reddi et al. (2018)Wilson et al. (2017). More recently, Balles and Hennig (2018) both theoretically and empirically pointed out that generalization is mainly determined by the sign effect rather than the adaptive learning rate, and the sign effect is problem-dependent. In this paper, we are mainly dealing with the non-convergence issue and will only empirically compare generalization ability among different Adam variants.

As for the non-convergence issue, Reddi et al. (2018) suggested that the EMA of vtv_{t} of Adam is the cause. The main problem lies in the following quantity:

which essentially measures the change in the inverse of learning rate with respect to time. Algorithms that use EMA to estimate the scale of the gradients cannot guarantee the positive semi-definiteness of Γt\Gamma_{t}, and that causes the non-convergence of Adam. To fix this issue, Reddi et al. (2018) proposed AMSGrad, which added one more step v^t=max{v^t1,vt}\widehat{v}_{t}=\max\{\widehat{v}_{t-1},v_{t}\} in (2). AMSGrad is claimed by its authors to have a “long-term memory” of past gradients.

Another explanation on the cause of non-convergence was recently proposed by Zhou et al. (2018). The authors observed that Adam may diverge because a small gradient may have a large step size which leads to a large update. Therefore, if the small gtg_{t} with large step size is often in the wrong direction, it could lead to divergence. Thus, they proposed a modification to Adam called AdaShift by replacing gt2g_{t}^{2} with gtn2g_{t-n}^{2} for some manually chosen nn when calculating vtv_{t}.

Both AdaShift and AMSGrad suggest that we should not fully trust the gradient information we acquire at current step, and the past gradients are useful when gtg_{t} is not reliable. In this paper, we take this idea one step further by suggesting that we may weight more of the past gradients than the present ones. We call our algorithm Nostalgic Adam (NosAdam). We will show that the design of the algorithm is inspired by our mathematical analysis on the convergence, and NosAdam has the fastest known convergence rate. Furthermore, we will discuss why “nostalgia” is important, and empirically investigate how different designs of vtv_{t} can lead to different performances from a loss landscape perspective. Finally, we examine the empirical performance of NosAdam on some common machine learning tasks. The experiments show us that NosAdam is a promising alternative to Adam and its variants.

Related Work

Adam is widely used in both academia and industry. However, it is also one of the least well-understood algorithms. In recent years, some remarkable works provided us with better understanding of the algorithm, and proposed different variants of it. Most of works focused on how to interpret or modify the re-scaling term vtv_{t} of (2).

As mentioned above, Reddi et al. (2018), Zhou et al. (2018) focused on the non-convergence issue of Adam, and proposed their own modified algorithms. Wilson et al. (2017) pointed out the generalization issue of adaptive optimization algorithms. Based on the assumption that vtv_{t} is the estimate of the second moment estimate of gtg_{t}, Balles and Hennig (2018) dissected Adam into sign-based direction and variance adaption magnitude. They also pointed out that the sign-based direction part is the decisive factor of generalization performance, and that is problem-dependent. This in a way addressed the generalization issue raised in Wilson et al. (2017).

However, the interpretation of vtv_{t} as an estimate of the second moment assumption may not be correct, since Chen and Gu (2019) showed that vt1/2v_{t}^{1/2} in the Adam update (2) can be replaced by vtpv_{t}^{p} for any p(0,12]p\in(0,\frac{1}{2}]. The modified algorithm is called Padam. In our supplementary material, we also proved that a convergence theorem of a “p-norm” form of NosAdam, where the re-scaling term vtv_{t} can be essentially viewed as a “p-moment” of gtg_{t}. These discoveries cast doubts on the the second moment assumption, since both the convergence analysis and empirical performance seemed not so dependent on this assumption.

The true role of vtv_{t}, however, remains a mystery. In AdaGrad Duchi et al. (2010), which is a special case of NosAdam, the authors mentioned an metaphor that “the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features.” They were suggesting that vtv_{t} is to some extent balancing the update speeds of different features according to their abundance in the data set. This understanding might be supported by a previous work called SWATS (Switching from Adam to SGD) Keskar and Socher (2017), which uses Adam for earlier epochs and then fix the re-scaling term vtv_{t} for later epochs. This suggests that there may be some sort of “optimal” re-scaling term, and we can keep using it after we obtain a good enough estimate.

Despite all the previous efforts, our understanding of the re-scaling term vtv_{t} is still very limited. In this paper, we investigate the issue from a loss landscape approach, and this provides us with some deeper understanding of when and how different Adam-like algorithms can perform well or poorly.

Nostalgic Adam Algorithm

In this section, we introduce the Nostalgic Adam (NosAdam) algorithm, followed by a discussion on its convergence. Let us first consider a general situation where we allow the parameter β2\beta_{2} in Equation (2) change in time tt. Without loss of generality, we may let β2,t=Bt1Bt\beta_{2,t}=\frac{B_{t-1}}{B_{t}}. Then, the NosAdam algorithm reads as in Algorithm 1. Like Adam and its variants, the condition Γt0\Gamma_{t}\succeq 0 is crucial in ensuring convergence. We will also see that to ensure positive semi-definiteness of Γt\Gamma_{t}, the algorithm naturally requires to weight more of the past gradients than the more recent ones when calculating vtv_{t}. To see this, we first present the following lemma.

The positive semi-definiteness of Vtαt2Vt1αt12\frac{V_{t}}{\alpha^{2}_{t}}-\frac{V_{t-1}}{\alpha^{2}_{t-1}} is tightly satisfied if Btt\frac{B_{t}}{t} is non-increasing.

Here the “tightly satified” in the lemma means the conclusion cannot be strengthened, in that if Btt\frac{B_{t}}{t} is increasing, then Vtαt2Vt1αt12\frac{V_{t}}{\alpha^{2}_{t}}-\frac{V_{t-1}}{\alpha^{2}_{t-1}} will be very easily violated since gjg_{j} can be infinitesimal.

Again, without loss of generality, we can write BtB_{t} as j=1tbj\sum_{j=1}^{t}b_{j}. Then, it is not hard to see that Btt\frac{B_{t}}{t} is non-increasing if and only if bjb_{j} is non-increasing. Noting that vt=k=1tgk2bkBtv_{t}=\sum_{k=1}^{t}g_{k}^{2}\frac{b_{k}}{B_{t}}, we can see that the sufficient condition for positive semi-definiteness of Γt\Gamma_{t} is that in the weighted average vtv_{t}, the weights of gradients should be non-increasing w.r.t. tt. In other words, we should weight more of the past gradients than the more recent ones.

From Algorithm 1, we can see that vtv_{t} can either decrease or increase based on the relationship between vt1v_{t-1} and gt2g_{t}^{2}, which is the reason why NosAdam circumvents the flaw of AMSGrad (Figure 4). Convergence of NosAdam is also guaranteed as stated by the following theorem.

Let BtB_{t} and bkb_{k} be the sequences defined in Algorithm 1, αt=α/t\alpha_{t}=\alpha/\sqrt{t}, β1,1=β1,β1,tβ1\beta_{1,1}=\beta_{1},\beta_{1,t}\leq\beta_{1} for all t. Assume that F\mathscr{F} has bounded diameter DD_{\infty} and ft(x)G||\nabla f_{t}(x)||_{\infty}\leq G_{\infty} for all t and xFx\in\mathscr{F}. Furthermore, let β2,t\beta_{2,t} be such that the following conditions are satisfied:

Then for {xt}\{x_{t}\} generated using NosAdam, we have the following bound on the regret

One notable characteristic of NosAdam, which makes it rather different from the analysis by Reddi et al. (2018), is that the conditions on BtB_{t} and btb_{t} are data-independent and are very easy to check. In particular, if we choose BtB_{t} as a hyperharmonic series, i.e. Bt=k=1tkγB_{t}=\sum_{k=1}^{t}k^{-\gamma}, then the convergence criteria are automatically satisfied. We shall call this special case NosAdam-HH, and its convergence result is summarized in the following corollary.

Supposeβ1,t=β1λt1\beta_{1,t}=\beta_{1}\lambda^{t-1}, bk=kγ,γ0b_{k}=k^{-\gamma},\gamma\geq 0 , thus Bt=k=1tkγB_{t}=\sum_{k=1}^{t}k^{-\gamma}, and β2,t=Bt1/Bt<1\beta_{2,t}=B_{t-1}/B_{t}<1 in Algorithm 1. Then BtB_{t} and btb_{t} satisfy the constraints in Therorem A.1, and we have

Our theory shows that the proposed NosAdam achieves convergence rate of O(1/T)O(1/\sqrt{T}), which is so far the best known convergence rate.

Why Nostalgic?

In this section, we investigate more about the mechanism behind Adam and AMSGrad, and analyze the pros and cons of being “nostalgic”.

As mentioned in Section 1, Reddi et al. (2018) proved that if Γt\Gamma_{t} is positive semi-definite, Adam converges. Otherwise, it may diverge. An example of divergence made by Reddi et al. (2018) is

where CC is slightly larger than 2. The correct optimization direction should be -1, while Adam would go towards 1. To fix this, they proposed AMSGrad, which ensures Γt0\Gamma_{t}\succeq 0 by updating vtv_{t} as follows

where v^t\hat{v}_{t} is used in the update step.

However, this example is not representative of real situations. Also, the explanation of “long-term memory” by Reddi et al. (2018) is not very illustrative. In the remaining part of this section, we aim to discuss some more realistic senarios and try to understand the pros and cons of different algorithms.

We start from analyzing the different weighting strategies when calculating vtv_{t}. For Adam,

and the weight (1β2)β2tk(1-\beta_{2})\beta_{2}^{t-k} increases exponentially. For NosAdam,

and for NosAdam-HH, bk=kγb_{k}=k^{-\gamma} is the kk-th term of a hyperharmonic series. For AMSGrad, vt(AMSGrad)v_{t}^{(\text{AMSGrad})} is data-dependent and therefore cannot be explicitly expressed. However, vt(AMSGrad)v_{t}^{(\text{AMSGrad})} is chosen to be the largest in {vs(Adam):0st}\{v_{s}^{(\text{Adam})}:0\leq s\leq t\}. Therefore, it can be seen as a shifted version of vt(Adam)v_{t}^{(\text{Adam})}, i.e. vs(Adam)=vtn(Adam)v_{s}^{(\text{Adam})}=v_{t-n}^{(\text{Adam})}, where nn depends on the data. This is similar as AdaShift, where nn is instead a hyperparameter. Figure 1 plots the first 100 weights of Adam, NosAdam and AMSGrad, where β2\beta_{2}, γ\gamma, nn, is chosen as 0.9, 0.1 and 20, respectively.

From the above analysis, we can see that vtv_{t} of Adam is mainly determined by its most current gradients. Therefore, when gtg_{t} keeps being small, the adaptive learning rate could be large, which may lead to oscillation of the sequence, and increasing chance of being trapped in local minimum. On the other hand, NosAdam adopts a more stable calculation of vtv_{t}, since it relies on all the past gradients.

We support the above discussion with an example of an objective function with a bowl-shaped landscape where the global minima is at the bottom of the bowl with lots of local minimum surrounding it. The explicit formula of the objective function is

Figure 2a shows one slice of the function for z=2.34z=2.34. In the function, aa and bb determine the depth and width of the global minima, and cc, rr, β\beta determine depth, location and width of the local minimums. In this example, aa, bb, cc, rr, β\beta are set to 30, 0.007, 0.25, 1, 20, respectively.

Figure 2b shows different trajectories of Adam and NosAdam when they are initiated at the same point on the side of the bowl. As expected, the trajectory of Adam (yellow) passes the global minima and ends up trapped in valley AA, while NosAdam (red) gradually converges to the global minima, i.e. valley BB.

With the operation of taking max, AMSGrad does not have the same non-convergence issue as discussed above. However, taking max may be problematic as well since v^t\hat{v}_{t} can never decrease. If a very large gradient appears at an iteration, then the adaptive learning rate for all later steps will keep being small. For example, if a large gradient (e.g. 100 times of the original gradient) appears at the 10610^{6} step in the example (4), we can see that AMSGrad will converge very slowly. This, however, will not be a problem for NosAdam which has the ability of both increasing and decreasing its vtv_{t}. See Figure 3 for a demonstration. Another example with sharper local minima by setting b=2b=2, c=4c=4, r=1.3r=1.3 is given in Figure 4; and the algorithms are initialized at location A. One can see that the sequence generated by AMSGrad is trapped in a sharp local minima, whereas NosAdam still converges to the global minimum. From these examples we can see that the operation of taking max of AMSGrad has some intrinsic flaws though it promises convergence. The way of computing vtv_{t} in NosAdam seems superior.

There are also situations in which NosAdam can work poorly. Just because NosAdam is nostalgic, it requires a relatively good initial point to achieve good performances though this is commonly required by most optimization algorithms. However, Adam can be less affected by bad initializations sometime due to its specific way of calculating vtv_{t}. This gives it a chance of jumping out of the local minimum (and a chance of jumping out of the global minima as well as shown in Figure 2). To demonstrate this, we let both Adam and NosAdma initialize in the valley A (see Figure 5). We can see that the trajectory of Adam manages to jump out of the valley, while it is more difficult for NosAdam to do so.

We note that although NosAdam requires good initialization, it does not necessarily mean initializing near the global minima. Since the algorithm is nostalgic, as long as the initial gradients are pointing towards the right direction, the algorithm may still converge to the global minima even though the initialization is far away from the global minima. As we can see from Figure 4 that NosAdam converges because all of the gradients are good ones at the beginning of the algorithm, which generates enough momentum to help the sequence dashes through the region with sharp local minimum.

Like any Adam-like algorithm, the convergence of NosAdam depends on the loss landscape and initialization. However, if the landscape is as shown in the above figures, then NosAdam has a better chance to converge than Adam and AMSGrad. In practice, it is therefore helpful to first examine the loss landscape before selecting an algorithm. However, it is time consuming to do in general. Nonetheless, earlier studies showed that neural networks with skip connections like ResNet and DenseNet lead to coercive loss functions similar to the one shown in the above figures Li et al. (2018).

Experiments

In this section, we conduct some experiments to compare NosAdam with Adam and its variant AMSGrad. We consider the task of multi-class classification using logistic regression, multi-layer fully connected neural networks and deep convolutional neural networks on MNIST LECUN and CIFAR-10 Krizhevsky et al. . The results generally indicate that NosAdam is a promising algorithm that works well in practice.

Throughout our experiments, we fixed β1\beta_{1} to be 0.9, β2\beta_{2} to be 0.999 for Adam and AMSGrad, and search γ\gamma in {1e-1,1e-2,1e-3,1e-4}\{\text{1e-1},\text{1e-2},\text{1e-3},\text{1e-4}\} for NosAdam. The initial learning rate is chosen from {1e-3,2e-3,...,9e-3,1e-2,2e-2,...,9e-2,1e-1,2e-1,...,9e-1}\{\text{1e-3},\text{2e-3},...,\text{9e-3},\text{1e-2},\text{2e-2},...,\text{9e-2},\text{1e-1},\text{2e-1},...,\text{9e-1}\} and the results are reported using the best set of hyper-parameters. All the experiments are done using Pytorch0.4.

Logistic Regression To investigate the performance of the algorithms on convex problems, we evaluate Adam, AMSGrad and NosAdam on multi-class logistic regression problem using the MNIST dataset. To be consistent with the theory, we set the step size αt=α/t\alpha_{t}=\alpha/\sqrt{t}. We set the minibatch size to be 128. According to Figure 6a, the three algorithms have very similar performance.

Multilayer Fully Connected Neural Networks

We first train a simple fully connected neural network with 1 hidden layer (with 100 neurons and ReLU as the activation function) for the multi-class classification problem on MNIST. We use constant step size αt=α\alpha_{t}=\alpha throughout the experiments for this set of experiments. The results are shown in Figure 6b. We can see that NosAdam slightly outperforms AMSGrad, while Adam is much worse than both NosAdam and AMSGrad and oscillates a lot. This is due to the difference of the definition of vtv_{t} for each algorithm: vtv_{t} in AMSGrad and NosAdam gradually becomes stationary and stays at a good re-scaling value; while vtv_{t} in Adam does not have such property.

Deep Convolutional Neural Networks Finally, we train a deep convolutional neural network on CIFAR-10. Wide Residual Network Zagoruyko and Komodakis (2016) is known to be able to achieve high accuracy with much less layers than ResNet He et al. (2015). In our experiment, we choose Wide ResNet28. The model is trained on 4 GPUs with the minibatch size 100. The initial learning rate is decayed at epoch 50 and epoch 100 by multiplying 0.1. In our experiments, the optimal performances are usually achieved when the learning rate is around 0.02 for all the three algorithms. For reproducibility, an anonymous link of code will be provided in the supplementary material.

Our results are shown in Figure 7. We observe that NosAdam works slightly better than AMSGrad and Adam in terms of both convergence speed and generalization. This indicates that NosAdam is a promising alternative to Adam and its variants.

Discussion

In this paper, we suggested that we should weight more of the past gradients when designing the adaptive learning rate. In fact, our original intuition came from mathematical analysis of the convergence of Adam-like algorithms. Based on such observation, we then proposed a new algorithm called Nostalgic Adam (NosAdam), and provided a convergence analysis. We also discussed the pros and cons of NosAdam comparing to Adam and AMSGrad using a simple example, which gave us a better idea when NosAdam could be effective.

For future works, we believe that loss landscape analysis and the design of a strategy to choose different algorithms adaptively based on the loss landscape would be worth pursuing. Hopefully, we can design an optimization algorithm that can adaptively adjust its re-scaling term in order to fully exploit the local geometry of the loss landscape.

Acknowledgments

This work would not have existed without the support of BICMR and School of Mathematical Sciences, Peking University. Bin Dong is supported in part by Beijing Natural Science Foundation (Z180001).

Appendix A Convergence of p-NosAdam

In this appendix, we use the same notations as in the paper “Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate”. We are going to prove a more general convergence theorem. In the original paper, we propose NosAdam, as shown in Algorithm 1. But in fact, NosAdam can be considered as a particular case of a more general algorithm, in which we replaces gt2g_{t}^{2} in the calculation of vtv_{t} by gpg_{p}, and vt1/2v_{t}^{1/2} in the update equation by vt1/pv_{t}^{1/p}. We call this algorithm p-NosAdam, as shown in Algorithm 2. NosAdam is p-NosAdam when p=2p=2.

In the remaining part of this appendix, we are going to prove the convergence theorem of p-NosAdam when p>1p>1. From Theorem A.1, we can see that the regret bound is O(Tmax(1p,p1p))O(T^{\text{max}(\frac{1}{p},\frac{p-1}{p})}).

Let BtB_{t} and bkb_{k} be the sequences defined in p-NosAdam, αt=α/t1/p,p>1\alpha_{t}=\alpha/t^{1/p},p>1, β1,1=β1,β1,tβ1\beta_{1,1}=\beta_{1},\beta_{1,t}\leq\beta_{1} for all t. Assume that F\mathscr{F} has bounded diameter DD_{\infty} and ft(x)G||\nabla f_{t}(x)||_{\infty}\leq G_{\infty} for all t and xFx\in\mathscr{F}. Furthermore, let β2,t\beta_{2,t} be such that the following conditions are satisfied:

Then for {xt}\{x_{t}\} generated using p-NosAdam, we have the following bound on the regret

Let x=argminxFt=1Tft(x)x^{*}=\text{argmin}_{x\in\mathcal{F}}\sum_{t=1}^{T}f_{t}(x). Therefore RT=t=1Tft(xt)ft(x)R_{T}=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*}).

To prove this theorem, we will use the following lemmas.

Using Lemma 4 in Reddi et al. (2018) with u1=xt+1u_{1}=x_{t+1} and u2=xu_{2}=x^{*}, we have the following:

Rearranging the above inequality, we have

The second inequality follows from simple application of Cauchy-Schwarz and Young’s inequality. We now use the standard approach of bounding the regret at each step using convexity of the function ftf_{t} in the following manner:

Base on this Lemma, we are going to find the corresponding upper bound for each term in the above regret bound inequality.

For the first term t=1T[12αt(1β1t)(Vt1/2p(xtx)2Vt1/2p(xt+1x)2)\sum_{t=1}^{T}[\frac{1}{2\alpha_{t}(1-\beta_{1t})}(||V_{t}^{1/2p}(x_{t}-x^{*})||^{2}-||V_{t}^{1/2p}(x_{t+1}-x^{*})||^{2}), we have Lemma A.3.

When Bt/tB_{t}/t is non-increasing, then Vt/αt2Vt1/αt12V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1} is semi-positive, and

Proof of Lemma A.3:

which means Vt/αt2Vt1/αt12V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1} is semi-positive.

The third inequation use the knowledge that Vt/αt2Vt1/αt120V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1}\geq 0.

For the second and the third terms in Lemma A.2, we have Lemma A.4.

Proof of Lemma A.4

The first inequality comes from k=1jbkgkpk=1tbkgkp\sum_{k=1}^{j}b_{k}g_{k}^{p}\leq\sum_{k=1}^{t}b_{k}g_{k}^{p}. The second inequality comes from that Bt/tB_{t}/t is non-increasing with respect to tt. The last inequality follows from then inequality t=jTβ1tj1/(1β1)\sum_{t=j}^{T}\beta_{1}^{t-j}\leq 1/(1-\beta_{1}).

Let Sj=k=1jbkgkpS_{j}=\sum_{k=1}^{j}b_{k}g_{k}^{p},

The first inequality comes from Lemma A.5 when p>1p>1. The last inequality comes from the second constraint, which tells us that Bj/(jbjp)B_{j}/(jb_{j}^{p}) is non-decreasing with respect to jj. This completes the proof of the lemma.

This finally completes the proof of Lemma A.4.

To complete Lemma A.4, we need to finally prove the next Lemma.

Proof of Lemma A.5

Finally, back to our theorem, using the inequalities in Lemma A.4 and Lemma A.3 to substitute the first three terms in Lemma A.2 completes the proof of our theorem.

References