Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate

Haiwen Huang, Chang Wang, Bin Dong

Introduction

Along with the rise of deep learning, various first-order stochastic optimization methods emerged. Among them, the most fundamental one is the stochastic gradient descent, and the Nesterov’s Accelerated Gradient method NESTEROV (1983) is also a well-known acceleration algorithm. Recently, many adaptive stochastic optimization methods have been proposed, such as AdaGrad Duchi et al. (2010), RMSProp Tieleman and Hinton (2012), AdaDelta Zeiler (2012) and Adam Kingma and Ba (2014). These algorithms can be written in the following general form:

where $g_{i}$ is the gradient obtained in the $i$ -th time step, $\alpha_{t}/\psi(g_{1},...,g_{t})$ the adaptive learning rate, and $\phi(g_{1},...,g_{t})$ the gradient estimation. There have been extensive studies on the design of gradient estimations which can be traced back to classical momentum methods Polyak (1964) and NAG NESTEROV (1983). In this paper, however, we focus more on how to understand and improve the adaptive learning rate.

Adam Kingma and Ba (2014) is perhaps the most widely used adaptive stochastic optimization method which uses an exponential moving average (EMA) to estimate the square of the gradient scale, so that the learning rate can be adjusted adaptively. More specifically, Adam takes the form of (1) with

We shall call $v_{t}$ the re-scaling term of the Adam and its variants, since it serves as a coordinate-wise re-scaling of the gradients. Despite its fast convergence and easiness in implementation, Adam is also known for its non-convergence and poor generalization in some cases Reddi et al. (2018)Wilson et al. (2017). More recently, Balles and Hennig (2018) both theoretically and empirically pointed out that generalization is mainly determined by the sign effect rather than the adaptive learning rate, and the sign effect is problem-dependent. In this paper, we are mainly dealing with the non-convergence issue and will only empirically compare generalization ability among different Adam variants.

As for the non-convergence issue, Reddi et al. (2018) suggested that the EMA of $v_{t}$ of Adam is the cause. The main problem lies in the following quantity:

which essentially measures the change in the inverse of learning rate with respect to time. Algorithms that use EMA to estimate the scale of the gradients cannot guarantee the positive semi-definiteness of $\Gamma_{t}$ , and that causes the non-convergence of Adam. To fix this issue, Reddi et al. (2018) proposed AMSGrad, which added one more step $\widehat{v}_{t}=\max\{\widehat{v}_{t-1},v_{t}\}$ in (2). AMSGrad is claimed by its authors to have a “long-term memory” of past gradients.

Another explanation on the cause of non-convergence was recently proposed by Zhou et al. (2018). The authors observed that Adam may diverge because a small gradient may have a large step size which leads to a large update. Therefore, if the small $g_{t}$ with large step size is often in the wrong direction, it could lead to divergence. Thus, they proposed a modification to Adam called AdaShift by replacing $g_{t}^{2}$ with $g_{t-n}^{2}$ for some manually chosen $n$ when calculating $v_{t}$ .

Both AdaShift and AMSGrad suggest that we should not fully trust the gradient information we acquire at current step, and the past gradients are useful when $g_{t}$ is not reliable. In this paper, we take this idea one step further by suggesting that we may weight more of the past gradients than the present ones. We call our algorithm Nostalgic Adam (NosAdam). We will show that the design of the algorithm is inspired by our mathematical analysis on the convergence, and NosAdam has the fastest known convergence rate. Furthermore, we will discuss why “nostalgia” is important, and empirically investigate how different designs of $v_{t}$ can lead to different performances from a loss landscape perspective. Finally, we examine the empirical performance of NosAdam on some common machine learning tasks. The experiments show us that NosAdam is a promising alternative to Adam and its variants.

Related Work

Adam is widely used in both academia and industry. However, it is also one of the least well-understood algorithms. In recent years, some remarkable works provided us with better understanding of the algorithm, and proposed different variants of it. Most of works focused on how to interpret or modify the re-scaling term $v_{t}$ of (2).

As mentioned above, Reddi et al. (2018), Zhou et al. (2018) focused on the non-convergence issue of Adam, and proposed their own modified algorithms. Wilson et al. (2017) pointed out the generalization issue of adaptive optimization algorithms. Based on the assumption that $v_{t}$ is the estimate of the second moment estimate of $g_{t}$ , Balles and Hennig (2018) dissected Adam into sign-based direction and variance adaption magnitude. They also pointed out that the sign-based direction part is the decisive factor of generalization performance, and that is problem-dependent. This in a way addressed the generalization issue raised in Wilson et al. (2017).

However, the interpretation of $v_{t}$ as an estimate of the second moment assumption may not be correct, since Chen and Gu (2019) showed that $v_{t}^{1/2}$ in the Adam update (2) can be replaced by $v_{t}^{p}$ for any $p\in(0,\frac{1}{2}]$ . The modified algorithm is called Padam. In our supplementary material, we also proved that a convergence theorem of a “p-norm” form of NosAdam, where the re-scaling term $v_{t}$ can be essentially viewed as a “p-moment” of $g_{t}$ . These discoveries cast doubts on the the second moment assumption, since both the convergence analysis and empirical performance seemed not so dependent on this assumption.

The true role of $v_{t}$ , however, remains a mystery. In AdaGrad Duchi et al. (2010), which is a special case of NosAdam, the authors mentioned an metaphor that “the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features.” They were suggesting that $v_{t}$ is to some extent balancing the update speeds of different features according to their abundance in the data set. This understanding might be supported by a previous work called SWATS (Switching from Adam to SGD) Keskar and Socher (2017), which uses Adam for earlier epochs and then fix the re-scaling term $v_{t}$ for later epochs. This suggests that there may be some sort of “optimal” re-scaling term, and we can keep using it after we obtain a good enough estimate.

Despite all the previous efforts, our understanding of the re-scaling term $v_{t}$ is still very limited. In this paper, we investigate the issue from a loss landscape approach, and this provides us with some deeper understanding of when and how different Adam-like algorithms can perform well or poorly.

Nostalgic Adam Algorithm

In this section, we introduce the Nostalgic Adam (NosAdam) algorithm, followed by a discussion on its convergence. Let us first consider a general situation where we allow the parameter $\beta_{2}$ in Equation (2) change in time $t$ . Without loss of generality, we may let $\beta_{2,t}=\frac{B_{t-1}}{B_{t}}$ . Then, the NosAdam algorithm reads as in Algorithm 1. Like Adam and its variants, the condition $\Gamma_{t}\succeq 0$ is crucial in ensuring convergence. We will also see that to ensure positive semi-definiteness of $\Gamma_{t}$ , the algorithm naturally requires to weight more of the past gradients than the more recent ones when calculating $v_{t}$ . To see this, we first present the following lemma.

The positive semi-definiteness of $\frac{V_{t}}{\alpha^{2}_{t}}-\frac{V_{t-1}}{\alpha^{2}_{t-1}}$ is tightly satisfied if $\frac{B_{t}}{t}$ is non-increasing.

Here the “tightly satified” in the lemma means the conclusion cannot be strengthened, in that if $\frac{B_{t}}{t}$ is increasing, then $\frac{V_{t}}{\alpha^{2}_{t}}-\frac{V_{t-1}}{\alpha^{2}_{t-1}}$ will be very easily violated since $g_{j}$ can be infinitesimal.

Again, without loss of generality, we can write $B_{t}$ as $\sum_{j=1}^{t}b_{j}$ . Then, it is not hard to see that $\frac{B_{t}}{t}$ is non-increasing if and only if $b_{j}$ is non-increasing. Noting that $v_{t}=\sum_{k=1}^{t}g_{k}^{2}\frac{b_{k}}{B_{t}}$ , we can see that the sufficient condition for positive semi-definiteness of $\Gamma_{t}$ is that in the weighted average $v_{t}$ , the weights of gradients should be non-increasing w.r.t. $t$ . In other words, we should weight more of the past gradients than the more recent ones.

From Algorithm 1, we can see that $v_{t}$ can either decrease or increase based on the relationship between $v_{t-1}$ and $g_{t}^{2}$ , which is the reason why NosAdam circumvents the flaw of AMSGrad (Figure 4). Convergence of NosAdam is also guaranteed as stated by the following theorem.

Let $B_{t}$ and $b_{k}$ be the sequences defined in Algorithm 1, $\alpha_{t}=\alpha/\sqrt{t}$ , $\beta_{1,1}=\beta_{1},\beta_{1,t}\leq\beta_{1}$ for all t. Assume that $\mathscr{F}$ has bounded diameter $D_{\infty}$ and $||\nabla f_{t}(x)||_{\infty}\leq G_{\infty}$ for all t and $x\in\mathscr{F}$ . Furthermore, let $\beta_{2,t}$ be such that the following conditions are satisfied:

Then for $\{x_{t}\}$ generated using NosAdam, we have the following bound on the regret

One notable characteristic of NosAdam, which makes it rather different from the analysis by Reddi et al. (2018), is that the conditions on $B_{t}$ and $b_{t}$ are data-independent and are very easy to check. In particular, if we choose $B_{t}$ as a hyperharmonic series, i.e. $B_{t}=\sum_{k=1}^{t}k^{-\gamma}$ , then the convergence criteria are automatically satisfied. We shall call this special case NosAdam-HH, and its convergence result is summarized in the following corollary.

Suppose $\beta_{1,t}=\beta_{1}\lambda^{t-1}$ , $b_{k}=k^{-\gamma},\gamma\geq 0$ , thus $B_{t}=\sum_{k=1}^{t}k^{-\gamma}$ , and $\beta_{2,t}=B_{t-1}/B_{t}<1$ in Algorithm 1. Then $B_{t}$ and $b_{t}$ satisfy the constraints in Therorem A.1, and we have

Our theory shows that the proposed NosAdam achieves convergence rate of $O(1/\sqrt{T})$ , which is so far the best known convergence rate.

Why Nostalgic?

In this section, we investigate more about the mechanism behind Adam and AMSGrad, and analyze the pros and cons of being “nostalgic”.

As mentioned in Section 1, Reddi et al. (2018) proved that if $\Gamma_{t}$ is positive semi-definite, Adam converges. Otherwise, it may diverge. An example of divergence made by Reddi et al. (2018) is

where $C$ is slightly larger than 2. The correct optimization direction should be -1, while Adam would go towards 1. To fix this, they proposed AMSGrad, which ensures $\Gamma_{t}\succeq 0$ by updating $v_{t}$ as follows

where $\hat{v}_{t}$ is used in the update step.

However, this example is not representative of real situations. Also, the explanation of “long-term memory” by Reddi et al. (2018) is not very illustrative. In the remaining part of this section, we aim to discuss some more realistic senarios and try to understand the pros and cons of different algorithms.

We start from analyzing the different weighting strategies when calculating $v_{t}$ . For Adam,

and the weight $(1-\beta_{2})\beta_{2}^{t-k}$ increases exponentially. For NosAdam,

and for NosAdam-HH, $b_{k}=k^{-\gamma}$ is the $k$ -th term of a hyperharmonic series. For AMSGrad, $v_{t}^{(\text{AMSGrad})}$ is data-dependent and therefore cannot be explicitly expressed. However, $v_{t}^{(\text{AMSGrad})}$ is chosen to be the largest in $\{v_{s}^{(\text{Adam})}:0\leq s\leq t\}$ . Therefore, it can be seen as a shifted version of $v_{t}^{(\text{Adam})}$ , i.e. $v_{s}^{(\text{Adam})}=v_{t-n}^{(\text{Adam})}$ , where $n$ depends on the data. This is similar as AdaShift, where $n$ is instead a hyperparameter. Figure 1 plots the first 100 weights of Adam, NosAdam and AMSGrad, where $\beta_{2}$ , $\gamma$ , $n$ , is chosen as 0.9, 0.1 and 20, respectively.

From the above analysis, we can see that $v_{t}$ of Adam is mainly determined by its most current gradients. Therefore, when $g_{t}$ keeps being small, the adaptive learning rate could be large, which may lead to oscillation of the sequence, and increasing chance of being trapped in local minimum. On the other hand, NosAdam adopts a more stable calculation of $v_{t}$ , since it relies on all the past gradients.

We support the above discussion with an example of an objective function with a bowl-shaped landscape where the global minima is at the bottom of the bowl with lots of local minimum surrounding it. The explicit formula of the objective function is

Figure 2a shows one slice of the function for $z=2.34$ . In the function, $a$ and $b$ determine the depth and width of the global minima, and $c$ , $r$ , $\beta$ determine depth, location and width of the local minimums. In this example, $a$ , $b$ , $c$ , $r$ , $\beta$ are set to 30, 0.007, 0.25, 1, 20, respectively.

Figure 2b shows different trajectories of Adam and NosAdam when they are initiated at the same point on the side of the bowl. As expected, the trajectory of Adam (yellow) passes the global minima and ends up trapped in valley $A$ , while NosAdam (red) gradually converges to the global minima, i.e. valley $B$ .

With the operation of taking max, AMSGrad does not have the same non-convergence issue as discussed above. However, taking max may be problematic as well since $\hat{v}_{t}$ can never decrease. If a very large gradient appears at an iteration, then the adaptive learning rate for all later steps will keep being small. For example, if a large gradient (e.g. 100 times of the original gradient) appears at the $10^{6}$ step in the example (4), we can see that AMSGrad will converge very slowly. This, however, will not be a problem for NosAdam which has the ability of both increasing and decreasing its $v_{t}$ . See Figure 3 for a demonstration. Another example with sharper local minima by setting $b=2$ , $c=4$ , $r=1.3$ is given in Figure 4; and the algorithms are initialized at location A. One can see that the sequence generated by AMSGrad is trapped in a sharp local minima, whereas NosAdam still converges to the global minimum. From these examples we can see that the operation of taking max of AMSGrad has some intrinsic flaws though it promises convergence. The way of computing $v_{t}$ in NosAdam seems superior.

There are also situations in which NosAdam can work poorly. Just because NosAdam is nostalgic, it requires a relatively good initial point to achieve good performances though this is commonly required by most optimization algorithms. However, Adam can be less affected by bad initializations sometime due to its specific way of calculating $v_{t}$ . This gives it a chance of jumping out of the local minimum (and a chance of jumping out of the global minima as well as shown in Figure 2). To demonstrate this, we let both Adam and NosAdma initialize in the valley A (see Figure 5). We can see that the trajectory of Adam manages to jump out of the valley, while it is more difficult for NosAdam to do so.

We note that although NosAdam requires good initialization, it does not necessarily mean initializing near the global minima. Since the algorithm is nostalgic, as long as the initial gradients are pointing towards the right direction, the algorithm may still converge to the global minima even though the initialization is far away from the global minima. As we can see from Figure 4 that NosAdam converges because all of the gradients are good ones at the beginning of the algorithm, which generates enough momentum to help the sequence dashes through the region with sharp local minimum.

Like any Adam-like algorithm, the convergence of NosAdam depends on the loss landscape and initialization. However, if the landscape is as shown in the above figures, then NosAdam has a better chance to converge than Adam and AMSGrad. In practice, it is therefore helpful to first examine the loss landscape before selecting an algorithm. However, it is time consuming to do in general. Nonetheless, earlier studies showed that neural networks with skip connections like ResNet and DenseNet lead to coercive loss functions similar to the one shown in the above figures Li et al. (2018).

Experiments

In this section, we conduct some experiments to compare NosAdam with Adam and its variant AMSGrad. We consider the task of multi-class classification using logistic regression, multi-layer fully connected neural networks and deep convolutional neural networks on MNIST LECUN and CIFAR-10 Krizhevsky et al. . The results generally indicate that NosAdam is a promising algorithm that works well in practice.

Throughout our experiments, we fixed $\beta_{1}$ to be 0.9, $\beta_{2}$ to be 0.999 for Adam and AMSGrad, and search $\gamma$ in $\{\text{1e-1},\text{1e-2},\text{1e-3},\text{1e-4}\}$ for NosAdam. The initial learning rate is chosen from $\{\text{1e-3},\text{2e-3},...,\text{9e-3},\text{1e-2},\text{2e-2},...,\text{9e-2},\text{1e-1},\text{2e-1},...,\text{9e-1}\}$ and the results are reported using the best set of hyper-parameters. All the experiments are done using Pytorch0.4.

Logistic Regression To investigate the performance of the algorithms on convex problems, we evaluate Adam, AMSGrad and NosAdam on multi-class logistic regression problem using the MNIST dataset. To be consistent with the theory, we set the step size $\alpha_{t}=\alpha/\sqrt{t}$ . We set the minibatch size to be 128. According to Figure 6a, the three algorithms have very similar performance.

Multilayer Fully Connected Neural Networks

We first train a simple fully connected neural network with 1 hidden layer (with 100 neurons and ReLU as the activation function) for the multi-class classification problem on MNIST. We use constant step size $\alpha_{t}=\alpha$ throughout the experiments for this set of experiments. The results are shown in Figure 6b. We can see that NosAdam slightly outperforms AMSGrad, while Adam is much worse than both NosAdam and AMSGrad and oscillates a lot. This is due to the difference of the definition of $v_{t}$ for each algorithm: $v_{t}$ in AMSGrad and NosAdam gradually becomes stationary and stays at a good re-scaling value; while $v_{t}$ in Adam does not have such property.

Deep Convolutional Neural Networks Finally, we train a deep convolutional neural network on CIFAR-10. Wide Residual Network Zagoruyko and Komodakis (2016) is known to be able to achieve high accuracy with much less layers than ResNet He et al. (2015). In our experiment, we choose Wide ResNet28. The model is trained on 4 GPUs with the minibatch size 100. The initial learning rate is decayed at epoch 50 and epoch 100 by multiplying 0.1. In our experiments, the optimal performances are usually achieved when the learning rate is around 0.02 for all the three algorithms. For reproducibility, an anonymous link of code will be provided in the supplementary material.

Our results are shown in Figure 7. We observe that NosAdam works slightly better than AMSGrad and Adam in terms of both convergence speed and generalization. This indicates that NosAdam is a promising alternative to Adam and its variants.

Discussion

In this paper, we suggested that we should weight more of the past gradients when designing the adaptive learning rate. In fact, our original intuition came from mathematical analysis of the convergence of Adam-like algorithms. Based on such observation, we then proposed a new algorithm called Nostalgic Adam (NosAdam), and provided a convergence analysis. We also discussed the pros and cons of NosAdam comparing to Adam and AMSGrad using a simple example, which gave us a better idea when NosAdam could be effective.

For future works, we believe that loss landscape analysis and the design of a strategy to choose different algorithms adaptively based on the loss landscape would be worth pursuing. Hopefully, we can design an optimization algorithm that can adaptively adjust its re-scaling term in order to fully exploit the local geometry of the loss landscape.

Acknowledgments

This work would not have existed without the support of BICMR and School of Mathematical Sciences, Peking University. Bin Dong is supported in part by Beijing Natural Science Foundation (Z180001).

Appendix A Convergence of p-NosAdam

In this appendix, we use the same notations as in the paper “Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate”. We are going to prove a more general convergence theorem. In the original paper, we propose NosAdam, as shown in Algorithm 1. But in fact, NosAdam can be considered as a particular case of a more general algorithm, in which we replaces $g_{t}^{2}$ in the calculation of $v_{t}$ by $g_{p}$ , and $v_{t}^{1/2}$ in the update equation by $v_{t}^{1/p}$ . We call this algorithm p-NosAdam, as shown in Algorithm 2. NosAdam is p-NosAdam when $p=2$ .

In the remaining part of this appendix, we are going to prove the convergence theorem of p-NosAdam when $p>1$ . From Theorem A.1, we can see that the regret bound is $O(T^{\text{max}(\frac{1}{p},\frac{p-1}{p})})$ .

Let $B_{t}$ and $b_{k}$ be the sequences defined in p-NosAdam, $\alpha_{t}=\alpha/t^{1/p},p>1$ , $\beta_{1,1}=\beta_{1},\beta_{1,t}\leq\beta_{1}$ for all t. Assume that $\mathscr{F}$ has bounded diameter $D_{\infty}$ and $||\nabla f_{t}(x)||_{\infty}\leq G_{\infty}$ for all t and $x\in\mathscr{F}$ . Furthermore, let $\beta_{2,t}$ be such that the following conditions are satisfied:

Then for $\{x_{t}\}$ generated using p-NosAdam, we have the following bound on the regret

Let $x^{*}=\text{argmin}_{x\in\mathcal{F}}\sum_{t=1}^{T}f_{t}(x)$ . Therefore $R_{T}=\sum_{t=1}^{T}f_{t}(x_{t})-f_{t}(x^{*})$ .

To prove this theorem, we will use the following lemmas.

Using Lemma 4 in Reddi et al. (2018) with $u_{1}=x_{t+1}$ and $u_{2}=x^{*}$ , we have the following:

Rearranging the above inequality, we have

The second inequality follows from simple application of Cauchy-Schwarz and Young’s inequality. We now use the standard approach of bounding the regret at each step using convexity of the function $f_{t}$ in the following manner:

Base on this Lemma, we are going to find the corresponding upper bound for each term in the above regret bound inequality.

For the first term $\sum_{t=1}^{T}[\frac{1}{2\alpha_{t}(1-\beta_{1t})}(||V_{t}^{1/2p}(x_{t}-x^{*})||^{2}-||V_{t}^{1/2p}(x_{t+1}-x^{*})||^{2})$ , we have Lemma A.3.

When $B_{t}/t$ is non-increasing, then $V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1}$ is semi-positive, and

Proof of Lemma A.3:

which means $V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1}$ is semi-positive.

The third inequation use the knowledge that $V_{t}/\alpha^{2}_{t}-V_{t-1}/\alpha^{2}_{t-1}\geq 0$ .

For the second and the third terms in Lemma A.2, we have Lemma A.4.

Proof of Lemma A.4

The first inequality comes from $\sum_{k=1}^{j}b_{k}g_{k}^{p}\leq\sum_{k=1}^{t}b_{k}g_{k}^{p}$ . The second inequality comes from that $B_{t}/t$ is non-increasing with respect to $t$ . The last inequality follows from then inequality $\sum_{t=j}^{T}\beta_{1}^{t-j}\leq 1/(1-\beta_{1})$ .

Let $S_{j}=\sum_{k=1}^{j}b_{k}g_{k}^{p}$ ,

The first inequality comes from Lemma A.5 when $p>1$ . The last inequality comes from the second constraint, which tells us that $B_{j}/(jb_{j}^{p})$ is non-decreasing with respect to $j$ . This completes the proof of the lemma.

This finally completes the proof of Lemma A.4.

To complete Lemma A.4, we need to finally prove the next Lemma.

Proof of Lemma A.5

Finally, back to our theorem, using the inequalities in Lemma A.4 and Lemma A.3 to substitute the first three terms in Lemma A.2 completes the proof of our theorem.