AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, Yong Yu

Introduction

One common choice of $\phi(g_{1},\dots,g_{t})$ is the exponential moving average of the gradients used in Momentum (Qian, 1999) and Adam (Kingma & Ba, 2014), which helps alleviate gradient oscillations. The commonly-used $\psi(g_{1},\dots,g_{t})$ in deep learning community is the exponential moving average of squared gradients, such as Adadelta (Zeiler, 2012), RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and Nadam (Dozat, 2016).

Adam (Kingma & Ba, 2014) is a typical adaptive learning rate method, which assembles the idea of using exponential moving average of first and second moments and bias correction. In general, Adam is robust and efficient in both dense and sparse gradient cases, and is popular in deep learning research. However, Adam is shown not being able to converge to optimal solution in certain cases. Reddi et al. (2018) point out that the key issue in the convergence proof of Adam lies in the quantity

which is assumed to be positive, but unfortunately, such an assumption does not always hold in Adam. They provide a set of counterexamples and demonstrate that the violation of positiveness of $\Gamma_{t}$ will lead to undesirable convergence behavior in Adam.

Reddi et al. (2018) then propose two variants, AMSGrad and AdamNC, to address the issue by keeping $\Gamma_{t}$ positive. Specifically, AMSGrad defines $\hat{v_{t}}$ as the historical maximum of $v_{t}$ , i.e., $\hat{v_{t}}=\max~{}\{v_{i}\}^{t}_{i=1}$ , and replaces $v_{t}$ with $\hat{v_{t}}$ to keep $v_{t}$ non-decreasing and therefore forces $\Gamma_{t}$ to be positive; while AdamNC forces $v_{t}$ to have “long-term memory” of past gradients and calculates $v_{t}$ as their average to make it stable. Though these two algorithms solve the non-convergence problem of Adam to a certain extent, they turn out to be inefficient in practice: they have to maintain a very large $v_{t}$ once a large gradient appears, and a large $v_{t}$ decreases the adaptive learning rate $\frac{~{}\alpha_{t}}{\sqrt{v_{t}}}$ and slows down the training process.

In this paper, we provide a new insight into adaptive learning rate methods, which brings a new perspective on solving the non-convergence issue of Adam. Specifically, in Section 3, we study the non-convergence of Adam via analyzing the accumulated step size of each gradient $g_{t}$ . We observe that in the common adaptive learning rate methods, a large gradient tends to have a relatively small step size, while a small gradient is likely to have a relatively large step size. We show that such biased step sizes stem from the inappropriate positive correlation between $v_{t}$ and $g_{t}$ , and we argue that this is the fundamental cause of the non-convergence issue of Adam.

In Section 4, we further prove that decorrelating $v_{t}$ and $g_{t}$ leads to equal and unbiased expected step size for each gradient, thus solving the non-convergence issue of Adam. We subsequently propose AdaShift, a decorrelated variant of adaptive learning rate methods, which achieves decorrelation between $v_{t}$ and $g_{t}$ by calculating $v_{t}$ using temporally shifted gradients. Finally, in Section 5, we study the performance of our proposed AdaShift, and demonstrate that it solves the non-convergence issue of Adam, while still maintaining a decent performance compared with Adam in terms of both training speed and generalization.

Preliminaries

In Adam, $m_{t}$ and $v_{t}$ are defined as the exponential moving average of $g_{t}$ and $g^{2}_{t}$ :

where $\beta_{1}\in[0,1)$ and $\beta_{2}\in[0,1)$ are the exponential decay rates for $m_{t}$ and $v_{t}$ , respectively, with $m_{0}=0$ and $v_{0}=0$ . They can also be written as:

To avoid the bias in the estimation of the expected value at the initial timesteps, Kingma & Ba (2014) propose to apply bias correction to $m_{t}$ and $v_{t}$ . Using $m_{t}$ as instance, it works as follows:

Online optimization problem.

An online optimization problem consists of a sequence of cost functions $f_{1}(\theta),\dots,f_{t}(\theta),\dots,f_{T}(\theta)$ , where the optimizer predicts the parameter $\theta_{t}$ at each time-step $t$ and evaluate it on an unknown cost function $f_{t}(\theta)$ . The performance of the optimizer is usually evaluated by regret $R(T)\triangleq\sum_{t=1}^{T}[f_{t}(\theta_{t})-f_{t}(\theta^{*})]$ , which is the sum of the difference between the online prediction $f_{t}(\theta_{t})$ and the best fixed-point parameter prediction $f_{t}(\theta^{*})$ for all the previous steps, where $\theta^{*}=\operatorname*{arg\,min}_{\theta\in\vartheta}\sum_{t=1}^{T}f_{t}(\theta)$ is the best fixed-point parameter from a feasible set $\vartheta$ .

Counterexamples.

Reddi et al. (2018) highlight that for any fixed $\beta_{1}$ and $\beta_{2}$ , there exists an online optimization problem where Adam has non-zero average regret, i.e., Adam does not converge to optimal solution . The counterexamples in the sequential version are given as follows:

where $C$ is a relatively large constant and $d$ is the length of an epoch. In Equation 6, most gradients of $f_{t}(\theta)$ with respect to $\theta$ are $-1$ , but the large positive gradient $C$ at the beginning of each epoch makes the overall gradient of each epoch positive, which means that one should decrease $\theta_{t}$ to minimize the loss. However, according to (Reddi et al., 2018), the accumulated update of $\theta$ in Adam under some circumstance is opposite (i.e., $\theta_{t}$ is increased), thus Adam cannot converge in such case. Reddi et al. (2018) argue that the reason of the non-convergence of Adam lies in that the positive assumption of $\Gamma_{t}\triangleq(\sqrt{v_{t}}/{\alpha_{t}}-\sqrt{v_{t-1}}/{\alpha_{t-1}})$ does not always hold in Adam.

The counterexamples are also extended to stochastic cases in (Reddi et al., 2018), where a finite set of cost functions appear in a stochastic order. Compared with sequential online optimization counterexample, the stochastic version is more general and closer to the practical situation. For the simplest one dimensional case, at each timestep $t$ , the function $f_{t}(\theta)$ is chosen as i.i.d.:

where $\delta$ is a small positive constant that is smaller than $C$ . The expected cost function of the above problem is $F(\theta)=\frac{1+\delta}{C+1}C\theta-\frac{C-\delta}{C+1}\theta=\delta\theta$ , therefore, one should decrease $\theta$ to minimize the loss. Reddi et al. (2018) prove that when $C$ is large enough, the expectation of accumulated parameter update in Adam is positive and results in increasing $\theta$ .

Basic Solutions

Reddi et al. (2018) propose maintaining the strict positiveness of $\Gamma_{t}$ as solution, for example, keeping $v_{t}$ non-decreasing or using increasing $\beta_{2}$ . In fact, keeping $\Gamma_{t}$ positive is not the only way to guarantee the convergence of Adam. Another important observation is that for any fixed sequential online optimization problem with infinitely repeating epochs (e.g., Equation 6), Adam will converge as long as $\beta_{1}$ is large enough. Formally, we have the following theorem:

The intuition behind Theorem 1 is that, if $\beta_{1}\rightarrow 1$ , then $m_{t}\rightarrow{\sum_{i=1}^{d}g_{i}}/{d}$ , i.e., $m_{t}$ approaches the average gradient of an epoch, according to Equation 5. Therefore, no matter what the adaptive learning rate ${\alpha_{t}}/{\sqrt{v_{t}}}$ is, Adam will always converge along the correct direction.

The cause of non-convergence: biased step size

In this section, we study the non-convergence issue by analyzing the counterexamples provided by Reddi et al. (2018). We show that the fundamental problem of common adaptive learning rate methods is that: $v_{t}$ is positively correlated to the scale of gradient $g_{t}$ , which results in a small step size $\alpha_{t}/\sqrt{v_{t}}$ for a large gradient, and a large step size for a small gradient. We argue that such an biased step size is the cause of non-convergence.

We will first define net update factor for the analysis of the accumulated influence of each gradient $g_{t}$ , then apply the net update factor to study the behaviors of Adam using Equation 6 as an example. The argument will be extended to the stochastic online optimization problem and general cases.

When $\beta_{1}\neq 0$ , due to the exponential moving effect of $m_{t}$ , the influence of $g_{t}$ exists in all of its following timesteps. For timestep $i$ ( $i\geq t$ ), the weight of $g_{t}$ is $(1-\beta_{1})\beta_{1}^{i-t}$ . We accordingly define a new tool for our analysis: the net update $net(g_{t})$ of each gradient $g_{t}$ , which is its accumulated influence on the entire optimization process:

and we call $k(g_{t})$ the net update factor of $g_{t}$ , which is the equivalent accumulated step size for gradient $g_{t}$ . Note that $k(g_{t})$ depends on $\{v_{i}\}^{\infty}_{i=t}$ , and in Adam, if $\beta_{1}\neq 0$ , then all elements in $\{v_{i}\}^{\infty}_{i=t}$ are related to $g_{t}$ . Therefore, $k(g_{t})$ is a function of $g_{t}$ .

It is worth noticing that in Momentum method, $v_{t}$ is equivalently set as $1$ . Therefore, we have $k(g_{t})=\alpha_{t}$ and $net(g_{t})=\alpha_{t}g_{t}$ , which means that the accumulated influence of each gradient $g_{t}$ in Momentum is the same as vanilla SGD (Stochastic Gradient Decent). Hence, the convergence of Momentum is similar to vanilla SGD. However, in adaptive learning rate methods, $v_{t}$ is function over the past gradients, which makes its convergence nontrivial.

2 Analysis on online optimization counterexamples

Note that $v_{t}$ exists in the definition of net update factor (Equation 8). Before further analyzing the convergence of Adam using the net update factor, we first study the pattern of $v_{t}$ in the sequential online optimization problem in Equation 6. Since Equation 6 is deterministic, we can derive the formula of $v_{t}$ as follows:

Given the formula of $v_{t}$ in Equation 9, we now study the net update factor of each gradient. We start with a simple case where $\beta_{1}=0$ . In this case we have

Since the limit of $v_{nd+i}$ in each epoch monotonically decreases with the increase of index $i$ according to Equation 9, the limit of $k(g_{nd+i})$ monotonically increases in each epoch. Specifically, the first gradient $g_{nd+1}=C$ in epoch $n$ represents the correct updating direction, but its influence is the smallest in this epoch. In contrast, the net update factor of the subsequent gradients $-1$ are relatively larger, though they indicate a wrong updating direction.

We further consider the general case where $\beta_{1}\neq 0$ . The result is presented in the following lemma:

In the sequential online optimization problem in Equation 6, when $n\rightarrow\infty$ , the limit of net update factor $k(g_{nd+i})$ of epoch $n$ satisfies: $\exists$ $1\leq j\leq d$ such that

where $k(C)$ denotes the net update factor for gradient $g_{i}=C$ .

Lemma 3 tells us that, in sequential online optimization problem in Equation 6, the net update factors are biased. Specifically, the net update factor for the large gradient $C$ is the smallest in the entire epoch, while all gradients $-1$ have larger net update factors. Such biased net update factors will possibly lead Adam to a wrong accumulated update direction.

Similar conclusion also holds in the stochastic online optimization problem in Equation 7. We derive the expectation of the net update factor for each gradient in the following lemma:

In the stochastic online optimization problem in Equation 7, assuming $\alpha_{t}=1$ , it holds that $k(C)<k(-1)$ , where $k(C)$ denote the expectation net update factor for $g_{i}=C$ and $k(-1)$ denote the expectation net update factor for $g_{i}=-1$ .

Though the formulas of net update factors in the stochastic case are more complicated than those in deterministic case, the analysis is actually more easier: the gradients with the same scale share the same expected net update factor, so we only need to analyze $k(C)$ and $k(-1)$ . From Lemma 4, we can see that in terms of the expectation net update factor, $k(C)$ is smaller than $k(-1)$ , which means the accumulated influence of gradient $C$ is smaller than gradient $-1$ .

3 Analysis on non-convergence of Adam

As we have observed in the previous section, a common characteristic of these counterexamples is that the net update factor for the gradient with large magnitude is smaller than these with small magnitude. The above observation can also be interpreted as a direct consequence of inappropriate correlation between $v_{t}$ and $g_{t}$ . Recall that $v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}$ . Assuming $v_{t-1}$ is independent of $g_{t}$ , then: when a new gradient $g_{t}$ arrives, if $g_{t}$ is large, $v_{t}$ is likely to be larger; and if $g_{t}$ is small, $v_{t}$ is also likely to be smaller. If $\beta_{1}=0$ , then $k(g_{t})={\alpha_{t}}/{\sqrt{v_{t}}}$ . As a result, a large gradient is likely to have a small net update factor, while a small gradient is likely to have a large net update factor in Adam.

When it comes to the scenario where $\beta_{1}>0$ , the arguments are actually quite similar. Given $v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}$ . Assuming $v_{t-1}$ and $\{g_{t+i}\}^{\infty}_{i=1}$ are independent from $g_{t}$ , then: not only does $v_{t}$ positively correlate with the magnitude of $g_{t}$ , but also the entire infinite sequence $\{v_{i}\}^{\infty}_{i=t}\,$ positively correlates with the magnitude of $g_{t}$ . Since the net update factor $k(g_{t})=\sum_{i=t}^{\infty}{~{}\alpha_{i}}/{\sqrt{v_{i}}}(1-\beta_{1})\beta_{1}^{i-t}$ negatively correlates with each $v_{i}$ in $\{v_{i}\}^{\infty}_{i=t}$ , it is thus negatively correlated with the magnitude of $g_{t}$ . That is, $k(g_{t})$ for a large gradient is likely to be smaller, while $k(g_{t})$ for a small gradient is likely to be larger.

The biased net update factors cause the non-convergence problem of Adam as well as all other adaptive learning rate methods where $v_{t}$ correlates with $g_{t}$ . To construct a counterexample, the same pattern is that: the large gradient is along the “correct” direction, while the small gradient is along the opposite direction. Due to the fact that the accumulated influence of a large gradient is small while the accumulated influence of a small gradient is large, Adam may update parameters along the wrong direction.

Finally, we would like to emphasize that even if Adam updates parameters along the right direction in general, the biased net update factors are still unfavorable since they slow down the convergence.

The proposed method: decorrelation via temporal shifting

According to the previous discussion, we conclude that the main cause of the non-convergence of Adam is the inappropriate correlation between $v_{t}$ and $g_{t}$ . Currently we have two possible solutions: (1) making $v_{t}$ act like a constant, which declines the correlation, e.g., using a large $\beta_{2}$ or keep $v_{t}$ non-decreasing (Reddi et al., 2018); (2) using a large $\beta_{1}$ (Theorem 1), where the aggressive momentum term helps to mitigate the impact of biased net update factors. However, neither of them solves the problem fundamentally.

The dilemma caused by $v_{t}$ enforces us to rethink its role. In adaptive learning rate methods, $v_{t}$ plays the role of estimating the second moments of gradients, which reflects the scale of gradient on average. With the adaptive learning rate $\alpha_{t}/\sqrt{v_{t}}$ , the update step of $g_{t}$ is scaled down by $\sqrt{v_{t}}$ and achieves rescaling invariance with respect to the scale of $g_{t}$ , which is practically useful to make the training process easy to control and the training system robust. However, the current scheme of $v_{t}$ , i.e., $v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}$ , brings a positive correlation between $v_{t}$ and $g_{t}$ , which results in reducing the effect of large gradients and increasing the effect of small gradients, and finally causes the non-convergence problem. Therefore, the key is to let $v_{t}$ be a quantity that reflects the scale of the gradients, while at the same time, be decorrelated with current gradient $g_{t}$ . Formally, we have the following theorem:

For any fixed online optimization problem with infinitely repeating of a finite set of cost functions $\{f_{1}(\theta),\dots,f_{t}(\theta),\dots f_{n}(\theta)\}$ , assuming $\beta_{1}=0$ and $\alpha_{t}$ is fixed, we have, if $v_{t}$ follows a fixed distribution and is independent of the current gradient $g_{t}$ , then the expected net update factor for each gradient is identical.

Let $P_{v}$ denote the distribution of $v_{t}$ . In the infinitely repeating online optimization scheme, the expectation of net update factor for each gradient $g_{t}$ is

Momentum (Qian, 1999) can be viewed as setting $v_{t}$ as a constant, which makes $v_{t}$ and $g_{t}$ independent. Furthermore, in our view, using an increasing $\beta_{2}$ (AdamNC) or keeping $\hat{v_{t}}$ as the largest $v_{t}$ (AMSGrad) is also to make $v_{t}$ almost fixed. However, fixing $v_{t}$ is not a desirable solution, because it damages the adaptability of Adam with respect to the adapting of step size.

We next introduce the proposed solution to make $v_{t}$ independent of $g_{t}$ , which is based on temporal independent assumption among gradients. We first introduce the idea of temporal decorrelation, then extend our solution to make use of the spatial information of gradients. Finally, we incorporate first moment estimation. The pseudo code of the proposed algorithm is presented as follows.

In practical setting, $f_{t}(\theta)$ usually involves different mini-batches $x_{t}$ , i.e., $f_{t}(\theta)=f(\theta;x_{t})$ . Given the randomness of mini-batch, we assume that the mini-batch $x_{t}$ is independent of each other and further assume that $f(\theta;x)$ keeps unchanged over time, then the gradient $g_{t}=\nabla f(\theta;x_{t})$ of each mini-batch is independent of each other.

Therefore, we could change the update rule for $v_{t}$ to involve $g_{t-n}$ instead of $g_{t}$ , which makes $v_{t}$ and $g_{t}$ temporally shifted and hence decorrelated:

Note that in the sequential online optimization problem, the assumption “ $g_{t}$ is independent of each other” does not hold. However, in the stochastic online optimization problem and practical neural network settings, our assumption generally holds.

2 Making use of the spatial elements of previous timesteps

Most optimization schemes involve a great many parameters. The dimension of $\theta$ is high, thus $g_{t}$ and $v_{t}$ are also of high dimension. However, $v_{t}$ is element-wisely computed in Equation 14. Specifically, we only use the $i$ -th dimension of $g_{t-n}$ to calculate the $i$ -th dimension of $v_{t}$ . In other words, it only makes use of the independence between $g_{t-n}[i]$ and $g_{t}[i]$ , where $g_{t}[i]$ denotes the $i$ -th element of $g_{t}$ . Actually, in the case of high-dimensional $g_{t}$ and $v_{t}$ , we can further assume that all elements of gradient $g_{t-n}$ at previous timesteps are independent with the $i$ -th dimension of $g_{t}$ . Therefore, all elements in $g_{t-n}$ can be used to compute $v_{t}$ without introducing correlation. To this end, we propose introducing a function $\phi$ over all elements of $g^{2}_{t-n}$ , i.e.,

For easy reference, we name the elements of $g_{t-n}$ other than $g_{t-n}[i]$ as the spatial elements of $g_{t-n}$ and name $\phi$ the spatial function or spatial operation. There is no restriction on the choice of $\phi$ , and we use $\phi(x)=\max_{i}x[i]$ for most of our experiments, which is shown to be a good choice. The $\max_{i}x[i]$ operation has a side effect that turns the adaptive learning rate $v_{t}$ into a shared scalar.

An important thing here is that, we no longer interpret $v_{t}$ as the second moment of $g_{t}$ . It is merely a random variable that is independent of $g_{t}$ , while at the same time, reflects the overall gradient scale. We leave further investigations on $\phi$ as future work.

3 Block-wise adaptive learning rate SGD

In practical setting, e.g., deep neural network, $\theta$ usually consists of many parameter blocks, e.g., the weight and bias for each layer. In deep neural network, the gradient scales (i.e., the variance) for different layers tend to be different (Glorot & Bengio, 2010; He et al., 2015). Different gradient scales make it hard to find a learning rate that is suitable for all layers, when using SGD and Momentum methods. In traditional adaptive learning rate methods, they apply element-wise rescaling for each gradient dimension, which achieves rescaling-invariance and somehow solves the above problem. However, Adam sometimes does not generalize better than SGD (Wilson et al., 2017; Keskar & Socher, 2017), which might relate to the excessive learning rate adaptation in Adam.

In our temporal decorrelation with spatial operation scheme, we can solve the “different gradient scales” issue more naturally, by applying $\phi$ block-wisely and outputs a shared adaptive learning rate scalar $v_{t}[i]$ for each block:

It makes the algorithm work like an adaptive learning rate SGD, where each block has an adaptive learning rate $\alpha_{t}/\sqrt{v_{t}[i]}$ while the relative gradient scale among in-block elements keep unchanged. As illustrated in Algorithm 1, the parameters $\theta_{t}$ including the related $g_{t}$ and $v_{t}$ are divided into $M$ blocks. Every block contains the parameters of the same type or same layer in neural network.

4 Incorporating first moment estimation: moving averaging windows

First moment estimation, i.e., defining $m_{t}$ as a moving average of $g_{t}$ , is an important technique of modern first order optimization algorithms, which alleviates mini-batch oscillations. In this section, we extend our algorithm to incorporate first moment estimation.

We have argued that $v_{t}$ needs to be decorrelated with $g_{t}$ . Analogously, when introducing the first moment estimation, we need to make $v_{t}$ and $m_{t}$ independent to make the expected net update factor unbiased. Based on our assumption of temporal independence, we further keep out the latest $n$ gradients $\{g_{t-i}\}^{n-1}_{i=0}$ , and update $v_{t}$ and $m_{t}$ via

In Equation 17, $\beta_{1}\in$ plays the role of decay rate for temporal elements. It can be viewed as a truncated version of exponential moving average that only applied to the latest few elements. Since we use truncating, it is feasible to use large $\beta_{1}$ without taking the risk of using too old gradients. In the extreme case where $\beta_{1}=1$ , it becomes vanilla averaging.

The key difference between Adam and the proposed method is that the latter temporally shifts the gradient $g_{t}$ for $n$ -step, i.e., using $g_{t-n}$ for calculating $v_{t}$ and using the kept-out $n$ gradients to evaluate $m_{t}$ (Equation 17), which makes $v_{t}$ and $m_{t}$ decorrelated and consequently solves the non-convergence issue. In addition, based on our new perspective on adaptive learning rate methods, $v_{t}$ is not necessarily the second moment and it is valid to further involve the calculation of $v_{t}$ with the spatial elements of previous gradients. We thus proposed to introduce the spatial operation $\phi$ that outputs a shared scalar for each block. The resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. We name the proposed method that makes use of temporal-shifting to decorrelated $v_{t}$ and $m_{t}$ AdaShift, which means “ADAptive learning rate method with temporal SHIFTing”.

Experiments

In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization. Without additional declaration, the reported result for each algorithm is the best we have found via parameter grid search. The anonymous code is provided at http://bit.ly/2NDXX6x.

Firstly, we verify our analysis on the stochastic online optimization problem in Equation 7, where we set $C=101$ and $\delta=0.02$ . We compare Adam, AMSGrad and AdaShift in this experiment. For fair comparison, we set $\alpha=0.001$ , $\beta_{1}=0$ and $\beta_{2}=0.999$ for all these methods. The results are shown in Figure 1(a). We can see that Adam tends to increase $\theta$ , that is, the accumulate update of $\theta$ in Adam is along the wrong direction, while AMSGrad and AdaShift update $\theta$ in the correct direction. Furthermore, given the same learning rate, AdaShift decreases $\theta$ faster than AMSGrad, which validates our argument that AMSGrad has a relatively higher $v_{t}$ that slows down the training.

In this experiment, we also verify Theorem 1. As shown in Figure 1(b), Adam is also able to converge to the correct direction with a sufficiently large $\beta_{1}$ and $\beta_{2}$ . Note that (1) AdaShift still converges with the fastest speed; (2) a small $\beta_{1}$ (e.g., $\beta_{1}=0.9$ , the light-blue line in Figure 1(b)) does not make Adam converge to the correct direction. We do not conduct the experiments on the sequential online optimization problem in Equation 6, because it does not fit our temporal independence assumption. To make it converge, one can use a large $\beta_{1}$ or $\beta_{2}$ , or set $v_{t}$ as a constant.

2 Logistic Regression and Multilayer Perceptron on MNIST

We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST, where the Multilayer Perceptron has two hidden layers and each has $256$ hidden units with no internal activation. The results are shown in Figure 2 and Figure 3, respectively. We find that in Logistic Regression, these learning algorithms achieve very similar final results in terms of both training speed and generalization. In Multilayer Perceptron, we compare Adam, AMSGrad and AdaShift with reduce-max spatial operation (max-AdaShift) and without spatial operation (non-AdaShift). We observe that max-AdaShift achieves the lowest training loss, while non-AdaShift has mild training loss oscillation and at the same time achieves better generalization. The worse generalization of max-AdaShift may be due to overfitting in this task, and the better generalization of non-AdaShift may stem from the regularization effect of its relatively unstable step size.

3 DenseNet and ResNet on CIFAR-10

ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) are two typical modern neural networks, which are efficient and widely-used. We test our algorithm with ResNet and DenseNet on CIFAR-10 datasets. We use a $18$ -layer ResNet and $100$ -layer DenseNet in our experiments. We plot the best results of Adam, AMSGrad and AdaShift in Figure 4 and Figure 5 for ResNet and DenseNet, respectively. We can see that AMSGrad is relatively worse in terms of both training speed and generalization. Adam and AdaShift share competitive results, while AdaShift is generally slightly better, especially the test accuracy of ResNet and the training loss of DenseNet.

4 DenseNet with Tiny-ImageNet

We further increase the complexity of dataset, switching from CIFAR-10 to Tiny-ImageNet, and compare the performance of Adam, AMSGrad and AdaShift with DenseNet. The results are shown in Figure 6, from which we can see that the training curves of Adam and AdaShift are basically overlapped, but AdaShift achieves higher test accuracy than Adam. AMSGrad has relatively higher training loss, and its test accuracy is relatively lower at the initial stage.

5 Generative model and Recurrent model

We also test our algorithm on the training of generative model and recurrent model. We choose WGAN-GP (Gulrajani et al., 2017) that involves Lipschitz continuity condition (which is hard to optimize), and Neural Machine Translation (NMT) (Luong et al., 2017) that involves typical recurrent unit LSTM, respectively. In Figure 7(a), we compare the performance of Adam, AMSGrad and AdaShift in the training of WGAN-GP discriminator, given a fixed generator. We notice that AdaShift is significantly better than Adam, while the performance of AMSGrad is relatively unsatisfactory. The test performance in terms of BLEU of NMT is shown in Figure 7(b), where AdaShift achieves a higher BLEU than Adam and AMSGrad.

Conclusion

In this paper, we study the non-convergence issue of adaptive learning rate methods from the perspective of the equivalent accumulated step size of each gradient, i.e., the net update factor defined in this paper. We show that there exists an inappropriate correlation between $v_{t}$ and $g_{t}$ , which leads to biased net update factor for each gradient. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_{t}$ and $g_{t}$ will lead to unbiased expected step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates $v_{t}$ and $g_{t}$ via calculating $v_{t}$ using temporally shifted gradient $g_{t-n}$ .

In addition, based on our new perspective on adaptive learning rate methods, $v_{t}$ is no longer necessarily the second moment of $g_{t}$ , but a random variable that is independent of $g_{t}$ and reflects the overall gradient scale. Thus, it is valid to calculate $v_{t}$ with the spatial elements of previous gradients. We further found that when the spatial operation $\phi$ outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. The experiment results demonstrate that AdaShift is able to solve the non-convergence issue of Adam. In the meantime, AdaShift achieves competitive and even better training and testing performance when compared with Adam.

References

To provide an intuitive impression on the relation among $C,d,\beta_{1},\beta_{2}$ and the convergence of Adam, we let $C=d=6$ , initialize $\theta_{1}=0$ , vary $\beta_{1}$ and $\beta_{2}$ among $[0,1)$ and let Adam go through 2000 timesteps (iterations). The final result of $\theta$ is shown in Figure 8(a). It suggests that for a fixed sequential online optimization problem, both of $\beta_{1}$ and $\beta_{2}$ determine the direction and speed of Adam optimization process. Furthermore, we also study the threshold point of $C$ and $d$ , under which Adam will change to the incorrect direction, for each fixed $\beta_{1}$ and $\beta_{2}$ that vary among $[0,1)$ . To simplify the experiments, we keep $d=C$ such that the overall gradient of each epoch being $+1$ . The result is shown in Figure 8(b), which suggests, at the condition of larger $\beta_{1}$ or larger $\beta_{2}$ , it needs a larger $C$ to make Adam stride on the opposite direction. In other words, large $\beta_{1}$ and $\beta_{2}$ will make the non-convergence rare to happen.

We also conduct the experiment in the stochastic problem to analyze the relation among $C$ , $\beta_{1}$ , $\beta_{2}$ and the convergence behavior of Adam. Results are shown in the Figure 8(c) and Figure 8(d) and the observations are similar to the previous: larger $C$ will cause non-convergence more easily and a larger $\beta_{1}$ or $\beta_{2}$ somehow help to resolve non-convergence issue. In this experiment, we set $\delta=1$ .

In the sequential online optimization problem Equation 6, let $\alpha_{t}$ being fixed, define $\mathcal{S}(\beta_{1},\beta_{2},C,d)$ to be the sum of the limits of step updates in a $d$ -step epoch:

Let $\mathcal{S}(\beta_{1},\beta_{2},C)=0$ , assuming $\beta_{2}$ and $C$ are large enough such that $v_{t}\gg 1$ , we get the equation:

Equation 19, though being quite complex, tells that both $\beta_{1}$ and $\beta_{2}$ are closely related to the counterexamples, and there exists a critical condition among these parameters.

Appendix B The AdaShift Pseudo Code

We provided the anonymous code where a Tensorflow implementation of this algorithm is available.

In order to verify the correlation between $g_{t}$ and $v_{t}$ in Adam and AdaShift, we conduct experiments to calculate the correlation coefficient between $g_{t}$ and $v_{t}$ . We train the Multilayer Perceptron on MNIST until converge and gather the gradient of the second hidden layer of each step. Based on these data, we calculate $v_{t}$ and the correlation coefficient between $g_{t}[i]$ and $g_{t-n}[i]$ , between $g_{t}[i]$ and $g_{t-n}[j]$ and between $g_{t}[i]$ and $v_{t}[i]$ of the last $10$ epochs using the Pearson correlation coefficient, which is formulated as follows:

To verify the temporal correlation between $g_{t}[i]$ and $g_{t-n}[i]$ , we range $n$ from $1$ to $10$ and calculate the average temporal correlation coefficient of all variables $i$ . Results are shown in Table 1.

To verify the spatial correlation between $g_{t}[i]$ and $g_{t-n}[j]$ , we again range $n$ from $1$ to $10$ and randomly sample some pairs of $i$ and $j$ and calculate the average spatial correlation coefficient of all the selected pairs. Results are shown in Table 2.

To verify the correlation between $g_{t}[i]$ and $v_{t}[i]$ within Adam, we calculate $v_{t}$ and the average correlation coefficient between $g_{t}^{2}$ and $v_{t}$ of all variables $i$ . The result is 0.435885276.

To verify the correlation between $g_{t-n}[i]$ and $v_{t}[i]$ within non-AdaShift and between $g_{t-n}[i]$ and $v_{t}$ within max-AdaShift, we range the keep number $n$ from $1$ to $10$ to calculate $v_{t}$ and the average correlation coefficient of all variables $i$ . The result is shown in Table 3 and Table 4.

Appendix D Proof of Theorem 1

With bias correction, the formulation of $m_{t}$ is written as follows

According to L’Hospital‘s rule, we can draw the following:

According to the definition of limitation, let $g^{*}=\frac{\sum_{i=1}^{t}g_{i}}{t}$ , we have, $\forall\epsilon>0$ , $\exists\beta_{1}\in(0,1)$ , such that

We set $\epsilon$ to be $|\frac{g^{*}}{2}|$ , then for each dimension of $m_{t}$ , i.e. $m_{t}[i]$ ,

So, $m_{t}$ shares the same sign with $g^{*}$ in every dimension.

Given it is a convex optimization problem, let the optimal parameter be $\theta^{*}$ , and the maximum step size is $\frac{\alpha_{t}}{\sqrt{v_{t}}}G$ that holds $\epsilon_{1}/G<\frac{\alpha_{t}}{\sqrt{v_{t}}}G<\epsilon_{2}/G$ , we have,

Given $\lVert\nabla f_{t}(\theta)\rVert_{\infty}\leq G$ , we have $f_{t}(\theta)-f_{t}(\theta^{*})<\epsilon_{2}$ , which implies the average regret

Appendix E Proof of Lemma 2

For a fixed $d$ , as $n$ approach infinity, we get the limit of $m_{nd+i}$ as:

For a fixed $d$ , as $n$ approach infinity, we get the limit of $v_{nd+i}$ as:

Appendix F Proof of Lemma 3

Thus, we can get the forward difference of $k(g_{nd+i})$ as:

Hence, we can draw the conclusion: $\exists 1\leq j\leq d$ , such that

where $K(C)$ is the net update factor for gradient $g_{i}=C$ . ∎

Appendix G Proof of Lemma 4

For a bounded random variable $X$ and a differentiable function $f(x)$ , the expectation of $f(X)$ is as follows:

where $D(X)$ is variance of $X$ , and $R_{3}$ is as follows:

$F(x)$ is the distribution function of $X$ . $R_{3}$ is a small quantity under some condition. And $c$ is large enough, such that: for any $\epsilon>0$ ,

(Proof of Lemma 4 ) In the stochastic online optimization problem equation 7, the gradient subjects the distribution as:

Then we can get the expectation of $g_{i}$ :

Meanwhile, under the assumption that gradients are i.i.d., the expectation and variance of $v_{i}$ are as following when $nd\to\infty$ :

Then, for the gradient $g_{i}$ , the net update factor is as follows:

It should to be clarified that we define $\sum_{j=1}^{t}\beta_{2}^{t-j}g_{i+j}^{2}$ equal to zero when $t=0$ . Then we define $X_{t}$ as:

For the function $f(x)=\frac{1}{\sqrt{x}}$ :

According to lemma 1, we can the expectation of $f(X_{t})$ as follows:

where $D_{t}=D[X_{k}]$ . Then for gradient $C$ and $-1$ , the net update factor is as follows:

We can see that each term in the infinite series of $k(C)$ is smaller than the corresponding one in $k(-1)$ . Thus, $k(C)<k(-1)$ .

Appendix H Proof of Lemma 6

We sum up all updates in an epoch, and define the summation as $\mathcal{S}(\beta_{1},\beta_{2},C)$ .

Assume $\beta_{2}$ and $C$ are large enough such that $v_{t}\gg 1$ , we get the approximation of limit of $v_{nd+i}$ as:

Then we can draw the expression of $\mathcal{S}(\beta_{1},\beta_{2},C)$ as:

Let $\mathcal{S}(\beta_{1},\beta_{2},C)=0$ , we get the equation about critical condition:

Appendix I Hyper-parameters Investigation

Here, we list all hyper-parameter setting of all above experiments.

In this section, we discuss the learning rate $\alpha_{t}$ sensitivity of AdaShift. We set $\alpha_{t}\in\{0.1,0.01,0.001\}$ and let $n=10$ , $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The results are shown in Figure 9 and Figure 10. Empirically, we found that when using the $max$ spatial operation, the best learning rate for AdaShift is around ten times of Adam.

In this section, we discuss the $\beta_{1}$ and $\beta_{2}$ sensitivity of AdaShift. We set $\alpha=0.01$ , $n=10$ and let $\beta_{1}\in\{0,0.9\}$ and $\beta_{2}\in\{0.9,0.99,0.999\}$ . The results are shown in Figure 11 and Figure 12. According to the results, AdaShift holds a low sensitivity to $\beta_{1}$ and $\beta_{2}$ . In some tasks, using the first moment estimation (with $\beta_{1}=0.9$ and $n=10$ ) or using a large $\beta_{2}$ , e.g., $0.999$ can attain better performance. The suggested parameters setting is $n=10,\beta_{1}=0.9,\beta_{2}=0.999$ .

I.4 n𝑛n and m𝑚m Sensitivity

In this section, we discuss the $n$ sensitivity of AdaShift. Here we also test a extended version of first moment estimation where it only uses the latest $m$ gradients ( $m\leq n$ ):

We set $\beta_{1}=0.9$ , $\beta_{2}=0.999$ . The results are shown in Figure 13, Figure 14 and Figure 15. In these experiments, AdaShift is fairly stable when changing $n$ and $m$ . We have not find a clear pattern on the performance change with respect to $n$ and $m$ .

Appendix J Temporal-only and Spatial-only

In our proposed algorithm, we apply a spatial operation on the temporally shifted gradient $g_{t-n}$ to update $v_{t}$ : $v_{t}[i]=\beta_{2}v_{t-1}[i]+(1-\beta_{2})\phi(g^{2}_{t-n}[i])$ . It is based on the temporal independent assumption, i.e., $g_{t-n}$ is independent of $g_{t}$ . And according to our argument in Section 4.2, one can further assume every element in $g_{t-n}$ is independent of the $i$ -th dimension of $g_{t}$ .

We purposely avoid involving the spatial elements of the current gradient $g_{t}$ , where the independence might not holds: when a sample which is rare and has a large gradient appear in the mini-batch $x_{t}$ , the overall scale of gradient $g_{t}$ might increase. However, for the temporally already decorrelation $g_{t-i}$ , further taking the advantage of the spatial irrelevance will not suffer from this problem.

We here provide extended experiments on two variants of AdaShift: (i) AdaShift (temporal-only), which only uses the vanilla temporal independent assumption and evaluate $v_{t}$ with: $v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t-n}^{2}$ ; (ii) AdaShift (spatial-only), which directly uses the spatial elements without temporal shifting.

According to our experiments, AdaShift (temporal-only), i.e., without the spatial operation, is less stable than AdaShift. In some tasks, AdaShift (temporal-only) works just fine; while in some other cases, AdaShift (temporal-only) suffers from explosive gradient and requires a relatively small learning rate. The performance of AdaShift (spatial-only) is close to Adam. More experiments for AdaShift (spatial-only) are included in the next section.

Appendix K Extended Experiments: Nadam and AdaShift(Space Only)

In this section, we extend the experiments and add the comparisons with Nadam and AdaShift (spatial-only). The results are shown in Figure 18, Figure19 and Figure20. According to these experiments, Nadam and AdaShift (spatial-only) share similar performence as Adam.

Appendix L Extension experiments: ill-conditioned quadratic problem

Rahimi & Recht raise the point, at “test of time” talk at NIPS 2017, that it is suspicious that gradient descent (aka back-propagation) is ultimate solution for optimization. A ill-conditioned quadratic problem with Two Layer Linear Net is showed to be challenging for gradient descent based methods, while alternative solutions, e.g., Levenberg-Marquardt, may converge faster and better. The problem is defined as follows:

where $A$ is some known badly conditioned matrix ( $k=10^{20}$ or $10^{5}$ ), and $W_{1}$ and $W_{2}$ are the trainable parameters.

We test SGD, Adam and AdaShift with this problem, the results are shown in Figure 21, Figure 24. It turns out as long as the training goes enough long, SGD, Adam, AdaShift all basically converge in this problem. Though SGD is significantly better than Adam and AdaShift.

We would tend to believe this is a general issue of adaptive learning rate method when comparing with vanilla SGD. Because these adaptive learning rate methods generally are scale-invariance, i.e., the step-size in terms of $g_{t}/sqrt(v_{t})$ is basically around one, which makes it hard to converge very well in such a ill-conditioning quadratic problem. SGD, in contrast, has a step-size $g_{t}$ ; as the training converges SGD would have a decreasing step-size, makes it much easier to converge better. The above analysis is confirmed with Figure 22 and Figure 23, with a decreasing learning rate, Adam and AdaShfit both converge very good.