AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, Yong Yu

Introduction

One common choice of ϕ(g1,,gt)\phi(g_{1},\dots,g_{t}) is the exponential moving average of the gradients used in Momentum (Qian, 1999) and Adam (Kingma & Ba, 2014), which helps alleviate gradient oscillations. The commonly-used ψ(g1,,gt)\psi(g_{1},\dots,g_{t}) in deep learning community is the exponential moving average of squared gradients, such as Adadelta (Zeiler, 2012), RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and Nadam (Dozat, 2016).

Adam (Kingma & Ba, 2014) is a typical adaptive learning rate method, which assembles the idea of using exponential moving average of first and second moments and bias correction. In general, Adam is robust and efficient in both dense and sparse gradient cases, and is popular in deep learning research. However, Adam is shown not being able to converge to optimal solution in certain cases. Reddi et al. (2018) point out that the key issue in the convergence proof of Adam lies in the quantity

which is assumed to be positive, but unfortunately, such an assumption does not always hold in Adam. They provide a set of counterexamples and demonstrate that the violation of positiveness of Γt\Gamma_{t} will lead to undesirable convergence behavior in Adam.

Reddi et al. (2018) then propose two variants, AMSGrad and AdamNC, to address the issue by keeping Γt\Gamma_{t} positive. Specifically, AMSGrad defines vt^\hat{v_{t}} as the historical maximum of vtv_{t}, i.e., vt^=max {vi}i=1t\hat{v_{t}}=\max~{}\{v_{i}\}^{t}_{i=1}, and replaces vtv_{t} with vt^\hat{v_{t}} to keep vtv_{t} non-decreasing and therefore forces Γt\Gamma_{t} to be positive; while AdamNC forces vtv_{t} to have “long-term memory” of past gradients and calculates vtv_{t} as their average to make it stable. Though these two algorithms solve the non-convergence problem of Adam to a certain extent, they turn out to be inefficient in practice: they have to maintain a very large vtv_{t} once a large gradient appears, and a large vtv_{t} decreases the adaptive learning rate  αtvt\frac{~{}\alpha_{t}}{\sqrt{v_{t}}} and slows down the training process.

In this paper, we provide a new insight into adaptive learning rate methods, which brings a new perspective on solving the non-convergence issue of Adam. Specifically, in Section 3, we study the non-convergence of Adam via analyzing the accumulated step size of each gradient gtg_{t}. We observe that in the common adaptive learning rate methods, a large gradient tends to have a relatively small step size, while a small gradient is likely to have a relatively large step size. We show that such biased step sizes stem from the inappropriate positive correlation between vtv_{t} and gtg_{t}, and we argue that this is the fundamental cause of the non-convergence issue of Adam.

In Section 4, we further prove that decorrelating vtv_{t} and gtg_{t} leads to equal and unbiased expected step size for each gradient, thus solving the non-convergence issue of Adam. We subsequently propose AdaShift, a decorrelated variant of adaptive learning rate methods, which achieves decorrelation between vtv_{t} and gtg_{t} by calculating vtv_{t} using temporally shifted gradients. Finally, in Section 5, we study the performance of our proposed AdaShift, and demonstrate that it solves the non-convergence issue of Adam, while still maintaining a decent performance compared with Adam in terms of both training speed and generalization.

Preliminaries

In Adam, mtm_{t} and vtv_{t} are defined as the exponential moving average of gtg_{t} and gt2g^{2}_{t}:

where β1[0,1)\beta_{1}\in[0,1) and β2[0,1)\beta_{2}\in[0,1) are the exponential decay rates for mtm_{t} and vtv_{t}, respectively, with m0=0m_{0}=0 and v0=0v_{0}=0. They can also be written as:

To avoid the bias in the estimation of the expected value at the initial timesteps, Kingma & Ba (2014) propose to apply bias correction to mtm_{t} and vtv_{t}. Using mtm_{t} as instance, it works as follows:

Online optimization problem.

An online optimization problem consists of a sequence of cost functions f1(θ),,ft(θ),,fT(θ)f_{1}(\theta),\dots,f_{t}(\theta),\dots,f_{T}(\theta), where the optimizer predicts the parameter θt\theta_{t} at each time-step tt and evaluate it on an unknown cost function ft(θ)f_{t}(\theta). The performance of the optimizer is usually evaluated by regret R(T)t=1T[ft(θt)ft(θ)]R(T)\triangleq\sum_{t=1}^{T}[f_{t}(\theta_{t})-f_{t}(\theta^{*})], which is the sum of the difference between the online prediction ft(θt)f_{t}(\theta_{t}) and the best fixed-point parameter prediction ft(θ)f_{t}(\theta^{*}) for all the previous steps, where θ=arg minθϑt=1Tft(θ)\theta^{*}=\operatorname*{arg\,min}_{\theta\in\vartheta}\sum_{t=1}^{T}f_{t}(\theta) is the best fixed-point parameter from a feasible set ϑ\vartheta.

Counterexamples.

Reddi et al. (2018) highlight that for any fixed β1\beta_{1} and β2\beta_{2}, there exists an online optimization problem where Adam has non-zero average regret, i.e., Adam does not converge to optimal solution . The counterexamples in the sequential version are given as follows:

where CC is a relatively large constant and dd is the length of an epoch. In Equation 6, most gradients of ft(θ)f_{t}(\theta) with respect to θ\theta are 1-1, but the large positive gradient CC at the beginning of each epoch makes the overall gradient of each epoch positive, which means that one should decrease θt\theta_{t} to minimize the loss. However, according to (Reddi et al., 2018), the accumulated update of θ\theta in Adam under some circumstance is opposite (i.e., θt\theta_{t} is increased), thus Adam cannot converge in such case. Reddi et al. (2018) argue that the reason of the non-convergence of Adam lies in that the positive assumption of Γt(vt/αtvt1/αt1)\Gamma_{t}\triangleq(\sqrt{v_{t}}/{\alpha_{t}}-\sqrt{v_{t-1}}/{\alpha_{t-1}}) does not always hold in Adam.

The counterexamples are also extended to stochastic cases in (Reddi et al., 2018), where a finite set of cost functions appear in a stochastic order. Compared with sequential online optimization counterexample, the stochastic version is more general and closer to the practical situation. For the simplest one dimensional case, at each timestep tt, the function ft(θ)f_{t}(\theta) is chosen as i.i.d.:

where δ\delta is a small positive constant that is smaller than CC. The expected cost function of the above problem is F(θ)=1+δC+1CθCδC+1θ=δθF(\theta)=\frac{1+\delta}{C+1}C\theta-\frac{C-\delta}{C+1}\theta=\delta\theta, therefore, one should decrease θ\theta to minimize the loss. Reddi et al. (2018) prove that when CC is large enough, the expectation of accumulated parameter update in Adam is positive and results in increasing θ\theta.

Basic Solutions

Reddi et al. (2018) propose maintaining the strict positiveness of Γt\Gamma_{t} as solution, for example, keeping vtv_{t} non-decreasing or using increasing β2\beta_{2}. In fact, keeping Γt\Gamma_{t} positive is not the only way to guarantee the convergence of Adam. Another important observation is that for any fixed sequential online optimization problem with infinitely repeating epochs (e.g., Equation 6), Adam will converge as long as β1\beta_{1} is large enough. Formally, we have the following theorem:

The intuition behind Theorem 1 is that, if β11\beta_{1}\rightarrow 1, then mti=1dgi/dm_{t}\rightarrow{\sum_{i=1}^{d}g_{i}}/{d}, i.e., mtm_{t} approaches the average gradient of an epoch, according to Equation 5. Therefore, no matter what the adaptive learning rate αt/vt{\alpha_{t}}/{\sqrt{v_{t}}} is, Adam will always converge along the correct direction.

The cause of non-convergence: biased step size

In this section, we study the non-convergence issue by analyzing the counterexamples provided by Reddi et al. (2018). We show that the fundamental problem of common adaptive learning rate methods is that: vtv_{t} is positively correlated to the scale of gradient gtg_{t}, which results in a small step size αt/vt\alpha_{t}/\sqrt{v_{t}} for a large gradient, and a large step size for a small gradient. We argue that such an biased step size is the cause of non-convergence.

We will first define net update factor for the analysis of the accumulated influence of each gradient gtg_{t}, then apply the net update factor to study the behaviors of Adam using Equation 6 as an example. The argument will be extended to the stochastic online optimization problem and general cases.

When β10\beta_{1}\neq 0, due to the exponential moving effect of mtm_{t}, the influence of gtg_{t} exists in all of its following timesteps. For timestep ii (iti\geq t), the weight of gtg_{t} is (1β1)β1it(1-\beta_{1})\beta_{1}^{i-t}. We accordingly define a new tool for our analysis: the net update net(gt)net(g_{t}) of each gradient gtg_{t}, which is its accumulated influence on the entire optimization process:

and we call k(gt)k(g_{t}) the net update factor of gtg_{t}, which is the equivalent accumulated step size for gradient gtg_{t}. Note that k(gt)k(g_{t}) depends on {vi}i=t\{v_{i}\}^{\infty}_{i=t}, and in Adam, if β10\beta_{1}\neq 0, then all elements in {vi}i=t\{v_{i}\}^{\infty}_{i=t} are related to gtg_{t}. Therefore, k(gt)k(g_{t}) is a function of gtg_{t}.

It is worth noticing that in Momentum method, vtv_{t} is equivalently set as 11. Therefore, we have k(gt)=αtk(g_{t})=\alpha_{t} and net(gt)=αtgtnet(g_{t})=\alpha_{t}g_{t}, which means that the accumulated influence of each gradient gtg_{t} in Momentum is the same as vanilla SGD (Stochastic Gradient Decent). Hence, the convergence of Momentum is similar to vanilla SGD. However, in adaptive learning rate methods, vtv_{t} is function over the past gradients, which makes its convergence nontrivial.

2 Analysis on online optimization counterexamples

Note that vtv_{t} exists in the definition of net update factor (Equation 8). Before further analyzing the convergence of Adam using the net update factor, we first study the pattern of vtv_{t} in the sequential online optimization problem in Equation 6. Since Equation 6 is deterministic, we can derive the formula of vtv_{t} as follows:

Given the formula of vtv_{t} in Equation 9, we now study the net update factor of each gradient. We start with a simple case where β1=0\beta_{1}=0. In this case we have

Since the limit of vnd+iv_{nd+i} in each epoch monotonically decreases with the increase of index ii according to Equation 9, the limit of k(gnd+i)k(g_{nd+i}) monotonically increases in each epoch. Specifically, the first gradient gnd+1=Cg_{nd+1}=C in epoch nn represents the correct updating direction, but its influence is the smallest in this epoch. In contrast, the net update factor of the subsequent gradients 1-1 are relatively larger, though they indicate a wrong updating direction.

We further consider the general case where β10\beta_{1}\neq 0. The result is presented in the following lemma:

In the sequential online optimization problem in Equation 6, when nn\rightarrow\infty, the limit of net update factor k(gnd+i)k(g_{nd+i}) of epoch nn satisfies: \exists 1jd1\leq j\leq d such that

where k(C)k(C) denotes the net update factor for gradient gi=Cg_{i}=C.

Lemma 3 tells us that, in sequential online optimization problem in Equation 6, the net update factors are biased. Specifically, the net update factor for the large gradient CC is the smallest in the entire epoch, while all gradients 1-1 have larger net update factors. Such biased net update factors will possibly lead Adam to a wrong accumulated update direction.

Similar conclusion also holds in the stochastic online optimization problem in Equation 7. We derive the expectation of the net update factor for each gradient in the following lemma:

In the stochastic online optimization problem in Equation 7, assuming αt=1\alpha_{t}=1, it holds that k(C)<k(1)k(C)<k(-1), where k(C)k(C) denote the expectation net update factor for gi=Cg_{i}=C and k(1)k(-1) denote the expectation net update factor for gi=1g_{i}=-1.

Though the formulas of net update factors in the stochastic case are more complicated than those in deterministic case, the analysis is actually more easier: the gradients with the same scale share the same expected net update factor, so we only need to analyze k(C)k(C) and k(1)k(-1). From Lemma 4, we can see that in terms of the expectation net update factor, k(C)k(C) is smaller than k(1)k(-1), which means the accumulated influence of gradient CC is smaller than gradient 1-1.

3 Analysis on non-convergence of Adam

As we have observed in the previous section, a common characteristic of these counterexamples is that the net update factor for the gradient with large magnitude is smaller than these with small magnitude. The above observation can also be interpreted as a direct consequence of inappropriate correlation between vtv_{t} and gtg_{t}. Recall that vt=β2vt1+(1β2)gt2v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}. Assuming vt1v_{t-1} is independent of gtg_{t}, then: when a new gradient gtg_{t} arrives, if gtg_{t} is large, vtv_{t} is likely to be larger; and if gtg_{t} is small, vtv_{t} is also likely to be smaller. If β1=0\beta_{1}=0, then k(gt)=αt/vtk(g_{t})={\alpha_{t}}/{\sqrt{v_{t}}}. As a result, a large gradient is likely to have a small net update factor, while a small gradient is likely to have a large net update factor in Adam.

When it comes to the scenario where β1>0\beta_{1}>0, the arguments are actually quite similar. Given vt=β2vt1+(1β2)gt2v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}. Assuming vt1v_{t-1} and {gt+i}i=1\{g_{t+i}\}^{\infty}_{i=1} are independent from gtg_{t}, then: not only does vtv_{t} positively correlate with the magnitude of gtg_{t}, but also the entire infinite sequence {vi}i=t\{v_{i}\}^{\infty}_{i=t}\, positively correlates with the magnitude of gtg_{t}. Since the net update factor k(gt)=i=t αi/vi(1β1)β1itk(g_{t})=\sum_{i=t}^{\infty}{~{}\alpha_{i}}/{\sqrt{v_{i}}}(1-\beta_{1})\beta_{1}^{i-t} negatively correlates with each viv_{i} in {vi}i=t\{v_{i}\}^{\infty}_{i=t}, it is thus negatively correlated with the magnitude of gtg_{t}. That is, k(gt)k(g_{t}) for a large gradient is likely to be smaller, while k(gt)k(g_{t}) for a small gradient is likely to be larger.

The biased net update factors cause the non-convergence problem of Adam as well as all other adaptive learning rate methods where vtv_{t} correlates with gtg_{t}. To construct a counterexample, the same pattern is that: the large gradient is along the “correct” direction, while the small gradient is along the opposite direction. Due to the fact that the accumulated influence of a large gradient is small while the accumulated influence of a small gradient is large, Adam may update parameters along the wrong direction.

Finally, we would like to emphasize that even if Adam updates parameters along the right direction in general, the biased net update factors are still unfavorable since they slow down the convergence.

The proposed method: decorrelation via temporal shifting

According to the previous discussion, we conclude that the main cause of the non-convergence of Adam is the inappropriate correlation between vtv_{t} and gtg_{t}. Currently we have two possible solutions: (1) making vtv_{t} act like a constant, which declines the correlation, e.g., using a large β2\beta_{2} or keep vtv_{t} non-decreasing (Reddi et al., 2018); (2) using a large β1\beta_{1} (Theorem 1), where the aggressive momentum term helps to mitigate the impact of biased net update factors. However, neither of them solves the problem fundamentally.

The dilemma caused by vtv_{t} enforces us to rethink its role. In adaptive learning rate methods, vtv_{t} plays the role of estimating the second moments of gradients, which reflects the scale of gradient on average. With the adaptive learning rate αt/vt\alpha_{t}/\sqrt{v_{t}}, the update step of gtg_{t} is scaled down by vt\sqrt{v_{t}} and achieves rescaling invariance with respect to the scale of gtg_{t}, which is practically useful to make the training process easy to control and the training system robust. However, the current scheme of vtv_{t}, i.e., vt=β2vt1+(1β2)gt2v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g^{2}_{t}, brings a positive correlation between vtv_{t} and gtg_{t}, which results in reducing the effect of large gradients and increasing the effect of small gradients, and finally causes the non-convergence problem. Therefore, the key is to let vtv_{t} be a quantity that reflects the scale of the gradients, while at the same time, be decorrelated with current gradient gtg_{t}. Formally, we have the following theorem:

For any fixed online optimization problem with infinitely repeating of a finite set of cost functions {f1(θ),,ft(θ),fn(θ)}\{f_{1}(\theta),\dots,f_{t}(\theta),\dots f_{n}(\theta)\}, assuming β1=0\beta_{1}=0 and αt\alpha_{t} is fixed, we have, if vtv_{t} follows a fixed distribution and is independent of the current gradient gtg_{t}, then the expected net update factor for each gradient is identical.

Let PvP_{v} denote the distribution of vtv_{t}. In the infinitely repeating online optimization scheme, the expectation of net update factor for each gradient gtg_{t} is

Momentum (Qian, 1999) can be viewed as setting vtv_{t} as a constant, which makes vtv_{t} and gtg_{t} independent. Furthermore, in our view, using an increasing β2\beta_{2} (AdamNC) or keeping vt^\hat{v_{t}} as the largest vtv_{t} (AMSGrad) is also to make vtv_{t} almost fixed. However, fixing vtv_{t} is not a desirable solution, because it damages the adaptability of Adam with respect to the adapting of step size.

We next introduce the proposed solution to make vtv_{t} independent of gtg_{t}, which is based on temporal independent assumption among gradients. We first introduce the idea of temporal decorrelation, then extend our solution to make use of the spatial information of gradients. Finally, we incorporate first moment estimation. The pseudo code of the proposed algorithm is presented as follows.

In practical setting, ft(θ)f_{t}(\theta) usually involves different mini-batches xtx_{t}, i.e., ft(θ)=f(θ;xt)f_{t}(\theta)=f(\theta;x_{t}). Given the randomness of mini-batch, we assume that the mini-batch xtx_{t} is independent of each other and further assume that f(θ;x)f(\theta;x) keeps unchanged over time, then the gradient gt=f(θ;xt)g_{t}=\nabla f(\theta;x_{t}) of each mini-batch is independent of each other.

Therefore, we could change the update rule for vtv_{t} to involve gtng_{t-n} instead of gtg_{t}, which makes vtv_{t} and gtg_{t} temporally shifted and hence decorrelated:

Note that in the sequential online optimization problem, the assumption “gtg_{t} is independent of each other” does not hold. However, in the stochastic online optimization problem and practical neural network settings, our assumption generally holds.

2 Making use of the spatial elements of previous timesteps

Most optimization schemes involve a great many parameters. The dimension of θ\theta is high, thus gtg_{t} and vtv_{t} are also of high dimension. However, vtv_{t} is element-wisely computed in Equation 14. Specifically, we only use the ii-th dimension of gtng_{t-n} to calculate the ii-th dimension of vtv_{t}. In other words, it only makes use of the independence between gtn[i]g_{t-n}[i] and gt[i]g_{t}[i], where gt[i]g_{t}[i] denotes the ii-th element of gtg_{t}. Actually, in the case of high-dimensional gtg_{t} and vtv_{t}, we can further assume that all elements of gradient gtng_{t-n} at previous timesteps are independent with the ii-th dimension of gtg_{t}. Therefore, all elements in gtng_{t-n} can be used to compute vtv_{t} without introducing correlation. To this end, we propose introducing a function ϕ\phi over all elements of gtn2g^{2}_{t-n}, i.e.,

For easy reference, we name the elements of gtng_{t-n} other than gtn[i]g_{t-n}[i] as the spatial elements of gtng_{t-n} and name ϕ\phi the spatial function or spatial operation. There is no restriction on the choice of ϕ\phi, and we use ϕ(x)=maxix[i]\phi(x)=\max_{i}x[i] for most of our experiments, which is shown to be a good choice. The maxix[i]\max_{i}x[i] operation has a side effect that turns the adaptive learning rate vtv_{t} into a shared scalar.

An important thing here is that, we no longer interpret vtv_{t} as the second moment of gtg_{t}. It is merely a random variable that is independent of gtg_{t}, while at the same time, reflects the overall gradient scale. We leave further investigations on ϕ\phi as future work.

3 Block-wise adaptive learning rate SGD

In practical setting, e.g., deep neural network, θ\theta usually consists of many parameter blocks, e.g., the weight and bias for each layer. In deep neural network, the gradient scales (i.e., the variance) for different layers tend to be different (Glorot & Bengio, 2010; He et al., 2015). Different gradient scales make it hard to find a learning rate that is suitable for all layers, when using SGD and Momentum methods. In traditional adaptive learning rate methods, they apply element-wise rescaling for each gradient dimension, which achieves rescaling-invariance and somehow solves the above problem. However, Adam sometimes does not generalize better than SGD (Wilson et al., 2017; Keskar & Socher, 2017), which might relate to the excessive learning rate adaptation in Adam.

In our temporal decorrelation with spatial operation scheme, we can solve the “different gradient scales” issue more naturally, by applying ϕ\phi block-wisely and outputs a shared adaptive learning rate scalar vt[i]v_{t}[i] for each block:

It makes the algorithm work like an adaptive learning rate SGD, where each block has an adaptive learning rate αt/vt[i]\alpha_{t}/\sqrt{v_{t}[i]} while the relative gradient scale among in-block elements keep unchanged. As illustrated in Algorithm 1, the parameters θt\theta_{t} including the related gtg_{t} and vtv_{t} are divided into MM blocks. Every block contains the parameters of the same type or same layer in neural network.

4 Incorporating first moment estimation: moving averaging windows

First moment estimation, i.e., defining mtm_{t} as a moving average of gtg_{t}, is an important technique of modern first order optimization algorithms, which alleviates mini-batch oscillations. In this section, we extend our algorithm to incorporate first moment estimation.

We have argued that vtv_{t} needs to be decorrelated with gtg_{t}. Analogously, when introducing the first moment estimation, we need to make vtv_{t} and mtm_{t} independent to make the expected net update factor unbiased. Based on our assumption of temporal independence, we further keep out the latest nn gradients {gti}i=0n1\{g_{t-i}\}^{n-1}_{i=0}, and update vtv_{t} and mtm_{t} via

In Equation 17, β1\beta_{1}\in plays the role of decay rate for temporal elements. It can be viewed as a truncated version of exponential moving average that only applied to the latest few elements. Since we use truncating, it is feasible to use large β1\beta_{1} without taking the risk of using too old gradients. In the extreme case where β1=1\beta_{1}=1, it becomes vanilla averaging.

The key difference between Adam and the proposed method is that the latter temporally shifts the gradient gtg_{t} for nn-step, i.e., using gtng_{t-n} for calculating vtv_{t} and using the kept-out nn gradients to evaluate mtm_{t} (Equation 17), which makes vtv_{t} and mtm_{t} decorrelated and consequently solves the non-convergence issue. In addition, based on our new perspective on adaptive learning rate methods, vtv_{t} is not necessarily the second moment and it is valid to further involve the calculation of vtv_{t} with the spatial elements of previous gradients. We thus proposed to introduce the spatial operation ϕ\phi that outputs a shared scalar for each block. The resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. We name the proposed method that makes use of temporal-shifting to decorrelated vtv_{t} and mtm_{t} AdaShift, which means “ADAptive learning rate method with temporal SHIFTing”.

Experiments

In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization. Without additional declaration, the reported result for each algorithm is the best we have found via parameter grid search. The anonymous code is provided at http://bit.ly/2NDXX6x.

Firstly, we verify our analysis on the stochastic online optimization problem in Equation 7, where we set C=101C=101 and δ=0.02\delta=0.02. We compare Adam, AMSGrad and AdaShift in this experiment. For fair comparison, we set α=0.001\alpha=0.001, β1=0\beta_{1}=0 and β2=0.999\beta_{2}=0.999 for all these methods. The results are shown in Figure 1(a). We can see that Adam tends to increase θ\theta, that is, the accumulate update of θ\theta in Adam is along the wrong direction, while AMSGrad and AdaShift update θ\theta in the correct direction. Furthermore, given the same learning rate, AdaShift decreases θ\theta faster than AMSGrad, which validates our argument that AMSGrad has a relatively higher vtv_{t} that slows down the training.

In this experiment, we also verify Theorem 1. As shown in Figure 1(b), Adam is also able to converge to the correct direction with a sufficiently large β1\beta_{1} and β2\beta_{2}. Note that (1) AdaShift still converges with the fastest speed; (2) a small β1\beta_{1} (e.g., β1=0.9\beta_{1}=0.9, the light-blue line in Figure 1(b)) does not make Adam converge to the correct direction. We do not conduct the experiments on the sequential online optimization problem in Equation 6, because it does not fit our temporal independence assumption. To make it converge, one can use a large β1\beta_{1} or β2\beta_{2}, or set vtv_{t} as a constant.

2 Logistic Regression and Multilayer Perceptron on MNIST

We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST, where the Multilayer Perceptron has two hidden layers and each has 256256 hidden units with no internal activation. The results are shown in Figure 2 and Figure 3, respectively. We find that in Logistic Regression, these learning algorithms achieve very similar final results in terms of both training speed and generalization. In Multilayer Perceptron, we compare Adam, AMSGrad and AdaShift with reduce-max spatial operation (max-AdaShift) and without spatial operation (non-AdaShift). We observe that max-AdaShift achieves the lowest training loss, while non-AdaShift has mild training loss oscillation and at the same time achieves better generalization. The worse generalization of max-AdaShift may be due to overfitting in this task, and the better generalization of non-AdaShift may stem from the regularization effect of its relatively unstable step size.

3 DenseNet and ResNet on CIFAR-10

ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) are two typical modern neural networks, which are efficient and widely-used. We test our algorithm with ResNet and DenseNet on CIFAR-10 datasets. We use a 1818-layer ResNet and 100100-layer DenseNet in our experiments. We plot the best results of Adam, AMSGrad and AdaShift in Figure 4 and Figure 5 for ResNet and DenseNet, respectively. We can see that AMSGrad is relatively worse in terms of both training speed and generalization. Adam and AdaShift share competitive results, while AdaShift is generally slightly better, especially the test accuracy of ResNet and the training loss of DenseNet.

4 DenseNet with Tiny-ImageNet

We further increase the complexity of dataset, switching from CIFAR-10 to Tiny-ImageNet, and compare the performance of Adam, AMSGrad and AdaShift with DenseNet. The results are shown in Figure 6, from which we can see that the training curves of Adam and AdaShift are basically overlapped, but AdaShift achieves higher test accuracy than Adam. AMSGrad has relatively higher training loss, and its test accuracy is relatively lower at the initial stage.

5 Generative model and Recurrent model

We also test our algorithm on the training of generative model and recurrent model. We choose WGAN-GP (Gulrajani et al., 2017) that involves Lipschitz continuity condition (which is hard to optimize), and Neural Machine Translation (NMT) (Luong et al., 2017) that involves typical recurrent unit LSTM, respectively. In Figure 7(a), we compare the performance of Adam, AMSGrad and AdaShift in the training of WGAN-GP discriminator, given a fixed generator. We notice that AdaShift is significantly better than Adam, while the performance of AMSGrad is relatively unsatisfactory. The test performance in terms of BLEU of NMT is shown in Figure 7(b), where AdaShift achieves a higher BLEU than Adam and AMSGrad.

Conclusion

In this paper, we study the non-convergence issue of adaptive learning rate methods from the perspective of the equivalent accumulated step size of each gradient, i.e., the net update factor defined in this paper. We show that there exists an inappropriate correlation between vtv_{t} and gtg_{t}, which leads to biased net update factor for each gradient. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating vtv_{t} and gtg_{t} will lead to unbiased expected step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates vtv_{t} and gtg_{t} via calculating vtv_{t} using temporally shifted gradient gtng_{t-n}.

In addition, based on our new perspective on adaptive learning rate methods, vtv_{t} is no longer necessarily the second moment of gtg_{t}, but a random variable that is independent of gtg_{t} and reflects the overall gradient scale. Thus, it is valid to calculate vtv_{t} with the spatial elements of previous gradients. We further found that when the spatial operation ϕ\phi outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. The experiment results demonstrate that AdaShift is able to solve the non-convergence issue of Adam. In the meantime, AdaShift achieves competitive and even better training and testing performance when compared with Adam.

References

To provide an intuitive impression on the relation among C,d,β1,β2C,d,\beta_{1},\beta_{2} and the convergence of Adam, we let C=d=6C=d=6, initialize θ1=0\theta_{1}=0, vary β1\beta_{1} and β2\beta_{2} among [0,1)[0,1) and let Adam go through 2000 timesteps (iterations). The final result of θ\theta is shown in Figure 8(a). It suggests that for a fixed sequential online optimization problem, both of β1\beta_{1} and β2\beta_{2} determine the direction and speed of Adam optimization process. Furthermore, we also study the threshold point of CC and dd, under which Adam will change to the incorrect direction, for each fixed β1\beta_{1} and β2\beta_{2} that vary among [0,1)[0,1). To simplify the experiments, we keep d=Cd=C such that the overall gradient of each epoch being +1+1. The result is shown in Figure 8(b), which suggests, at the condition of larger β1\beta_{1} or larger β2\beta_{2}, it needs a larger CC to make Adam stride on the opposite direction. In other words, large β1\beta_{1} and β2\beta_{2} will make the non-convergence rare to happen.

We also conduct the experiment in the stochastic problem to analyze the relation among CC, β1\beta_{1}, β2\beta_{2} and the convergence behavior of Adam. Results are shown in the Figure 8(c) and Figure 8(d) and the observations are similar to the previous: larger CC will cause non-convergence more easily and a larger β1\beta_{1} or β2\beta_{2} somehow help to resolve non-convergence issue. In this experiment, we set δ=1\delta=1.

In the sequential online optimization problem Equation 6, let αt\alpha_{t} being fixed, define S(β1,β2,C,d)\mathcal{S}(\beta_{1},\beta_{2},C,d) to be the sum of the limits of step updates in a dd-step epoch:

Let S(β1,β2,C)=0\mathcal{S}(\beta_{1},\beta_{2},C)=0, assuming β2\beta_{2} and CC are large enough such that vt1v_{t}\gg 1, we get the equation:

Equation 19, though being quite complex, tells that both β1\beta_{1} and β2\beta_{2} are closely related to the counterexamples, and there exists a critical condition among these parameters.

Appendix B The AdaShift Pseudo Code

We provided the anonymous code where a Tensorflow implementation of this algorithm is available.

In order to verify the correlation between gtg_{t} and vtv_{t} in Adam and AdaShift, we conduct experiments to calculate the correlation coefficient between gtg_{t} and vtv_{t}. We train the Multilayer Perceptron on MNIST until converge and gather the gradient of the second hidden layer of each step. Based on these data, we calculate vtv_{t} and the correlation coefficient between gt[i]g_{t}[i] and gtn[i]g_{t-n}[i], between gt[i]g_{t}[i] and gtn[j]g_{t-n}[j] and between gt[i]g_{t}[i] and vt[i]v_{t}[i] of the last 1010 epochs using the Pearson correlation coefficient, which is formulated as follows:

To verify the temporal correlation between gt[i]g_{t}[i] and gtn[i]g_{t-n}[i], we range nn from 11 to 1010 and calculate the average temporal correlation coefficient of all variables ii. Results are shown in Table 1.

To verify the spatial correlation between gt[i]g_{t}[i] and gtn[j]g_{t-n}[j], we again range nn from 11 to 1010 and randomly sample some pairs of ii and jj and calculate the average spatial correlation coefficient of all the selected pairs. Results are shown in Table 2.

To verify the correlation between gt[i]g_{t}[i] and vt[i]v_{t}[i] within Adam, we calculate vtv_{t} and the average correlation coefficient between gt2g_{t}^{2} and vtv_{t} of all variables ii. The result is 0.435885276.

To verify the correlation between gtn[i]g_{t-n}[i] and vt[i]v_{t}[i] within non-AdaShift and between gtn[i]g_{t-n}[i] and vtv_{t} within max-AdaShift, we range the keep number nn from 11 to 1010 to calculate vtv_{t} and the average correlation coefficient of all variables ii. The result is shown in Table 3 and Table 4.

Appendix D Proof of Theorem 1

With bias correction, the formulation of mtm_{t} is written as follows

According to L’Hospital‘s rule, we can draw the following:

According to the definition of limitation, let g=i=1tgitg^{*}=\frac{\sum_{i=1}^{t}g_{i}}{t}, we have, ϵ>0\forall\epsilon>0, β1(0,1)\exists\beta_{1}\in(0,1), such that

We set ϵ\epsilon to be g2|\frac{g^{*}}{2}|, then for each dimension of mtm_{t}, i.e. mt[i]m_{t}[i],

So, mtm_{t} shares the same sign with gg^{*} in every dimension.

Given it is a convex optimization problem, let the optimal parameter be θ\theta^{*}, and the maximum step size is αtvtG\frac{\alpha_{t}}{\sqrt{v_{t}}}G that holds ϵ1/G<αtvtG<ϵ2/G\epsilon_{1}/G<\frac{\alpha_{t}}{\sqrt{v_{t}}}G<\epsilon_{2}/G, we have,

Given ft(θ)G\lVert\nabla f_{t}(\theta)\rVert_{\infty}\leq G, we have ft(θ)ft(θ)<ϵ2f_{t}(\theta)-f_{t}(\theta^{*})<\epsilon_{2}, which implies the average regret

Appendix E Proof of Lemma 2

For a fixed dd, as nn approach infinity, we get the limit of mnd+im_{nd+i} as:

For a fixed dd, as nn approach infinity, we get the limit of vnd+iv_{nd+i} as:

Appendix F Proof of Lemma 3

Thus, we can get the forward difference of k(gnd+i)k(g_{nd+i}) as:

Hence, we can draw the conclusion: 1jd\exists 1\leq j\leq d, such that

where K(C)K(C) is the net update factor for gradient gi=Cg_{i}=C. ∎

Appendix G Proof of Lemma 4

For a bounded random variable XX and a differentiable function f(x)f(x), the expectation of f(X)f(X) is as follows:

where D(X)D(X) is variance of XX, and R3R_{3} is as follows:

F(x)F(x) is the distribution function of XX. R3R_{3} is a small quantity under some condition. And cc is large enough, such that: for any ϵ>0\epsilon>0,

(Proof of Lemma 4 ) In the stochastic online optimization problem equation 7, the gradient subjects the distribution as:

Then we can get the expectation of gig_{i} :

Meanwhile, under the assumption that gradients are i.i.d., the expectation and variance of viv_{i} are as following when ndnd\to\infty:

Then, for the gradient gig_{i}, the net update factor is as follows:

It should to be clarified that we define j=1tβ2tjgi+j2\sum_{j=1}^{t}\beta_{2}^{t-j}g_{i+j}^{2} equal to zero when t=0t=0. Then we define XtX_{t} as:

For the function f(x)=1xf(x)=\frac{1}{\sqrt{x}}:

According to lemma 1, we can the expectation of f(Xt)f(X_{t}) as follows:

where Dt=D[Xk]D_{t}=D[X_{k}]. Then for gradient CC and 1-1, the net update factor is as follows:

We can see that each term in the infinite series of k(C)k(C) is smaller than the corresponding one in k(1)k(-1). Thus, k(C)<k(1)k(C)<k(-1).

Appendix H Proof of Lemma 6

We sum up all updates in an epoch, and define the summation as S(β1,β2,C)\mathcal{S}(\beta_{1},\beta_{2},C).

Assume β2\beta_{2} and CC are large enough such that vt1v_{t}\gg 1, we get the approximation of limit of vnd+iv_{nd+i} as:

Then we can draw the expression of S(β1,β2,C)\mathcal{S}(\beta_{1},\beta_{2},C) as:

Let S(β1,β2,C)=0\mathcal{S}(\beta_{1},\beta_{2},C)=0, we get the equation about critical condition:

Appendix I Hyper-parameters Investigation

Here, we list all hyper-parameter setting of all above experiments.

In this section, we discuss the learning rate αt\alpha_{t} sensitivity of AdaShift. We set αt{0.1,0.01,0.001}\alpha_{t}\in\{0.1,0.01,0.001\} and let n=10n=10, β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999. The results are shown in Figure 9 and Figure 10. Empirically, we found that when using the maxmax spatial operation, the best learning rate for AdaShift is around ten times of Adam.

In this section, we discuss the β1\beta_{1} and β2\beta_{2} sensitivity of AdaShift. We set α=0.01\alpha=0.01, n=10n=10 and let β1{0,0.9}\beta_{1}\in\{0,0.9\} and β2{0.9,0.99,0.999}\beta_{2}\in\{0.9,0.99,0.999\}. The results are shown in Figure 11 and Figure 12. According to the results, AdaShift holds a low sensitivity to β1\beta_{1} and β2\beta_{2}. In some tasks, using the first moment estimation (with β1=0.9\beta_{1}=0.9 and n=10n=10) or using a large β2\beta_{2}, e.g., 0.9990.999 can attain better performance. The suggested parameters setting is n=10,β1=0.9,β2=0.999n=10,\beta_{1}=0.9,\beta_{2}=0.999.

I.4 n𝑛n and m𝑚m Sensitivity

In this section, we discuss the nn sensitivity of AdaShift. Here we also test a extended version of first moment estimation where it only uses the latest mm gradients (mnm\leq n):

We set β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999. The results are shown in Figure 13, Figure 14 and Figure 15. In these experiments, AdaShift is fairly stable when changing nn and mm. We have not find a clear pattern on the performance change with respect to nn and mm.

Appendix J Temporal-only and Spatial-only

In our proposed algorithm, we apply a spatial operation on the temporally shifted gradient gtng_{t-n} to update vtv_{t}: vt[i]=β2vt1[i]+(1β2)ϕ(gtn2[i])v_{t}[i]=\beta_{2}v_{t-1}[i]+(1-\beta_{2})\phi(g^{2}_{t-n}[i]). It is based on the temporal independent assumption, i.e., gtng_{t-n} is independent of gtg_{t}. And according to our argument in Section 4.2, one can further assume every element in gtng_{t-n} is independent of the ii-th dimension of gtg_{t}.

We purposely avoid involving the spatial elements of the current gradient gtg_{t}, where the independence might not holds: when a sample which is rare and has a large gradient appear in the mini-batch xtx_{t}, the overall scale of gradient gtg_{t} might increase. However, for the temporally already decorrelation gtig_{t-i}, further taking the advantage of the spatial irrelevance will not suffer from this problem.

We here provide extended experiments on two variants of AdaShift: (i) AdaShift (temporal-only), which only uses the vanilla temporal independent assumption and evaluate vtv_{t} with: vt=β2vt1+(1β2)gtn2v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t-n}^{2}; (ii) AdaShift (spatial-only), which directly uses the spatial elements without temporal shifting.

According to our experiments, AdaShift (temporal-only), i.e., without the spatial operation, is less stable than AdaShift. In some tasks, AdaShift (temporal-only) works just fine; while in some other cases, AdaShift (temporal-only) suffers from explosive gradient and requires a relatively small learning rate. The performance of AdaShift (spatial-only) is close to Adam. More experiments for AdaShift (spatial-only) are included in the next section.

Appendix K Extended Experiments: Nadam and AdaShift(Space Only)

In this section, we extend the experiments and add the comparisons with Nadam and AdaShift (spatial-only). The results are shown in Figure 18, Figure19 and Figure20. According to these experiments, Nadam and AdaShift (spatial-only) share similar performence as Adam.

Appendix L Extension experiments: ill-conditioned quadratic problem

Rahimi & Recht raise the point, at “test of time” talk at NIPS 2017, that it is suspicious that gradient descent (aka back-propagation) is ultimate solution for optimization. A ill-conditioned quadratic problem with Two Layer Linear Net is showed to be challenging for gradient descent based methods, while alternative solutions, e.g., Levenberg-Marquardt, may converge faster and better. The problem is defined as follows:

where AA is some known badly conditioned matrix (k=1020k=10^{20} or 10510^{5}), and W1W_{1} and W2W_{2} are the trainable parameters.

We test SGD, Adam and AdaShift with this problem, the results are shown in Figure 21, Figure 24. It turns out as long as the training goes enough long, SGD, Adam, AdaShift all basically converge in this problem. Though SGD is significantly better than Adam and AdaShift.

We would tend to believe this is a general issue of adaptive learning rate method when comparing with vanilla SGD. Because these adaptive learning rate methods generally are scale-invariance, i.e., the step-size in terms of gt/sqrt(vt)g_{t}/sqrt(v_{t}) is basically around one, which makes it hard to converge very well in such a ill-conditioning quadratic problem. SGD, in contrast, has a step-size gtg_{t}; as the training converges SGD would have a decreasing step-size, makes it much easier to converge better. The above analysis is confirmed with Figure 22 and Figure 23, with a decreasing learning rate, Adam and AdaShfit both converge very good.