AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods
Zhiming Zhou, Qingru Zhang, Guansong Lu, Hongwei Wang, Weinan Zhang, Yong Yu
Introduction
One common choice of is the exponential moving average of the gradients used in Momentum (Qian, 1999) and Adam (Kingma & Ba, 2014), which helps alleviate gradient oscillations. The commonly-used in deep learning community is the exponential moving average of squared gradients, such as Adadelta (Zeiler, 2012), RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and Nadam (Dozat, 2016).
Adam (Kingma & Ba, 2014) is a typical adaptive learning rate method, which assembles the idea of using exponential moving average of first and second moments and bias correction. In general, Adam is robust and efficient in both dense and sparse gradient cases, and is popular in deep learning research. However, Adam is shown not being able to converge to optimal solution in certain cases. Reddi et al. (2018) point out that the key issue in the convergence proof of Adam lies in the quantity
which is assumed to be positive, but unfortunately, such an assumption does not always hold in Adam. They provide a set of counterexamples and demonstrate that the violation of positiveness of will lead to undesirable convergence behavior in Adam.
Reddi et al. (2018) then propose two variants, AMSGrad and AdamNC, to address the issue by keeping positive. Specifically, AMSGrad defines as the historical maximum of , i.e., , and replaces with to keep non-decreasing and therefore forces to be positive; while AdamNC forces to have “long-term memory” of past gradients and calculates as their average to make it stable. Though these two algorithms solve the non-convergence problem of Adam to a certain extent, they turn out to be inefficient in practice: they have to maintain a very large once a large gradient appears, and a large decreases the adaptive learning rate and slows down the training process.
In this paper, we provide a new insight into adaptive learning rate methods, which brings a new perspective on solving the non-convergence issue of Adam. Specifically, in Section 3, we study the non-convergence of Adam via analyzing the accumulated step size of each gradient . We observe that in the common adaptive learning rate methods, a large gradient tends to have a relatively small step size, while a small gradient is likely to have a relatively large step size. We show that such biased step sizes stem from the inappropriate positive correlation between and , and we argue that this is the fundamental cause of the non-convergence issue of Adam.
In Section 4, we further prove that decorrelating and leads to equal and unbiased expected step size for each gradient, thus solving the non-convergence issue of Adam. We subsequently propose AdaShift, a decorrelated variant of adaptive learning rate methods, which achieves decorrelation between and by calculating using temporally shifted gradients. Finally, in Section 5, we study the performance of our proposed AdaShift, and demonstrate that it solves the non-convergence issue of Adam, while still maintaining a decent performance compared with Adam in terms of both training speed and generalization.
Preliminaries
In Adam, and are defined as the exponential moving average of and :
where and are the exponential decay rates for and , respectively, with and . They can also be written as:
To avoid the bias in the estimation of the expected value at the initial timesteps, Kingma & Ba (2014) propose to apply bias correction to and . Using as instance, it works as follows:
Online optimization problem.
An online optimization problem consists of a sequence of cost functions , where the optimizer predicts the parameter at each time-step and evaluate it on an unknown cost function . The performance of the optimizer is usually evaluated by regret , which is the sum of the difference between the online prediction and the best fixed-point parameter prediction for all the previous steps, where is the best fixed-point parameter from a feasible set .
Counterexamples.
Reddi et al. (2018) highlight that for any fixed and , there exists an online optimization problem where Adam has non-zero average regret, i.e., Adam does not converge to optimal solution . The counterexamples in the sequential version are given as follows:
where is a relatively large constant and is the length of an epoch. In Equation 6, most gradients of with respect to are , but the large positive gradient at the beginning of each epoch makes the overall gradient of each epoch positive, which means that one should decrease to minimize the loss. However, according to (Reddi et al., 2018), the accumulated update of in Adam under some circumstance is opposite (i.e., is increased), thus Adam cannot converge in such case. Reddi et al. (2018) argue that the reason of the non-convergence of Adam lies in that the positive assumption of does not always hold in Adam.
The counterexamples are also extended to stochastic cases in (Reddi et al., 2018), where a finite set of cost functions appear in a stochastic order. Compared with sequential online optimization counterexample, the stochastic version is more general and closer to the practical situation. For the simplest one dimensional case, at each timestep , the function is chosen as i.i.d.:
where is a small positive constant that is smaller than . The expected cost function of the above problem is , therefore, one should decrease to minimize the loss. Reddi et al. (2018) prove that when is large enough, the expectation of accumulated parameter update in Adam is positive and results in increasing .
Basic Solutions
Reddi et al. (2018) propose maintaining the strict positiveness of as solution, for example, keeping non-decreasing or using increasing . In fact, keeping positive is not the only way to guarantee the convergence of Adam. Another important observation is that for any fixed sequential online optimization problem with infinitely repeating epochs (e.g., Equation 6), Adam will converge as long as is large enough. Formally, we have the following theorem:
The intuition behind Theorem 1 is that, if , then , i.e., approaches the average gradient of an epoch, according to Equation 5. Therefore, no matter what the adaptive learning rate is, Adam will always converge along the correct direction.
The cause of non-convergence: biased step size
In this section, we study the non-convergence issue by analyzing the counterexamples provided by Reddi et al. (2018). We show that the fundamental problem of common adaptive learning rate methods is that: is positively correlated to the scale of gradient , which results in a small step size for a large gradient, and a large step size for a small gradient. We argue that such an biased step size is the cause of non-convergence.
We will first define net update factor for the analysis of the accumulated influence of each gradient , then apply the net update factor to study the behaviors of Adam using Equation 6 as an example. The argument will be extended to the stochastic online optimization problem and general cases.
When , due to the exponential moving effect of , the influence of exists in all of its following timesteps. For timestep (), the weight of is . We accordingly define a new tool for our analysis: the net update of each gradient , which is its accumulated influence on the entire optimization process:
and we call the net update factor of , which is the equivalent accumulated step size for gradient . Note that depends on , and in Adam, if , then all elements in are related to . Therefore, is a function of .
It is worth noticing that in Momentum method, is equivalently set as . Therefore, we have and , which means that the accumulated influence of each gradient in Momentum is the same as vanilla SGD (Stochastic Gradient Decent). Hence, the convergence of Momentum is similar to vanilla SGD. However, in adaptive learning rate methods, is function over the past gradients, which makes its convergence nontrivial.
2 Analysis on online optimization counterexamples
Note that exists in the definition of net update factor (Equation 8). Before further analyzing the convergence of Adam using the net update factor, we first study the pattern of in the sequential online optimization problem in Equation 6. Since Equation 6 is deterministic, we can derive the formula of as follows:
Given the formula of in Equation 9, we now study the net update factor of each gradient. We start with a simple case where . In this case we have
Since the limit of in each epoch monotonically decreases with the increase of index according to Equation 9, the limit of monotonically increases in each epoch. Specifically, the first gradient in epoch represents the correct updating direction, but its influence is the smallest in this epoch. In contrast, the net update factor of the subsequent gradients are relatively larger, though they indicate a wrong updating direction.
We further consider the general case where . The result is presented in the following lemma:
In the sequential online optimization problem in Equation 6, when , the limit of net update factor of epoch satisfies: such that
where denotes the net update factor for gradient .
Lemma 3 tells us that, in sequential online optimization problem in Equation 6, the net update factors are biased. Specifically, the net update factor for the large gradient is the smallest in the entire epoch, while all gradients have larger net update factors. Such biased net update factors will possibly lead Adam to a wrong accumulated update direction.
Similar conclusion also holds in the stochastic online optimization problem in Equation 7. We derive the expectation of the net update factor for each gradient in the following lemma:
In the stochastic online optimization problem in Equation 7, assuming , it holds that , where denote the expectation net update factor for and denote the expectation net update factor for .
Though the formulas of net update factors in the stochastic case are more complicated than those in deterministic case, the analysis is actually more easier: the gradients with the same scale share the same expected net update factor, so we only need to analyze and . From Lemma 4, we can see that in terms of the expectation net update factor, is smaller than , which means the accumulated influence of gradient is smaller than gradient .
3 Analysis on non-convergence of Adam
As we have observed in the previous section, a common characteristic of these counterexamples is that the net update factor for the gradient with large magnitude is smaller than these with small magnitude. The above observation can also be interpreted as a direct consequence of inappropriate correlation between and . Recall that . Assuming is independent of , then: when a new gradient arrives, if is large, is likely to be larger; and if is small, is also likely to be smaller. If , then . As a result, a large gradient is likely to have a small net update factor, while a small gradient is likely to have a large net update factor in Adam.
When it comes to the scenario where , the arguments are actually quite similar. Given . Assuming and are independent from , then: not only does positively correlate with the magnitude of , but also the entire infinite sequence positively correlates with the magnitude of . Since the net update factor negatively correlates with each in , it is thus negatively correlated with the magnitude of . That is, for a large gradient is likely to be smaller, while for a small gradient is likely to be larger.
The biased net update factors cause the non-convergence problem of Adam as well as all other adaptive learning rate methods where correlates with . To construct a counterexample, the same pattern is that: the large gradient is along the “correct” direction, while the small gradient is along the opposite direction. Due to the fact that the accumulated influence of a large gradient is small while the accumulated influence of a small gradient is large, Adam may update parameters along the wrong direction.
Finally, we would like to emphasize that even if Adam updates parameters along the right direction in general, the biased net update factors are still unfavorable since they slow down the convergence.
The proposed method: decorrelation via temporal shifting
According to the previous discussion, we conclude that the main cause of the non-convergence of Adam is the inappropriate correlation between and . Currently we have two possible solutions: (1) making act like a constant, which declines the correlation, e.g., using a large or keep non-decreasing (Reddi et al., 2018); (2) using a large (Theorem 1), where the aggressive momentum term helps to mitigate the impact of biased net update factors. However, neither of them solves the problem fundamentally.
The dilemma caused by enforces us to rethink its role. In adaptive learning rate methods, plays the role of estimating the second moments of gradients, which reflects the scale of gradient on average. With the adaptive learning rate , the update step of is scaled down by and achieves rescaling invariance with respect to the scale of , which is practically useful to make the training process easy to control and the training system robust. However, the current scheme of , i.e., , brings a positive correlation between and , which results in reducing the effect of large gradients and increasing the effect of small gradients, and finally causes the non-convergence problem. Therefore, the key is to let be a quantity that reflects the scale of the gradients, while at the same time, be decorrelated with current gradient . Formally, we have the following theorem:
For any fixed online optimization problem with infinitely repeating of a finite set of cost functions , assuming and is fixed, we have, if follows a fixed distribution and is independent of the current gradient , then the expected net update factor for each gradient is identical.
Let denote the distribution of . In the infinitely repeating online optimization scheme, the expectation of net update factor for each gradient is
Momentum (Qian, 1999) can be viewed as setting as a constant, which makes and independent. Furthermore, in our view, using an increasing (AdamNC) or keeping as the largest (AMSGrad) is also to make almost fixed. However, fixing is not a desirable solution, because it damages the adaptability of Adam with respect to the adapting of step size.
We next introduce the proposed solution to make independent of , which is based on temporal independent assumption among gradients. We first introduce the idea of temporal decorrelation, then extend our solution to make use of the spatial information of gradients. Finally, we incorporate first moment estimation. The pseudo code of the proposed algorithm is presented as follows.
In practical setting, usually involves different mini-batches , i.e., . Given the randomness of mini-batch, we assume that the mini-batch is independent of each other and further assume that keeps unchanged over time, then the gradient of each mini-batch is independent of each other.
Therefore, we could change the update rule for to involve instead of , which makes and temporally shifted and hence decorrelated:
Note that in the sequential online optimization problem, the assumption “ is independent of each other” does not hold. However, in the stochastic online optimization problem and practical neural network settings, our assumption generally holds.
2 Making use of the spatial elements of previous timesteps
Most optimization schemes involve a great many parameters. The dimension of is high, thus and are also of high dimension. However, is element-wisely computed in Equation 14. Specifically, we only use the -th dimension of to calculate the -th dimension of . In other words, it only makes use of the independence between and , where denotes the -th element of . Actually, in the case of high-dimensional and , we can further assume that all elements of gradient at previous timesteps are independent with the -th dimension of . Therefore, all elements in can be used to compute without introducing correlation. To this end, we propose introducing a function over all elements of , i.e.,
For easy reference, we name the elements of other than as the spatial elements of and name the spatial function or spatial operation. There is no restriction on the choice of , and we use for most of our experiments, which is shown to be a good choice. The operation has a side effect that turns the adaptive learning rate into a shared scalar.
An important thing here is that, we no longer interpret as the second moment of . It is merely a random variable that is independent of , while at the same time, reflects the overall gradient scale. We leave further investigations on as future work.
3 Block-wise adaptive learning rate SGD
In practical setting, e.g., deep neural network, usually consists of many parameter blocks, e.g., the weight and bias for each layer. In deep neural network, the gradient scales (i.e., the variance) for different layers tend to be different (Glorot & Bengio, 2010; He et al., 2015). Different gradient scales make it hard to find a learning rate that is suitable for all layers, when using SGD and Momentum methods. In traditional adaptive learning rate methods, they apply element-wise rescaling for each gradient dimension, which achieves rescaling-invariance and somehow solves the above problem. However, Adam sometimes does not generalize better than SGD (Wilson et al., 2017; Keskar & Socher, 2017), which might relate to the excessive learning rate adaptation in Adam.
In our temporal decorrelation with spatial operation scheme, we can solve the “different gradient scales” issue more naturally, by applying block-wisely and outputs a shared adaptive learning rate scalar for each block:
It makes the algorithm work like an adaptive learning rate SGD, where each block has an adaptive learning rate while the relative gradient scale among in-block elements keep unchanged. As illustrated in Algorithm 1, the parameters including the related and are divided into blocks. Every block contains the parameters of the same type or same layer in neural network.
4 Incorporating first moment estimation: moving averaging windows
First moment estimation, i.e., defining as a moving average of , is an important technique of modern first order optimization algorithms, which alleviates mini-batch oscillations. In this section, we extend our algorithm to incorporate first moment estimation.
We have argued that needs to be decorrelated with . Analogously, when introducing the first moment estimation, we need to make and independent to make the expected net update factor unbiased. Based on our assumption of temporal independence, we further keep out the latest gradients , and update and via
In Equation 17, plays the role of decay rate for temporal elements. It can be viewed as a truncated version of exponential moving average that only applied to the latest few elements. Since we use truncating, it is feasible to use large without taking the risk of using too old gradients. In the extreme case where , it becomes vanilla averaging.
The key difference between Adam and the proposed method is that the latter temporally shifts the gradient for -step, i.e., using for calculating and using the kept-out gradients to evaluate (Equation 17), which makes and decorrelated and consequently solves the non-convergence issue. In addition, based on our new perspective on adaptive learning rate methods, is not necessarily the second moment and it is valid to further involve the calculation of with the spatial elements of previous gradients. We thus proposed to introduce the spatial operation that outputs a shared scalar for each block. The resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. We name the proposed method that makes use of temporal-shifting to decorrelated and AdaShift, which means “ADAptive learning rate method with temporal SHIFTing”.
Experiments
In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization. Without additional declaration, the reported result for each algorithm is the best we have found via parameter grid search. The anonymous code is provided at http://bit.ly/2NDXX6x.
Firstly, we verify our analysis on the stochastic online optimization problem in Equation 7, where we set and . We compare Adam, AMSGrad and AdaShift in this experiment. For fair comparison, we set , and for all these methods. The results are shown in Figure 1(a). We can see that Adam tends to increase , that is, the accumulate update of in Adam is along the wrong direction, while AMSGrad and AdaShift update in the correct direction. Furthermore, given the same learning rate, AdaShift decreases faster than AMSGrad, which validates our argument that AMSGrad has a relatively higher that slows down the training.
In this experiment, we also verify Theorem 1. As shown in Figure 1(b), Adam is also able to converge to the correct direction with a sufficiently large and . Note that (1) AdaShift still converges with the fastest speed; (2) a small (e.g., , the light-blue line in Figure 1(b)) does not make Adam converge to the correct direction. We do not conduct the experiments on the sequential online optimization problem in Equation 6, because it does not fit our temporal independence assumption. To make it converge, one can use a large or , or set as a constant.
2 Logistic Regression and Multilayer Perceptron on MNIST
We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST, where the Multilayer Perceptron has two hidden layers and each has hidden units with no internal activation. The results are shown in Figure 2 and Figure 3, respectively. We find that in Logistic Regression, these learning algorithms achieve very similar final results in terms of both training speed and generalization. In Multilayer Perceptron, we compare Adam, AMSGrad and AdaShift with reduce-max spatial operation (max-AdaShift) and without spatial operation (non-AdaShift). We observe that max-AdaShift achieves the lowest training loss, while non-AdaShift has mild training loss oscillation and at the same time achieves better generalization. The worse generalization of max-AdaShift may be due to overfitting in this task, and the better generalization of non-AdaShift may stem from the regularization effect of its relatively unstable step size.
3 DenseNet and ResNet on CIFAR-10
ResNet (He et al., 2016) and DenseNet (Huang et al., 2017) are two typical modern neural networks, which are efficient and widely-used. We test our algorithm with ResNet and DenseNet on CIFAR-10 datasets. We use a -layer ResNet and -layer DenseNet in our experiments. We plot the best results of Adam, AMSGrad and AdaShift in Figure 4 and Figure 5 for ResNet and DenseNet, respectively. We can see that AMSGrad is relatively worse in terms of both training speed and generalization. Adam and AdaShift share competitive results, while AdaShift is generally slightly better, especially the test accuracy of ResNet and the training loss of DenseNet.
4 DenseNet with Tiny-ImageNet
We further increase the complexity of dataset, switching from CIFAR-10 to Tiny-ImageNet, and compare the performance of Adam, AMSGrad and AdaShift with DenseNet. The results are shown in Figure 6, from which we can see that the training curves of Adam and AdaShift are basically overlapped, but AdaShift achieves higher test accuracy than Adam. AMSGrad has relatively higher training loss, and its test accuracy is relatively lower at the initial stage.
5 Generative model and Recurrent model
We also test our algorithm on the training of generative model and recurrent model. We choose WGAN-GP (Gulrajani et al., 2017) that involves Lipschitz continuity condition (which is hard to optimize), and Neural Machine Translation (NMT) (Luong et al., 2017) that involves typical recurrent unit LSTM, respectively. In Figure 7(a), we compare the performance of Adam, AMSGrad and AdaShift in the training of WGAN-GP discriminator, given a fixed generator. We notice that AdaShift is significantly better than Adam, while the performance of AMSGrad is relatively unsatisfactory. The test performance in terms of BLEU of NMT is shown in Figure 7(b), where AdaShift achieves a higher BLEU than Adam and AMSGrad.
Conclusion
In this paper, we study the non-convergence issue of adaptive learning rate methods from the perspective of the equivalent accumulated step size of each gradient, i.e., the net update factor defined in this paper. We show that there exists an inappropriate correlation between and , which leads to biased net update factor for each gradient. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating and will lead to unbiased expected step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates and via calculating using temporally shifted gradient .
In addition, based on our new perspective on adaptive learning rate methods, is no longer necessarily the second moment of , but a random variable that is independent of and reflects the overall gradient scale. Thus, it is valid to calculate with the spatial elements of previous gradients. We further found that when the spatial operation outputs a shared scalar for each block, the resulting algorithm turns out to be closely related to SGD, where each block has an overall adaptive learning rate and the relative gradient scale in each block is maintained. The experiment results demonstrate that AdaShift is able to solve the non-convergence issue of Adam. In the meantime, AdaShift achieves competitive and even better training and testing performance when compared with Adam.
References
To provide an intuitive impression on the relation among and the convergence of Adam, we let , initialize , vary and among and let Adam go through 2000 timesteps (iterations). The final result of is shown in Figure 8(a). It suggests that for a fixed sequential online optimization problem, both of and determine the direction and speed of Adam optimization process. Furthermore, we also study the threshold point of and , under which Adam will change to the incorrect direction, for each fixed and that vary among . To simplify the experiments, we keep such that the overall gradient of each epoch being . The result is shown in Figure 8(b), which suggests, at the condition of larger or larger , it needs a larger to make Adam stride on the opposite direction. In other words, large and will make the non-convergence rare to happen.
We also conduct the experiment in the stochastic problem to analyze the relation among , , and the convergence behavior of Adam. Results are shown in the Figure 8(c) and Figure 8(d) and the observations are similar to the previous: larger will cause non-convergence more easily and a larger or somehow help to resolve non-convergence issue. In this experiment, we set .
In the sequential online optimization problem Equation 6, let being fixed, define to be the sum of the limits of step updates in a -step epoch:
Let , assuming and are large enough such that , we get the equation:
Equation 19, though being quite complex, tells that both and are closely related to the counterexamples, and there exists a critical condition among these parameters.
Appendix B The AdaShift Pseudo Code
We provided the anonymous code where a Tensorflow implementation of this algorithm is available.
In order to verify the correlation between and in Adam and AdaShift, we conduct experiments to calculate the correlation coefficient between and . We train the Multilayer Perceptron on MNIST until converge and gather the gradient of the second hidden layer of each step. Based on these data, we calculate and the correlation coefficient between and , between and and between and of the last epochs using the Pearson correlation coefficient, which is formulated as follows:
To verify the temporal correlation between and , we range from to and calculate the average temporal correlation coefficient of all variables . Results are shown in Table 1.
To verify the spatial correlation between and , we again range from to and randomly sample some pairs of and and calculate the average spatial correlation coefficient of all the selected pairs. Results are shown in Table 2.
To verify the correlation between and within Adam, we calculate and the average correlation coefficient between and of all variables . The result is 0.435885276.
To verify the correlation between and within non-AdaShift and between and within max-AdaShift, we range the keep number from to to calculate and the average correlation coefficient of all variables . The result is shown in Table 3 and Table 4.
Appendix D Proof of Theorem 1
With bias correction, the formulation of is written as follows
According to L’Hospital‘s rule, we can draw the following:
According to the definition of limitation, let , we have, , , such that
We set to be , then for each dimension of , i.e. ,
So, shares the same sign with in every dimension.
Given it is a convex optimization problem, let the optimal parameter be , and the maximum step size is that holds , we have,
Given , we have , which implies the average regret
Appendix E Proof of Lemma 2
For a fixed , as approach infinity, we get the limit of as:
For a fixed , as approach infinity, we get the limit of as:
Appendix F Proof of Lemma 3
Thus, we can get the forward difference of as:
Hence, we can draw the conclusion: , such that
where is the net update factor for gradient . ∎
Appendix G Proof of Lemma 4
For a bounded random variable and a differentiable function , the expectation of is as follows:
where is variance of , and is as follows:
is the distribution function of . is a small quantity under some condition. And is large enough, such that: for any ,
(Proof of Lemma 4 ) In the stochastic online optimization problem equation 7, the gradient subjects the distribution as:
Then we can get the expectation of :
Meanwhile, under the assumption that gradients are i.i.d., the expectation and variance of are as following when :
Then, for the gradient , the net update factor is as follows:
It should to be clarified that we define equal to zero when . Then we define as:
For the function :
According to lemma 1, we can the expectation of as follows:
where . Then for gradient and , the net update factor is as follows:
We can see that each term in the infinite series of is smaller than the corresponding one in . Thus, .
Appendix H Proof of Lemma 6
We sum up all updates in an epoch, and define the summation as .
Assume and are large enough such that , we get the approximation of limit of as:
Then we can draw the expression of as:
Let , we get the equation about critical condition:
Appendix I Hyper-parameters Investigation
Here, we list all hyper-parameter setting of all above experiments.
In this section, we discuss the learning rate sensitivity of AdaShift. We set and let , and . The results are shown in Figure 9 and Figure 10. Empirically, we found that when using the spatial operation, the best learning rate for AdaShift is around ten times of Adam.
In this section, we discuss the and sensitivity of AdaShift. We set , and let and . The results are shown in Figure 11 and Figure 12. According to the results, AdaShift holds a low sensitivity to and . In some tasks, using the first moment estimation (with and ) or using a large , e.g., can attain better performance. The suggested parameters setting is .
I.4 n𝑛n and m𝑚m Sensitivity
In this section, we discuss the sensitivity of AdaShift. Here we also test a extended version of first moment estimation where it only uses the latest gradients ():
We set , . The results are shown in Figure 13, Figure 14 and Figure 15. In these experiments, AdaShift is fairly stable when changing and . We have not find a clear pattern on the performance change with respect to and .
Appendix J Temporal-only and Spatial-only
In our proposed algorithm, we apply a spatial operation on the temporally shifted gradient to update : . It is based on the temporal independent assumption, i.e., is independent of . And according to our argument in Section 4.2, one can further assume every element in is independent of the -th dimension of .
We purposely avoid involving the spatial elements of the current gradient , where the independence might not holds: when a sample which is rare and has a large gradient appear in the mini-batch , the overall scale of gradient might increase. However, for the temporally already decorrelation , further taking the advantage of the spatial irrelevance will not suffer from this problem.
We here provide extended experiments on two variants of AdaShift: (i) AdaShift (temporal-only), which only uses the vanilla temporal independent assumption and evaluate with: ; (ii) AdaShift (spatial-only), which directly uses the spatial elements without temporal shifting.
According to our experiments, AdaShift (temporal-only), i.e., without the spatial operation, is less stable than AdaShift. In some tasks, AdaShift (temporal-only) works just fine; while in some other cases, AdaShift (temporal-only) suffers from explosive gradient and requires a relatively small learning rate. The performance of AdaShift (spatial-only) is close to Adam. More experiments for AdaShift (spatial-only) are included in the next section.
Appendix K Extended Experiments: Nadam and AdaShift(Space Only)
In this section, we extend the experiments and add the comparisons with Nadam and AdaShift (spatial-only). The results are shown in Figure 18, Figure19 and Figure20. According to these experiments, Nadam and AdaShift (spatial-only) share similar performence as Adam.
Appendix L Extension experiments: ill-conditioned quadratic problem
Rahimi & Recht raise the point, at “test of time” talk at NIPS 2017, that it is suspicious that gradient descent (aka back-propagation) is ultimate solution for optimization. A ill-conditioned quadratic problem with Two Layer Linear Net is showed to be challenging for gradient descent based methods, while alternative solutions, e.g., Levenberg-Marquardt, may converge faster and better. The problem is defined as follows:
where is some known badly conditioned matrix ( or ), and and are the trainable parameters.
We test SGD, Adam and AdaShift with this problem, the results are shown in Figure 21, Figure 24. It turns out as long as the training goes enough long, SGD, Adam, AdaShift all basically converge in this problem. Though SGD is significantly better than Adam and AdaShift.
We would tend to believe this is a general issue of adaptive learning rate method when comparing with vanilla SGD. Because these adaptive learning rate methods generally are scale-invariance, i.e., the step-size in terms of is basically around one, which makes it hard to converge very well in such a ill-conditioning quadratic problem. SGD, in contrast, has a step-size ; as the training converges SGD would have a decreasing step-size, makes it much easier to converge better. The above analysis is confirmed with Figure 22 and Figure 23, with a decreasing learning rate, Adam and AdaShfit both converge very good.