A Sufficient Condition for Convergences of Adam and RMSProp
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu
Introduction
Large-scale non-convex stochastic optimization , covering a slew of applications in statistics and machine learning such as learning a latent variable from massive data whose probability density distribution is unknown, takes the following generic formulation:
for , where is the learning rate of the -th component of stochastic gradient at the -th iteration. A sufficient condition to ensure the global convergence of vanilla SGD (2) is to require to meet
Although the vanilla SGD algorithm with learning rate satisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune an effective learning rate via condition (3).
To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including AdaGrad , RMSProp , Adam , Nadam , etc., have been proposed to automatically tune the learning rate by using second-order moments of historical stochastic gradients. Let and be the linear combinations of the historical second-order moments and stochastic gradient estimates , respectively. Then, the generic iteration scheme of these adaptive SGD algorithms is summarized as
for , where is called base learning rate and it is independent of stochastic gradient estimates for all . Although RMSProp, Adam, and Nadam work well for solving large-scale convex and non-convex optimization problems such as training deep neural networks, they have been pointed out to be divergent in some scenarios via convex counterexamples . This finding thoroughly destroys the fluke of a direct use of these algorithms without any further assumptions or corrections. Recently, developing sufficient conditions to guarantee global convergences of Adam and RMSProp-type algorithms has attracted much attention from both machine learning and optimization communities. The existing successful attempts can be divided into four categories:
(C1) Decreasing a learning rate. Reddi et al. have declared that the core cause of divergences of Adam and RMSProp is largely controlled by the difference between the two adjacent learning rates, i.e.,
Once positive definiteness of is violated, Adam and RMSProp may suffer from divergence . Based on this observation, two variants of Adam called AMSGrad and AdamNC have been proposed with convergence guarantees in both the convex and non-convex stochastic settings by requiring . In addition, Padam extended from AMSGrad has been proposed to contract the generalization gap in training deep neural networks, whose convergence has been ensured by requiring . In the strongly convex stochastic setting, by using the long-term memory technique developed in , Huang et al. have proposed NosAdam by attaching more weights on historical second-order moments to ensure its convergence. Prior to that, the convergence rate of RMSProp has already been established in the convex stochastic setting by employing similar parameters to those of AdamNC .
(C2) Adopting a big batch size. Basu et al. , for the first time, showed that deterministic Adam and RMSProp with original iteration schemes are actually convergent by using full-batch gradient. On the other hand, both Adam and RMSProp can be reshaped as specific signSGD-type algorithms whose convergence rates have been provided in the non-convex stochastic setting by setting batch size as large as the number of maximum iterations . Recently, Zaheer et al. have established convergence rate of original Adam directly in the non-convex stochastic setting by requiring the batch size to be the same order as the number of maximum iterations. We comment that this type of requirement is impractical when Adam and RMSProp are applied to tackle large-scale problems like (1), since these approaches cost a huge number of computations to estimate big-batch stochastic gradients in each iteration.
(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in , Zhou et al. have pointed out that the divergence of RMSProp is fundamentally caused by the unbalanced learning rate rather than the absence of . Based on this viewpoint, Zhou et al. have proposed AdaShift by incorporating a temporal decorrelation technique to eliminate the inappropriate correlation between and the current second-order moment , in which the adaptive learning rate is required to be independent of . However, convergence of AdaShift in was merely restricted to RMSProp for solving the convex counterexample in .
(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp , Zou et al. recently proposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weighted AdaGrad with unified heavy ball and Nesterov accelerated gradient momentums. Its convergence rate has also been provided in the non-convex stochastic setting by requiring a non-decreasing weighted sequence. Besides, many other adaptive stochastic algorithms without combining momentums, such as AdaGrad and stagewise AdaGrad , have been guaranteed to be convergent and work well in the non-convex stochastic setting.
In contrast with the above four types of modifications and restrictions, we introduce an alternative easy-to-check sufficient condition (abbreviated as (SC)) to guarantee the global convergences of original Adam and RMSProp. The proposed (SC) merely depends on the parameters in estimating and base learning rate . (SC) neither requires the positive definiteness of like (C1) nor needs the batch size as large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochastic settings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partially overlapped with (C1) since the proposed (SC) can cover AdamNC , AdaGrad with exponential moving average (AdaEMA) momentum , and RMSProp as instances whose convergences are all originally motivated by requiring the positive definiteness of . While, based on (SC), we can directly derive their global convergences in the non-convex stochastic setting as byproducts without checking the positive definiteness of step by step. Besides, (SC) can serve as an alternative explanation on divergences of original Adam and RMSProp, which are possibly due to incorrect parameter settings for accumulating the historical second-order moments rather than the unbalanced learning rate caused by the inappropriate correlation between and like (C3). In addition, AdamNC and AdaEMA are convergent under (SC), but violate (C3) in each iteration.
Moreover, by carefully reshaping the iteration scheme of Adam, we obtain a specific weighted AdaGrad with exponential moving average momentum, which extends the weighted AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum in two aspects: the new momentum mechanism and the new base learning rate setting provide a new perspective for understanding Adam and RMSProp. At last, we experimentally verify (SC) by applying Adam and RMSProp with different parameter settings to solve the counterexample and train deep neural networks including LeNet and ResNet . In summary, the contributions of this work are five-fold:
We introduce an easy-to-check sufficient condition to ensure the global convergences of original Adam and RMSProp in the non-convex stochastic setting. Moreover, this sufficient condition is distinctive from the existing conditions (C1)-(C4) and is easier to verify.
We reshape Adam as weighted AdaGrad with exponential moving average momentum, which provides a new perspective for understanding Adam and RMSProp and also complements AdaUSM in .
We provide a new explanation on the divergences of original Adam and RMSProp, which are possibly due to an incorrect parameter setting of the combinations of historical second-order moments based on (SC).
We find that the sufficient condition extends the restrictions of RMSProp and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad with momentum, etc. Thus, their convergences in the non-convex stochastic setting naturally hold.
We conduct experiments to validate the sufficient condition for the convergences of Adam/RMSProp. The experimental results match our theoretical results.
Generic Adam
Generic Adam covers RMSProp by setting . Moreover, it covers Adam with a bias correction as follows:
The original Adam with the bias correction takes constant parameters and . The iteration scheme is written as , with and . Let . Then, the above can be rewritten as . Thus, it is equivalent to taking constant , constant , and new base learning rate in Generic Adam.
Now we show that Generic Adam can be reformulated as a new type of weighted AdaGrad algorithms with exponential moving average momentum (Weighted AdaEMA).
The Weighted AdaEMA is a natural generalization of the AdaGrad algorithm. The classical AdaGrad is to take the weights , the momentum factors , and the parameters for constant .
The following proposition states the equivalence between Generic Adam and Weighted AdaEMA.
Algorithm 1 and Algorithm 2 are equivalent.
The divergence issue of Adam/RMSProp. When is taken constant, i.e., , Reddi et al. have pointed out that Adam and RMSProp () can be divergent even in the convex setting. They conjectured that the divergence is possibly due to the uncertainty of positive definiteness of in Eq. (5). This idea has motivated many new convergent variants of Adam by forcing . Recently, Zhou et al. further argued that the nature of divergences of Adam and RMSProp is possibly due to the unbalanced learning rate caused by the inappropriate correlation between and by studying the counterexample in . However, this explanation can be violated by many existing convergent Adam-type algorithms such as AdamNC, NosAdam , etc. So far, there is no satisfactory explanation for the core reason of the divergence issue. We will provide more insights in Section 4 based on our theoretical analysis.
Main Results
In this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters . Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters . Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochastic algorithm for non-convex problems:
To establish the upper-bound, we also suppose that the parameters , , and satisfy the restrictions:
The parameters satisfy for all for some constant ;
The parameters satisfy and is non-decreasing in with ;
The parameters satisfy that is “almost” non-increasing in , by which we mean that there exist a non-increasing sequence and a positive constant such that .
The restriction (R3) indeed says that is the product between some non-increasing sequence and some bounded sequence. This is a slight generalization of itself being non-decreasing. If itself is non-increasing, we can then take and . For most of the well-known Adam-type methods, is indeed non-decreasing, for instance, for AdaGrad with EMA momentum we have and , so is constant; for Adam with constant and non-increasing (say or ), is non-increasing. The motivation, instead of being decreasing, is that it allows us to deal with the bias correction steps in Adam .
We fix a positive constant In the special case that is constant, we can directly set . such that . Let and
where is the maximum of the indices with . The finiteness of is guaranteed by the fact that . When there are no such indices, i.e., , we take by convention. In general, . Our main results on estimating gradient residual state as follows:
Let be a sequence generated by Generic Adam for initial values , , and . Assume that and stochastic gradients satisfy assumptions (A1)-(A4). Let be randomly chosen from with equal probabilities . We have
where C^{\prime}\!=\!{2C_{0}^{2}C_{3}d\sqrt{G^{2}\!+\!\epsilon d}}{\big{/}}{[(1\!-\!\beta)\theta_{1}]} and
where and are defined as and C_{3}=\frac{C_{0}}{\sqrt{C_{1}}(1-\sqrt{\gamma})}\big{[}\frac{C_{0}^{2}\chi_{1}L}{C_{1}(1-\sqrt{\gamma})^{2}}+2\big{(}\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\big{)}^{2}G\big{]}.
Suppose the same setting and hypothesis as Theorem 4. Let be randomly chosen from with equal probabilities . Then for any , the following bound holds with probability at least :
where and are defined as those in Theorem 4.
Take with . Suppose . Then the in Theorem 5 is bounded from below by constants
In particular, when , we have the following more subtle estimate on lower and upper-bounds for
(i) Corollary 7 shows that if , the bound in Theorem 5 is only , hence not guaranteeing convergence. This result is not surprising as Adam with constant has already shown to be divergent . Hence, is its best convergence rate we can expect. We will discuss this case in more details in Section 4. (ii) Corollary 7 also indicates that in order to guarantee convergence, the parameter has to satisfy . Although we do not assume this in our restrictions (R1)-(R3), it turns out to be the consequence from our analysis. Note that if in (R1) and , then the restriction is automatically satisfied in (R2).
We are now ready to give the Sufficient Condition (SC) for convergence of Generic Adam based on Theorem 5.
Generic Adam is convergent if the parameters , , and satisfy
and is non-decreasing in ;
is “almost” non-increasing;
\big{(}{\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}\big{)}{\big{/}}\big{(}{T\alpha_{T}}\big{)}=o(1).
We now provide the convergence rate of Generic Adam with a specific class of parameters , i.e.,
for positive constants , where is taken such that . Note that can be taken bigger than 1. When , we can take and then . To guarantee (R3), we require . For such a family of parameters we have the following corollary.
Generic Adam with the above family of parameters converges as long as , and its non-asymptotic convergence rate is given by
Corollary 10 recovers and extends the results of some well-known algorithms below:
AdaGrad with exponential moving average (EMA). When , , and , Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) . In particular, if , this is the vanilla coordinate-wise AdaGrad. It corresponds to taking and in Corollary 10. Hence, AdaEMA has convergence rate .
AdamNC. Taking , , and in Generic Adam, where is the decay factor for the momentum factors , we recover AdamNC . Its convergence rate can be directly derived via Corollary 10.
RMSProp. Mukkamala and Hein have reached the same convergence rate for RMSprop with , when and under the convex assumption. Since RMSProp is essentially Generic Adam with all momentum factors , we recover Mukkamala and Hein’s results by taking and in Corollary 10. Moreover, our result generalizes to the non-convex stochastic setting, and it holds for all rather than only .
As Weighted AdaEMA is equivalent to Generic Adam, we present its convergence rate with specific polynomial growth weights in the following corollary.
Suppose in Weighted AdaEMA the weights for , and . Then Weighted AdaEMA has the non-asymptotic convergence rate.
Zou et al. proposed weighted AdaGrad with a unified momentum form which incorporates Heavy Ball (HB) momentum and Nesterov Accelerated Gradients (NAG) momentum. The same convergence rate was established for weights with polynomial growth. Our result complements by showing that the same convergence rate also holds for exponential moving average momentum.
(i) Huang et al. proposed Nostalgic Adam (NosAdam) which corresponds to taking the learning rate and with for , and We directly use and along with the notations of NosAdam . in Generic Adam. The idea of NosAdam is to guarantee by laying more weights on the historical second-order moments. A special case of NosAdam is NosAdam-HH which takes for as the hyper-harmonic series. Its convergence rate is established in the strongly convex stochastic setting. NosAdam-HH can be viewed as the Weighted AdaEMA taking and for .
(ii) Corollary 12 differs from the motivation of NosAdam as the weights we consider are for . Note that in both cases when , this is the AdaGrad algorithm, which corresponds to assigning equal weights to the past squares of gradients. Hence, we are actually in the opposite direction of NosAdam. We are more interested in the case of assigning more weights to the recent stochastic gradients. This can actually be viewed as a situation between AdaGrad and the original Adam with constant ’s.
Comparison between (SC) and (C1). Most of the convergent modifications of original Adam, such as AMSGrad, AdamNC, and NosAdam, all require in Eq. (5), which is equivalent to decreasing the adaptive learning rate step by step. Since the term (or adaptive learning rate ) involves the past stochastic gradients (hence not deterministic), the modification to guarantee either needs to change the iteration scheme of Adam (like AMSGrad) or needs to impose some strong restrictions on the base learning rate and (like AdamNC). Our sufficient condition provides an easy-to-check criterion for the convergence of Generic Adam in Corollary 9. It is not necessary to require . Moreover, we use exactly the same iteration scheme as original Adam without any modifications. Our work shows that the positive definiteness of may not be the essential issue for divergence of original Adam. It is probably due to that the parameters are not set correctly.
The currently most popular RMSProp and Adam’s parameter setting takes constant , i.e., . The motivation behind is to use the exponential moving average of squares of past stochastic gradients. In practice, parameter is recommended to be set very close to 1. For instance, a commonly adopted is taken as 0.999.
Although great performance in practice has been observed, such a constant parameter setting has the serious flaw that there is no convergence guarantee even for convex optimization, as proved by the counterexamples in . Ever since much work has been done to analyze the divergence issue of Adam and to propose modifications with convergence guarantees, as summarized in the introduction section. However, there is still not a satisfactory explanation that touches the fundamental reason of the divergence. Below, we try to provide more insights for the divergence issue of Adam/RMSProp with constant parameter , based on our analysis of the sufficient condition for convergence.
From the sufficient condition perspective. Let for and . According to Corollary 7, in Theorem 5 has the following estimate:
The bounds tell us some points on Adam with constant :
, so the convergence is not guaranteed. This result coincides with the divergence issue demonstrated in . Indeed, since in this case Adam is not convergent, this is the best bound we can have.
Consider the dependence on parameter . The bound is decreasing in . The best bound in this case is when , i.e., the base learning rate is taken constant. This explains why in practice taking a more aggressive constant base learning rate often leads to even better performance, comparing with taking a decaying one.
Consider the dependence on parameter . Note that the constants and depend on instead of the whole sequence . We can always set for while fix , by which we can take and independent of constant . Then the principal term of is linear in , so decreases to zero as . This explains why setting close to 1 in practice often results in better performance in practice.
Moreover, Corollary 10 shows us how the convergence rate continuously changes when we continuously verify parameters . Let us fix and consider the following continuous family of parameters with :
Note that when , then , this is the AdaEMA, which has the convergence rate ; when , then , this is the original Adam with constant , which only has the bound; when , by Corollary 10, the algorithm has the convergence rate. Along this continuous family of parameters, we observe that the theoretical convergence rate continuously deteriorates as the real parameter decreases from 1 to 0, namely, as we gradually shift from AdaEMA to Adam with constant . In the limiting case, the latter is not guaranteed with convergence any more. This phenomenon is empirically verified by the Synthetic Counterexample in Section 5.
From the Weighted AdaEMA perspective. Since Generic Adam is equivalent to Weighted AdaEMA, we can examine Adam with in terms of Weighted AdaEMA. In this case, we find that the associated sequence of weights is growing in an exponential order. Corollary 12 shows that as long as the weights are in polynomial growth, Weighted AdaEMA is convergent and its convergence rate is . This indicates that the exponential-moving-average technique in the estimate of second-order moments may assign a too aggressive weight to the current gradient, which leads to the divergence.
Experiments
In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample and to train LeNet on the MNIST dataset and ResNet on the CIFAR-100 dataset , respectively.
In this experiment, we verify the phenomenon described in Section 4 that how the convergence rate of Generic Adam gradually changes along a continuous path of families of parameters on the one-dimensional counterexample in :
where is the number of maximum iterations, with probability 0.01, and with probability 0.99.
Sensitivity of parameter . We set , , , and as with , respectively. Note that when , Generic Adam reduces to the originally divergent Adam with . When , Generic Adam reduces to the AdaEMA with .
The experimental results are shown in Figures 1(a) and 1(d). We can see that for , and , Generic Adam is convergent. Moreover, the convergence becomes slower when decreases, which exactly matches Corollary 10. On the other hand, for and , Figure 1(d) shows that they do not converge. It seems that the divergence for contradicts our theory. However, this is because when is very small, the convergence rate is so slow that we may not see a convergent trend in even iterations. Indeed, for , we actually have
which is not sufficiently close to 1. As a complementary experiment, we fix the numerator and only change when is small. We take and as the same, while for and , respectively. The result is shown in Figures 1(b) and 1(e). We can see that Generic Adam with is indeed convergent in this situation.
Sensitivity of parameter . Now, we show the sensitivity of of the sufficient condition (SC) by fixing and selecting from the collection . Figures 1(c) and 1(e) illustrate the sensitivity of parameter when Generic Adam is applied to solve the counterexample (9). The performance shows that when is fixed, smaller can lead to a faster and better convergence speed, which also coincides with the convergence results in Corollary 10.
2 LeNet on MNIST and ResNet-18 on CIFAR-100
In this subsection, we apply Generic Adam to train LeNet on the MNIST dataset and ResNet-18 on the CIFAR-100 dataset, respectively, in order to validate the convergence rates in Corollary 10. Meanwhile, the comparisons between Generic Adam and AMSGrad are also provided to distinguish their differences in training deep neural networks. We illustrate the performance profiles in three aspects: training loss vs. epochs, test loss vs. epochs, and test accuracy vs. epochs, respectively. Besides, the architectures of LeNet and ResNet-18, and the statistics of the MNIST and CIFAR-100 datasets are described in the supplementary material.
In the experiments, for Generic Adam, we set with and , respectively; for RMSProp, we set and along with the parameter settings in . For fairness, the base learning rates in Generic Adam, RMSProp, and AMSGrad are all set as . Figures 3 and 3 illustrate the results of Generic Adam with different , RMSProp, and AMSGrad for training LeNet on MNIST and training ResNet-18 on CIFAR-100, respectively. We can see that AMSGrad and Adam (Generic Adam with ) decrease the training loss slowest and show the worst test accuracy among the compared optimizers. One possible reason is due to the use of constant in AMSGrad and original Adam. By Figures 3 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter . Larger can contribute to a faster convergence rate of Generic Adam, which corroborates the theoretical result in Corollary 10. Additionally, the test accuracies in Figures 3(b) and 3(b) indicate that a smaller training loss can contribute to a higher test accuracy for Generic Adam.
Conclusions
In this work, we delved into the convergences of Adam and RMSProp, and presented an easy-to-check sufficient condition to guarantee their convergences in the non-convex stochastic setting. This sufficient condition merely depends on the base learning rate and the linear combination parameter of second-order moments. Relying on this sufficient condition, we found that the divergences of Adam and RMSProp are possibly due to the incorrect parameter settings of and . In addition, we reformulated Adam as weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. At last, the correctness of theoretical results was also verified via the counterexample in and training deep neural networks on real-world datasets.
References
Notations
Appendix A Key Lemmas
In this section we provide the necessary lemmas for the proofs of Theorems 4 and 5.
Given and a non-negative sequence , let for . Then the following estimate holds
The finite sum can be interpreted as a Riemann sum Since is decreasing on the interval , we have
Let and be two non-negative sequences. Let for . Then
Let and satisfy the restrictions (R2) and (R3). For any , we have
For any , since the sequence is non-increasing, we have . Hence,
which proves the first inequality. On the other hand, since is non-decreasing, it holds
Let for and by convention.
Fix a constant with . Let be as given as Eq. (6) in the main paper. For any , we have
For any , since for , and for , we have
We take the constant , where is the maximum of the indices for which . The proof is completed. ∎
If is a constant, we have . In this case we can take and .
Let . We have the following estimate
Let for and by convention. By the iteration formula and , we have
Similarly, by and , we have
Note that is non-decreasing by (R2), and by (R1). By Lemma 18, we have
Combining Eq. (18) and Eq. (19), we obtain the desired Eq. (16). The proof is completed. ∎
where .
To estimate (I), by the Schwartz inequality and the Lipschitz continuity of the gradient of , we have
Note that . Therefore,
where . The last inequality holds due to as . Therefore, we have
Combining Eq. (27), Eq. (33), and Eq. (34), we obtain
The term (IV) is estimated similarly as term (III). First, we have
where is the constant defined above. We have
Combining Eq. (22), Eq. (23), Eq. (25), Eq. (26), Eq. (35), and Eq. (37), we obtain
Let denote the constant . Then
Next, we estimate Eq. (21). When , we have
The same as what we did for term (I) in Lemma 21, we have
Then the similar argument as Eq. (33) implies that
Note that , so we have . By Lemma 18, this follows that for all . On the other hand,
Since for , it follows that
It is straightforward to acquire by induction that
By Lemma 17, it holds for any . By Lemma 18, . In addition, . Hence,
Combining Eq. (51) and Eq. (52), we then obtain the desired estimate Eq. (47). The proof is completed. ∎
Let and . Let . We therefore have
Note that and , so it holds that and Then, It follows that
Writing the norm in terms of coordinates, we obtain
The second inequality is due to the convex inequality . Indeed, we have the more general convex inequality that
for any positive random variable . Taking to be in the right hand side of Eq. (57), we obtain that
The last inequality is due to the following trivial inequality:
for any non-negative parameters and . It then follows that
For simplicity of notations, let , and . Note that . Hence,
Let . By Lemma 25, we have
Since is a non-increasing sequence, we have . By Eq. (63), we have
Note that . Combining Eq. (62), Eq. (63), and Eq. (65), we have
Note that for all . It follows that
Note that . By Eq. (62) and Eq. (65), we have
Let be randomly chosen from with equal probabilities . We have the following estimate
For any two random variables and , by the Hölder’s inequality, we have
Let , , and let , . By Eq. (69), we have
Since , and all entries are non-negative, we have
By Eq. (70), Eq. (71), and Eq. (72), we obtain
By Lemma 17, for any , so . Then, we obtain
Appendix B Proofs of the main results
In this section, we provide the detailed proofs of the propositions, theorems, and corollaries in the main body.
Algorithm 1 and Algorithm 2 are equivalent.
It suffices to show that Algorithm 1 can be realized as Algorithm 2 with a particular choice of parameters, and vice versa. Note that for Algorithm 1, it holds
Hence, given the parameters in Algorithm 1, we take . Then it holds
It follows that Eq. (77) becomes Eq. (76). Conversely, given the parameters of Algorithm 2, we take . Then Eq. (76) becomes Eq. (77). The proof is completed. ∎
B.2 Proof of Theorem 4
Let be a sequence generated by Generic Adam for initial values , , and . Assume that and stochastic gradients satisfy assumptions (A1)-(A4). Let be randomly chosen from with equal probabilities . We have the following estimate
where the constants and are given by
By the -Lipschitz continuity of the gradient of and the descent lemma, we have
where is the constant given in Lemma 24. By applying the estimates in Lemma 25 and Lemma 27 for the second and third terms in the right hand side of Eq. (81), and appropriately rearranging the terms, we obtain
B.3 Proof of Theorem 5
Let be a sequence generated by Generic Adam for initial values , , and . Assume that and stochastic gradients satisfy assumptions (A1)-(A4). Let be randomly chosen from with equal probabilities . Then for any , the following bound holds with probability at least :
where the constants and are given by
in which the constant is given by
Namely, . Therefore, . This completes the proof. ∎
B.4 Proof of Corollary 7
Take with . Suppose .Then in Theorem 5 is bounded from below by constants
In particular, when , we have the following more subtle estimate on lower and upper-bounds for :
Since , and is non-decreasing, we have . Hence, by Theorem 5, it holds
If, in particular, , by Theorem 5 we have
Combining Eqs. (87)-(88), we obtain the desired result. ∎
B.5 Proof of Corollary 10
Generic Adam with the above family of parameters converges as long as , and its non-asymptotic convergence rate is given by
It is not hard to verify that the following equalities hold:
In this case, . Therefore, by Theorem 5 the non-asymptotic convergence rate is given by
To guarantee convergence, then . ∎
B.6 Proof of Corollary 12
Suppose that in Weighted AdaEMA the weights for , and . Then Weighted AdaEMA has the non-asymptotic convergence rate.
By the proof procedures of Theorem 3, the equivalent Generic Adam has the parameters , where . Hence, it holds
We have that and is increasing. In addition, we have that is bounded, and hence “almost” non-increasing (by taking in (R3)). The restrictions (R1)-(R3) are all satisfied. Hence, we can apply Theorem 5 in this case. It follows that its convergence rate is given by
Appendix C Experimental Implementations
In this section, we describe the statistics of the training and validation datasets of MNISThttp://yann.lecun.com/exdb/mnist/ and CIFAR-100https://www.cs.toronto.edu/ kriz/cifar.html, the architectures of LeNet and ResNet-18, and detailed implementations.
MNIST is composed of ten classes of digits among , which includes 60,000 training examples and 10,000 validation examples. The dimension of each example is .
CIFAR-100 is composed of 100 classes of color images. Each class includes 6,000 images. In addition, these images are devided into 50,000 training examples and 10,000 validation examples.
C.2 Architectures of Neural Networks
C.3 Additional Experiments of ResNet-18 on CIFAR-100
We further illustrate Generic Adam with different , RMSProp, and AMSGrad with an alternative base learning rate on ResNet-18. We do cut-off by taking if . Note that is still non-increasing. The motivation is that at the very beginning the learning rate could be large which would deteriorate the performance. The performance profiles are also exactly in accordance with the analysis in theory, i.e., larger leads to a faster training process.