A Sufficient Condition for Convergences of Adam and RMSProp

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

Introduction

Large-scale non-convex stochastic optimization , covering a slew of applications in statistics and machine learning such as learning a latent variable from massive data whose probability density distribution is unknown, takes the following generic formulation:

for $k=1,2,\ldots,d$ , where $\eta_{t,k}\geq 0$ is the learning rate of the $k$ -th component of stochastic gradient $\bm{g}(\bm{x}_{t},\xi_{t})$ at the $t$ -th iteration. A sufficient condition to ensure the global convergence of vanilla SGD (2) is to require $\eta_{t}$ to meet

Although the vanilla SGD algorithm with learning rate $\eta_{t}$ satisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune an effective learning rate $\eta_{t}$ via condition (3).

To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including AdaGrad , RMSProp , Adam , Nadam , etc., have been proposed to automatically tune the learning rate $\eta_{t}$ by using second-order moments of historical stochastic gradients. Let $\bm{v}_{t,k}$ and $\bm{m}_{t,k}$ be the linear combinations of the historical second-order moments $(\bm{g}^{2}_{1,k},\bm{g}^{2}_{2,k},\cdots,\bm{g}^{2}_{t,k})$ and stochastic gradient estimates $(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k})$ , respectively. Then, the generic iteration scheme of these adaptive SGD algorithms is summarized as

for $k=1,2,\ldots,d$ , where $\alpha_{t}>0$ is called base learning rate and it is independent of stochastic gradient estimates $(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k})$ for all $t\geq 1$ . Although RMSProp, Adam, and Nadam work well for solving large-scale convex and non-convex optimization problems such as training deep neural networks, they have been pointed out to be divergent in some scenarios via convex counterexamples . This finding thoroughly destroys the fluke of a direct use of these algorithms without any further assumptions or corrections. Recently, developing sufficient conditions to guarantee global convergences of Adam and RMSProp-type algorithms has attracted much attention from both machine learning and optimization communities. The existing successful attempts can be divided into four categories:

(C1) Decreasing a learning rate. Reddi et al. have declared that the core cause of divergences of Adam and RMSProp is largely controlled by the difference between the two adjacent learning rates, i.e.,

Once positive definiteness of $\Gamma_{t}$ is violated, Adam and RMSProp may suffer from divergence . Based on this observation, two variants of Adam called AMSGrad and AdamNC have been proposed with convergence guarantees in both the convex and non-convex stochastic settings by requiring $\Gamma_{t}\succ 0$ . In addition, Padam extended from AMSGrad has been proposed to contract the generalization gap in training deep neural networks, whose convergence has been ensured by requiring $\Gamma_{t}\succ 0$ . In the strongly convex stochastic setting, by using the long-term memory technique developed in , Huang et al. have proposed NosAdam by attaching more weights on historical second-order moments to ensure its convergence. Prior to that, the convergence rate of RMSProp has already been established in the convex stochastic setting by employing similar parameters to those of AdamNC .

(C2) Adopting a big batch size. Basu et al. , for the first time, showed that deterministic Adam and RMSProp with original iteration schemes are actually convergent by using full-batch gradient. On the other hand, both Adam and RMSProp can be reshaped as specific signSGD-type algorithms whose $\mathcal{O}(1/\sqrt{T})$ convergence rates have been provided in the non-convex stochastic setting by setting batch size as large as the number of maximum iterations . Recently, Zaheer et al. have established $\mathcal{O}(1/\sqrt{T})$ convergence rate of original Adam directly in the non-convex stochastic setting by requiring the batch size to be the same order as the number of maximum iterations. We comment that this type of requirement is impractical when Adam and RMSProp are applied to tackle large-scale problems like (1), since these approaches cost a huge number of computations to estimate big-batch stochastic gradients in each iteration.

(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in , Zhou et al. have pointed out that the divergence of RMSProp is fundamentally caused by the unbalanced learning rate rather than the absence of $\Gamma_{t}\succ 0$ . Based on this viewpoint, Zhou et al. have proposed AdaShift by incorporating a temporal decorrelation technique to eliminate the inappropriate correlation between $\bm{v}_{t,k}$ and the current second-order moment $\bm{g}_{t,k}^{2}$ , in which the adaptive learning rate $\eta_{t,k}$ is required to be independent of $\bm{g}^{2}_{t,k}$ . However, convergence of AdaShift in was merely restricted to RMSProp for solving the convex counterexample in .

(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp , Zou et al. recently proposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weighted AdaGrad with unified heavy ball and Nesterov accelerated gradient momentums. Its $\mathcal{O}(\log{(T)}/\sqrt{T})$ convergence rate has also been provided in the non-convex stochastic setting by requiring a non-decreasing weighted sequence. Besides, many other adaptive stochastic algorithms without combining momentums, such as AdaGrad and stagewise AdaGrad , have been guaranteed to be convergent and work well in the non-convex stochastic setting.

In contrast with the above four types of modifications and restrictions, we introduce an alternative easy-to-check sufficient condition (abbreviated as (SC)) to guarantee the global convergences of original Adam and RMSProp. The proposed (SC) merely depends on the parameters in estimating $\bm{v}_{t,k}$ and base learning rate $\alpha_{t}$ . (SC) neither requires the positive definiteness of $\Gamma_{t}$ like (C1) nor needs the batch size as large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochastic settings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partially overlapped with (C1) since the proposed (SC) can cover AdamNC , AdaGrad with exponential moving average (AdaEMA) momentum , and RMSProp as instances whose convergences are all originally motivated by requiring the positive definiteness of $\Gamma_{t}$ . While, based on (SC), we can directly derive their global convergences in the non-convex stochastic setting as byproducts without checking the positive definiteness of $\Gamma_{t}$ step by step. Besides, (SC) can serve as an alternative explanation on divergences of original Adam and RMSProp, which are possibly due to incorrect parameter settings for accumulating the historical second-order moments rather than the unbalanced learning rate caused by the inappropriate correlation between $\bm{v}_{t,k}$ and $\bm{g}^{2}_{t,k}$ like (C3). In addition, AdamNC and AdaEMA are convergent under (SC), but violate (C3) in each iteration.

Moreover, by carefully reshaping the iteration scheme of Adam, we obtain a specific weighted AdaGrad with exponential moving average momentum, which extends the weighted AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum in two aspects: the new momentum mechanism and the new base learning rate setting provide a new perspective for understanding Adam and RMSProp. At last, we experimentally verify (SC) by applying Adam and RMSProp with different parameter settings to solve the counterexample and train deep neural networks including LeNet and ResNet . In summary, the contributions of this work are five-fold:

We introduce an easy-to-check sufficient condition to ensure the global convergences of original Adam and RMSProp in the non-convex stochastic setting. Moreover, this sufficient condition is distinctive from the existing conditions (C1)-(C4) and is easier to verify.

We reshape Adam as weighted AdaGrad with exponential moving average momentum, which provides a new perspective for understanding Adam and RMSProp and also complements AdaUSM in .

We provide a new explanation on the divergences of original Adam and RMSProp, which are possibly due to an incorrect parameter setting of the combinations of historical second-order moments based on (SC).

We find that the sufficient condition extends the restrictions of RMSProp and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad with momentum, etc. Thus, their convergences in the non-convex stochastic setting naturally hold.

We conduct experiments to validate the sufficient condition for the convergences of Adam/RMSProp. The experimental results match our theoretical results.

Generic Adam

Generic Adam covers RMSProp by setting $\beta_{t}=0$ . Moreover, it covers Adam with a bias correction as follows:

The original Adam with the bias correction takes constant parameters $\beta_{t}=\beta$ and $\theta_{t}=\theta$ . The iteration scheme is written as $\bm{x}_{t+1}\!=\!\bm{x}_{t}\!-\!\widehat{\alpha}_{t}\frac{\widehat{\bm{m}}_{t}}{\sqrt{\widehat{\bm{v}}_{t}}}$ , with $\widehat{\bm{m}}_{t}\!=\!\frac{\bm{m}_{t}}{1-\beta^{t}}$ and $\widehat{\bm{v}}_{t}\!=\!\frac{\bm{v}_{t}}{1-\theta^{t}}$ . Let $\alpha_{t}=\widehat{\alpha}_{t}\frac{\sqrt{1-\theta^{t}}}{1-\beta^{t}}$ . Then, the above can be rewritten as $\bm{x}_{t+1}=\bm{x}_{t}-{\alpha_{t}\bm{m}_{t}}/\sqrt{\bm{v}_{t}}$ . Thus, it is equivalent to taking constant $\beta_{t}$ , constant $\theta_{t}$ , and new base learning rate $\alpha_{t}$ in Generic Adam.

Now we show that Generic Adam can be reformulated as a new type of weighted AdaGrad algorithms with exponential moving average momentum (Weighted AdaEMA).

The Weighted AdaEMA is a natural generalization of the AdaGrad algorithm. The classical AdaGrad is to take the weights $w_{t}=1$ , the momentum factors $\beta_{t}=0$ , and the parameters $\alpha_{t}=\eta/\sqrt{t+1}$ for constant $\eta$ .

The following proposition states the equivalence between Generic Adam and Weighted AdaEMA.

Algorithm 1 and Algorithm 2 are equivalent.

The divergence issue of Adam/RMSProp. When $\theta_{t}$ is taken constant, i.e., $\theta_{t}=\theta$ , Reddi et al. have pointed out that Adam and RMSProp ( $\beta_{t}=0$ ) can be divergent even in the convex setting. They conjectured that the divergence is possibly due to the uncertainty of positive definiteness of $\Gamma_{t}$ in Eq. (5). This idea has motivated many new convergent variants of Adam by forcing $\Gamma_{t}\succ 0$ . Recently, Zhou et al. further argued that the nature of divergences of Adam and RMSProp is possibly due to the unbalanced learning rate $\eta_{t}$ caused by the inappropriate correlation between $v_{t,k}$ and $g^{2}_{t,k}$ by studying the counterexample in . However, this explanation can be violated by many existing convergent Adam-type algorithms such as AdamNC, NosAdam , etc. So far, there is no satisfactory explanation for the core reason of the divergence issue. We will provide more insights in Section 4 based on our theoretical analysis.

Main Results

In this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters $(\theta_{t},\alpha_{t})$ . Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters $(\theta_{t},\alpha_{t})$ . Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochastic algorithm for non-convex problems:

To establish the upper-bound, we also suppose that the parameters $\{\beta_{t}\}$ , $\{\theta_{t}\}$ , and $\{\alpha_{t}\}$ satisfy the restrictions:

The parameters $\{\beta_{t}\}$ satisfy $0\leq\beta_{t}\leq\beta<1$ for all $t$ for some constant $\beta$ ;

The parameters $\{\theta_{t}\}$ satisfy $0<\theta_{t}<1$ and $\theta_{t}$ is non-decreasing in $t$ with $\theta:=\lim_{t\to\infty}\theta_{t}>\beta^{2}$ ;

The parameters $\{\alpha_{t}\}$ satisfy that $\chi_{t}:=\frac{\alpha_{t}}{\sqrt{1-\theta_{t}}}$ is “almost” non-increasing in $t$ , by which we mean that there exist a non-increasing sequence $\{a_{t}\}$ and a positive constant $C_{0}$ such that $a_{t}\leq\chi_{t}\leq C_{0}a_{t}$ .

The restriction (R3) indeed says that $\chi_{t}$ is the product between some non-increasing sequence $\{a_{t}\}$ and some bounded sequence. This is a slight generalization of $\chi_{t}$ itself being non-decreasing. If $\chi_{t}$ itself is non-increasing, we can then take $a_{t}\!=\!\chi_{t}$ and $C_{0}\!=\!1$ . For most of the well-known Adam-type methods, $\chi_{t}$ is indeed non-decreasing, for instance, for AdaGrad with EMA momentum we have $\alpha_{t}=\eta/\sqrt{t}$ and $\theta_{t}\!=\!1-1/t$ , so $\chi_{t}=\eta$ is constant; for Adam with constant $\theta_{t}\!=\!\theta$ and non-increasing $\alpha_{t}$ (say $\alpha_{t}\!=\!\eta/\sqrt{t}$ or $\alpha_{t}\!=\!\eta$ ), $\chi_{t}\!=\!\alpha_{t}/\sqrt{1-\theta}$ is non-increasing. The motivation, instead of $\chi_{t}$ being decreasing, is that it allows us to deal with the bias correction steps in Adam .

We fix a positive constant $\theta^{\prime}>0$ In the special case that $\theta_{t}=\theta$ is constant, we can directly set $\theta^{\prime}=\theta$ . such that $\beta^{2}<\theta^{\prime}<\theta$ . Let $\gamma:={\beta^{2}}/{\theta^{\prime}}<1$ and

where $N$ is the maximum of the indices $j$ with $\theta_{j}<\theta^{\prime}$ . The finiteness of $N$ is guaranteed by the fact that $\lim_{t\to\infty}\theta_{t}=\theta>\theta^{\prime}$ . When there are no such indices, i.e., $\theta_{1}\geq\theta^{\prime}$ , we take $C_{1}=1$ by convention. In general, $C_{1}\leq 1$ . Our main results on estimating gradient residual state as follows:

Let $\{\bm{x}_{t}\}$ be a sequence generated by Generic Adam for initial values $\bm{x}_{1}$ , $\bm{m}_{0}=\bm{0}$ , and $\bm{v}_{0}=\bm{\epsilon}$ . Assume that $f$ and stochastic gradients $\bm{g}_{t}$ satisfy assumptions (A1)-(A4). Let $\tau$ be randomly chosen from $\{1,2,\ldots,T\}$ with equal probabilities $p_{\tau}=1/T$ . We have

where $C^{\prime}\!=\!{2C_{0}^{2}C_{3}d\sqrt{G^{2}\!+\!\epsilon d}}{\big{/}}{[(1\!-\!\beta)\theta_{1}]}$ and

where $C_{4}$ and $C_{3}$ are defined as $C_{4}=f(x_{1})-f^{*}$ and $C_{3}=\frac{C_{0}}{\sqrt{C_{1}}(1-\sqrt{\gamma})}\big{[}\frac{C_{0}^{2}\chi_{1}L}{C_{1}(1-\sqrt{\gamma})^{2}}+2\big{(}\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\big{)}^{2}G\big{]}$ .

Suppose the same setting and hypothesis as Theorem 4. Let $\tau$ be randomly chosen from $\{1,2,\ldots,T\}$ with equal probabilities $p_{\tau}=1/T$ . Then for any $\delta>0$ , the following bound holds with probability at least $1-\delta^{2/3}$ :

where $C$ and $C^{\prime}$ are defined as those in Theorem 4.

Take $\alpha_{t}=\eta/t^{s}$ with $0\leq s<1$ . Suppose $\lim_{t\to\infty}\theta_{t}=\theta<1$ . Then the $Bound(T)$ in Theorem 5 is bounded from below by constants

In particular, when $\theta_{t}=\theta<1$ , we have the following more subtle estimate on lower and upper-bounds for $Bound(T)$

(i) Corollary 7 shows that if $\lim_{t\to\infty}\theta_{t}=\theta<1$ , the bound in Theorem 5 is only $\mathcal{O}(1)$ , hence not guaranteeing convergence. This result is not surprising as Adam with constant $\theta_{t}$ has already shown to be divergent . Hence, $\mathcal{O}(1)$ is its best convergence rate we can expect. We will discuss this case in more details in Section 4. (ii) Corollary 7 also indicates that in order to guarantee convergence, the parameter has to satisfy $\lim_{t\to\infty}\theta_{t}=1$ . Although we do not assume this in our restrictions (R1)-(R3), it turns out to be the consequence from our analysis. Note that if $\beta<1$ in (R1) and $\lim_{t\to\infty}\theta_{t}=1$ , then the restriction $\lim_{t\to\infty}\theta_{t}>\beta^{2}$ is automatically satisfied in (R2).

We are now ready to give the Sufficient Condition (SC) for convergence of Generic Adam based on Theorem 5.

Generic Adam is convergent if the parameters $\{\alpha_{t}\}$ , $\{\beta_{t}\}$ , and $\{\theta_{t}\}$ satisfy

$0<\theta_{t}<1$ and $\theta_{t}$ is non-decreasing in $t$ ;

$\chi_{t}:=\alpha_{t}/\sqrt{1-\theta_{t}}$ is “almost” non-increasing;

$\big{(}{\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}\big{)}{\big{/}}\big{(}{T\alpha_{T}}\big{)}=o(1)$ .

We now provide the convergence rate of Generic Adam with a specific class of parameters $\{(\theta_{t},\alpha_{t})\}$ , i.e.,

for positive constants $\alpha,\eta,K$ , where $K$ is taken such that $\alpha/K^{r}<1$ . Note that $\alpha$ can be taken bigger than 1. When $\alpha<1$ , we can take $K=1$ and then $\theta_{t}=1-\alpha/t^{r},t\geq 1$ . To guarantee (R3), we require $r\leq 2s$ . For such a family of parameters we have the following corollary.

Generic Adam with the above family of parameters converges as long as $0<r\leq 2s<2$ , and its non-asymptotic convergence rate is given by

Corollary 10 recovers and extends the results of some well-known algorithms below:

AdaGrad with exponential moving average (EMA). When $\theta_{t}=1-1/t$ , $\alpha_{t}=\eta/\sqrt{t}$ , and $\beta_{t}=\beta<1$ , Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) . In particular, if $\beta=0$ , this is the vanilla coordinate-wise AdaGrad. It corresponds to taking $r=1$ and $s=1/2$ in Corollary 10. Hence, AdaEMA has convergence rate $\log(T)/\sqrt{T}$ .

AdamNC. Taking $\theta_{t}=1-1/t$ , $\alpha_{t}=\eta/\sqrt{t}$ , and $\beta_{t}=\beta\lambda^{t}$ in Generic Adam, where $\lambda<1$ is the decay factor for the momentum factors $\beta_{t}$ , we recover AdamNC . Its $\mathcal{O}(\log{(T)}/\sqrt{T})$ convergence rate can be directly derived via Corollary 10.

RMSProp. Mukkamala and Hein have reached the same $\mathcal{O}(\log{(T)}/\sqrt{T})$ convergence rate for RMSprop with $\theta_{t}=1-\alpha/t$ , when $0<\alpha\leq 1$ and $\alpha_{t}=\eta/\sqrt{t}$ under the convex assumption. Since RMSProp is essentially Generic Adam with all momentum factors $\beta_{t}=0$ , we recover Mukkamala and Hein’s results by taking $r=1$ and $s=1/2$ in Corollary 10. Moreover, our result generalizes to the non-convex stochastic setting, and it holds for all $\alpha\!>\!0$ rather than only $0\!<\!\alpha\!\leq\!1$ .

As Weighted AdaEMA is equivalent to Generic Adam, we present its convergence rate with specific polynomial growth weights in the following corollary.

Suppose in Weighted AdaEMA the weights $w_{t}=t^{r}$ for $r\!\geq\!0$ , and $\alpha_{t}\!=\!\eta/\sqrt{t}$ . Then Weighted AdaEMA has the $\mathcal{O}(\log(T)/\sqrt{T})$ non-asymptotic convergence rate.

Zou et al. proposed weighted AdaGrad with a unified momentum form which incorporates Heavy Ball (HB) momentum and Nesterov Accelerated Gradients (NAG) momentum. The same convergence rate was established for weights with polynomial growth. Our result complements by showing that the same convergence rate also holds for exponential moving average momentum.

(i) Huang et al. proposed Nostalgic Adam (NosAdam) which corresponds to taking the learning rate $\alpha_{t}=\eta/\sqrt{t}$ and $\theta_{t}=B_{t-1}/B_{t}$ with $B_{t}=\sum_{i=1}^{t}b_{i}$ for $b_{i}>0,\ i\geq 0$ , and $B_{0}>0$ We directly use $B_{t}$ and $b_{i}$ along with the notations of NosAdam . in Generic Adam. The idea of NosAdam is to guarantee $\Gamma_{t}\succ 0$ by laying more weights on the historical second-order moments. A special case of NosAdam is NosAdam-HH which takes $B_{t}=\sum_{i=1}^{t}i^{-r}$ for $r\geq 0$ as the hyper-harmonic series. Its $\mathcal{O}(1/\sqrt{T})$ convergence rate is established in the strongly convex stochastic setting. NosAdam-HH can be viewed as the Weighted AdaEMA taking $\alpha_{t}=\eta/\sqrt{t}$ and $w_{t}=t^{-r}$ for $r\geq 0$ .

(ii) Corollary 12 differs from the motivation of NosAdam as the weights we consider are $w_{t}=t^{r}$ for $r\geq 0$ . Note that in both cases when $r=0$ , this is the AdaGrad algorithm, which corresponds to assigning equal weights to the past squares of gradients. Hence, we are actually in the opposite direction of NosAdam. We are more interested in the case of assigning more weights to the recent stochastic gradients. This can actually be viewed as a situation between AdaGrad and the original Adam with constant $\theta_{t}$ ’s.

Comparison between (SC) and (C1). Most of the convergent modifications of original Adam, such as AMSGrad, AdamNC, and NosAdam, all require $\Gamma_{t}\succ 0$ in Eq. (5), which is equivalent to decreasing the adaptive learning rate $\eta_{t}$ step by step. Since the term $\Gamma_{t}$ (or adaptive learning rate $\eta_{t}$ ) involves the past stochastic gradients (hence not deterministic), the modification to guarantee $\Gamma_{t}\succ 0$ either needs to change the iteration scheme of Adam (like AMSGrad) or needs to impose some strong restrictions on the base learning rate $\alpha_{t}$ and $\theta_{t}$ (like AdamNC). Our sufficient condition provides an easy-to-check criterion for the convergence of Generic Adam in Corollary 9. It is not necessary to require $\Gamma_{t}\succ 0$ . Moreover, we use exactly the same iteration scheme as original Adam without any modifications. Our work shows that the positive definiteness of $\Gamma_{t}$ may not be the essential issue for divergence of original Adam. It is probably due to that the parameters are not set correctly.

The currently most popular RMSProp and Adam’s parameter setting takes constant $\theta_{t}$ , i.e., $\theta_{t}=\theta<1$ . The motivation behind is to use the exponential moving average of squares of past stochastic gradients. In practice, parameter $\theta$ is recommended to be set very close to 1. For instance, a commonly adopted $\theta$ is taken as 0.999.

Although great performance in practice has been observed, such a constant parameter setting has the serious flaw that there is no convergence guarantee even for convex optimization, as proved by the counterexamples in . Ever since much work has been done to analyze the divergence issue of Adam and to propose modifications with convergence guarantees, as summarized in the introduction section. However, there is still not a satisfactory explanation that touches the fundamental reason of the divergence. Below, we try to provide more insights for the divergence issue of Adam/RMSProp with constant parameter $\theta_{t}$ , based on our analysis of the sufficient condition for convergence.

From the sufficient condition perspective. Let $\alpha_{t}\!=\!\eta/t^{s}$ for $0\leq s\!<\!1$ and $\theta_{t}\!=\!\theta\!<\!1$ . According to Corollary 7, $Bound(T)$ in Theorem 5 has the following estimate:

The bounds tell us some points on Adam with constant $\theta_{t}$ :

$Bound(T)\!=\!\mathcal{O}(1)$ , so the convergence is not guaranteed. This result coincides with the divergence issue demonstrated in . Indeed, since in this case Adam is not convergent, this is the best bound we can have.

Consider the dependence on parameter $s$ . The bound is decreasing in $s$ . The best bound in this case is when $s=0$ , i.e., the base learning rate is taken constant. This explains why in practice taking a more aggressive constant base learning rate often leads to even better performance, comparing with taking a decaying one.

Consider the dependence on parameter $\theta$ . Note that the constants $C$ and $C^{\prime}$ depend on $\theta_{1}$ instead of the whole sequence $\theta_{t}$ . We can always set $\theta_{t}=\theta$ for $t\geq 2$ while fix $\theta_{1}<\theta$ , by which we can take $C$ and $C^{\prime}$ independent of constant $\theta$ . Then the principal term of $Bound(T)$ is linear in $\sqrt{1-\theta}$ , so decreases to zero as $\theta\to 1$ . This explains why setting $\theta$ close to 1 in practice often results in better performance in practice.

Moreover, Corollary 10 shows us how the convergence rate continuously changes when we continuously verify parameters $\theta_{t}$ . Let us fix $\alpha_{t}\!=\!1/\sqrt{t}$ and consider the following continuous family of parameters $\{\theta_{t}^{(r)}\}$ with $r\in$ :

Note that when $r=1$ , then $\theta_{t}=1-1/t$ , this is the AdaEMA, which has the convergence rate $\mathcal{O}(\log T/\sqrt{T})$ ; when $r=0$ , then $\theta_{t}=\bar{\theta}<1$ , this is the original Adam with constant $\theta_{t}$ , which only has the $\mathcal{O}(1)$ bound; when $0<r<1$ , by Corollary 10, the algorithm has the $\mathcal{O}(T^{-r/2})$ convergence rate. Along this continuous family of parameters, we observe that the theoretical convergence rate continuously deteriorates as the real parameter $r$ decreases from 1 to 0, namely, as we gradually shift from AdaEMA to Adam with constant $\theta_{t}$ . In the limiting case, the latter is not guaranteed with convergence any more. This phenomenon is empirically verified by the Synthetic Counterexample in Section 5.

From the Weighted AdaEMA perspective. Since Generic Adam is equivalent to Weighted AdaEMA, we can examine Adam with $\theta_{t}=\theta<1$ in terms of Weighted AdaEMA. In this case, we find that the associated sequence of weights $w_{t}=(1-\theta)\theta^{-t}$ is growing in an exponential order. Corollary 12 shows that as long as the weights are in polynomial growth, Weighted AdaEMA is convergent and its convergence rate is $\mathcal{O}(\log T/\sqrt{T})$ . This indicates that the exponential-moving-average technique in the estimate of second-order moments may assign a too aggressive weight to the current gradient, which leads to the divergence.

Experiments

In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample and to train LeNet on the MNIST dataset and ResNet on the CIFAR-100 dataset , respectively.

In this experiment, we verify the phenomenon described in Section 4 that how the convergence rate of Generic Adam gradually changes along a continuous path of families of parameters on the one-dimensional counterexample in :

where $T$ is the number of maximum iterations, $f_{t}(x)\!=\!1010x$ with probability 0.01, and $f_{t}(x)\!=\!10x$ with probability 0.99.

Sensitivity of parameter $r$ . We set $T=10^{7}$ , $\alpha_{t}=0.5/\sqrt{t}$ , $\beta=0.9$ , and $\theta_{t}$ as $\theta_{t}^{(r)}=1-(0.01+0.99r)/{t^{r}}$ with $r\in\{0,\ 0.25,\ 0.5,\ 0.75,\ 1.0\}$ , respectively. Note that when $r=0$ , Generic Adam reduces to the originally divergent Adam with $(\beta,\bar{\theta})=(0.9,0.99)$ . When $r=1$ , Generic Adam reduces to the AdaEMA with $\beta=0.9$ .

The experimental results are shown in Figures 1(a) and 1(d). We can see that for $r=1.0,0.75$ , and $0.5$ , Generic Adam is convergent. Moreover, the convergence becomes slower when $r$ decreases, which exactly matches Corollary 10. On the other hand, for $r=0$ and $0.25$ , Figure 1(d) shows that they do not converge. It seems that the divergence for $r=0.25$ contradicts our theory. However, this is because when $r$ is very small, the $\mathcal{O}(T^{-r/2})$ convergence rate is so slow that we may not see a convergent trend in even $10^{7}$ iterations. Indeed, for $r=0.25$ , we actually have

which is not sufficiently close to 1. As a complementary experiment, we fix the numerator and only change $r$ when $r$ is small. We take $\alpha_{t}$ and $\beta_{t}$ as the same, while $\theta_{t}^{(r)}=1-\frac{0.01}{t^{r}}$ for $r=0$ and $0.25$ , respectively. The result is shown in Figures 1(b) and 1(e). We can see that Generic Adam with $r=0.25$ is indeed convergent in this situation.

Sensitivity of parameter $s$ . Now, we show the sensitivity of $s$ of the sufficient condition (SC) by fixing $r\!=\!0.8$ and selecting $s$ from the collection $s=\{0.4,0.6,0.8\}$ . Figures 1(c) and 1(e) illustrate the sensitivity of parameter $s$ when Generic Adam is applied to solve the counterexample (9). The performance shows that when $s$ is fixed, smaller $r$ can lead to a faster and better convergence speed, which also coincides with the convergence results in Corollary 10.

2 LeNet on MNIST and ResNet-18 on CIFAR-100

In this subsection, we apply Generic Adam to train LeNet on the MNIST dataset and ResNet-18 on the CIFAR-100 dataset, respectively, in order to validate the convergence rates in Corollary 10. Meanwhile, the comparisons between Generic Adam and AMSGrad are also provided to distinguish their differences in training deep neural networks. We illustrate the performance profiles in three aspects: training loss vs. epochs, test loss vs. epochs, and test accuracy vs. epochs, respectively. Besides, the architectures of LeNet and ResNet-18, and the statistics of the MNIST and CIFAR-100 datasets are described in the supplementary material.

In the experiments, for Generic Adam, we set $\theta_{t}^{(r)}=1-(0.001+0.999r)/t^{r}$ with $r\in\{0,0.25,0.5,0.75,1\}$ and $\beta_{t}=0.9$ , respectively; for RMSProp, we set $\beta_{t}=0$ and $\theta_{t}=1-\frac{1}{t}$ along with the parameter settings in . For fairness, the base learning rates $\alpha_{t}$ in Generic Adam, RMSProp, and AMSGrad are all set as $0.001/\sqrt{t}$ . Figures 3 and 3 illustrate the results of Generic Adam with different $r$ , RMSProp, and AMSGrad for training LeNet on MNIST and training ResNet-18 on CIFAR-100, respectively. We can see that AMSGrad and Adam (Generic Adam with $r=0$ ) decrease the training loss slowest and show the worst test accuracy among the compared optimizers. One possible reason is due to the use of constant $\theta$ in AMSGrad and original Adam. By Figures 3 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter $\theta_{t}$ . Larger $r$ can contribute to a faster convergence rate of Generic Adam, which corroborates the theoretical result in Corollary 10. Additionally, the test accuracies in Figures 3(b) and 3(b) indicate that a smaller training loss can contribute to a higher test accuracy for Generic Adam.

Conclusions

In this work, we delved into the convergences of Adam and RMSProp, and presented an easy-to-check sufficient condition to guarantee their convergences in the non-convex stochastic setting. This sufficient condition merely depends on the base learning rate $\alpha_{t}$ and the linear combination parameter $\theta_{t}$ of second-order moments. Relying on this sufficient condition, we found that the divergences of Adam and RMSProp are possibly due to the incorrect parameter settings of $\alpha_{t}$ and $\theta_{t}$ . In addition, we reformulated Adam as weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. At last, the correctness of theoretical results was also verified via the counterexample in and training deep neural networks on real-world datasets.

References

Notations

Appendix A Key Lemmas

In this section we provide the necessary lemmas for the proofs of Theorems 4 and 5.

Given $S_{0}>0$ and a non-negative sequence $\{s_{t}\}$ , let $S_{t}=S_{0}+\sum_{i=1}^{t}s_{i}$ for $t\geq 1$ . Then the following estimate holds

The finite sum $\sum_{t=1}^{T}{s_{t}}/{S_{t}}$ can be interpreted as a Riemann sum $\sum_{t=1}^{T}(S_{t}-S_{t-1})/S_{t}.$ Since $1/x$ is decreasing on the interval $(0,\infty)$ , we have

Let $\{u_{t}\}$ and $\{s_{t}\}$ be two non-negative sequences. Let $S_{t}=\sum_{i=1}^{t}s_{i}$ for $t\geq 1$ . Then

Let $\{\theta_{t}\}$ and $\{\alpha_{t}\}$ satisfy the restrictions (R2) and (R3). For any $i\leq t$ , we have

For any $i\leq t$ , since the sequence $\{a_{t}\}$ is non-increasing, we have $a_{t}\leq a_{i}$ . Hence,

which proves the first inequality. On the other hand, since $\{\theta_{t}\}$ is non-decreasing, it holds

Let $\Theta_{(t,i)}=\prod_{j=i+1}^{t}\theta_{j}$ for $i<t$ and $\Theta_{(t,t)}=1$ by convention.

Fix a constant $\theta^{\prime}$ with $\beta^{2}<\theta^{\prime}<\theta$ . Let $C_{1}$ be as given as Eq. (6) in the main paper. For any $i\leq t$ , we have

For any $i\leq t$ , since $\theta_{j}\geq\theta^{\prime}$ for $j\geq N$ , and $\theta_{j}<\theta^{\prime}$ for $j<N$ , we have

We take the constant $C_{1}=\prod_{j=1}^{N}(\theta_{j}/\theta^{\prime})$ , where $N$ is the maximum of the indices for which $\theta_{j}<\theta^{\prime}$ . The proof is completed. ∎

If $\theta_{t}=\theta$ is a constant, we have $\Theta_{(t,i)}=\theta^{t-i}$ . In this case we can take $\theta^{\prime}=\theta$ and $C_{1}=1$ .

Let $\gamma:=\beta^{2}/{\theta^{\prime}}$ . We have the following estimate

Let $B_{(t,i)}=\prod_{j=i+1}^{t}\beta_{j}$ for $i<t$ and $B_{(t,t)}=1$ by convention. By the iteration formula $\bm{m}_{t}=\beta_{t}\bm{m}_{t-1}+(1-\beta_{t})\bm{g}_{t}$ and $\bm{m}_{0}=\bm{0}$ , we have

Similarly, by $\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}^{2}$ and $\bm{v}_{0}=\bm{\epsilon}$ , we have

Note that $\{\theta_{t}\}$ is non-decreasing by (R2), and $B_{(t,i)}\leq\beta^{t-i}$ by (R1). By Lemma 18, we have

Combining Eq. (18) and Eq. (19), we obtain the desired Eq. (16). The proof is completed. ∎

where $C_{2}=2\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)^{2}$ .

To estimate (I), by the Schwartz inequality and the Lipschitz continuity of the gradient of $f$ , we have

Note that $\bm{\sigma}_{t}\leq G$ . Therefore,

where $C_{2}^{\prime}=\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)$ . The last inequality holds due to $\beta_{t}/(1-\beta_{t})\leq\beta/(1-\beta)$ as $\beta_{t}\leq\beta$ . Therefore, we have

Combining Eq. (27), Eq. (33), and Eq. (34), we obtain

The term (IV) is estimated similarly as term (III). First, we have

where $C_{2}^{\prime}$ is the constant defined above. We have

Combining Eq. (22), Eq. (23), Eq. (25), Eq. (26), Eq. (35), and Eq. (37), we obtain

Let $C_{2}$ denote the constant ${2(C_{2}^{\prime})^{2}}$ . Then

Next, we estimate Eq. (21). When $t=1$ , we have

The same as what we did for term (I) in Lemma 21, we have

Then the similar argument as Eq. (33) implies that

Note that $\bm{v}_{t}\geq\theta_{t}\bm{v}_{t-1}$ , so we have $\bm{v}_{t}\geq\left(\prod_{j=i+1}^{t}\theta_{j}\right)\bm{v}_{i}=\Theta_{(t,i)}\bm{v}_{i}$ . By Lemma 18, this follows that $\bm{v}_{t}\geq C_{1}(\theta^{\prime})^{t-i}\bm{v}_{i}$ for all $i\leq t$ . On the other hand,

Since $\alpha_{t}=\chi_{t}\sqrt{1-\theta_{t}}\leq\chi_{t}\sqrt{1-\theta_{i}}$ for $i\leq t$ , it follows that

It is straightforward to acquire by induction that

By Lemma 17, it holds $\alpha_{t}\leq C_{0}\alpha_{i}$ for any $i\leq t$ . By Lemma 18, $\Theta_{(t,i)}\geq C_{1}(\theta^{\prime})^{t-i}$ . In addition, $B_{(t,i)}\leq\beta^{t-i}$ . Hence,

Combining Eq. (51) and Eq. (52), we then obtain the desired estimate Eq. (47). The proof is completed. ∎

Let $W_{0}=1$ and $W_{t}=\prod_{i=1}^{T}\theta_{i}^{-1}$ . Let $w_{t}=W_{t}-W_{t-1}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}=(1-\theta_{t})W_{t}$ . We therefore have

Note that $\bm{v}_{0}=\bm{\epsilon}$ and $\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}$ , so it holds that $W_{0}\bm{v}_{0}=\bm{\epsilon}$ and $W_{t}\bm{v}_{t}=W_{t-1}\bm{v}_{t-1}+w_{t}\bm{g}_{t}^{2}.$ Then, $W_{t}\bm{v}_{t}=W_{0}\bm{v}_{0}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}=\bm{\epsilon}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}.$ It follows that

Writing the norm in terms of coordinates, we obtain

The second inequality is due to the convex inequality $\frac{1}{d}\sum_{k=1}^{d}\log\left(z_{i}\right)\leq\log\left(\frac{1}{d}\sum_{k=1}^{d}z_{i}\right)$ . Indeed, we have the more general convex inequality that

for any positive random variable $X$ . Taking $X$ to be $1+\frac{1}{\epsilon d}\sum_{i=1}^{t}w_{i}\left\|\bm{g}_{i}\right\|^{2}$ in the right hand side of Eq. (57), we obtain that

The last inequality is due to the following trivial inequality:

for any non-negative parameters $a$ and $b$ . It then follows that

For simplicity of notations, let $\omega_{t}:=\left\|\frac{\sqrt{1-\theta_{t}}\bm{g}_{t}}{\sqrt{\bm{v}_{t}}}\right\|^{2}$ , and $\Omega_{t}:=\sum_{i=1}^{t}\omega_{i}$ . Note that $\chi_{t}\leq C_{0}a_{t}$ . Hence,

Let $S_{t}:=\log\left(1+\frac{G^{2}}{\epsilon d}\right)+\sum_{i=1}^{t}\log(\theta_{i}^{-1})$ . By Lemma 25, we have

Since $\{a_{t}\}$ is a non-increasing sequence, we have $a_{t}-a_{t+1}\geq 0$ . By Eq. (63), we have

Note that $a_{t}\leq\chi_{t}$ . Combining Eq. (62), Eq. (63), and Eq. (65), we have

Note that $\log(1+x)\leq x$ for all $x>-1$ . It follows that

Note that $\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}}$ . By Eq. (62) and Eq. (65), we have

Let $\tau$ be randomly chosen from $\{1,2,\ldots,T\}$ with equal probabilities $p_{\tau}=1/T$ . We have the following estimate

For any two random variables $X$ and $Y$ , by the Hölder’s inequality, we have

Let $X=\left(\frac{\left\|\bm{\nabla}f(\bm{x}_{t})\right\|^{2}}{\sqrt{\left\|\bm{\hat{v}}_{t}\right\|_{1}}}\right)^{2/3}$ , $Y=\left\|\bm{\hat{v}}_{t}\right\|_{1}^{1/3}$ , and let $p=3/2$ , $q=3$ . By Eq. (69), we have

Since $\bm{\hat{v}}_{t}=\theta_{t}\bm{{v}}_{t-1}+(1-\theta_{t})\bm{\sigma}_{t}^{2}$ , and all entries are non-negative, we have

By Eq. (70), Eq. (71), and Eq. (72), we obtain

By Lemma 17, $\alpha_{T}\leq C_{0}\alpha_{t}$ for any $t\leq T$ , so $\alpha_{t}^{-1}\leq C_{0}\alpha_{T}^{-1}$ . Then, we obtain

Appendix B Proofs of the main results

In this section, we provide the detailed proofs of the propositions, theorems, and corollaries in the main body.

Algorithm 1 and Algorithm 2 are equivalent.

It suffices to show that Algorithm 1 can be realized as Algorithm 2 with a particular choice of parameters, and vice versa. Note that for Algorithm 1, it holds

Hence, given the parameters $\theta_{t}$ in Algorithm 1, we take $w_{t}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}$ . Then it holds

It follows that Eq. (77) becomes Eq. (76). Conversely, given the parameters $w_{t}$ of Algorithm 2, we take $\theta_{t}=W_{t-1}/W_{t}$ . Then Eq. (76) becomes Eq. (77). The proof is completed. ∎

B.2 Proof of Theorem 4

where the constants $C$ and $C^{\prime}$ are given by

By the $L$ -Lipschitz continuity of the gradient of $f$ and the descent lemma, we have

where $C_{3}$ is the constant given in Lemma 24. By applying the estimates in Lemma 25 and Lemma 27 for the second and third terms in the right hand side of Eq. (81), and appropriately rearranging the terms, we obtain

B.3 Proof of Theorem 5

where the constants $C$ and $C^{\prime}$ are given by

in which the constant $C_{3}$ is given by

Namely, $\mathcal{P}\left(|\zeta|>\frac{C(T)}{\delta}\right)\leq\delta^{2/3}$ . Therefore, $\mathcal{P}\left(|\zeta|\leq\frac{C(T)}{\delta}\right)\geq 1-\delta^{2/3}$ . This completes the proof. ∎

B.4 Proof of Corollary 7

Take $\alpha_{t}=\eta/t^{s}$ with $0\leq s<1$ . Suppose $\lim_{t\to\infty}\theta_{t}=\theta<1$ .Then $Bound(T)$ in Theorem 5 is bounded from below by constants

In particular, when $\theta_{t}=\theta<1$ , we have the following more subtle estimate on lower and upper-bounds for $Bound(T)$ :

Since $\lim_{t\to\infty}\theta_{t}=\theta$ , and $\theta_{t}$ is non-decreasing, we have $(1-\theta_{t})\geq 1-\theta$ . Hence, by Theorem 5, it holds

If, in particular, $\theta_{t}=\theta<1$ , by Theorem 5 we have

Combining Eqs. (87)-(88), we obtain the desired result. ∎

B.5 Proof of Corollary 10

Generic Adam with the above family of parameters converges as long as $0<r\leq 2s<2$ , and its non-asymptotic convergence rate is given by

It is not hard to verify that the following equalities hold:

In this case, $T\alpha_{T}=\eta T^{1-s}$ . Therefore, by Theorem 5 the non-asymptotic convergence rate is given by

To guarantee convergence, then $0<r\leq 2s<2$ . ∎

B.6 Proof of Corollary 12

Suppose that in Weighted AdaEMA the weights $w_{t}=t^{r}$ for $r\!\geq\!0$ , and $\alpha_{t}\!=\!\eta/\sqrt{t}$ . Then Weighted AdaEMA has the $\mathcal{O}(\log(T)/\sqrt{T})$ non-asymptotic convergence rate.

By the proof procedures of Theorem 3, the equivalent Generic Adam has the parameters $\theta_{t}=W_{t-1}/W_{t}$ , where $W_{t}=1+\sum_{i=1}^{t}w_{i}$ . Hence, it holds

We have that $\lim_{t\to\infty}\theta_{t}=1>\beta$ and $\theta_{t}$ is increasing. In addition, we have that $\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}}$ is bounded, and hence “almost” non-increasing (by taking $a_{t}=1$ in (R3)). The restrictions (R1)-(R3) are all satisfied. Hence, we can apply Theorem 5 in this case. It follows that its convergence rate is given by

Appendix C Experimental Implementations

In this section, we describe the statistics of the training and validation datasets of MNISThttp://yann.lecun.com/exdb/mnist/ and CIFAR-100https://www.cs.toronto.edu/ kriz/cifar.html, the architectures of LeNet and ResNet-18, and detailed implementations.

MNIST is composed of ten classes of digits among $\{0,1,2,\ldots,9\}$ , which includes 60,000 training examples and 10,000 validation examples. The dimension of each example is $28\times 28$ .

CIFAR-100 is composed of 100 classes of $32\times 32$ color images. Each class includes 6,000 images. In addition, these images are devided into 50,000 training examples and 10,000 validation examples.

C.2 Architectures of Neural Networks

C.3 Additional Experiments of ResNet-18 on CIFAR-100

We further illustrate Generic Adam with different $r=\{0,0.25,0.5,0.75,1\}$ , RMSProp, and AMSGrad with an alternative base learning rate $\alpha=0.01$ on ResNet-18. We do cut-off by taking $\alpha_{t}=0.001$ if $t<2500$ . Note that $\alpha_{t}$ is still non-increasing. The motivation is that at the very beginning the learning rate $\alpha_{t}=\frac{0.01}{\sqrt{t}}$ could be large which would deteriorate the performance. The performance profiles are also exactly in accordance with the analysis in theory, i.e., larger $r$ leads to a faster training process.