A Sufficient Condition for Convergences of Adam and RMSProp

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

Introduction

Large-scale non-convex stochastic optimization , covering a slew of applications in statistics and machine learning such as learning a latent variable from massive data whose probability density distribution is unknown, takes the following generic formulation:

for k=1,2,,dk=1,2,\ldots,d, where ηt,k0\eta_{t,k}\geq 0 is the learning rate of the kk-th component of stochastic gradient g(xt,ξt)\bm{g}(\bm{x}_{t},\xi_{t}) at the tt-th iteration. A sufficient condition to ensure the global convergence of vanilla SGD (2) is to require ηt\eta_{t} to meet

Although the vanilla SGD algorithm with learning rate ηt\eta_{t} satisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune an effective learning rate ηt\eta_{t} via condition (3).

To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including AdaGrad , RMSProp , Adam , Nadam , etc., have been proposed to automatically tune the learning rate ηt\eta_{t} by using second-order moments of historical stochastic gradients. Let vt,k\bm{v}_{t,k} and mt,k\bm{m}_{t,k} be the linear combinations of the historical second-order moments (g1,k2,g2,k2,,gt,k2)(\bm{g}^{2}_{1,k},\bm{g}^{2}_{2,k},\cdots,\bm{g}^{2}_{t,k}) and stochastic gradient estimates (g1,k,g2,k,,gt,k)(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k}), respectively. Then, the generic iteration scheme of these adaptive SGD algorithms is summarized as

for k=1,2,,dk=1,2,\ldots,d, where αt>0\alpha_{t}>0 is called base learning rate and it is independent of stochastic gradient estimates (g1,k,g2,k,,gt,k)(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k}) for all t1t\geq 1. Although RMSProp, Adam, and Nadam work well for solving large-scale convex and non-convex optimization problems such as training deep neural networks, they have been pointed out to be divergent in some scenarios via convex counterexamples . This finding thoroughly destroys the fluke of a direct use of these algorithms without any further assumptions or corrections. Recently, developing sufficient conditions to guarantee global convergences of Adam and RMSProp-type algorithms has attracted much attention from both machine learning and optimization communities. The existing successful attempts can be divided into four categories:

(C1) Decreasing a learning rate. Reddi et al. have declared that the core cause of divergences of Adam and RMSProp is largely controlled by the difference between the two adjacent learning rates, i.e.,

Once positive definiteness of Γt\Gamma_{t} is violated, Adam and RMSProp may suffer from divergence . Based on this observation, two variants of Adam called AMSGrad and AdamNC have been proposed with convergence guarantees in both the convex and non-convex stochastic settings by requiring Γt0\Gamma_{t}\succ 0. In addition, Padam extended from AMSGrad has been proposed to contract the generalization gap in training deep neural networks, whose convergence has been ensured by requiring Γt0\Gamma_{t}\succ 0. In the strongly convex stochastic setting, by using the long-term memory technique developed in , Huang et al. have proposed NosAdam by attaching more weights on historical second-order moments to ensure its convergence. Prior to that, the convergence rate of RMSProp has already been established in the convex stochastic setting by employing similar parameters to those of AdamNC .

(C2) Adopting a big batch size. Basu et al. , for the first time, showed that deterministic Adam and RMSProp with original iteration schemes are actually convergent by using full-batch gradient. On the other hand, both Adam and RMSProp can be reshaped as specific signSGD-type algorithms whose O(1/T)\mathcal{O}(1/\sqrt{T}) convergence rates have been provided in the non-convex stochastic setting by setting batch size as large as the number of maximum iterations . Recently, Zaheer et al. have established O(1/T)\mathcal{O}(1/\sqrt{T}) convergence rate of original Adam directly in the non-convex stochastic setting by requiring the batch size to be the same order as the number of maximum iterations. We comment that this type of requirement is impractical when Adam and RMSProp are applied to tackle large-scale problems like (1), since these approaches cost a huge number of computations to estimate big-batch stochastic gradients in each iteration.

(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in , Zhou et al. have pointed out that the divergence of RMSProp is fundamentally caused by the unbalanced learning rate rather than the absence of Γt0\Gamma_{t}\succ 0. Based on this viewpoint, Zhou et al. have proposed AdaShift by incorporating a temporal decorrelation technique to eliminate the inappropriate correlation between vt,k\bm{v}_{t,k} and the current second-order moment gt,k2\bm{g}_{t,k}^{2}, in which the adaptive learning rate ηt,k\eta_{t,k} is required to be independent of gt,k2\bm{g}^{2}_{t,k}. However, convergence of AdaShift in was merely restricted to RMSProp for solving the convex counterexample in .

(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp , Zou et al. recently proposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weighted AdaGrad with unified heavy ball and Nesterov accelerated gradient momentums. Its O(log(T)/T)\mathcal{O}(\log{(T)}/\sqrt{T}) convergence rate has also been provided in the non-convex stochastic setting by requiring a non-decreasing weighted sequence. Besides, many other adaptive stochastic algorithms without combining momentums, such as AdaGrad and stagewise AdaGrad , have been guaranteed to be convergent and work well in the non-convex stochastic setting.

In contrast with the above four types of modifications and restrictions, we introduce an alternative easy-to-check sufficient condition (abbreviated as (SC)) to guarantee the global convergences of original Adam and RMSProp. The proposed (SC) merely depends on the parameters in estimating vt,k\bm{v}_{t,k} and base learning rate αt\alpha_{t}. (SC) neither requires the positive definiteness of Γt\Gamma_{t} like (C1) nor needs the batch size as large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochastic settings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partially overlapped with (C1) since the proposed (SC) can cover AdamNC , AdaGrad with exponential moving average (AdaEMA) momentum , and RMSProp as instances whose convergences are all originally motivated by requiring the positive definiteness of Γt\Gamma_{t}. While, based on (SC), we can directly derive their global convergences in the non-convex stochastic setting as byproducts without checking the positive definiteness of Γt\Gamma_{t} step by step. Besides, (SC) can serve as an alternative explanation on divergences of original Adam and RMSProp, which are possibly due to incorrect parameter settings for accumulating the historical second-order moments rather than the unbalanced learning rate caused by the inappropriate correlation between vt,k\bm{v}_{t,k} and gt,k2\bm{g}^{2}_{t,k} like (C3). In addition, AdamNC and AdaEMA are convergent under (SC), but violate (C3) in each iteration.

Moreover, by carefully reshaping the iteration scheme of Adam, we obtain a specific weighted AdaGrad with exponential moving average momentum, which extends the weighted AdaGrad with heavy ball momentum and Nesterov accelerated gradient momentum in two aspects: the new momentum mechanism and the new base learning rate setting provide a new perspective for understanding Adam and RMSProp. At last, we experimentally verify (SC) by applying Adam and RMSProp with different parameter settings to solve the counterexample and train deep neural networks including LeNet and ResNet . In summary, the contributions of this work are five-fold:

We introduce an easy-to-check sufficient condition to ensure the global convergences of original Adam and RMSProp in the non-convex stochastic setting. Moreover, this sufficient condition is distinctive from the existing conditions (C1)-(C4) and is easier to verify.

We reshape Adam as weighted AdaGrad with exponential moving average momentum, which provides a new perspective for understanding Adam and RMSProp and also complements AdaUSM in .

We provide a new explanation on the divergences of original Adam and RMSProp, which are possibly due to an incorrect parameter setting of the combinations of historical second-order moments based on (SC).

We find that the sufficient condition extends the restrictions of RMSProp and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad with momentum, etc. Thus, their convergences in the non-convex stochastic setting naturally hold.

We conduct experiments to validate the sufficient condition for the convergences of Adam/RMSProp. The experimental results match our theoretical results.

Generic Adam

Generic Adam covers RMSProp by setting βt=0\beta_{t}=0. Moreover, it covers Adam with a bias correction as follows:

The original Adam with the bias correction takes constant parameters βt=β\beta_{t}=\beta and θt=θ\theta_{t}=\theta. The iteration scheme is written as xt+1 ⁣= ⁣xt ⁣ ⁣α^tm^tv^t\bm{x}_{t+1}\!=\!\bm{x}_{t}\!-\!\widehat{\alpha}_{t}\frac{\widehat{\bm{m}}_{t}}{\sqrt{\widehat{\bm{v}}_{t}}}, with m^t ⁣= ⁣mt1βt\widehat{\bm{m}}_{t}\!=\!\frac{\bm{m}_{t}}{1-\beta^{t}} and v^t ⁣= ⁣vt1θt\widehat{\bm{v}}_{t}\!=\!\frac{\bm{v}_{t}}{1-\theta^{t}}. Let αt=α^t1θt1βt\alpha_{t}=\widehat{\alpha}_{t}\frac{\sqrt{1-\theta^{t}}}{1-\beta^{t}}. Then, the above can be rewritten as xt+1=xtαtmt/vt\bm{x}_{t+1}=\bm{x}_{t}-{\alpha_{t}\bm{m}_{t}}/\sqrt{\bm{v}_{t}}. Thus, it is equivalent to taking constant βt\beta_{t}, constant θt\theta_{t}, and new base learning rate αt\alpha_{t} in Generic Adam.

Now we show that Generic Adam can be reformulated as a new type of weighted AdaGrad algorithms with exponential moving average momentum (Weighted AdaEMA).

The Weighted AdaEMA is a natural generalization of the AdaGrad algorithm. The classical AdaGrad is to take the weights wt=1w_{t}=1, the momentum factors βt=0\beta_{t}=0, and the parameters αt=η/t+1\alpha_{t}=\eta/\sqrt{t+1} for constant η\eta.

The following proposition states the equivalence between Generic Adam and Weighted AdaEMA.

Algorithm 1 and Algorithm 2 are equivalent.

The divergence issue of Adam/RMSProp. When θt\theta_{t} is taken constant, i.e., θt=θ\theta_{t}=\theta, Reddi et al. have pointed out that Adam and RMSProp (βt=0\beta_{t}=0) can be divergent even in the convex setting. They conjectured that the divergence is possibly due to the uncertainty of positive definiteness of Γt\Gamma_{t} in Eq. (5). This idea has motivated many new convergent variants of Adam by forcing Γt0\Gamma_{t}\succ 0. Recently, Zhou et al. further argued that the nature of divergences of Adam and RMSProp is possibly due to the unbalanced learning rate ηt\eta_{t} caused by the inappropriate correlation between vt,kv_{t,k} and gt,k2g^{2}_{t,k} by studying the counterexample in . However, this explanation can be violated by many existing convergent Adam-type algorithms such as AdamNC, NosAdam , etc. So far, there is no satisfactory explanation for the core reason of the divergence issue. We will provide more insights in Section 4 based on our theoretical analysis.

Main Results

In this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters (θt,αt)(\theta_{t},\alpha_{t}). Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters (θt,αt)(\theta_{t},\alpha_{t}). Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochastic algorithm for non-convex problems:

To establish the upper-bound, we also suppose that the parameters {βt}\{\beta_{t}\}, {θt}\{\theta_{t}\}, and {αt}\{\alpha_{t}\} satisfy the restrictions:

The parameters {βt}\{\beta_{t}\} satisfy 0βtβ<10\leq\beta_{t}\leq\beta<1 for all tt for some constant β\beta;

The parameters {θt}\{\theta_{t}\} satisfy 0<θt<10<\theta_{t}<1 and θt\theta_{t} is non-decreasing in tt with θ:=limtθt>β2\theta:=\lim_{t\to\infty}\theta_{t}>\beta^{2};

The parameters {αt}\{\alpha_{t}\} satisfy that χt:=αt1θt\chi_{t}:=\frac{\alpha_{t}}{\sqrt{1-\theta_{t}}} is “almost” non-increasing in tt, by which we mean that there exist a non-increasing sequence {at}\{a_{t}\} and a positive constant C0C_{0} such that atχtC0ata_{t}\leq\chi_{t}\leq C_{0}a_{t}.

The restriction (R3) indeed says that χt\chi_{t} is the product between some non-increasing sequence {at}\{a_{t}\} and some bounded sequence. This is a slight generalization of χt\chi_{t} itself being non-decreasing. If χt\chi_{t} itself is non-increasing, we can then take at ⁣= ⁣χta_{t}\!=\!\chi_{t} and C0 ⁣= ⁣1C_{0}\!=\!1. For most of the well-known Adam-type methods, χt\chi_{t} is indeed non-decreasing, for instance, for AdaGrad with EMA momentum we have αt=η/t\alpha_{t}=\eta/\sqrt{t} and θt ⁣= ⁣11/t\theta_{t}\!=\!1-1/t, so χt=η\chi_{t}=\eta is constant; for Adam with constant θt ⁣= ⁣θ\theta_{t}\!=\!\theta and non-increasing αt\alpha_{t} (say αt ⁣= ⁣η/t\alpha_{t}\!=\!\eta/\sqrt{t} or αt ⁣= ⁣η\alpha_{t}\!=\!\eta), χt ⁣= ⁣αt/1θ\chi_{t}\!=\!\alpha_{t}/\sqrt{1-\theta} is non-increasing. The motivation, instead of χt\chi_{t} being decreasing, is that it allows us to deal with the bias correction steps in Adam .

We fix a positive constant θ>0\theta^{\prime}>0In the special case that θt=θ\theta_{t}=\theta is constant, we can directly set θ=θ\theta^{\prime}=\theta. such that β2<θ<θ\beta^{2}<\theta^{\prime}<\theta. Let γ:=β2/θ<1\gamma:={\beta^{2}}/{\theta^{\prime}}<1 and

where NN is the maximum of the indices jj with θj<θ\theta_{j}<\theta^{\prime}. The finiteness of NN is guaranteed by the fact that limtθt=θ>θ\lim_{t\to\infty}\theta_{t}=\theta>\theta^{\prime}. When there are no such indices, i.e., θ1θ\theta_{1}\geq\theta^{\prime}, we take C1=1C_{1}=1 by convention. In general, C11C_{1}\leq 1. Our main results on estimating gradient residual state as follows:

Let {xt}\{\bm{x}_{t}\} be a sequence generated by Generic Adam for initial values x1\bm{x}_{1}, m0=0\bm{m}_{0}=\bm{0}, and v0=ϵ\bm{v}_{0}=\bm{\epsilon}. Assume that ff and stochastic gradients gt\bm{g}_{t} satisfy assumptions (A1)-(A4). Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. We have

where C^{\prime}\!=\!{2C_{0}^{2}C_{3}d\sqrt{G^{2}\!+\!\epsilon d}}{\big{/}}{[(1\!-\!\beta)\theta_{1}]} and

where C4C_{4} and C3C_{3} are defined as C4=f(x1)fC_{4}=f(x_{1})-f^{*} and C_{3}=\frac{C_{0}}{\sqrt{C_{1}}(1-\sqrt{\gamma})}\big{[}\frac{C_{0}^{2}\chi_{1}L}{C_{1}(1-\sqrt{\gamma})^{2}}+2\big{(}\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\big{)}^{2}G\big{]}.

Suppose the same setting and hypothesis as Theorem 4. Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. Then for any δ>0\delta>0, the following bound holds with probability at least 1δ2/31-\delta^{2/3}:

where CC and CC^{\prime} are defined as those in Theorem 4.

Take αt=η/ts\alpha_{t}=\eta/t^{s} with 0s<10\leq s<1. Suppose limtθt=θ<1\lim_{t\to\infty}\theta_{t}=\theta<1. Then the Bound(T)Bound(T) in Theorem 5 is bounded from below by constants

In particular, when θt=θ<1\theta_{t}=\theta<1, we have the following more subtle estimate on lower and upper-bounds for Bound(T)Bound(T)

(i) Corollary 7 shows that if limtθt=θ<1\lim_{t\to\infty}\theta_{t}=\theta<1, the bound in Theorem 5 is only O(1)\mathcal{O}(1), hence not guaranteeing convergence. This result is not surprising as Adam with constant θt\theta_{t} has already shown to be divergent . Hence, O(1)\mathcal{O}(1) is its best convergence rate we can expect. We will discuss this case in more details in Section 4. (ii) Corollary 7 also indicates that in order to guarantee convergence, the parameter has to satisfy limtθt=1\lim_{t\to\infty}\theta_{t}=1. Although we do not assume this in our restrictions (R1)-(R3), it turns out to be the consequence from our analysis. Note that if β<1\beta<1 in (R1) and limtθt=1\lim_{t\to\infty}\theta_{t}=1, then the restriction limtθt>β2\lim_{t\to\infty}\theta_{t}>\beta^{2} is automatically satisfied in (R2).

We are now ready to give the Sufficient Condition (SC) for convergence of Generic Adam based on Theorem 5.

Generic Adam is convergent if the parameters {αt}\{\alpha_{t}\}, {βt}\{\beta_{t}\}, and {θt}\{\theta_{t}\} satisfy

0<θt<10<\theta_{t}<1 and θt\theta_{t} is non-decreasing in tt;

χt:=αt/1θt\chi_{t}:=\alpha_{t}/\sqrt{1-\theta_{t}} is “almost” non-increasing;

\big{(}{\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}\big{)}{\big{/}}\big{(}{T\alpha_{T}}\big{)}=o(1).

We now provide the convergence rate of Generic Adam with a specific class of parameters {(θt,αt)}\{(\theta_{t},\alpha_{t})\}, i.e.,

for positive constants α,η,K\alpha,\eta,K, where KK is taken such that α/Kr<1\alpha/K^{r}<1. Note that α\alpha can be taken bigger than 1. When α<1\alpha<1, we can take K=1K=1 and then θt=1α/tr,t1\theta_{t}=1-\alpha/t^{r},t\geq 1. To guarantee (R3), we require r2sr\leq 2s. For such a family of parameters we have the following corollary.

Generic Adam with the above family of parameters converges as long as 0<r2s<20<r\leq 2s<2, and its non-asymptotic convergence rate is given by

Corollary 10 recovers and extends the results of some well-known algorithms below:

AdaGrad with exponential moving average (EMA). When θt=11/t\theta_{t}=1-1/t, αt=η/t\alpha_{t}=\eta/\sqrt{t}, and βt=β<1\beta_{t}=\beta<1, Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) . In particular, if β=0\beta=0, this is the vanilla coordinate-wise AdaGrad. It corresponds to taking r=1r=1 and s=1/2s=1/2 in Corollary 10. Hence, AdaEMA has convergence rate log(T)/T\log(T)/\sqrt{T}.

AdamNC. Taking θt=11/t\theta_{t}=1-1/t, αt=η/t\alpha_{t}=\eta/\sqrt{t}, and βt=βλt\beta_{t}=\beta\lambda^{t} in Generic Adam, where λ<1\lambda<1 is the decay factor for the momentum factors βt\beta_{t}, we recover AdamNC . Its O(log(T)/T)\mathcal{O}(\log{(T)}/\sqrt{T}) convergence rate can be directly derived via Corollary 10.

RMSProp. Mukkamala and Hein have reached the same O(log(T)/T)\mathcal{O}(\log{(T)}/\sqrt{T}) convergence rate for RMSprop with θt=1α/t\theta_{t}=1-\alpha/t, when 0<α10<\alpha\leq 1 and αt=η/t\alpha_{t}=\eta/\sqrt{t} under the convex assumption. Since RMSProp is essentially Generic Adam with all momentum factors βt=0\beta_{t}=0, we recover Mukkamala and Hein’s results by taking r=1r=1 and s=1/2s=1/2 in Corollary 10. Moreover, our result generalizes to the non-convex stochastic setting, and it holds for all α ⁣> ⁣0\alpha\!>\!0 rather than only 0 ⁣< ⁣α ⁣ ⁣10\!<\!\alpha\!\leq\!1.

As Weighted AdaEMA is equivalent to Generic Adam, we present its convergence rate with specific polynomial growth weights in the following corollary.

Suppose in Weighted AdaEMA the weights wt=trw_{t}=t^{r} for r ⁣ ⁣0r\!\geq\!0, and αt ⁣= ⁣η/t\alpha_{t}\!=\!\eta/\sqrt{t}. Then Weighted AdaEMA has the O(log(T)/T)\mathcal{O}(\log(T)/\sqrt{T}) non-asymptotic convergence rate.

Zou et al. proposed weighted AdaGrad with a unified momentum form which incorporates Heavy Ball (HB) momentum and Nesterov Accelerated Gradients (NAG) momentum. The same convergence rate was established for weights with polynomial growth. Our result complements by showing that the same convergence rate also holds for exponential moving average momentum.

(i) Huang et al. proposed Nostalgic Adam (NosAdam) which corresponds to taking the learning rate αt=η/t\alpha_{t}=\eta/\sqrt{t} and θt=Bt1/Bt\theta_{t}=B_{t-1}/B_{t} with Bt=i=1tbiB_{t}=\sum_{i=1}^{t}b_{i} for bi>0, i0b_{i}>0,\ i\geq 0, and B0>0B_{0}>0We directly use BtB_{t} and bib_{i} along with the notations of NosAdam . in Generic Adam. The idea of NosAdam is to guarantee Γt0\Gamma_{t}\succ 0 by laying more weights on the historical second-order moments. A special case of NosAdam is NosAdam-HH which takes Bt=i=1tirB_{t}=\sum_{i=1}^{t}i^{-r} for r0r\geq 0 as the hyper-harmonic series. Its O(1/T)\mathcal{O}(1/\sqrt{T}) convergence rate is established in the strongly convex stochastic setting. NosAdam-HH can be viewed as the Weighted AdaEMA taking αt=η/t\alpha_{t}=\eta/\sqrt{t} and wt=trw_{t}=t^{-r} for r0r\geq 0.

(ii) Corollary 12 differs from the motivation of NosAdam as the weights we consider are wt=trw_{t}=t^{r} for r0r\geq 0. Note that in both cases when r=0r=0, this is the AdaGrad algorithm, which corresponds to assigning equal weights to the past squares of gradients. Hence, we are actually in the opposite direction of NosAdam. We are more interested in the case of assigning more weights to the recent stochastic gradients. This can actually be viewed as a situation between AdaGrad and the original Adam with constant θt\theta_{t}’s.

Comparison between (SC) and (C1). Most of the convergent modifications of original Adam, such as AMSGrad, AdamNC, and NosAdam, all require Γt0\Gamma_{t}\succ 0 in Eq. (5), which is equivalent to decreasing the adaptive learning rate ηt\eta_{t} step by step. Since the term Γt\Gamma_{t} (or adaptive learning rate ηt\eta_{t}) involves the past stochastic gradients (hence not deterministic), the modification to guarantee Γt0\Gamma_{t}\succ 0 either needs to change the iteration scheme of Adam (like AMSGrad) or needs to impose some strong restrictions on the base learning rate αt\alpha_{t} and θt\theta_{t} (like AdamNC). Our sufficient condition provides an easy-to-check criterion for the convergence of Generic Adam in Corollary 9. It is not necessary to require Γt0\Gamma_{t}\succ 0. Moreover, we use exactly the same iteration scheme as original Adam without any modifications. Our work shows that the positive definiteness of Γt\Gamma_{t} may not be the essential issue for divergence of original Adam. It is probably due to that the parameters are not set correctly.

The currently most popular RMSProp and Adam’s parameter setting takes constant θt\theta_{t}, i.e., θt=θ<1\theta_{t}=\theta<1. The motivation behind is to use the exponential moving average of squares of past stochastic gradients. In practice, parameter θ\theta is recommended to be set very close to 1. For instance, a commonly adopted θ\theta is taken as 0.999.

Although great performance in practice has been observed, such a constant parameter setting has the serious flaw that there is no convergence guarantee even for convex optimization, as proved by the counterexamples in . Ever since much work has been done to analyze the divergence issue of Adam and to propose modifications with convergence guarantees, as summarized in the introduction section. However, there is still not a satisfactory explanation that touches the fundamental reason of the divergence. Below, we try to provide more insights for the divergence issue of Adam/RMSProp with constant parameter θt\theta_{t}, based on our analysis of the sufficient condition for convergence.

From the sufficient condition perspective. Let αt ⁣= ⁣η/ts\alpha_{t}\!=\!\eta/t^{s} for 0s ⁣< ⁣10\leq s\!<\!1 and θt ⁣= ⁣θ ⁣< ⁣1\theta_{t}\!=\!\theta\!<\!1. According to Corollary 7, Bound(T)Bound(T) in Theorem 5 has the following estimate:

The bounds tell us some points on Adam with constant θt\theta_{t}:

Bound(T) ⁣= ⁣O(1)Bound(T)\!=\!\mathcal{O}(1), so the convergence is not guaranteed. This result coincides with the divergence issue demonstrated in . Indeed, since in this case Adam is not convergent, this is the best bound we can have.

Consider the dependence on parameter ss. The bound is decreasing in ss. The best bound in this case is when s=0s=0, i.e., the base learning rate is taken constant. This explains why in practice taking a more aggressive constant base learning rate often leads to even better performance, comparing with taking a decaying one.

Consider the dependence on parameter θ\theta. Note that the constants CC and CC^{\prime} depend on θ1\theta_{1} instead of the whole sequence θt\theta_{t}. We can always set θt=θ\theta_{t}=\theta for t2t\geq 2 while fix θ1<θ\theta_{1}<\theta, by which we can take CC and CC^{\prime} independent of constant θ\theta. Then the principal term of Bound(T)Bound(T) is linear in 1θ\sqrt{1-\theta}, so decreases to zero as θ1\theta\to 1. This explains why setting θ\theta close to 1 in practice often results in better performance in practice.

Moreover, Corollary 10 shows us how the convergence rate continuously changes when we continuously verify parameters θt\theta_{t}. Let us fix αt ⁣= ⁣1/t\alpha_{t}\!=\!1/\sqrt{t} and consider the following continuous family of parameters {θt(r)}\{\theta_{t}^{(r)}\} with rr\in:

Note that when r=1r=1, then θt=11/t\theta_{t}=1-1/t, this is the AdaEMA, which has the convergence rate O(logT/T)\mathcal{O}(\log T/\sqrt{T}); when r=0r=0, then θt=θˉ<1\theta_{t}=\bar{\theta}<1, this is the original Adam with constant θt\theta_{t}, which only has the O(1)\mathcal{O}(1) bound; when 0<r<10<r<1, by Corollary 10, the algorithm has the O(Tr/2)\mathcal{O}(T^{-r/2}) convergence rate. Along this continuous family of parameters, we observe that the theoretical convergence rate continuously deteriorates as the real parameter rr decreases from 1 to 0, namely, as we gradually shift from AdaEMA to Adam with constant θt\theta_{t}. In the limiting case, the latter is not guaranteed with convergence any more. This phenomenon is empirically verified by the Synthetic Counterexample in Section 5.

From the Weighted AdaEMA perspective. Since Generic Adam is equivalent to Weighted AdaEMA, we can examine Adam with θt=θ<1\theta_{t}=\theta<1 in terms of Weighted AdaEMA. In this case, we find that the associated sequence of weights wt=(1θ)θtw_{t}=(1-\theta)\theta^{-t} is growing in an exponential order. Corollary 12 shows that as long as the weights are in polynomial growth, Weighted AdaEMA is convergent and its convergence rate is O(logT/T)\mathcal{O}(\log T/\sqrt{T}). This indicates that the exponential-moving-average technique in the estimate of second-order moments may assign a too aggressive weight to the current gradient, which leads to the divergence.

Experiments

In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample and to train LeNet on the MNIST dataset and ResNet on the CIFAR-100 dataset , respectively.

In this experiment, we verify the phenomenon described in Section 4 that how the convergence rate of Generic Adam gradually changes along a continuous path of families of parameters on the one-dimensional counterexample in :

where TT is the number of maximum iterations, ft(x) ⁣= ⁣1010xf_{t}(x)\!=\!1010x with probability 0.01, and ft(x) ⁣= ⁣10xf_{t}(x)\!=\!10x with probability 0.99.

Sensitivity of parameter rr. We set T=107T=10^{7}, αt=0.5/t\alpha_{t}=0.5/\sqrt{t}, β=0.9\beta=0.9, and θt\theta_{t} as θt(r)=1(0.01+0.99r)/tr\theta_{t}^{(r)}=1-(0.01+0.99r)/{t^{r}} with r{0, 0.25, 0.5, 0.75, 1.0}r\in\{0,\ 0.25,\ 0.5,\ 0.75,\ 1.0\}, respectively. Note that when r=0r=0, Generic Adam reduces to the originally divergent Adam with (β,θˉ)=(0.9,0.99)(\beta,\bar{\theta})=(0.9,0.99). When r=1r=1, Generic Adam reduces to the AdaEMA with β=0.9\beta=0.9.

The experimental results are shown in Figures 1(a) and 1(d). We can see that for r=1.0,0.75r=1.0,0.75, and 0.50.5, Generic Adam is convergent. Moreover, the convergence becomes slower when rr decreases, which exactly matches Corollary 10. On the other hand, for r=0r=0 and 0.250.25, Figure 1(d) shows that they do not converge. It seems that the divergence for r=0.25r=0.25 contradicts our theory. However, this is because when rr is very small, the O(Tr/2)\mathcal{O}(T^{-r/2}) convergence rate is so slow that we may not see a convergent trend in even 10710^{7} iterations. Indeed, for r=0.25r=0.25, we actually have

which is not sufficiently close to 1. As a complementary experiment, we fix the numerator and only change rr when rr is small. We take αt\alpha_{t} and βt\beta_{t} as the same, while θt(r)=10.01tr\theta_{t}^{(r)}=1-\frac{0.01}{t^{r}} for r=0r=0 and 0.250.25, respectively. The result is shown in Figures 1(b) and 1(e). We can see that Generic Adam with r=0.25r=0.25 is indeed convergent in this situation.

Sensitivity of parameter ss. Now, we show the sensitivity of ss of the sufficient condition (SC) by fixing r ⁣= ⁣0.8r\!=\!0.8 and selecting ss from the collection s={0.4,0.6,0.8}s=\{0.4,0.6,0.8\}. Figures 1(c) and 1(e) illustrate the sensitivity of parameter ss when Generic Adam is applied to solve the counterexample (9). The performance shows that when ss is fixed, smaller rr can lead to a faster and better convergence speed, which also coincides with the convergence results in Corollary 10.

2 LeNet on MNIST and ResNet-18 on CIFAR-100

In this subsection, we apply Generic Adam to train LeNet on the MNIST dataset and ResNet-18 on the CIFAR-100 dataset, respectively, in order to validate the convergence rates in Corollary 10. Meanwhile, the comparisons between Generic Adam and AMSGrad are also provided to distinguish their differences in training deep neural networks. We illustrate the performance profiles in three aspects: training loss vs. epochs, test loss vs. epochs, and test accuracy vs. epochs, respectively. Besides, the architectures of LeNet and ResNet-18, and the statistics of the MNIST and CIFAR-100 datasets are described in the supplementary material.

In the experiments, for Generic Adam, we set θt(r)=1(0.001+0.999r)/tr\theta_{t}^{(r)}=1-(0.001+0.999r)/t^{r} with r{0,0.25,0.5,0.75,1}r\in\{0,0.25,0.5,0.75,1\} and βt=0.9\beta_{t}=0.9, respectively; for RMSProp, we set βt=0\beta_{t}=0 and θt=11t\theta_{t}=1-\frac{1}{t} along with the parameter settings in . For fairness, the base learning rates αt\alpha_{t} in Generic Adam, RMSProp, and AMSGrad are all set as 0.001/t0.001/\sqrt{t}. Figures 3 and 3 illustrate the results of Generic Adam with different rr, RMSProp, and AMSGrad for training LeNet on MNIST and training ResNet-18 on CIFAR-100, respectively. We can see that AMSGrad and Adam (Generic Adam with r=0r=0) decrease the training loss slowest and show the worst test accuracy among the compared optimizers. One possible reason is due to the use of constant θ\theta in AMSGrad and original Adam. By Figures 3 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter θt\theta_{t}. Larger rr can contribute to a faster convergence rate of Generic Adam, which corroborates the theoretical result in Corollary 10. Additionally, the test accuracies in Figures 3(b) and 3(b) indicate that a smaller training loss can contribute to a higher test accuracy for Generic Adam.

Conclusions

In this work, we delved into the convergences of Adam and RMSProp, and presented an easy-to-check sufficient condition to guarantee their convergences in the non-convex stochastic setting. This sufficient condition merely depends on the base learning rate αt\alpha_{t} and the linear combination parameter θt\theta_{t} of second-order moments. Relying on this sufficient condition, we found that the divergences of Adam and RMSProp are possibly due to the incorrect parameter settings of αt\alpha_{t} and θt\theta_{t}. In addition, we reformulated Adam as weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. At last, the correctness of theoretical results was also verified via the counterexample in and training deep neural networks on real-world datasets.

References

Notations

Appendix A Key Lemmas

In this section we provide the necessary lemmas for the proofs of Theorems 4 and 5.

Given S0>0S_{0}>0 and a non-negative sequence {st}\{s_{t}\}, let St=S0+i=1tsiS_{t}=S_{0}+\sum_{i=1}^{t}s_{i} for t1t\geq 1. Then the following estimate holds

The finite sum t=1Tst/St\sum_{t=1}^{T}{s_{t}}/{S_{t}} can be interpreted as a Riemann sum t=1T(StSt1)/St.\sum_{t=1}^{T}(S_{t}-S_{t-1})/S_{t}. Since 1/x1/x is decreasing on the interval (0,)(0,\infty), we have

Let {ut}\{u_{t}\} and {st}\{s_{t}\} be two non-negative sequences. Let St=i=1tsiS_{t}=\sum_{i=1}^{t}s_{i} for t1t\geq 1. Then

Let {θt}\{\theta_{t}\} and {αt}\{\alpha_{t}\} satisfy the restrictions (R2) and (R3). For any iti\leq t, we have

For any iti\leq t, since the sequence {at}\{a_{t}\} is non-increasing, we have ataia_{t}\leq a_{i}. Hence,

which proves the first inequality. On the other hand, since {θt}\{\theta_{t}\} is non-decreasing, it holds

Let Θ(t,i)=j=i+1tθj\Theta_{(t,i)}=\prod_{j=i+1}^{t}\theta_{j} for i<ti<t and Θ(t,t)=1\Theta_{(t,t)}=1 by convention.

Fix a constant θ\theta^{\prime} with β2<θ<θ\beta^{2}<\theta^{\prime}<\theta. Let C1C_{1} be as given as Eq. (6) in the main paper. For any iti\leq t, we have

For any iti\leq t, since θjθ\theta_{j}\geq\theta^{\prime} for jNj\geq N, and θj<θ\theta_{j}<\theta^{\prime} for j<Nj<N, we have

We take the constant C1=j=1N(θj/θ)C_{1}=\prod_{j=1}^{N}(\theta_{j}/\theta^{\prime}), where NN is the maximum of the indices for which θj<θ\theta_{j}<\theta^{\prime}. The proof is completed. ∎

If θt=θ\theta_{t}=\theta is a constant, we have Θ(t,i)=θti\Theta_{(t,i)}=\theta^{t-i}. In this case we can take θ=θ\theta^{\prime}=\theta and C1=1C_{1}=1.

Let γ:=β2/θ\gamma:=\beta^{2}/{\theta^{\prime}}. We have the following estimate

Let B(t,i)=j=i+1tβjB_{(t,i)}=\prod_{j=i+1}^{t}\beta_{j} for i<ti<t and B(t,t)=1B_{(t,t)}=1 by convention. By the iteration formula mt=βtmt1+(1βt)gt\bm{m}_{t}=\beta_{t}\bm{m}_{t-1}+(1-\beta_{t})\bm{g}_{t} and m0=0\bm{m}_{0}=\bm{0}, we have

Similarly, by vt=θtvt1+(1θt)gt2\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}^{2} and v0=ϵ\bm{v}_{0}=\bm{\epsilon}, we have

Note that {θt}\{\theta_{t}\} is non-decreasing by (R2), and B(t,i)βtiB_{(t,i)}\leq\beta^{t-i} by (R1). By Lemma 18, we have

Combining Eq. (18) and Eq. (19), we obtain the desired Eq. (16). The proof is completed. ∎

where C2=2(β/(1β)C1(1γ)θ1+1)2C_{2}=2\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)^{2}.

To estimate (I), by the Schwartz inequality and the Lipschitz continuity of the gradient of ff, we have

Note that σtG\bm{\sigma}_{t}\leq G. Therefore,

where C2=(β/(1β)C1(1γ)θ1+1)C_{2}^{\prime}=\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right). The last inequality holds due to βt/(1βt)β/(1β)\beta_{t}/(1-\beta_{t})\leq\beta/(1-\beta) as βtβ\beta_{t}\leq\beta. Therefore, we have

Combining Eq. (27), Eq. (33), and Eq. (34), we obtain

The term (IV) is estimated similarly as term (III). First, we have

where C2C_{2}^{\prime} is the constant defined above. We have

Combining Eq. (22), Eq. (23), Eq. (25), Eq. (26), Eq. (35), and Eq. (37), we obtain

Let C2C_{2} denote the constant 2(C2)2{2(C_{2}^{\prime})^{2}}. Then

Next, we estimate Eq. (21). When t=1t=1, we have

The same as what we did for term (I) in Lemma 21, we have

Then the similar argument as Eq. (33) implies that

Note that vtθtvt1\bm{v}_{t}\geq\theta_{t}\bm{v}_{t-1}, so we have vt(j=i+1tθj)vi=Θ(t,i)vi\bm{v}_{t}\geq\left(\prod_{j=i+1}^{t}\theta_{j}\right)\bm{v}_{i}=\Theta_{(t,i)}\bm{v}_{i}. By Lemma 18, this follows that vtC1(θ)tivi\bm{v}_{t}\geq C_{1}(\theta^{\prime})^{t-i}\bm{v}_{i} for all iti\leq t. On the other hand,

Since αt=χt1θtχt1θi\alpha_{t}=\chi_{t}\sqrt{1-\theta_{t}}\leq\chi_{t}\sqrt{1-\theta_{i}} for iti\leq t, it follows that

It is straightforward to acquire by induction that

By Lemma 17, it holds αtC0αi\alpha_{t}\leq C_{0}\alpha_{i} for any iti\leq t. By Lemma 18, Θ(t,i)C1(θ)ti\Theta_{(t,i)}\geq C_{1}(\theta^{\prime})^{t-i}. In addition, B(t,i)βtiB_{(t,i)}\leq\beta^{t-i}. Hence,

Combining Eq. (51) and Eq. (52), we then obtain the desired estimate Eq. (47). The proof is completed. ∎

Let W0=1W_{0}=1 and Wt=i=1Tθi1W_{t}=\prod_{i=1}^{T}\theta_{i}^{-1}. Let wt=WtWt1=(1θt)i=1tθi1=(1θt)Wtw_{t}=W_{t}-W_{t-1}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}=(1-\theta_{t})W_{t}. We therefore have

Note that v0=ϵ\bm{v}_{0}=\bm{\epsilon} and vt=θtvt1+(1θt)gt\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}, so it holds that W0v0=ϵW_{0}\bm{v}_{0}=\bm{\epsilon} and Wtvt=Wt1vt1+wtgt2.W_{t}\bm{v}_{t}=W_{t-1}\bm{v}_{t-1}+w_{t}\bm{g}_{t}^{2}. Then, Wtvt=W0v0+i=1twigi2=ϵ+i=1twigi2.W_{t}\bm{v}_{t}=W_{0}\bm{v}_{0}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}=\bm{\epsilon}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}. It follows that

Writing the norm in terms of coordinates, we obtain

The second inequality is due to the convex inequality 1dk=1dlog(zi)log(1dk=1dzi)\frac{1}{d}\sum_{k=1}^{d}\log\left(z_{i}\right)\leq\log\left(\frac{1}{d}\sum_{k=1}^{d}z_{i}\right). Indeed, we have the more general convex inequality that

for any positive random variable XX. Taking XX to be 1+1ϵdi=1twigi21+\frac{1}{\epsilon d}\sum_{i=1}^{t}w_{i}\left\|\bm{g}_{i}\right\|^{2} in the right hand side of Eq. (57), we obtain that

The last inequality is due to the following trivial inequality:

for any non-negative parameters aa and bb. It then follows that

For simplicity of notations, let ωt:=1θtgtvt2\omega_{t}:=\left\|\frac{\sqrt{1-\theta_{t}}\bm{g}_{t}}{\sqrt{\bm{v}_{t}}}\right\|^{2}, and Ωt:=i=1tωi\Omega_{t}:=\sum_{i=1}^{t}\omega_{i}. Note that χtC0at\chi_{t}\leq C_{0}a_{t}. Hence,

Let St:=log(1+G2ϵd)+i=1tlog(θi1)S_{t}:=\log\left(1+\frac{G^{2}}{\epsilon d}\right)+\sum_{i=1}^{t}\log(\theta_{i}^{-1}). By Lemma 25, we have

Since {at}\{a_{t}\} is a non-increasing sequence, we have atat+10a_{t}-a_{t+1}\geq 0. By Eq. (63), we have

Note that atχta_{t}\leq\chi_{t}. Combining Eq. (62), Eq. (63), and Eq. (65), we have

Note that log(1+x)x\log(1+x)\leq x for all x>1x>-1. It follows that

Note that χt=αt/1θt\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}}. By Eq. (62) and Eq. (65), we have

Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. We have the following estimate

For any two random variables XX and YY, by the Hölder’s inequality, we have

Let X=(f(xt)2v^t1)2/3X=\left(\frac{\left\|\bm{\nabla}f(\bm{x}_{t})\right\|^{2}}{\sqrt{\left\|\bm{\hat{v}}_{t}\right\|_{1}}}\right)^{2/3}, Y=v^t11/3Y=\left\|\bm{\hat{v}}_{t}\right\|_{1}^{1/3}, and let p=3/2p=3/2, q=3q=3. By Eq. (69), we have

Since v^t=θtvt1+(1θt)σt2\bm{\hat{v}}_{t}=\theta_{t}\bm{{v}}_{t-1}+(1-\theta_{t})\bm{\sigma}_{t}^{2}, and all entries are non-negative, we have

By Eq. (70), Eq. (71), and Eq. (72), we obtain

By Lemma 17, αTC0αt\alpha_{T}\leq C_{0}\alpha_{t} for any tTt\leq T, so αt1C0αT1\alpha_{t}^{-1}\leq C_{0}\alpha_{T}^{-1}. Then, we obtain

Appendix B Proofs of the main results

In this section, we provide the detailed proofs of the propositions, theorems, and corollaries in the main body.

Algorithm 1 and Algorithm 2 are equivalent.

It suffices to show that Algorithm 1 can be realized as Algorithm 2 with a particular choice of parameters, and vice versa. Note that for Algorithm 1, it holds

Hence, given the parameters θt\theta_{t} in Algorithm 1, we take wt=(1θt)i=1tθi1w_{t}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}. Then it holds

It follows that Eq. (77) becomes Eq. (76). Conversely, given the parameters wtw_{t} of Algorithm 2, we take θt=Wt1/Wt\theta_{t}=W_{t-1}/W_{t}. Then Eq. (76) becomes Eq. (77). The proof is completed. ∎

B.2 Proof of Theorem 4

Let {xt}\{\bm{x}_{t}\} be a sequence generated by Generic Adam for initial values x1\bm{x}_{1}, m0=0\bm{m}_{0}=\bm{0}, and v0=ϵ\bm{v}_{0}=\bm{\epsilon}. Assume that ff and stochastic gradients gt\bm{g}_{t} satisfy assumptions (A1)-(A4). Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. We have the following estimate

where the constants CC and CC^{\prime} are given by

By the LL-Lipschitz continuity of the gradient of ff and the descent lemma, we have

where C3C_{3} is the constant given in Lemma 24. By applying the estimates in Lemma 25 and Lemma 27 for the second and third terms in the right hand side of Eq. (81), and appropriately rearranging the terms, we obtain

B.3 Proof of Theorem 5

Let {xt}\{\bm{x}_{t}\} be a sequence generated by Generic Adam for initial values x1\bm{x}_{1}, m0=0\bm{m}_{0}=\bm{0}, and v0=ϵ\bm{v}_{0}=\bm{\epsilon}. Assume that ff and stochastic gradients gt\bm{g}_{t} satisfy assumptions (A1)-(A4). Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. Then for any δ>0\delta>0, the following bound holds with probability at least 1δ2/31-\delta^{2/3}:

where the constants CC and CC^{\prime} are given by

in which the constant C3C_{3} is given by

Namely, P(ζ>C(T)δ)δ2/3\mathcal{P}\left(|\zeta|>\frac{C(T)}{\delta}\right)\leq\delta^{2/3}. Therefore, P(ζC(T)δ)1δ2/3\mathcal{P}\left(|\zeta|\leq\frac{C(T)}{\delta}\right)\geq 1-\delta^{2/3}. This completes the proof. ∎

B.4 Proof of Corollary 7

Take αt=η/ts\alpha_{t}=\eta/t^{s} with 0s<10\leq s<1. Suppose limtθt=θ<1\lim_{t\to\infty}\theta_{t}=\theta<1.Then Bound(T)Bound(T) in Theorem 5 is bounded from below by constants

In particular, when θt=θ<1\theta_{t}=\theta<1, we have the following more subtle estimate on lower and upper-bounds for Bound(T)Bound(T):

Since limtθt=θ\lim_{t\to\infty}\theta_{t}=\theta, and θt\theta_{t} is non-decreasing, we have (1θt)1θ(1-\theta_{t})\geq 1-\theta. Hence, by Theorem 5, it holds

If, in particular, θt=θ<1\theta_{t}=\theta<1, by Theorem 5 we have

Combining Eqs. (87)-(88), we obtain the desired result. ∎

B.5 Proof of Corollary 10

Generic Adam with the above family of parameters converges as long as 0<r2s<20<r\leq 2s<2, and its non-asymptotic convergence rate is given by

It is not hard to verify that the following equalities hold:

In this case, TαT=ηT1sT\alpha_{T}=\eta T^{1-s}. Therefore, by Theorem 5 the non-asymptotic convergence rate is given by

To guarantee convergence, then 0<r2s<20<r\leq 2s<2. ∎

B.6 Proof of Corollary 12

Suppose that in Weighted AdaEMA the weights wt=trw_{t}=t^{r} for r ⁣ ⁣0r\!\geq\!0, and αt ⁣= ⁣η/t\alpha_{t}\!=\!\eta/\sqrt{t}. Then Weighted AdaEMA has the O(log(T)/T)\mathcal{O}(\log(T)/\sqrt{T}) non-asymptotic convergence rate.

By the proof procedures of Theorem 3, the equivalent Generic Adam has the parameters θt=Wt1/Wt\theta_{t}=W_{t-1}/W_{t}, where Wt=1+i=1twiW_{t}=1+\sum_{i=1}^{t}w_{i}. Hence, it holds

We have that limtθt=1>β\lim_{t\to\infty}\theta_{t}=1>\beta and θt\theta_{t} is increasing. In addition, we have that χt=αt/1θt\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}} is bounded, and hence “almost” non-increasing (by taking at=1a_{t}=1 in (R3)). The restrictions (R1)-(R3) are all satisfied. Hence, we can apply Theorem 5 in this case. It follows that its convergence rate is given by

Appendix C Experimental Implementations

In this section, we describe the statistics of the training and validation datasets of MNISThttp://yann.lecun.com/exdb/mnist/ and CIFAR-100https://www.cs.toronto.edu/ kriz/cifar.html, the architectures of LeNet and ResNet-18, and detailed implementations.

MNIST is composed of ten classes of digits among {0,1,2,,9}\{0,1,2,\ldots,9\}, which includes 60,000 training examples and 10,000 validation examples. The dimension of each example is 28×2828\times 28.

CIFAR-100 is composed of 100 classes of 32×3232\times 32 color images. Each class includes 6,000 images. In addition, these images are devided into 50,000 training examples and 10,000 validation examples.

C.2 Architectures of Neural Networks

C.3 Additional Experiments of ResNet-18 on CIFAR-100

We further illustrate Generic Adam with different r={0,0.25,0.5,0.75,1}r=\{0,0.25,0.5,0.75,1\}, RMSProp, and AMSGrad with an alternative base learning rate α=0.01\alpha=0.01 on ResNet-18. We do cut-off by taking αt=0.001\alpha_{t}=0.001 if t<2500t<2500. Note that αt\alpha_{t} is still non-increasing. The motivation is that at the very beginning the learning rate αt=0.01t\alpha_{t}=\frac{0.01}{\sqrt{t}} could be large which would deteriorate the performance. The performance profiles are also exactly in accordance with the analysis in theory, i.e., larger rr leads to a faster training process.