Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Congliang Chen, Li Shen, Fangyu Zou, Wei Liu

Introduction

Large-scale non-convex stochastic optimization (Bottou et al., 2018), covering a slew of applications in statistics and machine learning (Jain et al., 2017; Bottou et al., 2018) such as learning a latent variable from massive data whose probability density distribution is unknown, takes the following generic formulation:

Alternatively, a compromised approach to handle this difficulty is to use an unbiased stochastic estimate of  ⁣f(x)\bm{\nabla}\!{f}(\bm{x}), denoted as g(x,ξ)g(\bm{x},\xi), which leads to the stochastic gradient descent (SGD) algorithm (Robbins and Monro, 1985). Its coordinate-wise version is defined as follows:

for k=1,2,,dk=1,2,\ldots,d, where ηt,k0\eta_{t,k}\geq 0 is the learning rate of the kk-th component of stochastic gradient g(xt,ξt)\bm{g}(\bm{x}_{t},\xi_{t}) at the tt-th iteration. Under some mild assumptions (e.g., the optimal solution exists), a sufficient condition (Robbins and Monro, 1985) to ensure the global convergence of vanilla SGD in Eq. (2) is to require ηt\eta_{t} to meet the following diminishing condition:

Although the vanilla SGD algorithm with learning rate ηt\eta_{t} satisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune an effective learning rate ηt\eta_{t} via condition (3).

To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including AdaGrad (Duchi et al., 2011), RMSProp (Hinton et al., 2012), Adam (Kingma and Ba, 2014), Nadam (Dozat, 2016), AdaBound (Luo et al., 2019), etc., have been proposed to automatically tune the learning rate ηt\eta_{t} by using second-order moments of historical stochastic gradients {gt}\{\bm{g}_{t}\}. Let vt,k\bm{v}_{t,k} and mt,k\bm{m}_{t,k} be the exponential moving average of the historical second-order moments (g1,k2,g2,k2,,gt,k2)(\bm{g}^{2}_{1,k},\bm{g}^{2}_{2,k},\cdots,\bm{g}^{2}_{t,k}) and stochastic gradient estimates (g1,k,g2,k,,gt,k)(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k}), respectively. More specifically, two groups of hyperparameters (βt\beta_{t}, θt\theta_{t}) will be involved into the calculation of mt,k=βtmt1,k+(1βt)gt,km_{t,k}=\beta_{t}m_{t-1,k}+(1-\beta_{t})g_{t,k} and vt,k=θtvt1,k+(1θt)gt,k2v_{t,k}=\theta_{t}v_{t-1,k}+(1-\theta_{t})g^{2}_{t,k}. Then, the generic iteration scheme of these adaptive SGD algorithms (Reddi et al., 2018; Chen et al., 2018a) is summarized as

for k=1,2,,dk=1,2,\ldots,d, where αt>0\alpha_{t}>0 is called base learning rate and it is independent of stochastic gradient estimates (g1,k,g2,k,,gt,k)(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k}) for all t1t\geq 1. Although Adam works well for solving large scale convex and non-convex optimization problems such as training deep neural networks, it has been disclosed to be divergent in some scenarios via counterexamples (Reddi et al., 2018). Thus, without any further assumptions for corrections, Adam should not be directly used. Recently, developing sufficient conditions to guarantee global convergences of Adam -type algorithms has attracted much attention from both machine learning and optimization communities. The existing successful attempts can be divided into four categories: decreasing a learning rate, adopting a big batch size, incorporating a temporal decorrelation, and seeking an analogous surrogate. However, some of them are either hard to check or impractical. In this work, we will first introduce an alternative easy-to-check sufficient condition to guarantee the global convergences of the original Adam.

Meanwhile, in practice, stochastic Adam, where a single sample is used to estimate gradient, converges slowly to the optimal point. People usually use mini-batch Adam instead to get faster convergence performance. In SGD, although how the sample size will affect the convergence has been well studied (Li et al., 2014), few works give analysis on mini-batch adaptive gradient methods especially on Adam, since mini-batch size largely affects adaptive learning rate ηt,k\eta_{t,k} in Eq .(4), which makes the analysis difficult. In this work, we give the first complexity analysis for mini-batch Adam, which shows that mini-batch Adam can also be theoretically accelerated by using a larger mini-batch size.

On the other hand, as the data size goes larger in machine learning problems, it is hard to collect, store and process data in a single machine. Several machines are involved in the optimization process. Hence, distributed optimization methods are proposed, where distributed Adam is also popularly used. Different from mini-batch Adam, where only one machine is used for optimization, several machines are involved. In the distributed setting, machines are connected via a network graph. More specifically, there are two kinds of structures used in distributed Adam: parameter-server structure and decentralized structure. In the parameter-server structure, there is one special machine called as parameter server and the rest called workers. The parameter server connects to all workers, but workers don’t connect to each other. Therefore, workers can share information with the parameter server in each communication round but cannot share information with the other workers. However, in the decentralized structure, there is not a server involved in the structure. A pre-defined graph connects all machines. A machine can only share information with its direct neighbors in each communication round. Still, few works answer how the local batch size and number of machines will affect the convergence of distributed Adam. In this work, because the analysis of distributed Adam under the parameter-server model is similar to Mini-batch Adam, we answer this question and show that distributed Adam under a parameter-server model can also achieve a linear speedup property as distributed SGD (Yu et al., 2019).

In summary, the contributions of this work are five-fold:

We introduce an easy-to-check sufficient condition to ensure the global convergences (i.e., averaged expected gradient norm converges to 0) of generic Adam in the common smooth non-convex stochastic setting with mild assumptions. Moreover, this sufficient condition is distinctive from the existing conditions and is easier to verify.

We provide a new explanation on the divergences of original Adam and RMSProp, which are possibly due to an incorrect parameter setting of the combinations of historical second-order moments.

We find that the sufficient condition extends the restrictions of RMSProp (Mukkamala and Hein, 2017) and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad with momentum, etc. Thus, their convergences in the non-convex stochastic setting naturally hold.

We theoretically show that mini-batch Adam can be further accelerated by adopting a larger mini-batch size, and that distributed Adam can achieve a linear speed up property in the parameter-server distributed system by using commonly used sufficient condition parameters.

We conduct experiments to validate the sufficient condition for the convergences of Adam and mini-batch Adam. The experimental results match our theoretical results.

The paper is organized as follows. In Section 2, we first give the formulation of generic Adam and then discuss several works related to Adam including several existing sufficient convergence conditions, analysis of mini-batch, and distributed stochastic gradient methods. In Section 3, we derive the sufficient condition for convergence of Adam and provide several insights for the divergence of vanilla Adam. In Section 4, we give the complexity analysis on practical Adam with a commonly used sufficient condition parameter, including mini-batch Adam and distributed Adam. At last, in Section 5, we conduct some experiments under both theoretical settings and practical settings to verify the established theory. In addition, by practical Adam, we mean that we give a thorough analysis for Adam, mini-batch Adam, and distributed Adam, which have been commonly used for training deep neural networks without theoretical guarantees.

Related work

It is not hard to check that Generic Adam covers RMSProp by setting βt=0\beta_{t}=0 directly. Moreover, it covers Adam with a bias correction (Kingma and Ba, 2014) as follows:

The vanilla Adam with the bias correction (Kingma and Ba, 2014) takes constant parameters βt=β\beta_{t}=\beta and θt=θ\theta_{t}=\theta. The iteration scheme is written as xt+1=xtα^tm^tv^t\bm{x}_{t+1}=\bm{x}_{t}-\widehat{\alpha}_{t}\frac{\widehat{\bm{m}}_{t}}{\sqrt{\widehat{\bm{v}}_{t}}}, with m^t=mt1βt\widehat{\bm{m}}_{t}=\frac{\bm{m}_{t}}{1-\beta^{t}} and v^t=vt1θt\widehat{\bm{v}}_{t}=\frac{\bm{v}_{t}}{1-\theta^{t}}. Let αt=α^t1θt1βt\alpha_{t}=\widehat{\alpha}_{t}\frac{\sqrt{1-\theta^{t}}}{1-\beta^{t}}. Then, the above can be rewritten as xt+1=xtαtmt/vt\bm{x}_{t+1}=\bm{x}_{t}-{\alpha_{t}\bm{m}_{t}}/\sqrt{\bm{v}_{t}}. Thus, it is equivalent to taking constant βt\beta_{t}, constant θt\theta_{t}, and new base learning rate αt\alpha_{t} in Generic Adam.

2 Convergence Conditions for Adam

First, because Reddi et al. (2018) gave counterexamples on divergence of origin Adam, several sufficient conditions have been proposed to guarantee global convergences of Adam that can be summarized into the following four categories:

(C1) Decreasing a learning rate. Reddi et al. (2018) have declared that the core cause of divergences of Adam and RMSProp is largely controlled by the difference between the two adjacent learning rates, i.e.,

Once positive definiteness of Γt\Gamma_{t} is violated, Adam and RMSProp may suffer from divergence (Reddi et al., 2018). Based on this observation, two variants of Adam called AMSGrad and AdamNC have been proposed with convergence guarantees in both the convex (Reddi et al., 2018) and non-convex (Chen et al., 2018a) stochastic settings by requiring Γt0\Gamma_{t}\succ 0. In addition, Padam (Zhou et al., 2018a) extended from AMSGrad has been proposed to contract the generalization gap in training deep neural networks, whose convergence has been ensured by requiring Γt0\Gamma_{t}\succ 0. As a relaxation of Γt0\Gamma_{t}\succ 0, Barakat and Bianchi (2020) showed that when αt/vtαt1/(cvt1)\alpha_{t}/\sqrt{v_{t}}\leq\alpha_{t-1}/(c\sqrt{v_{t-1}}) holds for all tt and some positive cc, the algorithm Adam can converge. In the strongly convex stochastic setting, by using the long-term memory technique developed in (Reddi et al., 2018), Huang et al. (2018) have proposed NosAdam by attaching more weights on historical second-order moments to ensure its convergence. Prior to that, the convergence rate of RMSProp (Mukkamala and Hein, 2017) has already been established in the convex stochastic setting by employing similar parameters to those of AdamNC (Reddi et al., 2018).

(C2) Adopting a big batch size. Basu et al. (2018), for the first time, showed that deterministic Adam and RMSProp with original iteration schemes are convergent by using a full-batch gradient. On the other hand, both Adam and RMSProp can be reshaped as specific signSGD-type algorithms (Balles and Hennig, 2018; Bernstein et al., 2018) whose O(1/T)\mathcal{O}(1/\sqrt{T}) convergence rates have been provided in the non-convex stochastic setting by setting batch size as large as the number of maximum iterations (Bernstein et al., 2018). Recently, Zaheer et al. (2018) have established O(1/T)\mathcal{O}(1/\sqrt{T}) convergence rate of original Adam directly in the non-convex stochastic setting by requiring the batch size to be the same order as the number of maximum iterations. We comment that this type of requirement is impractical when Adam and RMSProp are applied to tackle large-scale problems (1), since these approaches cost a huge number of computations to estimate big-batch stochastic gradients in each iteration.

(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in (Reddi et al., 2018), Zhou et al. (2018b) have pointed out that the divergence of RMSProp is fundamentally caused by the imbalanced learning rate rather than the absence of Γt0\Gamma_{t}\succ 0. Based on this viewpoint, Zhou et al. (2018b) have proposed AdaShift by incorporating a temporal decorrelation technique to eliminate the inappropriate correlation between vt,k\bm{v}_{t,k} and the current second-order moment gt,k2\bm{g}_{t,k}^{2}, in which the adaptive learning rate ηt,k\eta_{t,k} is required to be independent of gt,k2\bm{g}^{2}_{t,k}. However, the convergence of AdaShift in (Zhou et al., 2018b) was merely restricted to RMSProp for solving the convex counterexample in (Reddi et al., 2018).

(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp (Reddi et al., 2018), Zou et al. (2018) proposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weighted AdaGrad with a unified heavy ball and Nesterov accelerated gradient momentums. Its O(log(T)/T)\mathcal{O}(\log{(T)}/\sqrt{T}) convergence rate has also been provided in the non-convex stochastic setting by requiring a non-decreasing weighted sequence. Besides, many other adaptive stochastic algorithms without combining momentums, such as AdaGrad (Ward et al., 2018; Li and Orabona, 2019) and stagewise AdaGrad (Chen et al., 2018b), have been guaranteed to be convergent and work well in the non-convex stochastic setting.

In contrast with the above four types of modifications and restrictions, we introduce an alternative easy-to-check sufficient condition (abbreviated as (SC)) to guarantee the global convergences of original Adam. The proposed (SC) merely depends on the parameters in estimating vt,k\bm{v}_{t,k} and base learning rate αt\alpha_{t}. (SC) neither requires the positive definiteness of Γt\Gamma_{t} like (C1) nor needs the batch size as large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochastic settings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partially overlapped with (C1) since the proposed (SC) can cover AdamNC (Reddi et al., 2018), AdaGrad with exponential moving average (AdaEMA) momentum (Chen et al., 2018a), and RMSProp (Mukkamala and Hein, 2017) as instances whose convergences are all originally motivated by requiring the positive definiteness of Γt\Gamma_{t}. While, based on (SC), we can directly derive their global convergences in the non-convex stochastic setting as byproducts without checking the positive definiteness of Γt\Gamma_{t} step by step. Besides, (SC) can serve as an alternative explanation on divergences of original Adam and RMSProp, which are possibly due to incorrect parameter settings for accumulating the historical second-order moments rather than the imbalanced learning rate caused by the inappropriate correlation between vt,k\bm{v}_{t,k} and gt,k2\bm{g}^{2}_{t,k} like (C3). In addition, AdamNC and AdaEMA are convergent under (SC), but violate (C3) in each iteration. Meanwhile, there are lots of work improving upper bounds for the above algorithms, e.g., Défossez et al. (2020) improved the constants related to β\beta by introducing a novel average scheme in the analysis.

3 Mini-batch Stochastic Gradient Methods

In practice, people usually use mini-batch stochastic gradient methods instead of single sample stochastic gradient methods or full gradient methods for faster convergence. For mini-batch SGD algorithms, Li et al. (2014) have shown that mini-batch SGD boosts O(1T)\mathcal{O}(\frac{1}{\sqrt{T}}) convergence rate of SGD to O(1sT)\mathcal{O}(\frac{1}{\sqrt{sT}}) where ss is the mini-batch size. However, as it is much harder to show the convergence of adaptive gradient methods, few works analyze how sample size will affect the convergence of the adaptive gradient algorithms. Li and Orabona (2019) gave an analysis on Adagrad and showed the convergence rate is linear in the sample size. Zaheer et al. (2018) gave the analysis on Adam, showing that large batch size can help convergence, but the batch size should increase with iteration increasing, which may not be practical. In this work, we theoretically show that mini-batch Adam can be accelerated by adopting a larger mini-batch size as mini-batch SGD (Li et al., 2014) in the same order.

4 Distributed Stochastic Gradient Methods

Distributed stochastic gradient descent was first introduced in Agarwal and Duchi (2011) in the parameter-server setting. Further, in the decentralized setting, Lian et al. (2017) gave the analysis on the stochastic gradient descent. The analysis shows that the convergence speed will be linear in the number of workers in the parameter-server setting or will be linear to some constant related to the decentralized graph structure. For the adaptive gradient methods, in the parameter-server setting, Reddi et al. (2020) gave algorithms in the federated scenario called FedAdam, FedAdagrad, and FedYogi. Moreover, they showed that the convergence speed will be linear in the number of workers. However, instead divided by vt\sqrt{\bm{v}_{t}}, they divide the gradient with vt+ϵ\sqrt{\bm{v_{t}}+\bm{\epsilon}}. Meanwhile, in their assumptions, ϵ\epsilon in the algorithm should be in the order of O(GL)O(\frac{G}{L}), where GG is the upper bound of gradient norm, and LL is the Lipschitz constant of the objective function. However, in practice, ϵ\epsilon is always set to be a small value, much smaller than G/LG/L. On the other hand, the large ϵ\epsilon may dominate the adaptive term in their algorithms. Hence, their methods may degrade to stochastic gradient descent. Carnevale et al. (2020) shoed that Adam with gradient tracking method can be linearly accelerated with an increasing number of nodes in the decentralized and strongly convex setting. Still, it is unclear whether, in the nonconvex setting, this linear speedup will hold when Adam is used. Moreover, Chen et al. (2020) gave an analysis of Adagrad and showed the convergence speed will be linear in the number of workers. Meanwhile, Nazari et al. (2019) gave the analysis of Adagrad in the decentralized setting. Xie et al. (2019) also gave a variant on Adagrad algorithm called AdaAlter in the centralized setting and showed the convergence will linearly speed up by increasing the number of workers. Recently, Chen et al. (2021, 2022) extend Adam to the distributed quantized Adam with error compensation technique Stich et al. (2018). However, the linear speedup property in (Chen et al., 2021, 2022) does not hold. To the best of our knowledge, whether the distributed Adam can achieve a linear speedup is still open. This paper theoretically demonstrates that the distributed Adam in the parameter-server model can achieve a linear speedup concerning the number of workers.

Novel Sufficient Condition for Convergence of Adam

In this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters (θt,αt)(\theta_{t},\alpha_{t}). Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters (θt,αt)(\theta_{t},\alpha_{t}). Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochastic algorithm for non-convex problems:

In addition, we also suppose that the parameters {βt}\{\beta_{t}\}, {θt}\{\theta_{t}\}, and {αt}\{\alpha_{t}\} satisfy the following restrictions:

The parameters {βt}\{\beta_{t}\} satisfy 0βtβ<10\leq\beta_{t}\leq\beta<1 for all tt for some constant β\beta;

The parameters {θt}\{\theta_{t}\} satisfy 0<θt<10<\theta_{t}<1 and θt\theta_{t} is non-decreasing in tt with θ:=limtθt>β2\theta:=\lim\limits_{t\to\infty}\theta_{t}>\beta^{2};

The parameters {αt}\{\alpha_{t}\} satisfy that χt:=αt1θt\chi_{t}:=\frac{\alpha_{t}}{\sqrt{1-\theta_{t}}} is “almost” non-increasing in tt, by which we mean that there exist a non-increasing sequence {at}\{a_{t}\} and a positive constant C0C_{0} independent of tt such that atχtC0ata_{t}\leq\chi_{t}\leq C_{0}a_{t}.

The restriction (R3) indeed says that χt\chi_{t} is the product between some non-increasing sequence {at}\{a_{t}\} and some bounded sequence. This is a slight generalization of χt\chi_{t} itself being non-decreasing. If χt\chi_{t} itself is non-increasing, we can then take at=χta_{t}=\chi_{t} and C0=1C_{0}=1. For most of the well-known Adam-type methods, χt\chi_{t} is indeed non-decreasing. For instance, for AdaGrad with EMA momentum we have αt=η/t\alpha_{t}=\eta/\sqrt{t} and θt=11/t\theta_{t}=1-1/t, so χt=η\chi_{t}=\eta is constant; for Adam with constant θt=θ\theta_{t}=\theta and non-increasing αt\alpha_{t} (say αt=η/t\alpha_{t}=\eta/\sqrt{t} or αt=η\alpha_{t}=\eta), χt=αt/1θ\chi_{t}=\alpha_{t}/\sqrt{1-\theta} is non-increasing. The motivation, instead of χt\chi_{t} being decreasing, is that it allows us to deal with the bias correction steps in Adam (Kingma and Ba, 2014).

We fix a positive constant θ>0\theta^{\prime}>0In the special case that θt=θ\theta_{t}=\theta is constant, we can directly set θ=θ\theta^{\prime}=\theta. such that β2<θ<θ\beta^{2}<\theta^{\prime}<\theta. Let γ:=β2/θ<1\gamma:={\beta^{2}}/{\theta^{\prime}}<1 and

where NN is the maximum of the indices jj with θj<θ\theta_{j}<\theta^{\prime}. The finiteness of NN is guaranteed by the fact that limtθt=θ>θ\lim_{t\to\infty}\theta_{t}=\theta>\theta^{\prime}. When there are no such indices, i.e., θ1θ\theta_{1}\geq\theta^{\prime}, we take C1=1C_{1}=1 by convention. In general, C11C_{1}\leq 1. Our main results on estimating gradient residual state as follows:

Let {xt}\{\bm{x}_{t}\} be a sequence generated by Generic Adam for initial values x1\bm{x}_{1}, m0=0\bm{m}_{0}=\bm{0}, and v0=ϵ\bm{v}_{0}=\bm{\epsilon}. Assume that ff and stochastic gradients gt\bm{g}_{t} satisfy assumptions (A1)-(A4). Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. Then, we have

where C^{\prime}\!=\!{2C_{0}^{2}C_{3}d\sqrt{G^{2}\!+\!\epsilon d}}{\big{/}}{[(1\!-\!\beta)\theta_{1}]} and

in which C4C_{4} and C3C_{3} are defined as C4=f(x1)fC_{4}=f(x_{1})-f^{*} and C_{3}=\frac{C_{0}}{\sqrt{C_{1}}(1-\sqrt{\gamma})}\big{[}\frac{C_{0}^{2}\chi_{1}L}{C_{1}(1-\sqrt{\gamma})^{2}}+2\big{(}\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\big{)}^{2}G\big{]}, respectively.

Let {xt}\{\bm{x}_{t}\} be the sequence generated by Algorithm 1. For T1T\geq 1, it holds that

(Lemma 33 in Appendix) Let {xt}\{\bm{x}_{t}\} be the sequence generated by Algorithm 1. For T1T\geq 1, it holds that

for some constants ζ0\zeta_{0} and ζ1\zeta_{1}.

(Lemma 36 in Appendix) Let τ\tau be an integer that is randomly chosen from {1,2,,T}\{1,2,\cdots,T\} with equal probabilities. We have the following estimate

Thus, using the above three lemmas, we can prove Theorem 2.

2 Discussion of Theorem 2

Take αt=η/ts\alpha_{t}=\eta/t^{s} with 0s<10\leq s<1. Suppose limtθt=θ<1\lim_{t\to\infty}\theta_{t}=\theta<1. Define Bound(T):=C+Ct=1Tαt1θtTαTBound(T):=\frac{C+C^{\prime}\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}{T\alpha_{T}}. Then the Bound(T)Bound(T) in Theorem 2 is bounded from below by constants

In particular, when θt=θ<1\theta_{t}=\theta<1, we have the following more subtle estimate of lower and upper-bounds for Bound(T)Bound(T)

(i) Corollary 7 shows that if limtθt=θ<1\lim_{t\to\infty}\theta_{t}=\theta<1, the bound in Theorem 2 is only O(1)\mathcal{O}(1), hence not guaranteeing convergence. This result is not surprising as Adam with constant θt\theta_{t} has already shown to be divergent (Reddi et al., 2018). Hence, O(1)\mathcal{O}(1) is its best convergence rate we can expect. We will discuss this case in more details in Section 3.4. (ii) Corollary 7 also indicates that in order to guarantee convergence, the parameter has to satisfy limtθt=1\lim_{t\to\infty}\theta_{t}=1. Although we do not assume this in our restrictions (R1)-(R3), it turns out to be the consequence from our analysis. Note that if β<1\beta<1 in (R1) and limtθt=1\lim_{t\to\infty}\theta_{t}=1, then the restriction limtθt>β2\lim_{t\to\infty}\theta_{t}>\beta^{2} is automatically satisfied in (R2).

We are now ready to give the Sufficient Condition for convergence of Generic Adam.

Generic Adam is convergent if the parameters {αt}\{\alpha_{t}\}, {βt}\{\beta_{t}\}, and {θt}\{\theta_{t}\} satisfy the following four conditions:

0<θt<10<\theta_{t}<1 and θt\theta_{t} is non-decreasing in tt;

χt:=αt/1θt\chi_{t}:=\alpha_{t}/\sqrt{1-\theta_{t}} is “almost” non-increasing;

\big{(}{\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}\big{)}{\big{/}}\big{(}{T\alpha_{T}}\big{)}=o(1).

3 Convergence Rate of Generic Adam

We now provide the convergence rate of Generic Adam with specific parameters {(θt,αt)}\{(\theta_{t},\alpha_{t})\}, i.e.,

for positive constants α,η,K\alpha,\eta,K, where KK is taken such that α/Kr<1\alpha/K^{r}<1. Note that α\alpha can be taken bigger than 1. When α<1\alpha<1, we can take K=1K=1 and then θt=1α/tr,t1\theta_{t}=1-\alpha/t^{r},t\geq 1. To guarantee (R3), we require r2sr\leq 2s. For such a family of parameters we have the following corollary.

Generic Adam with the above family of parameters (i.e. (8)) converges as long as 0<r2s<20<r\leq 2s<2, and its non-asymptotic convergence rate is given by

Corollary 10 recovers and extends the results of some well-known algorithms below:

AdaGrad with exponential moving average (EMA). When θt=11/t\theta_{t}=1-1/t, αt=η/t\alpha_{t}=\eta/\sqrt{t}, and βt=β<1\beta_{t}=\beta<1, Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) (Chen et al., 2018a). In particular, if β=0\beta=0, this is the vanilla coordinate-wise AdaGrad. It corresponds to taking r=1r=1 and s=1/2s=1/2 in Corollary 10. Hence, AdaEMA has convergence rate O(log(T)/T)\mathcal{O}\left(\sqrt{\log(T)/\sqrt{T}}\right).

AdamNC. Taking θt=11/t\theta_{t}=1-1/t, αt=η/t\alpha_{t}=\eta/\sqrt{t}, and βt=βλt\beta_{t}=\beta\lambda^{t} in Generic Adam, where λ<1\lambda<1 is the decay factor for the momentum factors βt\beta_{t}, we recover AdamNC (Reddi et al., 2018). Its O(log(T)/T)\mathcal{O}\left(\sqrt{\log{(T)}/\sqrt{T}}\right) convergence rate can be directly derived via Corollary 10.

RMSProp. Mukkamala and Hein (2017) have reached the same O(log(T)/T)\mathcal{O}\left(\sqrt{\log{(T)}/\sqrt{T}}\right) convergence rate for RMSprop with θt=1α/t\theta_{t}=1-\alpha/t, when 0<α10<\alpha\leq 1 and αt=η/t\alpha_{t}=\eta/\sqrt{t} under the convex assumption. Since RMSProp is essentially Generic Adam with all momentum factors βt=0\beta_{t}=0, we recover Mukkamala and Hein’s results by taking r=1r=1 and s=1/2s=1/2 in Corollary 10. Moreover, our result generalizes to the non-convex stochastic setting, and it holds for all α ⁣> ⁣0\alpha\!>\!0 rather than only 0 ⁣< ⁣α ⁣ ⁣10\!<\!\alpha\!\leq\!1.

The summarization of the above algorithms is provided in Table 1.

Comparison with Reddi et al. (2018) . Most of the convergent modifications of original Adam, such as AMSGrad, AdamNC, and NosAdam, all require Γt0\Gamma_{t}\succ 0 in Eq. (5), which is equivalent to decreasing the adaptive learning rate ηt\eta_{t} step by step. Since the term Γt\Gamma_{t} (or adaptive learning rate ηt\eta_{t}) involves the past stochastic gradients (hence not deterministic), the modification to guarantee Γt0\Gamma_{t}\succ 0 either needs to change the iteration scheme of Adam (like AMSGrad) or needs to impose some strong restrictions on the base learning rates αt\alpha_{t} and θt\theta_{t} (like AdamNC). Our sufficient condition provides an easy-to-check criterion for the convergence of Generic Adam in Corollary 9. It is not necessary to require Γt0\Gamma_{t}\succ 0. Moreover, we use exactly the same iteration scheme as the original Adam without any modifications. Our work shows that the positive definiteness of Γt\Gamma_{t} may not be an essential issue for the divergence of the original Adam. The divergence may be due to the incorrect setting of moving average parameters instead of non-positive definiteness of Γt\Gamma_{t}.

The currently most popular RMSProp and Adam’s parameter setting takes constant θt\theta_{t}, i.e., θt=θ<1\theta_{t}=\theta<1. The motivation behind is to use the exponential moving average of squares of past stochastic gradients. In practice, parameter θ\theta is recommended to be set very close to 1. For instance, a commonly adopted θ\theta is taken as 0.999.

Although great performance in practice has been observed, such a constant parameter setting has the serious flaw that there is no convergence guarantee even for convex optimization, as proved by the counterexamples in (Reddi et al., 2018). Ever since much work has been done to analyze the divergence issue of Adam and to propose modifications with convergence guarantees, as summarized in the introduction section. However, there is still not a satisfactory explanation that touches the fundamental reason for the divergence. In this section, we try to provide more insights for the divergence issue of Adam/RMSProp with constant parameter θt\theta_{t}, based on our analysis of the sufficient condition for convergence.

From the sufficient condition perspective. Let αt ⁣= ⁣η/ts\alpha_{t}\!=\!\eta/t^{s} for 0s ⁣< ⁣10\leq s\!<\!1 and θt ⁣= ⁣θ ⁣< ⁣1\theta_{t}\!=\!\theta\!<\!1. According to Corollary 7, Bound(T)Bound(T) in Theorem 2 has the following estimate:

The bounds tell us some points on Adam with constant θt\theta_{t}:

Bound(T) ⁣= ⁣O(1)Bound(T)\!=\!\mathcal{O}(1), so the convergence is not guaranteed. This result coincides with the divergence issue demonstrated in (Reddi et al., 2018). Indeed, since in this case Adam is not convergent, this is the best bound we can have.

Consider the dependence on parameter ss. The bound is decreasing in ss. The best bound in this case is when s=0s=0, i.e., the base learning rate is taken constant. This explains why in practice taking a more aggressive constant base learning rate often leads to even better performance, comparing with taking a decaying one.

Consider the dependence on parameter θ\theta. Note that the constants CC and CC^{\prime} depend on θ1\theta_{1} instead of the whole sequence θt\theta_{t}. We can always set θt=θ\theta_{t}=\theta for t2t\geq 2 while fix θ1<θ\theta_{1}<\theta, by which we can take CC and CC^{\prime} independent of constant θ\theta. Then the principal term of Bound(T)Bound(T) is linear in 1θ\sqrt{1-\theta}, so decreases to zero as θ1\theta\to 1. This explains why setting θ\theta close to 1 often results in better performance in practice.

Moreover, Corollary 10 shows us how the convergence rate continuously changes when we continuously vary parameters θt\theta_{t}. Let us fix αt ⁣= ⁣1/t\alpha_{t}\!=\!1/\sqrt{t} and consider the following continuous family of parameters {θt(r)}\{\theta_{t}^{(r)}\} with rr\in:

Note that when r=1r=1, then θt=11/t\theta_{t}=1-1/t, this is the AdaEMA, which has the convergence rate O(logT/T)\mathcal{O}\left(\sqrt{\log T/\sqrt{T}}\right); when r=0r=0, then θt=θˉ<1\theta_{t}=\bar{\theta}<1, this is the original Adam with constant θt\theta_{t}, which only has the O(1)\mathcal{O}(1) bound; when 0<r<10<r<1, by Corollary 10, the algorithm has the O(Tr/4)\mathcal{O}(T^{-r/4}) convergence rate. Along with this continuous family of parameters, we observe that the theoretical convergence rate continuously deteriorates as the real parameter rr decreases from 1 to 0, namely, as we gradually shift from AdaEMA to Adam with constant θt\theta_{t}. In the limiting case, the latter is not guaranteed with convergence anymore.

Complexity Analysis for Practical Adam: Mini-batch/Distributed Adam

Due to the limited time, limited computational resources, and noise in data collection and processing, it is almost impossible to achieve the accurate stationary point. Thus, instead of achieving the accurate stationary point, people get more attention to achieving some approximated stationary point in practice. The crucial question under this situation will become how much time is needed to achieve some approximated solution. This section will answer this question by answering how many iterations are needed to obtain an ε\varepsilon-stationary point. First, we define ε\varepsilon-stationary point as follows:

We define a random variable x\bm{x} as an ε\varepsilon-stationary point of problem (1), if

According to Definition 12 and Theorem 2, for generic Adam, we can directly give the following corollary:

For any ε>0\varepsilon>0, if we take TC52ε4T\geq C_{5}^{2}\varepsilon^{-4}, αt=αT, βt=β,θt=1θT\alpha_{t}=\frac{\alpha}{\sqrt{T}},\ \beta_{t}=\beta,\theta_{t}=1-\frac{\theta}{T}, which satisfy γ=β1θT<1\gamma=\frac{\beta}{1-\frac{\theta}{T}}<1 and θt14\theta_{t}\geq\frac{1}{4}, then by taking τ\tau uniformly from {1,2,,T}\{1,2,\cdots,T\}, it holds that

Difference from Corollary 7. Corollary 7 shows when limtθt<1lim_{t\rightarrow\infty}\theta_{t}<1 the algorithm cannot converge to a stationary point. However, because the goal of the algorithm switches to an ε\varepsilon-stationary point, the choice of αt\alpha_{t} and θt\theta_{t} is not contradicting Corollary 7. We will use the same choice in the following section.

With the parameter setting in Corollary 10, the number of iterations T should satisfy Tlog2T=Ω(ε4)\frac{T}{\log^{2}T}=\Omega(\varepsilon^{-4}), instead of T=Ω(ε4)T=\Omega(\varepsilon^{-4}), which gives a larger iteration number T. However, we can use the same parameter setting in Corollary 10 when T changes, while we need to change parameters in Corollary 13 for different T.

From Theorem 13, to achieve an ε\varepsilon-stationary point, only Ω(ε4)\Omega(\varepsilon^{-4}) iterations are needed. Comparing the result with SGD (Li et al., 2014), we have the same order of iterations to achieve an ε\varepsilon-stationary point.

In the following two sections, we will analyze two practical Adam variations, i.e., mini-batch and distributed Adam. Although they can use the same technique for analysis, we list two algorithms for readers in different communities (single machine learning algorithm (mini-batch Adam) v.s. multi-machine learning algorithm (distributed Adam)).

It has been shown that when ss samples are used to estimate the gradient in the stochastic gradient descent algorithm, the convergence speed can be accelerated ss times than the single sample algorithms Li et al. (2014). Meanwhile, in practice, the mini-batch technique is widely used to optimize problem (1) such as training a neural network with the Adam algorithm. In this section, we will give the analysis on mini-batch Adam. The Mini-batch Adam algorithm is defined in the following Algorithm 2. Different from Algorithm 1, in Algorithm 2 ss samples which are identically distributed and independent when the iterate xt\bm{x}_{t} is used to estimate the gradient f(xt)\bm{\nabla}f(\bm{x}_{t}). We average the ss estimates and use the averaged stochastic gradient, which should be a more accurate estimation to update xt\bm{x}_{t}.

To link sample size and convergence rate, we give a new assumption on the stochastic gradient and state it as follows:

We add (A5) to establish the relation between sample size and the convergence rate. Intuitively, with an increasing size of samples, the variance of the gradient estimator should reduce. Utilizing this reduction, we can obtain an ε\varepsilon-stationary point, with fewer iterations but a larger sample size. Also, this assumption is widely used in analysis such as Yan et al. (2018). The following results are given under assumptions (A1) to (A5), and the result of mini-batch Adam is given as follows:

For any ε>0\varepsilon>0, if we take αt=αT\alpha_{t}=\frac{\alpha}{\sqrt{T}}, βt=β\beta_{t}=\beta and θt=1θT\theta_{t}=1-\frac{\theta}{T}, which satisfy γ=βt2θt<1\gamma=\frac{\beta_{t}^{2}}{\theta_{t}}<1, θt14\theta_{t}\geq\frac{1}{4} and FT(T,s)εF_{T}(T,s)\leq\varepsilon, then there exists t{1,2,,T}t\in\{1,2,\cdots,T\} such that

where by taking ϵ=1sd\epsilon=\frac{1}{sd}, it holds that

Thus, to achieving an ε\varepsilon-stationary point, Ω(ε4s1)\Omega(\varepsilon^{-4}s^{-1}) iterations are needed.

Below, we give three comments on the above results: (i) From Theorem 17, to achieve an ε\varepsilon-stationary point, when we only consider the order with respect to ε\varepsilon, Ω(ε4)\Omega(\varepsilon^{-4}) iterations are needed. Besides, by jointly considering ε\varepsilon and batch size ss, we can accelerate the algorithm to achieve an ε\varepsilon-stationary point, where Ω(ε4s1)\Omega(\varepsilon^{-4}s^{-1}) iterations are needed, which indicates that Mini-batch Adam can be linearly accelerated with respect to the mini-batch size. The result is in the same order of mini-batch SGD in (Li et al., 2014). (ii) Deriving the linear speedup property of mini-batch Adam with respect to mini-batch size is much more difficult than the analysis techniques for mini-batch SGD (Li et al., 2014) since the adaptive learning rate in Algorithm 2 is highly coupled with mini-batch stochastic gradient estimates. In fact, the adaptive learning rate implicitly adjusts the magnitude of the learning rate with respect to mini-batch size, while the hand-crafted learning rate in mini-batch SGD has to be tuned carefully via a linear LR scaling technique (Krizhevsky, 2014; You et al., 2017) for a large mini-batch training. (iii) The dimension dependence of the above analysis is O(d)O(\sqrt{d}). Meanwhile, some analyses on (variants of) Adam (Chen et al. (2018a); Défossez et al. (2020); Zou et al. (2018)) achieve the same dimension dependence, while Zhou et al. (2018a) showed that in AMSGrad, RMSProp and Adagrad the dependence can be O(d1/4)O(d^{-1/4}).

2 Proof Sketch of Theorem 17

(Lemma 40 in Appendix) Let {xt}\{\bm{x}_{t}\} be the sequence generated by Algorithm 2 or 3. For T1T\geq 1, when αt=αT\alpha_{t}=\frac{\alpha}{\sqrt{T}} and θt=1θT\theta_{t}=1-\frac{\theta}{T}, we have

for some constants ζ3\zeta_{3}, ζ4\zeta_{4} and ζ5\zeta_{5}.

(Lemma 41 in Appendix) Let {xt}\{\bm{x}_{t}\} be the sequence generated by Algorithm 2 or 3. For T1T\geq 1, it holds that

for some constants ζ6\zeta_{6}, ζ7\zeta_{7} and ζ8\zeta_{8}.

3 Convergence Analysis for Distributed Adam

For large-scale problems such as training deep convolutional neural networks over the ImageNet dataset (Russakovsky et al., 2015), it is hard to optimize problem (1) on a single machine. In this section, we extend the mini-batch Adam to the distributed Adam like the distributed SGD method (Yu et al., 2019). The simplest structure is the parameter-server model in the distributed setting, where a parameter server and multiple workers are involved in the optimization process. As it is shown in Algorithm 3, in each iteration, a worker receives the iterate xt\bm{x}_{t} from the server, samples a stochastic gradient with respect to xt\bm{x}_{t}, and sends the gradient to the server. Meanwhile, the parameter server receives gradients from workers in each iteration, averages the gradients, and performs an Adam update.

Algorithm 3 with ss workers performs the same as Algorithm 2 with ss i.i.d. samples.

Below, we give two remarks on the above distributed Adam algorithm: (i) For distributed Adam, to achieve an ε\varepsilon-stationary point, Ω(ε4s1)\Omega(\varepsilon^{-4}s^{-1}) iterations are needed, which is a linear speedup with respect to the number of workers in the network, which is in the same order as that is in distributed SGD (Yu et al., 2019). (ii) Distributed Adam has been popularly used for training deep neural networks. In addition, there also exist several variants of the distributed Adam algorithm, such as PMD-LAMB (Wang et al., 2020), LAMB (You et al., 2019), LARS (You et al., 2017), etc, for training large-scale deep neural networks. However, all these works do not establish the linear speedup property for distributed adaptive methods.

Experimental Results

In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample (Chen et al., 2018a) and to train LeNet (LeCun et al., 1998) on the MNIST dataset (LeCun et al., 2010) and ResNet (He et al., 2016) on the CIFAR-100 dataset (Krizhevsky, 2009), respectively. Moreover, we use different batch sizes to train LeNet (LeCun et al., 1998) on the MNIST dataset (LeCun et al., 2010) and ResNet (He et al., 2016) on the CIFAR-100 dataset (Krizhevsky, 2009), respectively, and validate theory of the mini-batch Adam algorithm.

Sensitivity of parameter rr. We set T=105T=10^{5}, αt=5/t\alpha_{t}=5/\sqrt{t}, β=0.9\beta=0.9, and θt\theta_{t} as θt(r)=1(0.01+0.99r2)/tr\theta_{t}^{(r)}=1-(0.01+0.99r^{2})/{t^{r}} with r{0, 0.25, 0.5, 0.75, 1.0}r\in\{0,\ 0.25,\ 0.5,\ 0.75,\ 1.0\}, respectively. Note that when r=0r=0, Generic Adam reduces to the originally divergent Adam (Kingma and Ba, 2014) with (β,θˉ)=(0.9,0.99)(\beta,\bar{\theta})=(0.9,0.99). When r=1r=1, Generic Adam reduces to AdaEMA (Chen et al., 2018a) with β=0.9\beta=0.9.

The experimental results are shown in the left figure of Figure 1. We can see that for r=1.0,0.75r=1.0,0.75 and 0.50.5, Generic Adam is convergent. Moreover, the convergence becomes slower when rr decreases, which exactly matches Corollary 10. On the other hand, for r=0r=0 and 0.250.25, Figure 1 shows that they do not converge. It seems that the divergence for r=0.25r=0.25 contradicts our theory. However, this is because when rr is very small, the O(Tr/4)\mathcal{O}(T^{-r/4}) convergence rate is so slow that we may not see a convergent trend in even 10510^{5} iterations. Indeed, for r=0.25r=0.25, we actually have

which is not sufficiently close to 1. As a complementary experiment, we fix the numerator and only change rr when rr is small. We take αt\alpha_{t} and βt\beta_{t} as the same, while θt(r)=10.01tr\theta_{t}^{(r)}=1-\frac{0.01}{t^{r}} for r=0r=0 and 0.250.25, respectively. The result is shown in the middle figure of Figures 1. We can see that Generic Adam with r=0.25r=0.25 is indeed convergent in this situation.

Sensitivity of parameter ss. Now, we show the sensitivity of ss of the sufficient condition (SC) by fixing r ⁣= ⁣0.8r\!=\!0.8 and selecting ss from the collection s={0.4,0.6,0.8}s=\{0.4,0.6,0.8\}. The right figure in Figure 1 illustrates the sensitivity of parameter ss when Generic Adam is applied to solve the counterexample (10). The performance shows that when ss is fixed, smaller rr can lead to a faster and better convergence speed, which also coincides with the convergence results in Corollary 10.

2 LeNet on MNIST and ResNet-18 on CIFAR-100

In the experiments, for Generic Adam, we set θt(r)=1(0.001+0.999r)/tr\theta_{t}^{(r)}=1-(0.001+0.999r)/t^{r} with r{0,0.25,0.5,0.75,1}r\in\{0,0.25,0.5,0.75,1\} and βt=0.9\beta_{t}=0.9, respectively; for RMSProp, we set βt=0\beta_{t}=0 and θt=11t\theta_{t}=1-\frac{1}{t} along with the parameter settings in Mukkamala and Hein (2017). For fairness, the base learning rates αt\alpha_{t} in Generic Adam, RMSProp, and AMSGrad are all set as 0.001/t0.001/\sqrt{t}. Figures 3 and 3 illustrate the results of Generic Adam with different rr, RMSProp, and AMSGrad for training LeNet on MNIST and training ResNet-18 on CIFAR-100, respectively. We can see that AMSGrad and Adam (Generic Adam with r=0r=0) decrease the training loss slowest and show the worst test accuracy among the compared optimizers. One possible reason is due to the use of constant θ\theta in AMSGrad and original Adam. By Figures 3 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter θt\theta_{t}. Larger rr can contribute to a faster convergence rate of Generic Adam, which corroborates the theoretical result in Corollary 10. Additionally, the test accuracies in Figures 3(b) and 3(b) indicate that a smaller training loss can contribute to a higher test accuracy for Generic Adam.

3 Experiments on Practical Adam

In this section, we apply mini-batch Adam algorithms and mini-batch SGD algorithms to the following quadratic minimization task:

3.2 Vision Tasks

3.3 Transformer XL on WikiText-103

Also, we applied mini-batch Adam to train a base model of Transformer XL (Dai et al., 2019) on the dataset WikiText-103 (Merity et al., 2016). The base model of Transformer XL contains 16 self-attention layers. In each self-attention layer, there are 10 heads, and the encoding dimension of each head is set to 41. The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia. We adopt the same parameter settings provided by the authors but test on batch size {30,60,120}\{30,60,120\}. The results are shown in Figure 7. Also, as it is shown in the figure, a larger batch size can give a lower training loss in all experiments. Meanwhile, in the figure, AMSGrad and Adam achieve much better performance than SGD-M, which shows the benefit of using adaptive methods instead of SGD-based methods, just like what was mentioned in Zhang et al. (2019).

Conclusions

In this work, we delved into the convergences of Adam, and presented an easy-to-check sufficient condition to guarantee their convergences in the non-convex stochastic setting. This sufficient condition merely depends on the base learning rate and the linear combination parameter of second-order moments. Relying on this sufficient condition, we found that the divergences of Adam are possibly due to the incorrect parameter settings. Besides, when encountering the practice Adam, we theoretically showed that the number of samples will linearly speed up the convergence in both the mini-batch setting and distributed setting, which closes the gap between theory and practice. At last, the correctness of theoretical results has also been verified via the counterexample and training deep neural networks on real-world datasets.

Acknowledgement

This work is supported by the Major Science and Technology Innovation 2030 “Brain Science and Brain-like Research” key project (No. 2021ZD0201405).

Appendix A Key Lemma to prove Theorem 2 and Theorem 13

In this section we provide the necessary lemmas for the proofs of Theorem 2 and Theorem 13. First, we give some notations for simplifying the following proof.

Given S0>0S_{0}>0 and a non-negative sequence {st}\{s_{t}\}, let St=S0+i=1tsiS_{t}=S_{0}+\sum_{i=1}^{t}s_{i} for t1t\geq 1. Then the following estimate holds

Proof The finite sum t=1Tst/St\sum_{t=1}^{T}{s_{t}}/{S_{t}} can be interpreted as a Riemann sum t=1T(StSt1)/St.\sum_{t=1}^{T}(S_{t}-S_{t-1})/S_{t}. Since 1/x1/x is decreasing on the interval (0,)(0,\infty), we have

Let {ut}\{u_{t}\} and {st}\{s_{t}\} be two non-negative sequences. Let St=i=1tsiS_{t}=\sum_{i=1}^{t}s_{i} for t1t\geq 1. Then

Let {θt}\{\theta_{t}\} and {αt}\{\alpha_{t}\} satisfy the restrictions (R2) and (R3). For any iti\leq t, we have

Proof For any iti\leq t, since the sequence {at}\{a_{t}\} is non-increasing, we have ataia_{t}\leq a_{i}. Hence,

which proves the first inequality. On the other hand, since {θt}\{\theta_{t}\} is non-decreasing, it holds

Let Θ(t,i)=j=i+1tθj\Theta_{(t,i)}=\prod_{j=i+1}^{t}\theta_{j} for i<ti<t and Θ(t,t)=1\Theta_{(t,t)}=1 by convention.

Fix a constant θ\theta^{\prime} with β2<θ<θ\beta^{2}<\theta^{\prime}<\theta. Let C1C_{1} be as given as Eq. (6) in the main paper. For any iti\leq t, we have

Proof For any iti\leq t, since θjθ\theta_{j}\geq\theta^{\prime} for jNj\geq N, and θj<θ\theta_{j}<\theta^{\prime} for j<Nj<N, we have

We take the constant C1=j=1N(θj/θ)C_{1}=\prod_{j=1}^{N}(\theta_{j}/\theta^{\prime}), where NN is the maximum of the indices for which θj<θ\theta_{j}<\theta^{\prime}. The proof is completed.

If θt=θ\theta_{t}=\theta is a constant, we have Θ(t,i)=θti\Theta_{(t,i)}=\theta^{t-i}. In this case we can take θ=θ\theta^{\prime}=\theta and C1=1C_{1}=1.

Let γ:=β2/θ\gamma:=\beta^{2}/{\theta^{\prime}}. We have the following estimate

Proof Let B(t,i)=j=i+1tβjB_{(t,i)}=\prod_{j=i+1}^{t}\beta_{j} for i<ti<t and B(t,t)=1B_{(t,t)}=1 by convention. By the iteration formula mt=βtmt1+(1βt)gt\bm{m}_{t}=\beta_{t}\bm{m}_{t-1}+(1-\beta_{t})\bm{g}_{t} and m0=0\bm{m}_{0}=\bm{0}, we have

Similarly, by vt=θtvt1+(1θt)gt2\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}^{2} and v0=ϵ\bm{v}_{0}=\bm{\epsilon}, we have

Note that {θt}\{\theta_{t}\} is non-decreasing by (R2), and B(t,i)βtiB_{(t,i)}\leq\beta^{t-i} by (R1). By Lemma 27, we have

With the notations above, the following equality holds

Combining Eq. (19) and Eq. (20), we obtain the desired Eq. (17). The proof is completed.

where C2=2(β/(1β)C1(1γ)θ1+1)2C_{2}=2\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)^{2}.

To estimate (I), by the Schwartz inequality and the Lipschitz continuity of the gradient of ff, we have

Note that δtG\bm{\delta}_{t}\leq G. Therefore,

where C2=(β/(1β)C1(1γ)θ1+1)C_{2}^{\prime}=\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right). The last inequality holds due to βt/(1βt)β/(1β)\beta_{t}/(1-\beta_{t})\leq\beta/(1-\beta) as βtβ\beta_{t}\leq\beta. Therefore, we have

Combining Eq. (28), Eq. (34), and Eq. (35), we obtain

The term (IV) is estimated similarly as term (III). First, we have

where C2C_{2}^{\prime} is the constant defined above. We have

Combining Eq. (23), Eq. (24), Eq. (26), Eq. (27), Eq. (36), and Eq. (38), we obtain

The same as what we did for term (I) in Lemma 30, we have

Then the similar argument as Eq. (34) implies that

Proof Note that vtθtvt1\bm{v}_{t}\geq\theta_{t}\bm{v}_{t-1}, so we have vt(j=i+1tθj)vi=Θ(t,i)vi\bm{v}_{t}\geq\left(\prod_{j=i+1}^{t}\theta_{j}\right)\bm{v}_{i}=\Theta_{(t,i)}\bm{v}_{i}. By Lemma 27, this follows that vtC1(θ)tivi\bm{v}_{t}\geq C_{1}(\theta^{\prime})^{t-i}\bm{v}_{i} for all iti\leq t. On the other hand,

Since αt=χt1θtχt1θi\alpha_{t}=\chi_{t}\sqrt{1-\theta_{t}}\leq\chi_{t}\sqrt{1-\theta_{i}} for iti\leq t, it follows that

By Lemma 26, χtC0χi,it\chi_{t}\leq C_{0}\chi_{i},\forall i\leq t. Hence,

It is straightforward to acquire by induction that

By Lemma 26, it holds αtC0αi\alpha_{t}\leq C_{0}\alpha_{i} for any iti\leq t. By Lemma 27, Θ(t,i)C1(θ)ti\Theta_{(t,i)}\geq C_{1}(\theta^{\prime})^{t-i}. In addition, B(t,i)βtiB_{(t,i)}\leq\beta^{t-i}. Hence,

Combining Eq. (52) and Eq. (53), we then obtain the desired estimate Eq. (48). The proof is completed.

Proof Let W0=1W_{0}=1 and Wt=i=1Tθi1W_{t}=\prod_{i=1}^{T}\theta_{i}^{-1}. Let wt=WtWt1=(1θt)i=1tθi1=(1θt)Wtw_{t}=W_{t}-W_{t-1}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}=(1-\theta_{t})W_{t}. We therefore have

Note that v0=ϵ\bm{v}_{0}=\bm{\epsilon} and vt=θtvt1+(1θt)gt\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}, so it holds that W0v0=ϵW_{0}\bm{v}_{0}=\bm{\epsilon} and Wtvt=Wt1vt1+wtgt2.W_{t}\bm{v}_{t}=W_{t-1}\bm{v}_{t-1}+w_{t}\bm{g}_{t}^{2}. Then, Wtvt=W0v0+i=1twigi2=ϵ+i=1twigi2.W_{t}\bm{v}_{t}=W_{0}\bm{v}_{0}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}=\bm{\epsilon}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}. It follows that

Writing the norm in terms of coordinates, we obtain

The last inequality is due to the following trivial inequality:

for any non-negative parameters aa and bb. It then follows that

Proof For simplicity of notations, let ωt:=1θtgtvt2\omega_{t}:=\left\|\frac{\sqrt{1-\theta_{t}}\bm{g}_{t}}{\sqrt{\bm{v}_{t}}}\right\|^{2}, and Ωt:=i=1tωi\Omega_{t}:=\sum_{i=1}^{t}\omega_{i}. Note that χtC0at\chi_{t}\leq C_{0}a_{t}. Hence,

Note that atχta_{t}\leq\chi_{t}. Combining Eq. (62), Eq. (63), and Eq. (64), we have

Note that log(1+x)x\log(1+x)\leq x for all x>1x>-1. It follows that

Note that χt=αt/1θt\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}}. By Eq. (62) and Eq. (64), we have

(Lemma 6 in Section 3) Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. We have the following estimate

Proof For any two random variables XX and YY, by the Hölder’s inequality, we have

Let X=(f(xt)2v^t1)1/2X=\left(\frac{\left\|\bm{\nabla}f(\bm{x}_{t})\right\|^{2}}{\sqrt{\left\|\bm{\hat{v}}_{t}\right\|_{1}}}\right)^{1/2}, Y=v^t11/4Y=\left\|\bm{\hat{v}}_{t}\right\|_{1}^{1/4}, and let p=2p=2, q=2q=2. By Eq. (68), we have

By Eq. (69), Eq. (70), and Eq. (71), we obtain

By Lemma 26, αTC0αt\alpha_{T}\leq C_{0}\alpha_{t} for any tTt\leq T, so αt1C0αT1\alpha_{t}^{-1}\leq C_{0}\alpha_{T}^{-1}. Then, we obtain

Appendix B Proof of Theorem 2

Let {xt}\{\bm{x}_{t}\} be a sequence generated by Generic Adam for initial values x1\bm{x}_{1}, m0=0\bm{m}_{0}=\bm{0}, and v0=ϵ\bm{v}_{0}=\bm{\epsilon}. Assume that ff and stochastic gradients gt\bm{g}_{t} satisfy assumptions (A1)-(A4). Let τ\tau be randomly chosen from {1,2,,T}\{1,2,\ldots,T\} with equal probabilities pτ=1/Tp_{\tau}=1/T. We have the following estimate

where the constants CC and CC^{\prime} are given by

Proof By the LL-Lipschitz continuity of the gradient of ff and the descent lemma, we have

where C3C_{3} is the constant given in Lemma 33. By applying the estimates in Lemma 34 and Lemma 36 for the second and third terms in the right hand side of Eq. (78), and appropriately rearranging the terms, we obtain

Appendix C Proof of Corollary 10

Generic Adam with the above family of parameters converges as long as 0<r2s<20<r\leq 2s<2, and its non-asymptotic convergence rate is given by

Proof It is not hard to verify that the following equalities hold:

In this case, TαT=ηT1sT\alpha_{T}=\eta T^{1-s}. Therefore, by Theorem 2 the non-asymptotic convergence rate is given by

To guarantee convergence, then 0<r2s<20<r\leq 2s<2.

Appendix D Proof of Theorem 13

For any T>0T>0, if we take αt=αT, βt=β,θt=1θT\alpha_{t}=\frac{\alpha}{\sqrt{T}},\ \beta_{t}=\beta,\theta_{t}=1-\frac{\theta}{T}, which satisfies γ=β1θT<1\gamma=\frac{\beta}{1-\frac{\theta}{T}}<1 and θt14\theta_{t}\geq\frac{1}{4}, then it holds that

Proof Based on Theorem 2, by plugging αt,βt\alpha_{t},\beta_{t} and θt\theta_{t} in the conclusion of Theorem 2, we can get the desired result.

Appendix E Key Lemma to prove Theorem 17

In this section, we provide the additional lemmas for the proofs of Theorem 17.

With the definitions in Algorithm 2, for any t=1,2,,Tt=1,2,\cdots,T we have the following estimation:

Proof With the similar proof in Lemma 34, it holds that

Thus, by taking expectation on both side, we can obtain

(Lemma 19 in Section 4.2) By the definition of MtM_{t}, it holds that

Proof Using Lemma 33, by plugging C0=C1=1C_{0}=C_{1}=1, χt=αθ\chi_{t}=\frac{\alpha}{\sqrt{\theta}} and θt14\theta_{t}\geq\frac{1}{4}, it holds that

(Lemma 20 in Section 4) The following estimation always holds:

With inequalities (83) and (84), we can obtain

Proof We discuss the solution of xx in 4 different situations. First, when BxABx\geq A and DxCD\sqrt{x}\geq C, we have

Secondly, when BxABx\geq A and DxCD\sqrt{x}\leq C, we have

Thirdly, when BxABx\leq A and DxCD\sqrt{x}\geq C, it holds that

Last, when BxABx\leq A and DxCD\sqrt{x}\leq C, it holds that

Therefore, combining four different conditions, we have x(4BD)2+4BC+(4AD)2/3+4ACx\leq(4BD)^{2}+4BC+(4AD)^{2/3}+\sqrt{4AC}.

Appendix F Proof of Theorem 17

For any T>0T>0, if we take αt=αT\alpha_{t}=\frac{\alpha}{\sqrt{T}}, βt=β\beta_{t}=\beta , θt=1θT\theta_{t}=1-\frac{\theta}{T}, which satisfies γ=βtθt<1\gamma=\frac{\beta_{t}}{\theta_{t}}<1 and θt14\theta_{t}\geq\frac{1}{4}, then there exists t{1,2,,T}t\in\{1,2,\cdots,T\} such that

In addition, by taking ϵ=1sd\epsilon=\frac{1}{sd}, it holds that

Proof First, according to the gradient Lipschitz condition of ff, it holds

Using Lemma 38, 39 and 40 with rearranging the corresponding terms, we have

Before using Lemma 41, we list the order of 4 terms in Lemma 41 as follows:

References