Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Congliang Chen, Li Shen, Fangyu Zou, Wei Liu

Introduction

Large-scale non-convex stochastic optimization (Bottou et al., 2018), covering a slew of applications in statistics and machine learning (Jain et al., 2017; Bottou et al., 2018) such as learning a latent variable from massive data whose probability density distribution is unknown, takes the following generic formulation:

Alternatively, a compromised approach to handle this difficulty is to use an unbiased stochastic estimate of $\bm{\nabla}\!{f}(\bm{x})$ , denoted as $g(\bm{x},\xi)$ , which leads to the stochastic gradient descent (SGD) algorithm (Robbins and Monro, 1985). Its coordinate-wise version is defined as follows:

for $k=1,2,\ldots,d$ , where $\eta_{t,k}\geq 0$ is the learning rate of the $k$ -th component of stochastic gradient $\bm{g}(\bm{x}_{t},\xi_{t})$ at the $t$ -th iteration. Under some mild assumptions (e.g., the optimal solution exists), a sufficient condition (Robbins and Monro, 1985) to ensure the global convergence of vanilla SGD in Eq. (2) is to require $\eta_{t}$ to meet the following diminishing condition:

Although the vanilla SGD algorithm with learning rate $\eta_{t}$ satisfying condition (3) does converge, its empirical performance could be still stagnating, since it is difficult to tune an effective learning rate $\eta_{t}$ via condition (3).

To further improve the empirical performance of SGD, a large variety of adaptive SGD algorithms, including AdaGrad (Duchi et al., 2011), RMSProp (Hinton et al., 2012), Adam (Kingma and Ba, 2014), Nadam (Dozat, 2016), AdaBound (Luo et al., 2019), etc., have been proposed to automatically tune the learning rate $\eta_{t}$ by using second-order moments of historical stochastic gradients $\{\bm{g}_{t}\}$ . Let $\bm{v}_{t,k}$ and $\bm{m}_{t,k}$ be the exponential moving average of the historical second-order moments $(\bm{g}^{2}_{1,k},\bm{g}^{2}_{2,k},\cdots,\bm{g}^{2}_{t,k})$ and stochastic gradient estimates $(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k})$ , respectively. More specifically, two groups of hyperparameters ( $\beta_{t}$ , $\theta_{t}$ ) will be involved into the calculation of $m_{t,k}=\beta_{t}m_{t-1,k}+(1-\beta_{t})g_{t,k}$ and $v_{t,k}=\theta_{t}v_{t-1,k}+(1-\theta_{t})g^{2}_{t,k}$ . Then, the generic iteration scheme of these adaptive SGD algorithms (Reddi et al., 2018; Chen et al., 2018a) is summarized as

for $k=1,2,\ldots,d$ , where $\alpha_{t}>0$ is called base learning rate and it is independent of stochastic gradient estimates $(\bm{g}_{1,k},\bm{g}_{2,k},\cdots,\bm{g}_{t,k})$ for all $t\geq 1$ . Although Adam works well for solving large scale convex and non-convex optimization problems such as training deep neural networks, it has been disclosed to be divergent in some scenarios via counterexamples (Reddi et al., 2018). Thus, without any further assumptions for corrections, Adam should not be directly used. Recently, developing sufficient conditions to guarantee global convergences of Adam -type algorithms has attracted much attention from both machine learning and optimization communities. The existing successful attempts can be divided into four categories: decreasing a learning rate, adopting a big batch size, incorporating a temporal decorrelation, and seeking an analogous surrogate. However, some of them are either hard to check or impractical. In this work, we will first introduce an alternative easy-to-check sufficient condition to guarantee the global convergences of the original Adam.

Meanwhile, in practice, stochastic Adam, where a single sample is used to estimate gradient, converges slowly to the optimal point. People usually use mini-batch Adam instead to get faster convergence performance. In SGD, although how the sample size will affect the convergence has been well studied (Li et al., 2014), few works give analysis on mini-batch adaptive gradient methods especially on Adam, since mini-batch size largely affects adaptive learning rate $\eta_{t,k}$ in Eq .(4), which makes the analysis difficult. In this work, we give the first complexity analysis for mini-batch Adam, which shows that mini-batch Adam can also be theoretically accelerated by using a larger mini-batch size.

On the other hand, as the data size goes larger in machine learning problems, it is hard to collect, store and process data in a single machine. Several machines are involved in the optimization process. Hence, distributed optimization methods are proposed, where distributed Adam is also popularly used. Different from mini-batch Adam, where only one machine is used for optimization, several machines are involved. In the distributed setting, machines are connected via a network graph. More specifically, there are two kinds of structures used in distributed Adam: parameter-server structure and decentralized structure. In the parameter-server structure, there is one special machine called as parameter server and the rest called workers. The parameter server connects to all workers, but workers don’t connect to each other. Therefore, workers can share information with the parameter server in each communication round but cannot share information with the other workers. However, in the decentralized structure, there is not a server involved in the structure. A pre-defined graph connects all machines. A machine can only share information with its direct neighbors in each communication round. Still, few works answer how the local batch size and number of machines will affect the convergence of distributed Adam. In this work, because the analysis of distributed Adam under the parameter-server model is similar to Mini-batch Adam, we answer this question and show that distributed Adam under a parameter-server model can also achieve a linear speedup property as distributed SGD (Yu et al., 2019).

In summary, the contributions of this work are five-fold:

We introduce an easy-to-check sufficient condition to ensure the global convergences (i.e., averaged expected gradient norm converges to 0) of generic Adam in the common smooth non-convex stochastic setting with mild assumptions. Moreover, this sufficient condition is distinctive from the existing conditions and is easier to verify.

We provide a new explanation on the divergences of original Adam and RMSProp, which are possibly due to an incorrect parameter setting of the combinations of historical second-order moments.

We find that the sufficient condition extends the restrictions of RMSProp (Mukkamala and Hein, 2017) and covers many convergent variants of Adam, e.g., AdamNC, AdaGrad with momentum, etc. Thus, their convergences in the non-convex stochastic setting naturally hold.

We theoretically show that mini-batch Adam can be further accelerated by adopting a larger mini-batch size, and that distributed Adam can achieve a linear speed up property in the parameter-server distributed system by using commonly used sufficient condition parameters.

We conduct experiments to validate the sufficient condition for the convergences of Adam and mini-batch Adam. The experimental results match our theoretical results.

The paper is organized as follows. In Section 2, we first give the formulation of generic Adam and then discuss several works related to Adam including several existing sufficient convergence conditions, analysis of mini-batch, and distributed stochastic gradient methods. In Section 3, we derive the sufficient condition for convergence of Adam and provide several insights for the divergence of vanilla Adam. In Section 4, we give the complexity analysis on practical Adam with a commonly used sufficient condition parameter, including mini-batch Adam and distributed Adam. At last, in Section 5, we conduct some experiments under both theoretical settings and practical settings to verify the established theory. In addition, by practical Adam, we mean that we give a thorough analysis for Adam, mini-batch Adam, and distributed Adam, which have been commonly used for training deep neural networks without theoretical guarantees.

Related work

It is not hard to check that Generic Adam covers RMSProp by setting $\beta_{t}=0$ directly. Moreover, it covers Adam with a bias correction (Kingma and Ba, 2014) as follows:

The vanilla Adam with the bias correction (Kingma and Ba, 2014) takes constant parameters $\beta_{t}=\beta$ and $\theta_{t}=\theta$ . The iteration scheme is written as $\bm{x}_{t+1}=\bm{x}_{t}-\widehat{\alpha}_{t}\frac{\widehat{\bm{m}}_{t}}{\sqrt{\widehat{\bm{v}}_{t}}}$ , with $\widehat{\bm{m}}_{t}=\frac{\bm{m}_{t}}{1-\beta^{t}}$ and $\widehat{\bm{v}}_{t}=\frac{\bm{v}_{t}}{1-\theta^{t}}$ . Let $\alpha_{t}=\widehat{\alpha}_{t}\frac{\sqrt{1-\theta^{t}}}{1-\beta^{t}}$ . Then, the above can be rewritten as $\bm{x}_{t+1}=\bm{x}_{t}-{\alpha_{t}\bm{m}_{t}}/\sqrt{\bm{v}_{t}}$ . Thus, it is equivalent to taking constant $\beta_{t}$ , constant $\theta_{t}$ , and new base learning rate $\alpha_{t}$ in Generic Adam.

2 Convergence Conditions for Adam

First, because Reddi et al. (2018) gave counterexamples on divergence of origin Adam, several sufficient conditions have been proposed to guarantee global convergences of Adam that can be summarized into the following four categories:

(C1) Decreasing a learning rate. Reddi et al. (2018) have declared that the core cause of divergences of Adam and RMSProp is largely controlled by the difference between the two adjacent learning rates, i.e.,

Once positive definiteness of $\Gamma_{t}$ is violated, Adam and RMSProp may suffer from divergence (Reddi et al., 2018). Based on this observation, two variants of Adam called AMSGrad and AdamNC have been proposed with convergence guarantees in both the convex (Reddi et al., 2018) and non-convex (Chen et al., 2018a) stochastic settings by requiring $\Gamma_{t}\succ 0$ . In addition, Padam (Zhou et al., 2018a) extended from AMSGrad has been proposed to contract the generalization gap in training deep neural networks, whose convergence has been ensured by requiring $\Gamma_{t}\succ 0$ . As a relaxation of $\Gamma_{t}\succ 0$ , Barakat and Bianchi (2020) showed that when $\alpha_{t}/\sqrt{v_{t}}\leq\alpha_{t-1}/(c\sqrt{v_{t-1}})$ holds for all $t$ and some positive $c$ , the algorithm Adam can converge. In the strongly convex stochastic setting, by using the long-term memory technique developed in (Reddi et al., 2018), Huang et al. (2018) have proposed NosAdam by attaching more weights on historical second-order moments to ensure its convergence. Prior to that, the convergence rate of RMSProp (Mukkamala and Hein, 2017) has already been established in the convex stochastic setting by employing similar parameters to those of AdamNC (Reddi et al., 2018).

(C2) Adopting a big batch size. Basu et al. (2018), for the first time, showed that deterministic Adam and RMSProp with original iteration schemes are convergent by using a full-batch gradient. On the other hand, both Adam and RMSProp can be reshaped as specific signSGD-type algorithms (Balles and Hennig, 2018; Bernstein et al., 2018) whose $\mathcal{O}(1/\sqrt{T})$ convergence rates have been provided in the non-convex stochastic setting by setting batch size as large as the number of maximum iterations (Bernstein et al., 2018). Recently, Zaheer et al. (2018) have established $\mathcal{O}(1/\sqrt{T})$ convergence rate of original Adam directly in the non-convex stochastic setting by requiring the batch size to be the same order as the number of maximum iterations. We comment that this type of requirement is impractical when Adam and RMSProp are applied to tackle large-scale problems (1), since these approaches cost a huge number of computations to estimate big-batch stochastic gradients in each iteration.

(C3) Incorporating a temporal decorrelation. By exploring the structure of the convex counterexample in (Reddi et al., 2018), Zhou et al. (2018b) have pointed out that the divergence of RMSProp is fundamentally caused by the imbalanced learning rate rather than the absence of $\Gamma_{t}\succ 0$ . Based on this viewpoint, Zhou et al. (2018b) have proposed AdaShift by incorporating a temporal decorrelation technique to eliminate the inappropriate correlation between $\bm{v}_{t,k}$ and the current second-order moment $\bm{g}_{t,k}^{2}$ , in which the adaptive learning rate $\eta_{t,k}$ is required to be independent of $\bm{g}^{2}_{t,k}$ . However, the convergence of AdaShift in (Zhou et al., 2018b) was merely restricted to RMSProp for solving the convex counterexample in (Reddi et al., 2018).

(C4) Seeking an analogous surrogate. Due to the divergences of Adam and RMSProp (Reddi et al., 2018), Zou et al. (2018) proposed a class of new surrogates called AdaUSM to approximate Adam and RMSProp by integrating weighted AdaGrad with a unified heavy ball and Nesterov accelerated gradient momentums. Its $\mathcal{O}(\log{(T)}/\sqrt{T})$ convergence rate has also been provided in the non-convex stochastic setting by requiring a non-decreasing weighted sequence. Besides, many other adaptive stochastic algorithms without combining momentums, such as AdaGrad (Ward et al., 2018; Li and Orabona, 2019) and stagewise AdaGrad (Chen et al., 2018b), have been guaranteed to be convergent and work well in the non-convex stochastic setting.

In contrast with the above four types of modifications and restrictions, we introduce an alternative easy-to-check sufficient condition (abbreviated as (SC)) to guarantee the global convergences of original Adam. The proposed (SC) merely depends on the parameters in estimating $\bm{v}_{t,k}$ and base learning rate $\alpha_{t}$ . (SC) neither requires the positive definiteness of $\Gamma_{t}$ like (C1) nor needs the batch size as large as the same order as the number of maximum iterations like (C2) in both the convex and non-convex stochastic settings. Thus, it is easier to verify and more practical compared with (C1)-(C3). On the other hand, (SC) is partially overlapped with (C1) since the proposed (SC) can cover AdamNC (Reddi et al., 2018), AdaGrad with exponential moving average (AdaEMA) momentum (Chen et al., 2018a), and RMSProp (Mukkamala and Hein, 2017) as instances whose convergences are all originally motivated by requiring the positive definiteness of $\Gamma_{t}$ . While, based on (SC), we can directly derive their global convergences in the non-convex stochastic setting as byproducts without checking the positive definiteness of $\Gamma_{t}$ step by step. Besides, (SC) can serve as an alternative explanation on divergences of original Adam and RMSProp, which are possibly due to incorrect parameter settings for accumulating the historical second-order moments rather than the imbalanced learning rate caused by the inappropriate correlation between $\bm{v}_{t,k}$ and $\bm{g}^{2}_{t,k}$ like (C3). In addition, AdamNC and AdaEMA are convergent under (SC), but violate (C3) in each iteration. Meanwhile, there are lots of work improving upper bounds for the above algorithms, e.g., Défossez et al. (2020) improved the constants related to $\beta$ by introducing a novel average scheme in the analysis.

3 Mini-batch Stochastic Gradient Methods

In practice, people usually use mini-batch stochastic gradient methods instead of single sample stochastic gradient methods or full gradient methods for faster convergence. For mini-batch SGD algorithms, Li et al. (2014) have shown that mini-batch SGD boosts $\mathcal{O}(\frac{1}{\sqrt{T}})$ convergence rate of SGD to $\mathcal{O}(\frac{1}{\sqrt{sT}})$ where $s$ is the mini-batch size. However, as it is much harder to show the convergence of adaptive gradient methods, few works analyze how sample size will affect the convergence of the adaptive gradient algorithms. Li and Orabona (2019) gave an analysis on Adagrad and showed the convergence rate is linear in the sample size. Zaheer et al. (2018) gave the analysis on Adam, showing that large batch size can help convergence, but the batch size should increase with iteration increasing, which may not be practical. In this work, we theoretically show that mini-batch Adam can be accelerated by adopting a larger mini-batch size as mini-batch SGD (Li et al., 2014) in the same order.

4 Distributed Stochastic Gradient Methods

Distributed stochastic gradient descent was first introduced in Agarwal and Duchi (2011) in the parameter-server setting. Further, in the decentralized setting, Lian et al. (2017) gave the analysis on the stochastic gradient descent. The analysis shows that the convergence speed will be linear in the number of workers in the parameter-server setting or will be linear to some constant related to the decentralized graph structure. For the adaptive gradient methods, in the parameter-server setting, Reddi et al. (2020) gave algorithms in the federated scenario called FedAdam, FedAdagrad, and FedYogi. Moreover, they showed that the convergence speed will be linear in the number of workers. However, instead divided by $\sqrt{\bm{v}_{t}}$ , they divide the gradient with $\sqrt{\bm{v_{t}}+\bm{\epsilon}}$ . Meanwhile, in their assumptions, $\epsilon$ in the algorithm should be in the order of $O(\frac{G}{L})$ , where $G$ is the upper bound of gradient norm, and $L$ is the Lipschitz constant of the objective function. However, in practice, $\epsilon$ is always set to be a small value, much smaller than $G/L$ . On the other hand, the large $\epsilon$ may dominate the adaptive term in their algorithms. Hence, their methods may degrade to stochastic gradient descent. Carnevale et al. (2020) shoed that Adam with gradient tracking method can be linearly accelerated with an increasing number of nodes in the decentralized and strongly convex setting. Still, it is unclear whether, in the nonconvex setting, this linear speedup will hold when Adam is used. Moreover, Chen et al. (2020) gave an analysis of Adagrad and showed the convergence speed will be linear in the number of workers. Meanwhile, Nazari et al. (2019) gave the analysis of Adagrad in the decentralized setting. Xie et al. (2019) also gave a variant on Adagrad algorithm called AdaAlter in the centralized setting and showed the convergence will linearly speed up by increasing the number of workers. Recently, Chen et al. (2021, 2022) extend Adam to the distributed quantized Adam with error compensation technique Stich et al. (2018). However, the linear speedup property in (Chen et al., 2021, 2022) does not hold. To the best of our knowledge, whether the distributed Adam can achieve a linear speedup is still open. This paper theoretically demonstrates that the distributed Adam in the parameter-server model can achieve a linear speedup concerning the number of workers.

Novel Sufficient Condition for Convergence of Adam

In this section, we characterize the upper-bound of gradient residual of problem (1) as a function of parameters $(\theta_{t},\alpha_{t})$ . Then the convergence rate of Generic Adam is derived directly by specifying appropriate parameters $(\theta_{t},\alpha_{t})$ . Below, we state the necessary assumptions that are commonly used for analyzing the convergence of a stochastic algorithm for non-convex problems:

In addition, we also suppose that the parameters $\{\beta_{t}\}$ , $\{\theta_{t}\}$ , and $\{\alpha_{t}\}$ satisfy the following restrictions:

The parameters $\{\beta_{t}\}$ satisfy $0\leq\beta_{t}\leq\beta<1$ for all $t$ for some constant $\beta$ ;

The parameters $\{\theta_{t}\}$ satisfy $0<\theta_{t}<1$ and $\theta_{t}$ is non-decreasing in $t$ with $\theta:=\lim\limits_{t\to\infty}\theta_{t}>\beta^{2}$ ;

The parameters $\{\alpha_{t}\}$ satisfy that $\chi_{t}:=\frac{\alpha_{t}}{\sqrt{1-\theta_{t}}}$ is “almost” non-increasing in $t$ , by which we mean that there exist a non-increasing sequence $\{a_{t}\}$ and a positive constant $C_{0}$ independent of $t$ such that $a_{t}\leq\chi_{t}\leq C_{0}a_{t}$ .

The restriction (R3) indeed says that $\chi_{t}$ is the product between some non-increasing sequence $\{a_{t}\}$ and some bounded sequence. This is a slight generalization of $\chi_{t}$ itself being non-decreasing. If $\chi_{t}$ itself is non-increasing, we can then take $a_{t}=\chi_{t}$ and $C_{0}=1$ . For most of the well-known Adam-type methods, $\chi_{t}$ is indeed non-decreasing. For instance, for AdaGrad with EMA momentum we have $\alpha_{t}=\eta/\sqrt{t}$ and $\theta_{t}=1-1/t$ , so $\chi_{t}=\eta$ is constant; for Adam with constant $\theta_{t}=\theta$ and non-increasing $\alpha_{t}$ (say $\alpha_{t}=\eta/\sqrt{t}$ or $\alpha_{t}=\eta$ ), $\chi_{t}=\alpha_{t}/\sqrt{1-\theta}$ is non-increasing. The motivation, instead of $\chi_{t}$ being decreasing, is that it allows us to deal with the bias correction steps in Adam (Kingma and Ba, 2014).

We fix a positive constant $\theta^{\prime}>0$ In the special case that $\theta_{t}=\theta$ is constant, we can directly set $\theta^{\prime}=\theta$ . such that $\beta^{2}<\theta^{\prime}<\theta$ . Let $\gamma:={\beta^{2}}/{\theta^{\prime}}<1$ and

where $N$ is the maximum of the indices $j$ with $\theta_{j}<\theta^{\prime}$ . The finiteness of $N$ is guaranteed by the fact that $\lim_{t\to\infty}\theta_{t}=\theta>\theta^{\prime}$ . When there are no such indices, i.e., $\theta_{1}\geq\theta^{\prime}$ , we take $C_{1}=1$ by convention. In general, $C_{1}\leq 1$ . Our main results on estimating gradient residual state as follows:

Let $\{\bm{x}_{t}\}$ be a sequence generated by Generic Adam for initial values $\bm{x}_{1}$ , $\bm{m}_{0}=\bm{0}$ , and $\bm{v}_{0}=\bm{\epsilon}$ . Assume that $f$ and stochastic gradients $\bm{g}_{t}$ satisfy assumptions (A1)-(A4). Let $\tau$ be randomly chosen from $\{1,2,\ldots,T\}$ with equal probabilities $p_{\tau}=1/T$ . Then, we have

where $C^{\prime}\!=\!{2C_{0}^{2}C_{3}d\sqrt{G^{2}\!+\!\epsilon d}}{\big{/}}{[(1\!-\!\beta)\theta_{1}]}$ and

in which $C_{4}$ and $C_{3}$ are defined as $C_{4}=f(x_{1})-f^{*}$ and $C_{3}=\frac{C_{0}}{\sqrt{C_{1}}(1-\sqrt{\gamma})}\big{[}\frac{C_{0}^{2}\chi_{1}L}{C_{1}(1-\sqrt{\gamma})^{2}}+2\big{(}\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\big{)}^{2}G\big{]}$ , respectively.

Let $\{\bm{x}_{t}\}$ be the sequence generated by Algorithm 1. For $T\geq 1$ , it holds that

(Lemma 33 in Appendix) Let $\{\bm{x}_{t}\}$ be the sequence generated by Algorithm 1. For $T\geq 1$ , it holds that

for some constants $\zeta_{0}$ and $\zeta_{1}$ .

(Lemma 36 in Appendix) Let $\tau$ be an integer that is randomly chosen from $\{1,2,\cdots,T\}$ with equal probabilities. We have the following estimate

Thus, using the above three lemmas, we can prove Theorem 2.

2 Discussion of Theorem 2

Take $\alpha_{t}=\eta/t^{s}$ with $0\leq s<1$ . Suppose $\lim_{t\to\infty}\theta_{t}=\theta<1$ . Define $Bound(T):=\frac{C+C^{\prime}\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}{T\alpha_{T}}$ . Then the $Bound(T)$ in Theorem 2 is bounded from below by constants

In particular, when $\theta_{t}=\theta<1$ , we have the following more subtle estimate of lower and upper-bounds for $Bound(T)$

(i) Corollary 7 shows that if $\lim_{t\to\infty}\theta_{t}=\theta<1$ , the bound in Theorem 2 is only $\mathcal{O}(1)$ , hence not guaranteeing convergence. This result is not surprising as Adam with constant $\theta_{t}$ has already shown to be divergent (Reddi et al., 2018). Hence, $\mathcal{O}(1)$ is its best convergence rate we can expect. We will discuss this case in more details in Section 3.4. (ii) Corollary 7 also indicates that in order to guarantee convergence, the parameter has to satisfy $\lim_{t\to\infty}\theta_{t}=1$ . Although we do not assume this in our restrictions (R1)-(R3), it turns out to be the consequence from our analysis. Note that if $\beta<1$ in (R1) and $\lim_{t\to\infty}\theta_{t}=1$ , then the restriction $\lim_{t\to\infty}\theta_{t}>\beta^{2}$ is automatically satisfied in (R2).

We are now ready to give the Sufficient Condition for convergence of Generic Adam.

Generic Adam is convergent if the parameters $\{\alpha_{t}\}$ , $\{\beta_{t}\}$ , and $\{\theta_{t}\}$ satisfy the following four conditions:

$0<\theta_{t}<1$ and $\theta_{t}$ is non-decreasing in $t$ ;

$\chi_{t}:=\alpha_{t}/\sqrt{1-\theta_{t}}$ is “almost” non-increasing;

$\big{(}{\sum_{t=1}^{T}\alpha_{t}\sqrt{1-\theta_{t}}}\big{)}{\big{/}}\big{(}{T\alpha_{T}}\big{)}=o(1)$ .

3 Convergence Rate of Generic Adam

We now provide the convergence rate of Generic Adam with specific parameters $\{(\theta_{t},\alpha_{t})\}$ , i.e.,

for positive constants $\alpha,\eta,K$ , where $K$ is taken such that $\alpha/K^{r}<1$ . Note that $\alpha$ can be taken bigger than 1. When $\alpha<1$ , we can take $K=1$ and then $\theta_{t}=1-\alpha/t^{r},t\geq 1$ . To guarantee (R3), we require $r\leq 2s$ . For such a family of parameters we have the following corollary.

Generic Adam with the above family of parameters (i.e. (8)) converges as long as $0<r\leq 2s<2$ , and its non-asymptotic convergence rate is given by

Corollary 10 recovers and extends the results of some well-known algorithms below:

AdaGrad with exponential moving average (EMA). When $\theta_{t}=1-1/t$ , $\alpha_{t}=\eta/\sqrt{t}$ , and $\beta_{t}=\beta<1$ , Generic Adam is exactly AdaGrad with EMA momentum (AdaEMA) (Chen et al., 2018a). In particular, if $\beta=0$ , this is the vanilla coordinate-wise AdaGrad. It corresponds to taking $r=1$ and $s=1/2$ in Corollary 10. Hence, AdaEMA has convergence rate $\mathcal{O}\left(\sqrt{\log(T)/\sqrt{T}}\right)$ .

AdamNC. Taking $\theta_{t}=1-1/t$ , $\alpha_{t}=\eta/\sqrt{t}$ , and $\beta_{t}=\beta\lambda^{t}$ in Generic Adam, where $\lambda<1$ is the decay factor for the momentum factors $\beta_{t}$ , we recover AdamNC (Reddi et al., 2018). Its $\mathcal{O}\left(\sqrt{\log{(T)}/\sqrt{T}}\right)$ convergence rate can be directly derived via Corollary 10.

RMSProp. Mukkamala and Hein (2017) have reached the same $\mathcal{O}\left(\sqrt{\log{(T)}/\sqrt{T}}\right)$ convergence rate for RMSprop with $\theta_{t}=1-\alpha/t$ , when $0<\alpha\leq 1$ and $\alpha_{t}=\eta/\sqrt{t}$ under the convex assumption. Since RMSProp is essentially Generic Adam with all momentum factors $\beta_{t}=0$ , we recover Mukkamala and Hein’s results by taking $r=1$ and $s=1/2$ in Corollary 10. Moreover, our result generalizes to the non-convex stochastic setting, and it holds for all $\alpha\!>\!0$ rather than only $0\!<\!\alpha\!\leq\!1$ .

The summarization of the above algorithms is provided in Table 1.

Comparison with Reddi et al. (2018) . Most of the convergent modifications of original Adam, such as AMSGrad, AdamNC, and NosAdam, all require $\Gamma_{t}\succ 0$ in Eq. (5), which is equivalent to decreasing the adaptive learning rate $\eta_{t}$ step by step. Since the term $\Gamma_{t}$ (or adaptive learning rate $\eta_{t}$ ) involves the past stochastic gradients (hence not deterministic), the modification to guarantee $\Gamma_{t}\succ 0$ either needs to change the iteration scheme of Adam (like AMSGrad) or needs to impose some strong restrictions on the base learning rates $\alpha_{t}$ and $\theta_{t}$ (like AdamNC). Our sufficient condition provides an easy-to-check criterion for the convergence of Generic Adam in Corollary 9. It is not necessary to require $\Gamma_{t}\succ 0$ . Moreover, we use exactly the same iteration scheme as the original Adam without any modifications. Our work shows that the positive definiteness of $\Gamma_{t}$ may not be an essential issue for the divergence of the original Adam. The divergence may be due to the incorrect setting of moving average parameters instead of non-positive definiteness of $\Gamma_{t}$ .

The currently most popular RMSProp and Adam’s parameter setting takes constant $\theta_{t}$ , i.e., $\theta_{t}=\theta<1$ . The motivation behind is to use the exponential moving average of squares of past stochastic gradients. In practice, parameter $\theta$ is recommended to be set very close to 1. For instance, a commonly adopted $\theta$ is taken as 0.999.

Although great performance in practice has been observed, such a constant parameter setting has the serious flaw that there is no convergence guarantee even for convex optimization, as proved by the counterexamples in (Reddi et al., 2018). Ever since much work has been done to analyze the divergence issue of Adam and to propose modifications with convergence guarantees, as summarized in the introduction section. However, there is still not a satisfactory explanation that touches the fundamental reason for the divergence. In this section, we try to provide more insights for the divergence issue of Adam/RMSProp with constant parameter $\theta_{t}$ , based on our analysis of the sufficient condition for convergence.

From the sufficient condition perspective. Let $\alpha_{t}\!=\!\eta/t^{s}$ for $0\leq s\!<\!1$ and $\theta_{t}\!=\!\theta\!<\!1$ . According to Corollary 7, $Bound(T)$ in Theorem 2 has the following estimate:

The bounds tell us some points on Adam with constant $\theta_{t}$ :

$Bound(T)\!=\!\mathcal{O}(1)$ , so the convergence is not guaranteed. This result coincides with the divergence issue demonstrated in (Reddi et al., 2018). Indeed, since in this case Adam is not convergent, this is the best bound we can have.

Consider the dependence on parameter $s$ . The bound is decreasing in $s$ . The best bound in this case is when $s=0$ , i.e., the base learning rate is taken constant. This explains why in practice taking a more aggressive constant base learning rate often leads to even better performance, comparing with taking a decaying one.

Consider the dependence on parameter $\theta$ . Note that the constants $C$ and $C^{\prime}$ depend on $\theta_{1}$ instead of the whole sequence $\theta_{t}$ . We can always set $\theta_{t}=\theta$ for $t\geq 2$ while fix $\theta_{1}<\theta$ , by which we can take $C$ and $C^{\prime}$ independent of constant $\theta$ . Then the principal term of $Bound(T)$ is linear in $\sqrt{1-\theta}$ , so decreases to zero as $\theta\to 1$ . This explains why setting $\theta$ close to 1 often results in better performance in practice.

Moreover, Corollary 10 shows us how the convergence rate continuously changes when we continuously vary parameters $\theta_{t}$ . Let us fix $\alpha_{t}\!=\!1/\sqrt{t}$ and consider the following continuous family of parameters $\{\theta_{t}^{(r)}\}$ with $r\in$ :

Note that when $r=1$ , then $\theta_{t}=1-1/t$ , this is the AdaEMA, which has the convergence rate $\mathcal{O}\left(\sqrt{\log T/\sqrt{T}}\right)$ ; when $r=0$ , then $\theta_{t}=\bar{\theta}<1$ , this is the original Adam with constant $\theta_{t}$ , which only has the $\mathcal{O}(1)$ bound; when $0<r<1$ , by Corollary 10, the algorithm has the $\mathcal{O}(T^{-r/4})$ convergence rate. Along with this continuous family of parameters, we observe that the theoretical convergence rate continuously deteriorates as the real parameter $r$ decreases from 1 to 0, namely, as we gradually shift from AdaEMA to Adam with constant $\theta_{t}$ . In the limiting case, the latter is not guaranteed with convergence anymore.

Complexity Analysis for Practical Adam: Mini-batch/Distributed Adam

Due to the limited time, limited computational resources, and noise in data collection and processing, it is almost impossible to achieve the accurate stationary point. Thus, instead of achieving the accurate stationary point, people get more attention to achieving some approximated stationary point in practice. The crucial question under this situation will become how much time is needed to achieve some approximated solution. This section will answer this question by answering how many iterations are needed to obtain an $\varepsilon$ -stationary point. First, we define $\varepsilon$ -stationary point as follows:

We define a random variable $\bm{x}$ as an $\varepsilon$ -stationary point of problem (1), if

According to Definition 12 and Theorem 2, for generic Adam, we can directly give the following corollary:

For any $\varepsilon>0$ , if we take $T\geq C_{5}^{2}\varepsilon^{-4}$ , $\alpha_{t}=\frac{\alpha}{\sqrt{T}},\ \beta_{t}=\beta,\theta_{t}=1-\frac{\theta}{T}$ , which satisfy $\gamma=\frac{\beta}{1-\frac{\theta}{T}}<1$ and $\theta_{t}\geq\frac{1}{4}$ , then by taking $\tau$ uniformly from $\{1,2,\cdots,T\}$ , it holds that

Difference from Corollary 7. Corollary 7 shows when $lim_{t\rightarrow\infty}\theta_{t}<1$ the algorithm cannot converge to a stationary point. However, because the goal of the algorithm switches to an $\varepsilon$ -stationary point, the choice of $\alpha_{t}$ and $\theta_{t}$ is not contradicting Corollary 7. We will use the same choice in the following section.

With the parameter setting in Corollary 10, the number of iterations T should satisfy $\frac{T}{\log^{2}T}=\Omega(\varepsilon^{-4})$ , instead of $T=\Omega(\varepsilon^{-4})$ , which gives a larger iteration number T. However, we can use the same parameter setting in Corollary 10 when T changes, while we need to change parameters in Corollary 13 for different T.

From Theorem 13, to achieve an $\varepsilon$ -stationary point, only $\Omega(\varepsilon^{-4})$ iterations are needed. Comparing the result with SGD (Li et al., 2014), we have the same order of iterations to achieve an $\varepsilon$ -stationary point.

In the following two sections, we will analyze two practical Adam variations, i.e., mini-batch and distributed Adam. Although they can use the same technique for analysis, we list two algorithms for readers in different communities (single machine learning algorithm (mini-batch Adam) v.s. multi-machine learning algorithm (distributed Adam)).

It has been shown that when $s$ samples are used to estimate the gradient in the stochastic gradient descent algorithm, the convergence speed can be accelerated $s$ times than the single sample algorithms Li et al. (2014). Meanwhile, in practice, the mini-batch technique is widely used to optimize problem (1) such as training a neural network with the Adam algorithm. In this section, we will give the analysis on mini-batch Adam. The Mini-batch Adam algorithm is defined in the following Algorithm 2. Different from Algorithm 1, in Algorithm 2 $s$ samples which are identically distributed and independent when the iterate $\bm{x}_{t}$ is used to estimate the gradient $\bm{\nabla}f(\bm{x}_{t})$ . We average the $s$ estimates and use the averaged stochastic gradient, which should be a more accurate estimation to update $\bm{x}_{t}$ .

To link sample size and convergence rate, we give a new assumption on the stochastic gradient and state it as follows:

We add (A5) to establish the relation between sample size and the convergence rate. Intuitively, with an increasing size of samples, the variance of the gradient estimator should reduce. Utilizing this reduction, we can obtain an $\varepsilon$ -stationary point, with fewer iterations but a larger sample size. Also, this assumption is widely used in analysis such as Yan et al. (2018). The following results are given under assumptions (A1) to (A5), and the result of mini-batch Adam is given as follows:

For any $\varepsilon>0$ , if we take $\alpha_{t}=\frac{\alpha}{\sqrt{T}}$ , $\beta_{t}=\beta$ and $\theta_{t}=1-\frac{\theta}{T}$ , which satisfy $\gamma=\frac{\beta_{t}^{2}}{\theta_{t}}<1$ , $\theta_{t}\geq\frac{1}{4}$ and $F_{T}(T,s)\leq\varepsilon$ , then there exists $t\in\{1,2,\cdots,T\}$ such that

where by taking $\epsilon=\frac{1}{sd}$ , it holds that

Thus, to achieving an $\varepsilon$ -stationary point, $\Omega(\varepsilon^{-4}s^{-1})$ iterations are needed.

Below, we give three comments on the above results: (i) From Theorem 17, to achieve an $\varepsilon$ -stationary point, when we only consider the order with respect to $\varepsilon$ , $\Omega(\varepsilon^{-4})$ iterations are needed. Besides, by jointly considering $\varepsilon$ and batch size $s$ , we can accelerate the algorithm to achieve an $\varepsilon$ -stationary point, where $\Omega(\varepsilon^{-4}s^{-1})$ iterations are needed, which indicates that Mini-batch Adam can be linearly accelerated with respect to the mini-batch size. The result is in the same order of mini-batch SGD in (Li et al., 2014). (ii) Deriving the linear speedup property of mini-batch Adam with respect to mini-batch size is much more difficult than the analysis techniques for mini-batch SGD (Li et al., 2014) since the adaptive learning rate in Algorithm 2 is highly coupled with mini-batch stochastic gradient estimates. In fact, the adaptive learning rate implicitly adjusts the magnitude of the learning rate with respect to mini-batch size, while the hand-crafted learning rate in mini-batch SGD has to be tuned carefully via a linear LR scaling technique (Krizhevsky, 2014; You et al., 2017) for a large mini-batch training. (iii) The dimension dependence of the above analysis is $O(\sqrt{d})$ . Meanwhile, some analyses on (variants of) Adam (Chen et al. (2018a); Défossez et al. (2020); Zou et al. (2018)) achieve the same dimension dependence, while Zhou et al. (2018a) showed that in AMSGrad, RMSProp and Adagrad the dependence can be $O(d^{-1/4})$ .

2 Proof Sketch of Theorem 17

(Lemma 40 in Appendix) Let $\{\bm{x}_{t}\}$ be the sequence generated by Algorithm 2 or 3. For $T\geq 1$ , when $\alpha_{t}=\frac{\alpha}{\sqrt{T}}$ and $\theta_{t}=1-\frac{\theta}{T}$ , we have

for some constants $\zeta_{3}$ , $\zeta_{4}$ and $\zeta_{5}$ .

(Lemma 41 in Appendix) Let $\{\bm{x}_{t}\}$ be the sequence generated by Algorithm 2 or 3. For $T\geq 1$ , it holds that

for some constants $\zeta_{6}$ , $\zeta_{7}$ and $\zeta_{8}$ .

3 Convergence Analysis for Distributed Adam

For large-scale problems such as training deep convolutional neural networks over the ImageNet dataset (Russakovsky et al., 2015), it is hard to optimize problem (1) on a single machine. In this section, we extend the mini-batch Adam to the distributed Adam like the distributed SGD method (Yu et al., 2019). The simplest structure is the parameter-server model in the distributed setting, where a parameter server and multiple workers are involved in the optimization process. As it is shown in Algorithm 3, in each iteration, a worker receives the iterate $\bm{x}_{t}$ from the server, samples a stochastic gradient with respect to $\bm{x}_{t}$ , and sends the gradient to the server. Meanwhile, the parameter server receives gradients from workers in each iteration, averages the gradients, and performs an Adam update.

Algorithm 3 with $s$ workers performs the same as Algorithm 2 with $s$ i.i.d. samples.

Below, we give two remarks on the above distributed Adam algorithm: (i) For distributed Adam, to achieve an $\varepsilon$ -stationary point, $\Omega(\varepsilon^{-4}s^{-1})$ iterations are needed, which is a linear speedup with respect to the number of workers in the network, which is in the same order as that is in distributed SGD (Yu et al., 2019). (ii) Distributed Adam has been popularly used for training deep neural networks. In addition, there also exist several variants of the distributed Adam algorithm, such as PMD-LAMB (Wang et al., 2020), LAMB (You et al., 2019), LARS (You et al., 2017), etc, for training large-scale deep neural networks. However, all these works do not establish the linear speedup property for distributed adaptive methods.

Experimental Results

In this section, we experimentally validate the proposed sufficient condition by applying Generic Adam and RMSProp to solve the counterexample (Chen et al., 2018a) and to train LeNet (LeCun et al., 1998) on the MNIST dataset (LeCun et al., 2010) and ResNet (He et al., 2016) on the CIFAR-100 dataset (Krizhevsky, 2009), respectively. Moreover, we use different batch sizes to train LeNet (LeCun et al., 1998) on the MNIST dataset (LeCun et al., 2010) and ResNet (He et al., 2016) on the CIFAR-100 dataset (Krizhevsky, 2009), respectively, and validate theory of the mini-batch Adam algorithm.

Sensitivity of parameter $r$ . We set $T=10^{5}$ , $\alpha_{t}=5/\sqrt{t}$ , $\beta=0.9$ , and $\theta_{t}$ as $\theta_{t}^{(r)}=1-(0.01+0.99r^{2})/{t^{r}}$ with $r\in\{0,\ 0.25,\ 0.5,\ 0.75,\ 1.0\}$ , respectively. Note that when $r=0$ , Generic Adam reduces to the originally divergent Adam (Kingma and Ba, 2014) with $(\beta,\bar{\theta})=(0.9,0.99)$ . When $r=1$ , Generic Adam reduces to AdaEMA (Chen et al., 2018a) with $\beta=0.9$ .

The experimental results are shown in the left figure of Figure 1. We can see that for $r=1.0,0.75$ and $0.5$ , Generic Adam is convergent. Moreover, the convergence becomes slower when $r$ decreases, which exactly matches Corollary 10. On the other hand, for $r=0$ and $0.25$ , Figure 1 shows that they do not converge. It seems that the divergence for $r=0.25$ contradicts our theory. However, this is because when $r$ is very small, the $\mathcal{O}(T^{-r/4})$ convergence rate is so slow that we may not see a convergent trend in even $10^{5}$ iterations. Indeed, for $r=0.25$ , we actually have

which is not sufficiently close to 1. As a complementary experiment, we fix the numerator and only change $r$ when $r$ is small. We take $\alpha_{t}$ and $\beta_{t}$ as the same, while $\theta_{t}^{(r)}=1-\frac{0.01}{t^{r}}$ for $r=0$ and $0.25$ , respectively. The result is shown in the middle figure of Figures 1. We can see that Generic Adam with $r=0.25$ is indeed convergent in this situation.

Sensitivity of parameter $s$ . Now, we show the sensitivity of $s$ of the sufficient condition (SC) by fixing $r\!=\!0.8$ and selecting $s$ from the collection $s=\{0.4,0.6,0.8\}$ . The right figure in Figure 1 illustrates the sensitivity of parameter $s$ when Generic Adam is applied to solve the counterexample (10). The performance shows that when $s$ is fixed, smaller $r$ can lead to a faster and better convergence speed, which also coincides with the convergence results in Corollary 10.

2 LeNet on MNIST and ResNet-18 on CIFAR-100

In the experiments, for Generic Adam, we set $\theta_{t}^{(r)}=1-(0.001+0.999r)/t^{r}$ with $r\in\{0,0.25,0.5,0.75,1\}$ and $\beta_{t}=0.9$ , respectively; for RMSProp, we set $\beta_{t}=0$ and $\theta_{t}=1-\frac{1}{t}$ along with the parameter settings in Mukkamala and Hein (2017). For fairness, the base learning rates $\alpha_{t}$ in Generic Adam, RMSProp, and AMSGrad are all set as $0.001/\sqrt{t}$ . Figures 3 and 3 illustrate the results of Generic Adam with different $r$ , RMSProp, and AMSGrad for training LeNet on MNIST and training ResNet-18 on CIFAR-100, respectively. We can see that AMSGrad and Adam (Generic Adam with $r=0$ ) decrease the training loss slowest and show the worst test accuracy among the compared optimizers. One possible reason is due to the use of constant $\theta$ in AMSGrad and original Adam. By Figures 3 and 3, we can observe that the convergences of Generic Adam are extremely sensitive to the choice of parameter $\theta_{t}$ . Larger $r$ can contribute to a faster convergence rate of Generic Adam, which corroborates the theoretical result in Corollary 10. Additionally, the test accuracies in Figures 3(b) and 3(b) indicate that a smaller training loss can contribute to a higher test accuracy for Generic Adam.

3 Experiments on Practical Adam

In this section, we apply mini-batch Adam algorithms and mini-batch SGD algorithms to the following quadratic minimization task:

3.2 Vision Tasks

3.3 Transformer XL on WikiText-103

Also, we applied mini-batch Adam to train a base model of Transformer XL (Dai et al., 2019) on the dataset WikiText-103 (Merity et al., 2016). The base model of Transformer XL contains 16 self-attention layers. In each self-attention layer, there are 10 heads, and the encoding dimension of each head is set to 41. The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia. We adopt the same parameter settings provided by the authors but test on batch size $\{30,60,120\}$ . The results are shown in Figure 7. Also, as it is shown in the figure, a larger batch size can give a lower training loss in all experiments. Meanwhile, in the figure, AMSGrad and Adam achieve much better performance than SGD-M, which shows the benefit of using adaptive methods instead of SGD-based methods, just like what was mentioned in Zhang et al. (2019).

Conclusions

In this work, we delved into the convergences of Adam, and presented an easy-to-check sufficient condition to guarantee their convergences in the non-convex stochastic setting. This sufficient condition merely depends on the base learning rate and the linear combination parameter of second-order moments. Relying on this sufficient condition, we found that the divergences of Adam are possibly due to the incorrect parameter settings. Besides, when encountering the practice Adam, we theoretically showed that the number of samples will linearly speed up the convergence in both the mini-batch setting and distributed setting, which closes the gap between theory and practice. At last, the correctness of theoretical results has also been verified via the counterexample and training deep neural networks on real-world datasets.

Acknowledgement

This work is supported by the Major Science and Technology Innovation 2030 “Brain Science and Brain-like Research” key project (No. 2021ZD0201405).

Appendix A Key Lemma to prove Theorem 2 and Theorem 13

In this section we provide the necessary lemmas for the proofs of Theorem 2 and Theorem 13. First, we give some notations for simplifying the following proof.

Given $S_{0}>0$ and a non-negative sequence $\{s_{t}\}$ , let $S_{t}=S_{0}+\sum_{i=1}^{t}s_{i}$ for $t\geq 1$ . Then the following estimate holds

Proof The finite sum $\sum_{t=1}^{T}{s_{t}}/{S_{t}}$ can be interpreted as a Riemann sum $\sum_{t=1}^{T}(S_{t}-S_{t-1})/S_{t}.$ Since $1/x$ is decreasing on the interval $(0,\infty)$ , we have

Let $\{u_{t}\}$ and $\{s_{t}\}$ be two non-negative sequences. Let $S_{t}=\sum_{i=1}^{t}s_{i}$ for $t\geq 1$ . Then

Let $\{\theta_{t}\}$ and $\{\alpha_{t}\}$ satisfy the restrictions (R2) and (R3). For any $i\leq t$ , we have

Proof For any $i\leq t$ , since the sequence $\{a_{t}\}$ is non-increasing, we have $a_{t}\leq a_{i}$ . Hence,

which proves the first inequality. On the other hand, since $\{\theta_{t}\}$ is non-decreasing, it holds

Let $\Theta_{(t,i)}=\prod_{j=i+1}^{t}\theta_{j}$ for $i<t$ and $\Theta_{(t,t)}=1$ by convention.

Fix a constant $\theta^{\prime}$ with $\beta^{2}<\theta^{\prime}<\theta$ . Let $C_{1}$ be as given as Eq. (6) in the main paper. For any $i\leq t$ , we have

Proof For any $i\leq t$ , since $\theta_{j}\geq\theta^{\prime}$ for $j\geq N$ , and $\theta_{j}<\theta^{\prime}$ for $j<N$ , we have

We take the constant $C_{1}=\prod_{j=1}^{N}(\theta_{j}/\theta^{\prime})$ , where $N$ is the maximum of the indices for which $\theta_{j}<\theta^{\prime}$ . The proof is completed.

If $\theta_{t}=\theta$ is a constant, we have $\Theta_{(t,i)}=\theta^{t-i}$ . In this case we can take $\theta^{\prime}=\theta$ and $C_{1}=1$ .

Let $\gamma:=\beta^{2}/{\theta^{\prime}}$ . We have the following estimate

Proof Let $B_{(t,i)}=\prod_{j=i+1}^{t}\beta_{j}$ for $i<t$ and $B_{(t,t)}=1$ by convention. By the iteration formula $\bm{m}_{t}=\beta_{t}\bm{m}_{t-1}+(1-\beta_{t})\bm{g}_{t}$ and $\bm{m}_{0}=\bm{0}$ , we have

Similarly, by $\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}^{2}$ and $\bm{v}_{0}=\bm{\epsilon}$ , we have

Note that $\{\theta_{t}\}$ is non-decreasing by (R2), and $B_{(t,i)}\leq\beta^{t-i}$ by (R1). By Lemma 27, we have

With the notations above, the following equality holds

Combining Eq. (19) and Eq. (20), we obtain the desired Eq. (17). The proof is completed.

where $C_{2}=2\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)^{2}$ .

To estimate (I), by the Schwartz inequality and the Lipschitz continuity of the gradient of $f$ , we have

Note that $\bm{\delta}_{t}\leq G$ . Therefore,

where $C_{2}^{\prime}=\left(\frac{\beta/(1-\beta)}{\sqrt{C_{1}(1-\gamma)\theta_{1}}}+1\right)$ . The last inequality holds due to $\beta_{t}/(1-\beta_{t})\leq\beta/(1-\beta)$ as $\beta_{t}\leq\beta$ . Therefore, we have

Combining Eq. (28), Eq. (34), and Eq. (35), we obtain

The term (IV) is estimated similarly as term (III). First, we have

where $C_{2}^{\prime}$ is the constant defined above. We have

Combining Eq. (23), Eq. (24), Eq. (26), Eq. (27), Eq. (36), and Eq. (38), we obtain

The same as what we did for term (I) in Lemma 30, we have

Then the similar argument as Eq. (34) implies that

Proof Note that $\bm{v}_{t}\geq\theta_{t}\bm{v}_{t-1}$ , so we have $\bm{v}_{t}\geq\left(\prod_{j=i+1}^{t}\theta_{j}\right)\bm{v}_{i}=\Theta_{(t,i)}\bm{v}_{i}$ . By Lemma 27, this follows that $\bm{v}_{t}\geq C_{1}(\theta^{\prime})^{t-i}\bm{v}_{i}$ for all $i\leq t$ . On the other hand,

Since $\alpha_{t}=\chi_{t}\sqrt{1-\theta_{t}}\leq\chi_{t}\sqrt{1-\theta_{i}}$ for $i\leq t$ , it follows that

By Lemma 26, $\chi_{t}\leq C_{0}\chi_{i},\forall i\leq t$ . Hence,

It is straightforward to acquire by induction that

By Lemma 26, it holds $\alpha_{t}\leq C_{0}\alpha_{i}$ for any $i\leq t$ . By Lemma 27, $\Theta_{(t,i)}\geq C_{1}(\theta^{\prime})^{t-i}$ . In addition, $B_{(t,i)}\leq\beta^{t-i}$ . Hence,

Combining Eq. (52) and Eq. (53), we then obtain the desired estimate Eq. (48). The proof is completed.

Proof Let $W_{0}=1$ and $W_{t}=\prod_{i=1}^{T}\theta_{i}^{-1}$ . Let $w_{t}=W_{t}-W_{t-1}=(1-\theta_{t})\prod_{i=1}^{t}\theta_{i}^{-1}=(1-\theta_{t})W_{t}$ . We therefore have

Note that $\bm{v}_{0}=\bm{\epsilon}$ and $\bm{v}_{t}=\theta_{t}\bm{v}_{t-1}+(1-\theta_{t})\bm{g}_{t}$ , so it holds that $W_{0}\bm{v}_{0}=\bm{\epsilon}$ and $W_{t}\bm{v}_{t}=W_{t-1}\bm{v}_{t-1}+w_{t}\bm{g}_{t}^{2}.$ Then, $W_{t}\bm{v}_{t}=W_{0}\bm{v}_{0}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}=\bm{\epsilon}+\sum_{i=1}^{t}w_{i}\bm{g}_{i}^{2}.$ It follows that

Writing the norm in terms of coordinates, we obtain

The last inequality is due to the following trivial inequality:

for any non-negative parameters $a$ and $b$ . It then follows that

Proof For simplicity of notations, let $\omega_{t}:=\left\|\frac{\sqrt{1-\theta_{t}}\bm{g}_{t}}{\sqrt{\bm{v}_{t}}}\right\|^{2}$ , and $\Omega_{t}:=\sum_{i=1}^{t}\omega_{i}$ . Note that $\chi_{t}\leq C_{0}a_{t}$ . Hence,

Note that $a_{t}\leq\chi_{t}$ . Combining Eq. (62), Eq. (63), and Eq. (64), we have

Note that $\log(1+x)\leq x$ for all $x>-1$ . It follows that

Note that $\chi_{t}=\alpha_{t}/\sqrt{1-\theta_{t}}$ . By Eq. (62) and Eq. (64), we have

(Lemma 6 in Section 3) Let $\tau$ be randomly chosen from $\{1,2,\ldots,T\}$ with equal probabilities $p_{\tau}=1/T$ . We have the following estimate

Proof For any two random variables $X$ and $Y$ , by the Hölder’s inequality, we have

Let $X=\left(\frac{\left\|\bm{\nabla}f(\bm{x}_{t})\right\|^{2}}{\sqrt{\left\|\bm{\hat{v}}_{t}\right\|_{1}}}\right)^{1/2}$ , $Y=\left\|\bm{\hat{v}}_{t}\right\|_{1}^{1/4}$ , and let $p=2$ , $q=2$ . By Eq. (68), we have

By Eq. (69), Eq. (70), and Eq. (71), we obtain

By Lemma 26, $\alpha_{T}\leq C_{0}\alpha_{t}$ for any $t\leq T$ , so $\alpha_{t}^{-1}\leq C_{0}\alpha_{T}^{-1}$ . Then, we obtain

Appendix B Proof of Theorem 2

where the constants $C$ and $C^{\prime}$ are given by

Proof By the $L$ -Lipschitz continuity of the gradient of $f$ and the descent lemma, we have

where $C_{3}$ is the constant given in Lemma 33. By applying the estimates in Lemma 34 and Lemma 36 for the second and third terms in the right hand side of Eq. (78), and appropriately rearranging the terms, we obtain

Appendix C Proof of Corollary 10

Generic Adam with the above family of parameters converges as long as $0<r\leq 2s<2$ , and its non-asymptotic convergence rate is given by

Proof It is not hard to verify that the following equalities hold:

In this case, $T\alpha_{T}=\eta T^{1-s}$ . Therefore, by Theorem 2 the non-asymptotic convergence rate is given by

To guarantee convergence, then $0<r\leq 2s<2$ .

Appendix D Proof of Theorem 13

For any $T>0$ , if we take $\alpha_{t}=\frac{\alpha}{\sqrt{T}},\ \beta_{t}=\beta,\theta_{t}=1-\frac{\theta}{T}$ , which satisfies $\gamma=\frac{\beta}{1-\frac{\theta}{T}}<1$ and $\theta_{t}\geq\frac{1}{4}$ , then it holds that

Proof Based on Theorem 2, by plugging $\alpha_{t},\beta_{t}$ and $\theta_{t}$ in the conclusion of Theorem 2, we can get the desired result.

Appendix E Key Lemma to prove Theorem 17

In this section, we provide the additional lemmas for the proofs of Theorem 17.

With the definitions in Algorithm 2, for any $t=1,2,\cdots,T$ we have the following estimation:

Proof With the similar proof in Lemma 34, it holds that

Thus, by taking expectation on both side, we can obtain

(Lemma 19 in Section 4.2) By the definition of $M_{t}$ , it holds that

Proof Using Lemma 33, by plugging $C_{0}=C_{1}=1$ , $\chi_{t}=\frac{\alpha}{\sqrt{\theta}}$ and $\theta_{t}\geq\frac{1}{4}$ , it holds that

(Lemma 20 in Section 4) The following estimation always holds:

With inequalities (83) and (84), we can obtain

Proof We discuss the solution of $x$ in 4 different situations. First, when $Bx\geq A$ and $D\sqrt{x}\geq C$ , we have

Secondly, when $Bx\geq A$ and $D\sqrt{x}\leq C$ , we have

Thirdly, when $Bx\leq A$ and $D\sqrt{x}\geq C$ , it holds that

Last, when $Bx\leq A$ and $D\sqrt{x}\leq C$ , it holds that

Therefore, combining four different conditions, we have $x\leq(4BD)^{2}+4BC+(4AD)^{2/3}+\sqrt{4AC}$ .

Appendix F Proof of Theorem 17

For any $T>0$ , if we take $\alpha_{t}=\frac{\alpha}{\sqrt{T}}$ , $\beta_{t}=\beta$ , $\theta_{t}=1-\frac{\theta}{T}$ , which satisfies $\gamma=\frac{\beta_{t}}{\theta_{t}}<1$ and $\theta_{t}\geq\frac{1}{4}$ , then there exists $t\in\{1,2,\cdots,T\}$ such that

In addition, by taking $\epsilon=\frac{1}{sd}$ , it holds that

Proof First, according to the gradient Lipschitz condition of $f$ , it holds

Using Lemma 38, 39 and 40 with rearranging the corresponding terms, we have

Before using Lemma 41, we list the order of 4 terms in Lemma 41 as follows: