A Simple Convergence Proof of Adam and Adagrad

Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier

Introduction

First-order methods with adaptive step sizes have proved useful in many fields of machine learning, be it for sparse optimization (Duchi et al., 2013), tensor factorization (Lacroix et al., 2018) or deep learning (Goodfellow et al., 2016). Duchi et al. (2011) introduced Adagrad, which rescales each coordinate by a sum of squared past gradient values. While Adagrad proved effective for sparse optimization (Duchi et al., 2013), experiments showed that it under-performed when applied to deep learning (Wilson et al., 2017). RMSProp (Tieleman & Hinton, 2012) proposed an exponential moving average instead of a cumulative sum to solve this. Kingma & Ba (2015) developed Adam, one of the most popular adaptive methods in deep learning, built upon RMSProp and added corrective terms at the beginning of training, together with heavy-ball style momentum.

In the online convex optimization setting, Duchi et al. (2011) showed that Adagrad achieves optimal regret for online convex optimization. Kingma & Ba (2015) provided a similar proof for Adam when using a decreasing overall step size, although this proof was later shown to be incorrect by Reddi et al. (2018), who introduced AMSGrad as a convergent alternative. Ward et al. (2019) proved that Adagrad also converges to a critical point for non convex objectives with a rate $O(\ln(N)/\sqrt{N})$ when using a scalar adaptive step-size, instead of diagonal. Zou et al. (2019b) extended this proof to the vector case, while Zou et al. (2019a) displayed a bound for Adam, showing convergence when the decay of the exponential moving average scales as $1-1/N$ and the learning rate as $1/\sqrt{N}$ .

In this paper, we present a simplified and unified proof of convergence to a critical point for Adagrad and Adam for stochastic non-convex smooth optimization. We assume that the objective function is lower bounded, smooth and the stochastic gradients are almost surely bounded. We recover the standard $O(\ln(N)/\sqrt{N})$ convergence rate for Adagrad for all step sizes, and the same rate with Adam with an appropriate choice of the step sizes and decay parameters, in particular, Adam can converge without using the AMSGrad variant. Compared to previous work, our bound significantly improves the dependency on the momentum parameter $\beta_{1}$ . The best known bounds for Adagrad and Adam are respectively in $O((1-\beta_{1})^{-3})$ and $O((1-\beta_{1})^{-5})$ (see Section 3), while our result is in $O((1-\beta_{1})^{-1})$ for both algorithms. This improvement is a step toward understanding the practical efficiency of heavy-ball momentum.

The precise setting and assumptions are stated in the next section, and previous work is then described in Section 3. The main theorems are presented in Section 4, followed by a full proof for the case without momentum in Section 5. The proof of the convergence with momentum is deferred to the supplementary material, Section A. Finally we compare our bounds with experimental results, both on toy and real life problems in Section 6.

Setup

2 Adaptive methods

The parameter $\beta_{1}$ is a heavy-ball style momentum parameter (Polyak, 1964), while $\beta_{2}$ controls the decay rate of the per-coordinate exponential moving average of the squared gradients. Taking $\beta_{1}=0$ , $\beta_{2}=1$ and $\alpha_{n}=\alpha$ gives Adagrad. While the original Adagrad algorithm did not include a heavy-ball-like momentum, our analysis also applies to the case $\beta_{1}>0$ .

The original Adam algorithm (Kingma & Ba, 2015) uses a weighed average, rather than a weighted sum for (1) and (2), i.e. it uses

We can achieve the same definition by taking $\alpha_{\textrm{adam}}=\alpha\cdot\frac{1-\beta_{1}}{\sqrt{1-\beta_{2}}}$ . The original Adam algorithm further includes two corrective terms to account for the fact that $m_{n}$ and $v_{n}$ are biased towards 0 for the first few iterations. Those corrective terms are equivalent to taking a step-size $\alpha_{n}$ of the form

Those corrective terms can be seen as the normalization factors for the weighted sum given by (1) and (2) Note that each term goes to its limit value within a few times $1/(1-\beta)$ updates (with $\beta\in\{\beta_{1},\beta_{2}\}$ ). which explains the $(1-\beta_{1})$ term in (4). In the present work, we propose to drop the corrective term for $m_{n}$ , and to keep only the one for $v_{n}$ , thus using the alternative step size

This simplification motivated by several observations:

By dropping either corrective terms, $\alpha_{n}$ becomes monotonic, which simplifies the proof.

For typical values of $\beta_{1}$ and $\beta_{2}$ (e.g. 0.9 and 0.999), the corrective term for $m_{n}$ converges to its limit value much faster than the one for $v_{n}$ .

Removing the corrective term for $m_{n}$ is equivalent to a learning-rate warmup, which is popular in deep learning, while removing the one for $v_{n}$ would lead to an increased step size during early training. For values of $\beta_{2}$ close to 1, this can lead to divergence in practice.

We experimentally verify in Section 6.3 that dropping the corrective term for $m_{n}$ has no observable effect on the training process, while dropping the corrective term for $v_{n}$ leads to observable perturbations. In the following, we thus consider the variation of Adam obtained by taking $\alpha_{n}$ provided by (5).

3 Assumptions

We make three assumptions. We first assume $F$ is bounded below by $F_{*}$ , that is,

We discuss the use of assumption (7) in Section 4.2.

Related work

Early work on adaptive methods (McMahan & Streeter, 2010; Duchi et al., 2011) showed that Adagrad achieves an optimal rate of convergence of $O(1/\sqrt{N})$ for convex optimization (Agarwal et al., 2009). Later, RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015) were developed for training deep neural networks, using an exponential moving average of the past squared gradients.

Kingma & Ba (2015) offered a proof that Adam with a decreasing step size converges for convex objectives. However, the proof contained a mistake spotted by Reddi et al. (2018), who also gave examples of convex problems where Adam does not converge to an optimal solution. They proposed AMSGrad as a convergent variant, which consisted in retaining the maximum value of the exponential moving average. When $\alpha$ goes to zero, AMSGrad is shown to converge in the convex and non-convex setting (Fang & Klabjan, 2019; Zhou et al., 2018). Despite this apparent flaw in the Adam algorithm, it remains a widely popular optimizer, raising the question as to whether it converges. When $\beta_{2}$ goes to $1$ and $\alpha$ to 0, our results and previous work (Zou et al., 2019a) show that Adam does converge with the same rate as Adagrad. This is coherent with the counter examples of Reddi et al. (2018), because they uses a small exponential decay parameter $\beta_{2}<1/5$ .

The convergence of Adagrad for non-convex objectives was first tackled by Li & Orabona (2019), who proved its convergence, but under restrictive conditions (e.g., $\alpha\leq\sqrt{\epsilon}/L$ ). The proof technique was improved by Ward et al. (2019), who showed the convergence of “scalar” Adagrad, i.e., with a single learning rate, for any value of $\alpha$ with a rate of $O(\ln(N)/\sqrt{N})$ . Our approach builds on this work but we extend it to both Adagrad and Adam, in their coordinate-wise version, as used in practice, while also supporting heavy-ball momentum.

The coordinate-wise version of Adagrad was also tackled by Zou et al. (2019b), offering a convergence result for Adagrad with either heavy-ball or Nesterov style momentum. We obtain the same rate for heavy-ball momentum with respect to $N$ (i.e., $O(\ln(N)/\sqrt{N})$ ), but we improve the dependence on the momentum parameter $\beta_{1}$ from $O((1-\beta_{1})^{-3})$ to $O((1-\beta_{1})^{-1})$ . Chen et al. (2019) also provided a bound for Adagrad and Adam, but without convergence guarantees for Adam for any hyper-parameter choice, and with a worse dependency on $\beta_{1}$ . Zhou et al. (2018) also cover Adagrad in the stochastic setting, however their proof technique leads to a $\sqrt{1/\epsilon}$ term in their bound, typically with $\epsilon{=}10^{-8}$ . Finally, a convergence bound for Adam was introduced by Zou et al. (2019a). We recover the same scaling of the bound with respect to $\alpha$ and $\beta_{2}$ . However their bound has a dependency of $O((1-\beta_{1})^{-5})$ with respect to $\beta_{1}$ , while we get $O((1-\beta_{1})^{-1})$ , a significant improvement. Shi et al. (2020) obtain similar convergence results for RMSProp and Adam when considering the random shuffling setup. They use an affine growth condition (i.e. norm of the stochastic gradient is bounded by an affine function of the norm of the deterministic gradient) instead of the boundness of the gradient, but their bound decays with the number of total epochs, not stochastic updates leading to an overall $\sqrt{s}$ extra term with $s$ the size of the dataset. Finally, Faw et al. (2022) use the same affine growth assumption to derive high probability bounds for scalar Adagrad.

Non adaptive methods like SGD are also well studied in the non convex setting (Ghadimi & Lan, 2013), with a convergence rate of $O(1/\sqrt{N})$ for a smooth objective with bounded variance of the gradients. Unlike adaptive methods, SGD requires knowing the smoothness constant. When adding heavy-ball momentum, Yang et al. (2016) showed that the convergence bound degrades as $O((1-\beta_{1})^{-2})$ , assuming that the gradients are bounded. We apply our proof technique for momentum to SGD in the Appendix, Section B and improve this dependency to $O((1-\beta_{1})^{-1})$ . Recent work by Liu et al. (2020) achieves the same dependency with weaker assumptions. Defazio (2020) provided an in-depth analysis of SGD-M with a tight Liapunov analysis.

Main results

If $\beta_{1}=0$ , this is equivalent to sampling $\tau$ uniformly in $\{0,\ldots,N{-}1\}$ . If $\beta_{1}>0$ , the last few $\frac{1}{1-\beta_{1}}$ iterations are sampled rarely, and iterations older than a few times that number are sampled almost uniformly. Our results bound the expected squared norm of the gradient at iteration $\tau$ , which is standard for non convex stochastic optimization (Ghadimi & Lan, 2013).

For simplicity, we first give convergence results for $\beta_{1}=0$ , along with a complete proof in Section 5. We then provide the results with momentum, with their proofs in the Appendix, Section A.6. We also provide a bound on the convergence of SGD with a $O(1/(1-\beta_{1})$ dependency in the Appendix, Section B.2, along with its proof in Section B.4.

2 Analysis of the bounds

Diving into the technicalities of the proof to come, we will see in Section 5 that we apply Lemma 5.2 once per dimension. The contribution from each coordinate is mostly independent of the actual scale of its gradients (as it only appears in the log), so that the right hand side of the convergence bound will grow as $d$ . In contrast, the scalar version of Adagrad (Ward et al., 2019) has a single learning rate, so that Lemma 5.2 is only applied once, removing the dependency on $d$ . However, this variant is rarely used in practice.

It is also possible to replace the bound on the gradient with an affine growth condition, i.e. the norm of the stochastic gradient is bounded by an affine function of the norm of the expected gradient. A proof for scalar Adagrad is provided by Faw et al. (2022). Shi et al. (2020) do the same for RMSProp, however their convergence bound is decays as $O(\log(T)/\sqrt{T})$ with $T$ the number of epoch, not the number of updates, leading to a significantly less tight bound for large datasets.

Looking at Theorems 3 and 4, we see that increasing $\beta_{1}$ always deteriorates the bounds. Taking $\beta_{1}=0$ in those theorems gives us almost exactly the bound without heavy-ball momentum from Theorems 1 and 2, up to a factor 3 in the terms of the form $dR^{2}$ .

As discussed in Section 3, previous bounds for Adagrad in the non-convex setting deteriorates as $O((1-\beta_{1})^{-3})$ (Zou et al., 2019b), while bounds for Adam deteriorates as $O((1-\beta_{1})^{-5})$ (Zou et al., 2019a). Our unified proof for Adam and Adagrad achieves a dependency of $O((1-\beta_{1})^{-1})$ , a significant improvement. We refer the reader to the Appendix, Section A.3, for a detailed analysis. While our dependency still contradicts the benefits of using momentum observed in practice, see Section 6, our tighter analysis is a step in the right direction.

Note that in (9), we sample with a lower probability the latest iterations. This can be explained by the fact that the proof technique for stochastic optimization in the non-convex case is based on the idea that for every iteration $n$ , either $\nabla F(x_{n})$ is small, or $F(x_{n+1})$ will decrease by some amount. However, when introducing momentum, and especially when taking the limit $\beta_{1}\rightarrow 1$ , the latest gradient $\nabla F(x_{n})$ has almost no influence over $x_{n+1}$ , as the momentum term updates slowly. Momentum spreads the influence of the gradients over time, and thus, it will take a few updates for a gradient to have fully influenced the iterate $x_{n}$ and thus the value of the function $F(x_{n})$ . From a formal point of view, the sampling weights given by (9) naturally appear as part of the proof which is presented in Section A.6.

3 Optimal finite horizon Adam is Adagrad

with $E=4dR^{2}N^{b/2}+dRLN^{b-a}$ . Let us ignore the log terms for now, and use $\frac{N^{-b}}{1-N^{-b}}\sim N^{-b}$ for $N\rightarrow\infty$ , to get

Adding back the logarithmic term, the best rate we can obtain is $O(\ln(N)/\sqrt{N})$ , and it is only achieved for $a=1/2$ and $b=1$ , i.e., $\alpha=\alpha_{1}/\sqrt{N}$ and $\beta_{2}=1-1/N$ . We can see the resemblance between Adagrad on one side and Adam with a finite horizon and such parameters on the other. Indeed, an exponential moving average with a parameter $\beta_{2}=1-1/N$ as a typical averaging window length of size $N$ , while Adagrad would be an exact average of the past $N$ terms. In particular, the bound for Adam now becomes

which differ from (10) only by a $+N/(N-1)$ next to the log term.

Our analysis highlights an important fact: Adam is to Adagrad like constant step size SGD is to decaying step size SGD. While Adagrad is asymptotically optimal, it also leads to a slower decrease of the term proportional to $F(x_{0})-F_{*}$ , as $1/\sqrt{N}$ instead of $1/N$ for Adam. During the initial phase of training, it is likely that this term dominates the loss, which could explain the popularity of Adam for training deep neural networks rather than Adagrad. With its default parameters, Adam will not converge. It is however possible to choose $\alpha$ and $\beta_{2}$ to achieve an $\epsilon$ critical point for $\epsilon$ arbitrarily small and, for a known time horizon, they can be chosen to obtain the exact same bound as Adagrad.

Remember that we recover Adagrad when $\alpha_{n}=\alpha$ for $\alpha>0$ and $\beta_{2}=1$ , while Adam can be obtained taking $0<\beta_{2}<1$ , $\alpha>0$ ,

i.e., we replace the last gradient contribution by its expected value conditioned on $f_{1},\ldots,f_{n-1}$ .

A problem posed by the update (16) is the correlation between the numerator and denominator. This prevents us from easily computing the conditional expectation and as noted by Reddi et al. (2018), the expected direction of update can have a positive dot product with the objective gradient. It is however possible to control the deviation from the descent direction, following Ward et al. (2019) with this first lemma.

Now we need to control the size of the second term $A$ ,

Given that $\epsilon+v\geq g^{2}$ and taking the conditional expectation we obtain

which we simplify using the same argument as for (24) into

Injecting (28) and (21) into (20) finishes the proof. ∎

Anticipating on Section 5.2, the previous Lemma gives us a bound on the deviation from a descent direction. While for a specific iteration, this deviation can take us away from a descent direction, the next lemma tells us that the sum of those deviations cannot grow larger than a logarithmic term. This key insight introduced in Ward et al. (2019) is what makes the proof work.

The first term forms a telescoping series, while the second one is bounded by $-\ln(\beta_{2})$ . Summing over all $j\in[N]$ gives the desired result. ∎

2 Proof of Adam and Adagrad without momentum

As explained in Section 2.2, we have $\alpha_{n}=\alpha$ for $\alpha>0$ . Using the smoothness of $F$ (8), we have

Summing the previous inequality for all $n\in[N]$ , taking the complete expectation, and using that $\sqrt{n}\leq\sqrt{N}$ gives us,

From there, we can bound the last sum on the right hand side using Lemma 5.2 once for each dimension. Rearranging the terms, we obtain the result of Theorem 1.

As given by (5) in Section 2.2, we have $\alpha_{n}=\alpha\sqrt{\frac{1-\beta_{2}^{n}}{1-\beta_{2}}}$ for $\alpha>0$ . Using the smoothness of $F$ defined in (8), we have

Taking the conditional expectation with respect to $f_{1},\ldots,f_{n-1}$ we can apply the descent Lemma 5.1 and use (34) to obtain from (33),

Given that $\beta_{2}<1$ , we have $\alpha_{n}\leq\frac{\alpha}{\sqrt{1-\beta_{2}}}$ . Summing the previous inequality for all $n\in[N]$ and taking the complete expectation yields

Applying Lemma 5.2 for each dimension and rearranging the terms finishes the proof of Theorem 2.

Experiments

On Figure 1, we compare the effective dependency of the average squared norm of the gradient in the parameters $\alpha$ , $\beta_{1}$ and $\beta_{2}$ for Adam, when used on a toy task and CIFAR-10.

Intuitively, each coordinate is pointing most of the time towards 1, but exceptionally towards -1 with a weight of $1/\sqrt{p_{i}}$ . Those rare events happens less and less often as $i$ increase, but with an increasing weight. Those weights are chosen so that all the coordinates of the gradient have the same varianceWe deviate from the a.s. bounded gradient assumption for this experiment, see Section 4.2 for a discussion on a.s. bound vs bound in expectation.. It is necessary to take different probabilities for each coordinate. If we use the same $p$ for all, we observe a phase transition when $1-\beta_{2}\approx p$ , but not the continuous improvement we obtain on Figure 1(a).

We take $\alpha$ from a uniform range in log-space between $10^{-6}$ and $10^{-2}$ with 9 values, for $1-\beta_{1}$ the range is from $10^{-5}$ to $0.3$ with 9 values, and for $1-\beta_{2}$ , from $10^{-6}$ to $10^{-1}$ with 11 values. Unlike for the toy problem, we do not initialize close to the optimum, as even after 600 epochs, the norm of the gradients indicates that we are not at a critical point. All curves are averaged over three runs. Error bars are plotted but not visible in log-log scale, except for large values of $\alpha$ .

2 Analysis

Let us now turn to Figure 1(b). As we start from random weights for this problem, we observe that a large step size gives the best performance, although we observe a high variance for the largest $\alpha$ . This indicates that training becomes unstable for large $\alpha$ , which is not predicted by the theory. This is likely a consequence of the bounded gradient assumption (7) not being verified for deep neural networks. We observe a small improvement as $1-\beta_{2}$ decreases, although nowhere near what we observed on our toy problem. Finally, we observe a sweet spot for the momentum $\beta_{1}$ , not predicted by our theory. We conjecture that this is due to the variance reduction effect of momentum (averaging of the gradients over multiple mini-batches, while the weights have not moved so much as to invalidate past information).

3 Impact of the Adam corrective terms

Using the same experimental setup on CIFAR-10, we compare the impact of removing either of the corrective term of the original Adam algorithm (Kingma & Ba, 2015), as discussed in Section 2.2. We ran a cartesian product of training for 100 epochs, with $\beta_{1}\in\{0,0.5,0.8,0.9,0.99\}$ , $\beta_{2}\in\{0.9,0.99,0.999,0.9999\}$ , and $\alpha\in\{10^{-4},10^{-3}\}$ . We report both the training loss and norm of the expected gradient on Figure 2. We notice a limited difference when dropping the corrective term on $m_{n}$ , but dropping the term $v_{n}$ has an important impact on the training trajectories. This confirm our motivation for simplifying the proof by removing the corrective term on the momentum.

Conclusion

We provide a simple proof on the convergence of Adam and Adagrad without heavy-ball style momentum. Our analysis highlights a link between the two algorithms: with right the hyper-parameters, Adam converges like Adagrad. The extension to heavy-ball momentum is more complex, but we significantly improve the dependence on the momentum parameter for Adam, Adagrad, as well as SGD. We exhibit a toy problem where the dependency on $\alpha$ and $\beta_{2}$ experimentally matches our prediction. However, we do not predict the practical interest of momentum, so that improvements to the proof are needed for future work.

The present theoretical results on the optimization of non convex losses in a stochastic settings impact our understanding of the training of deep neural network. It might allow a deeper understanding of neural network training dynamics and thus reinforce any existing deep learning applications. There would be however no direct possible negative impact to society.

References

Supplementary material for A Simple Convergence Proof of Adam and Adagrad

Overview

In Section A, we detail the results for the convergence of Adam and Adagrad with heavy-ball momentum. For an overview of the contributions of our proof technique, see Section A.4.

Then in Section B, we show how our technique also applies to SGD and improves its dependency in $\beta_{1}$ compared with previous work by Yang et al. (2016), from $O((1-\beta_{1})^{-2})$ to $O(1-\beta_{1})^{-1}$ . The proof is simpler than for Adam/Adagrad, and show the generality of our technique.

Appendix A Convergence of adaptive methods with heavy-ball momentum

For Adagrad (potentially extended with heavy-ball momentum), we have $\beta_{2}=1$ and

Notice we include the factor $1-\beta_{1}$ in the step size rather than in (A.1), as this allows for a more elegant proof. The original Adam algorithm included compensation factors for both $\beta_{1}$ and $\beta_{2}$ (Kingma & Ba, 2015) to correct the initial scale of $m$ and $v$ which are initialized at 0. Adam would be exactly recovered by replacing (A.2) with

i.e. the contribution from the $k$ last gradients are replaced by their expected value for know values of $f_{1},\ldots,f_{n-k-1}$ . For $k=1$ , we recover the same definition as in (18).

A.2 Results

If $\beta_{1}=0$ , this is equivalent to sampling $\tau$ uniformly in $\{0,\ldots,N-1\}$ . If $\beta_{1}>0$ , the last few $\frac{1}{1-\beta_{1}}$ iterations are sampled rarely, and all iterations older than a few times that number are sampled almost uniformly. We bound the expected squared norm of the total gradient at iteration $\tau$ , which is standard for non convex stochastic optimization (Ghadimi & Lan, 2013).

Note that like in previous works, the bound worsen as $\beta_{1}$ increases, with a dependency of the form $O((1-\beta_{1})^{-1})$ . This is a significant improvement over the existing bound for Adagrad with heavy-ball momentum, which scales as $(1-\beta_{1})^{-3}$ (Zou et al., 2019b), or the best known bound for Adam which scales as $(1-\beta_{1})^{-5}$ (Zou et al., 2019a).

Technical lemmas to prove the following theorems are introduced in Section A.5, while the proof of Theorems 3 and 4 are provided in Section A.6.

A.3 Analysis of the results with momentum

First notice that taking $\beta_{1}\rightarrow 0$ in Theorems 3 and 4, we almost recover the same result as stated in 2 and 1, only losing on the term $4dR^{2}$ which becomes $12dR^{2}$ .

Assuming $N\gg\frac{\beta_{1}}{1-\beta_{1}}$ and $\beta_{1}/\beta_{2}\approx\beta_{1}$ , which is verified for typical values of $\beta_{1}$ and $\beta_{2}$ (Kingma & Ba, 2015), it is possible to simplify the bound for Adam (13) as

Similarly, if we assume $N\gg\frac{\beta_{1}}{1-\beta_{1}}$ , we can simplify the bound for Adagrad (LABEL:app:eq:main_bound_adagad_momentum) as

The term $(1-\beta_{2})^{3/2}$ in the denominator in (A.9) is indeed compensated by the $\alpha^{2}$ in the numerator and we again recover the proper $\ln(N)/\sqrt{N}$ convergence rate, which matches (A.10) up to a $+1$ term next to the log.

A.4 Overview of the proof, contributions and limitations

There is a number of steps to the proof. First we derive a Lemma similar in spirit to the descent Lemma 5.1. There are two differences: first, when computing the dot product between the current expected gradient and each past gradient contained in the momentum, we have to re-center the expected gradient to its values in the past, using the smoothness assumption. Besides, we now have to decorrelate more terms between the numerator and denominator, as the numerator contains not only the latest gradient but a decaying sum of the past ones. We similarly extend Lemma 5.2 to support momentum specific terms. The rest of the proof follows mostly as in Section 5, except with a few more manipulation to regroup the gradient terms coming from different iterations.

Compared with previous work (Zou et al., 2019b; a), the re-centering of past gradients in (A.14) is a key aspect to improve the dependency in $\beta_{1}$ , with a small price to pay using the smoothness of $F$ which is compensated by the introduction of extra $G_{n-k,i}^{2}$ in (A.1). Then, a tight handling of the different summations as well as the the introduction of a non uniform sampling of the iterates (A.8), which naturally arises when grouping the different terms in (A.49), allow to obtain the overall improved dependency in $O((1-\beta_{1})^{-1})$ .

The same technique can be applied to SGD, the proof becoming simpler as there is no correlation between the step size and the gradient estimate, see Section B. If you want to better understand the handling of momentum without the added complexity of adaptive methods, we recommend starting with this proof.

A limitation of the proof technique is that we do not show that heavy-ball momentum can lead to a variance reduction of the update. Either more powerful probabilistic results, or extra regularity assumptions could allow to further improve our worst case bounds of the variance of the update, which in turn might lead to a bound with an improvement when using heavy-ball momentum.

A.5 Technical lemmas

We first need an updated version of 5.1 that includes momentum.

We use multiple times (22) in this proof, which we repeat here for convenience,

Let us now take an index $0\leq k\leq n-1$ . We show that the contribution of past gradients $G_{n-k}$ and $g_{n-k}$ due to the heavy-ball momentum can be controlled thanks to the decay term $\beta_{1}^{k}$ . Let us first have a look at $B$ . Using (A.13) with

Notice first that for any dimension $i\in[d]$ , $\epsilon+v_{n,i}\geq\epsilon+\beta_{2}^{k}v_{n-k,i}\geq\beta_{2}^{k}(\epsilon+v_{n-k,i})$ , so that

Besides, using the L-smoothness of $F$ given by (8), we have

using Jensen inequality and the fact that $\alpha_{n}$ is non-decreasing. Injecting (A.16) and (A.17) into (A.15), we obtain

Now going back to the $A$ term in (A.14), we will study the main term of the summation, i.e. for $i\in[d]$ and $k<n$

Now turning to $\rho$ , we use (A.13) with

Notice that in A.23, we possibly divide by zero. It suffice to notice that if $r^{2}=0$ then $\delta^{2}=0$ a.s. so that $\rho=0$ and (A.24) is still verified. Summing (A.22) and (A.24), we get

Taking the complete expectation and using that by definition $\epsilon+v_{n,i}\geq\epsilon+\beta_{2}^{k}v_{n-k,i}\geq\beta_{2}^{k}(\epsilon+v_{n-k,i})$ we get

Injecting (A.28) and (A.18) into (A.14) finishes the proof.

Similarly, we will need an updated version of 5.2.

Given that for $l\in[j]$ , we have by definition $\epsilon+b_{j}\geq\epsilon+\beta_{2}^{j-l}b_{l}\geq\beta_{2}^{j-l}(\epsilon+b_{l})$ , we get

Thus, when summing over all $j\in[n]$ , we get

We also need two technical lemmas on the sum of series.

We first need to study the following integral:

where we used the classical integral of the standard Gaussian density function.

where we used (A.33). Given that $\sqrt{-\ln(a)}\geq\sqrt{1-a}$ we obtain (A.32). ∎

A.6 Proof of Adam and Adagrad with momentum

Taking the full expectation and using Lemma A.1,

First looking at $B$ , we have using Lemma A.2,

Then looking at $C$ and introducing the change of index $j=n-l$ ,

using Lemma A.4. Finally, using Lemma A.2, we get

Finally, introducing the same change of index $j=n-k$ for $D$ , we get

using Lemma A.3. Finally, using Lemma 5.2 or equivalently Lemma A.2 with $\beta_{1}=0$ , we get

This is as far as we can get without having to use the specific form of $\alpha_{N}$ given by either (A.2) for Adam or (A.3) for Adagrad. We will now split the proof for either algorithm.

For Adam, using (A.2), we have $\alpha_{n}=(1-\beta_{1})\Omega_{n}\alpha$ . Thus, we can simplify the $A$ term from (A.37), also using the usual change of index $j=n-k$ , to get

If we now introduce $\tau$ as in (A.8), we can first notice that

Further notice that for any coordinate $i\in[d]$ , we have $v_{N,i}\leq\frac{R^{2}}{1-\beta_{2}}$ , besides $\alpha_{N}\leq\alpha\frac{1-\beta_{1}}{\sqrt{1-\beta_{2}}}$ , so that putting together (A.37), (A.46), (A.38), (A.40) and (A.42) we get

For Adagrad, we have $\alpha_{n}=(1-\beta_{1})\alpha$ , $\beta_{2}=1$ and $\Omega_{n}\leq\sqrt{N}$ so that,

Reusing (A.44) and (A.45) from the Adam proof, and introducing $\tau$ as in (9), we immediately have

Further notice that for any coordinate $i\in[d]$ , we have $v_{N}\leq NR^{2}$ , besides $\alpha_{N}=(1-\beta_{1})\alpha$ , so that putting together (A.37), (A.50), (A.38), (A.40) and (A.42) with $\beta_{2}=1$ , we get

A.7 Proof variant using Hölder inequality

Following (Ward et al., 2019; Zou et al., 2019b), it is possible to get rid of the almost sure bound on the gradient given by (7), and replace it with a bound in expectation, i.e.

Looking at the actual proof, we use (6) in a single place: just after (A.35), in order to derive an upper bound for the denominator in the following term:

Appendix B Non convex SGD with heavy-ball momentum

We extend the existing proof of convergence for SGD in the non convex setting to use heavy-ball momentum (Ghadimi & Lan, 2013). Compared with previous work on momentum for non convex SGD byYang et al. (2016), we improve the dependency in $\beta_{1}$ from $O((1-\beta_{1})^{-2})$ to $O((1-\beta_{1})^{-1})$ . A recent work by Liu et al. (2020) achieve a similar dependency of $O(1/(1-\beta_{1}))$ , with weaker assumptions (without the bounded gradients assumptions).

We reuse the notations from Section 2.1. Note however that we use here different assumptions than in Section 2.3. We first assume $F$ is bounded below by $F_{*}$ , that is,

We then assume that the stochastic gradients have bounded variance, and that the gradients of $F$ are uniformly bounded, i.e. there exist $R$ and $\sigma$ so that

B.2 Result

B.3 Analysis

It is possible to achieve a rate of convergence of the form $O(1/\sqrt{N})$ , by taking for any $C>0$ ,

In comparison, Theorem 3 by Yang et al. (2016) would give us, assuming now that $\alpha=(1-\beta_{1})\min\left\{\frac{1}{L},\frac{C}{\sqrt{N}}\right\}$ ,

We observe an overall dependency in $\beta_{1}$ of the form $O((1-\beta_{1})^{-2})$ for Theorem 3 by Yang et al. (2016) , which we improve to $O((1-\beta_{1})^{-1})$ with our proof.

Liu et al. (2020) achieves a similar dependency in $(1-\beta_{1})$ as here, but with weaker assumptions. Indeed, in their Theorem 1, their result contains a term in $O(1/\alpha)$ with $\alpha\leq(1-\beta_{1})M$ for some problem dependent constant $M$ that does not depend on $\beta_{1}$ .

Notice that as the typical size of the update $m_{n}$ will increase with $\beta_{1}$ , by a factor $1/(1-\beta_{1})$ , it is convenient to scale down $\alpha$ by the same factor, as we did with (B.8) (without loss of generality, as $C$ can take any value). Taking $\alpha$ of this form has the advantage of keeping the first term on the right hand side in (B.6) independent of $\beta_{1}$ , allowing us to focus only on the second term.

B.4 Proof

Let $A_{i}=\sum_{q=i}^{Q}a^{q}q$ , we have

Finally, taking $i=0$ and $Q\rightarrow\infty$ gives us the upper bound. ∎

For simplicity, we note $G_{n}=\nabla F(x_{n-1})$ the expected gradient and $g_{n}=\nabla f_{n}(x_{n-1})$ the stochastisc gradient at iteration $n$ .

with $x=G_{n}-G_{n-k}$ , $y=g_{n-k}$ and $\displaystyle\lambda=\frac{1-\beta_{1}}{k\alpha L}$ to the second term in (B.14), and use (B.15) to get

Now let us take $k\in\{0,\ldots,n-1\}$ , first notice that

Now, using (B.12) from Lemma B.2, we obtain

Taking the expectation, and using Lemma B.3 and Lemma B.1, we get

rearranging, and summing over $n\in\{1,\ldots,N\}$ , we get

Let us focus on the $A$ term on the left-hand side first. Introducing the change of index $i=n-k$ , we get

We recognize the unnormalized probability given by the random iterate $\tau$ as defined by (B.5). The normalization constant is

which we can inject into (B.24) to obtain

Injecting (B.25) into (B.23), and using the fact that $F$ is bounded below by $F_{*}$ (B.1), we have

which concludes the proof of Theorem B.1.