On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong

Introduction

First-order optimization has witnessed tremendous progress in the last decade, especially to solve machine learning problems (Bottou et al., 2018). Almost every first-order method obeys the following generic form (Boyd & Vandenberghe, 2004), xt+1=xtαtΔt\mathbf{x}_{t+1}=\mathbf{x}_{t}-\alpha_{t}\boldsymbol{\Delta}_{t}, where xt\mathbf{x}_{t} denotes the solution updated at the ttth iteration for t=1,2,,Tt=1,2,\ldots,T, TT is the number of iterations, Δt\boldsymbol{\Delta}_{t} is a certain (approximate) descent direction, and αt>0\alpha_{t}>0 is some learning rate. The most well-known first-order algorithms are gradient descent (GD) for deterministic optimization (Nesterov, 2013, Cartis et al., 2010) and stochastic gradient descent (SGD) for stochastic optimization (Zinkevich, 2003, Ghadimi & Lan, 2013), where the former determines Δt\boldsymbol{\Delta}_{t} using the full (batch) gradient of an objective function, and the latter uses a simpler but more computationally-efficient stochastic (unbiased) gradient estimate.

Recent works have proposed a variety of accelerated versions of GD and SGD (Nesterov, 2013). These achievements fall into three categories: a) momentum methods (Nesterov, 1983, Polyak, 1964, Ghadimi et al., 2015) which carefully design the descent direction Δt\boldsymbol{\Delta}_{t}; b) adaptive learning rate methods (Becker et al., 1988, Duchi et al., 2011, Zeiler, 2012, Dauphin et al., 2015) which determine good learning rates αt\alpha_{t}, and c) adaptive gradient methods that enjoy dual advantages of a) and b). In particular, Adam (Kingma & Ba, 2014), belonging to the third type of methods, has become extremely popular to solve deep learning problems, e.g., to train deep neural networks. Despite its superior performance in practice, theoretical investigation of Adam-like methods for non-convex optimization is still missing.

Very recently, the work (Reddi et al., 2018) pointed out the convergence issues of Adam even in the convex setting, and proposed AMSGrad, a corrected version of Adam. Although AMSGrad has made a positive step towards understanding the theoretical behavior of adaptive gradient methods, the convergence analysis of (Reddi et al., 2018) was still very restrictive because it only works for convex problems, despite the fact that the most successful applications are for non-convex problems. Apparently, there still exists a large gap between theory and practice. To the best of our knowledge, the question that whether adaptive gradient methods such as Adam, AMSGrad, AdaGrad converge for non-convex problems is still open in theory.

After the non-convergence issue of Adam has been raised in (Reddi et al., 2018), there have been a few recent works on proposing new variants of Adam-type algorithms. In the convex setting, reference (Huang et al., 2018) proposed to stabilize the coordinate-wise weighting factor to ensure convergence. Reference (Chen & Gu, 2018) developed an algorithm that changes the coordinate-wise weighting factor to achieve better generalization performance. Concurrent with this work, several works are trying to understand performance of Adam in non-convex optimization problems. Reference (Basu et al., 2018) provided convergence rate of original Adam and RMSprop under full-batch (deterministic) setting, and (Ward et al., 2018) proved convergence rate of a modified version of AdaGrad where coordinate-wise weighting is removed. Furthermore, the work (Zhou et al., 2018) provided convergence results for AMSGrad that exhibit a tight dependency on problem dimension compared to (Reddi et al., 2018). The works (Zou & Shen, 2018) and (Li & Orabona, 2018) proved that both AdaGrad and its variant (AdaFom) converge to a stationary point with a high probability. The aforementioned works are independent of ours. In particular, our analysis is not only more comprehensive (it covers the analysis of a large family of algorithms in a single framework), but more importantly, it provides insights on how oscillation of stepsizes can affect the convergence rate.

Our work aims to build the theory to understand the behavior for a generic class of adaptive gradient methods for non-convex optimization. In particular, we provide mild sufficient conditions that guarantee the convergence for the Adam-type methods. We summarize our contribution as follows.

\bullet (Generality) We consider a class of generalized Adam, referred to as the “Adam-type”, and we show for the first time that under suitable conditions about the stepsizes and algorithm parameters, this class of algorithms all converge to first-order stationary solutions of the non-convex problem, with O(logT/T)O(\log T/\sqrt{T}) convergence rate. This class includes the recently proposed AMSGrad (Reddi et al., 2018), AdaGrad (Duchi et al., 2011), and stochastic heavy-ball methods as well as two new algorithms explained below.

(AdaFom) Adam adds momentum to both the first and the second moment estimate, but this leads to possible divergence (Reddi et al., 2018). We show that the divergence issue can actually be fixed by a simple variant which adds momentum to only the first moment estimate while using the same second moment estimate as that of AdaGrad, which we call AdaFom (AdaGrad with First Order Moment).

(Constant Momemtum) Our convergence analysis is applicable to the constant momentum parameter setting for AMSGrad and AdaFom. The divergence example of Adam given in (Reddi et al., 2018) is for constant momentum parameter, but the convergence analysis of AMSGrad in (Reddi et al., 2018) is for diminishing momentum parameter. This discrepancy leads to a question whether the convergence of AMSGrad is due to the algorithm form or due to the momentum parameter choice – we show that the constant-momentum version of AMSGrad indeed converges, thus excluding the latter possibility.

\bullet (Practicality) The sufficient conditions we derive are simple and easy to check in practice. They can be used to either certify the convergence of a given algorithm for a class of problem instances, or to track the progress and behavior of a particular realization of an algorithm.

\bullet (Tightness and Insight) We show the conditions are essential and “tight”, in the sense that violating them can make an algorithm diverge. Importantly, our conditions provide insights on how oscillation of a so-called “effective stepsize" (that we define later) can affect the convergence rate of the class of algorithms. We also provide interpretations of the convergence conditions to illustrate why under some circumstances, certain Adam-type algorithms can outperform SGD.

We use z=x/yz=x/y to denote element-wise division if xx and yy are both vectors of size dd; xyx\odot y is element-wise product, x2x^{2} is element-wise square if xx is a vector, x\sqrt{x} is element-wise square root if xx is a vector, (x)j(x)_{j} denotes jjth coordinate of xx, x\|x\| is x2\|x\|_{2} if not otherwise specified. We use [N][N] to denote the set {1,,N}\{1,\cdots,N\}, and use O()O(\cdot), o()o(\cdot), Ω()\Omega(\cdot), ω()\omega(\cdot) as standard asymptotic notations.

Preliminaries and Adam-Type Algorithms

Stochastic optimization is a popular framework for analyzing algorithms in machine learning due to the popularity of mini-batch gradient evaluation. We consider the following generic problem where we are minimizing a function ff, expressed in the expectation form as follows

where ξ\xi is a certain random variable representing randomly selected data sample or random noise.

In a generic first-order optimization algorithm, at a given time tt we have access to an unbiased noisy gradient gtg_{t} of f(x)f(x), evaluated at the current iterate xtx_{t}. The noisy gradient is assumed to be bounded and the noise on the gradient at different time tt is assumed to be independent. An important assumption that we will make throughout this paper is that the function f(x)f(x) is continuously differentiable and has Lipschitz continuous gradient, but could otherwise be a non-convex function. The non-convex assumption represents a major departure from the convexity that has been assumed in recent papers for analyzing Adam-type methods, such as (Kingma & Ba, 2014) and (Reddi et al., 2018).

Our work focuses on the generic form of exponentially weighted stochastic gradient descent method presented in Algorithm 1, for which we name as generalized Adam due to its resemblance to the original Adam algorithm and many of its variants.

Algorithm 1. Generalized Adam S0. Initialize m0=0m_{0}=0 and x1x_{1} For t=1,,Tt=1,\cdots,T, do S1. mt=β1,tmt1+(1β1,t)gtm_{t}=\beta_{1,t}m_{t-1}+(1-\beta_{1,t})g_{t} S2. {\mbox{\hat{v}}}_{t}={h_{t}}(g_{1},g_{2},...,g_{t}) S3. x_{t+1}=x_{t}-\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}} End

We highlight that Algorithm 1 includes many well-known algorithms as special cases. We summarize some popular variants of the generalized Adam algorithm in Table 1.

We present some interesting findings for the algorithms presented in Table 1.

Adam is often regarded as a “momentum version” of AdaGrad, but it is different from AdaFom which is also a momentum version of AdaGrad AdaGrad with first order momentum is also studied in (Zou & Shen, 2018) which appeared online after our first version. The difference lies in the form of {\mbox{\hat{v}}}_{t}. Intuitively, Adam adds momentum to both the first and second order moment estimate, while in AdaFom we only add momentum to the first moment estimate and use the same second moment estimate as AdaGrad. These two methods are related in the following way: if we let β2=11/t\beta_{2}=1-1/t in the expression of v^t\hat{v}_{t} in Adam, we obtain AdaFom. We can view AdaFom as a variant of Adam with an increasing sequence of β2\beta_{2}, or view Adam as a variant of AdaFom with exponentially decaying weights of gt2g_{t}^{2}. However, this small change has large impact on the convergence: we prove that AdaFom can always converge under standard assumptions (see Corollary 3.2) , while Adam is shown to possibly diverge (Reddi et al., 2018).

The convergence of AMSGrad using a fast diminishing β1,t\beta_{1,t} such that β1,tβ1,t1,β1,ttb,b=0\beta_{1,t}\leq\beta_{1,t-1},\beta_{1,t}\xrightarrow[t\to\infty]{}b,b=0 in convex optimization was studied in (Reddi et al., 2018). However, the convergence of the version with constant β1\beta_{1} or strictly positive bb and the version for non-convex setting are unexplored before our work. We notice that an independent work (Zhou et al., 2018) has also proved the convergence of AMSGrad with constant β1\beta_{1}.

It is also worth mentioning that Algorithm 1 can be applied to solve the popular “finite-sum” problems whose objective is a sum of nn individual cost functions. That is,

In the remainder of this paper, we will analyze Algorithm 1 and provide sufficient conditions under which the algorithm converges to first-order stationary solutions with sublinear rate. We will also discuss how our results can be applied to special cases of generalized Adam.

Convergence Analysis for Generalized Adam

The main technical challenge in analyzing the non-convex version of Adam-type algorithms is that the actually used update directions could no longer be unbiased estimates of the true gradients. Furthermore, an additional difficulty is introduced by the involved form of the adaptive learning rate. Therefore the biased gradients have to be carefully analyzed together with the use of the inverse of exponential moving average while adjusting the learning rate. The existing convex analysis (Reddi et al., 2018) does not apply to the non-convex scenario we study for at least two reasons: first, non-convex optimization requires a different convergence criterion, given by stationarity rather than the global optimality; second, we consider constant momentum controlling parameter.

In the following, we formalize the assumptions required in our convergence analysis.

A1: ff is differentiable and has LL-Lipschitz gradient, i.e. x,y\forall x,y, f(x)f(y)Lxy.\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|. It is also lower bounded, i.e. f(x)>f(x^{*})>-\infty where xx^{*} is an optimal solution.

A2: At time tt, the algorithm can access a bounded noisy gradient and the true gradient is bounded, i.e. f(xt)H,gtH,t>1.\|\nabla f(x_{t})\|\leq H,\quad\|g_{t}\|\leq H,\quad\forall t>1.

A3: The noisy gradient is unbiased and the noise is independent, i.e. gt=f(xt)+ζtg_{t}=\nabla f(x_{t})+\zeta_{t}, E[ζt]=0E[\zeta_{t}]=0 and ζi\zeta_{i} is independent of ζj\zeta_{j} if iji\neq j.

Reference (Reddi et al., 2018) uses a similar (but slightly different) assumption as A2, i.e., the bounded elements of the gradient gta\|g_{t}\|_{\infty}\leq a for some finite aa. The bounded norm of f(xt)\nabla f(x_{t}) in A2 is equivalent to Lipschitz continuity of ff (when ff is differentiable) which is a commonly used condition in convergence analysis. This assumption is often satisfied in practice, for example it holds for the finite sum problem (2) when each fif_{i} has bounded gradient, and gt=fi(xt)g_{t}=\nabla f_{i}(x_{t}) where ii is sampled randomly. A3 is also standard in stochastic optimization for analyzing convergence.

Our main result shows that if the coordinate-wise weighting term \sqrt{{\mbox{\hat{v}}}_{t}} in Algorithm 1 is properly chosen, we can ensure the global convergence as well as the sublinear convergence rate of the algorithm (to a first-order stationary solution). First, we characterize how the effective stepsize parameters αt\alpha_{t} and {\mbox{\hat{v}}}_{t} affect convergence of Adam-type algorithms.

Suppose that Assumptions A1-A3 are satisfied, β1\beta_{1} is chosen such that β1β1,t\beta_{1}\geq\beta_{1,t}, β1,t[0,1)\beta_{1,t}\in[0,1) is non-increasing, and for some constant G>0G>0, \left\|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\right\|\leq G,\;\forall~{}t. Then Algorithm 1 yields

where C1,C2,C3C_{1},C_{2},C_{3} are constants independent of dd and TT, C4C_{4} is a constant independent of TT, the expectation is taken with respect to all the randomness corresponding to {gt}\{g_{t}\}.

Further, let \gamma_{t}:=\min_{j\in[d]}\min_{\{g_{i}\}_{i=1}^{t}}\alpha_{t}/(\sqrt{{\mbox{\hat{v}}}_{t}})_{j} denote the minimum possible value of effective stepsize at time tt over all possible coordinate and past gradients {gi}i=1t\{g_{i}\}_{i=1}^{t}. Then the convergence rate of Algorithm 1 is given by

where s1(T)s_{1}(T) is defined through the upper bound of RHS of (3), namely, O(s1(T))O(s_{1}(T)), and t=1Tγt=Ω(s2(T))\sum_{t=1}^{T}\gamma_{t}=\Omega(s_{2}(T)).

In Theorem 3.1, \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq G is a mild condition. Roughly speaking, it implies that the change of xtx_{t} at each each iteration should be finite. As will be evident later, with gtH\|g_{t}\|\leq H, the condition \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq G is automatically satisfied for both AdaGrad and AMSGrad. Besides, instead of bounding the minimum norm of f\nabla f in (4), we can also apply a probabilistic output (e.g., select an output xR\mathbf{x}_{R} with probability p(R=t)=γtt=1Tγtp(R=t)=\frac{\gamma_{t}}{\sum_{t=1}^{T}\gamma_{t}}) to bound E[f(xR)2]E[\|\nabla f(x_{R})\|^{2}] (Ghadimi & Lan, 2013, Lei et al., 2017). It is worth mentioning that a small number ϵ\epsilon could be added to v^t\hat{v}_{t} for ensuring the numerical stability. In this case, our Theorem 3.1 still holds given the fact the resulting algorithm is still a special case of Algorithm 1. Accordingly, our convergence results for AMSGrad and AdaFom that will be derived later also hold as \|\alpha_{t}m_{t}/(\sqrt{{\mbox{\hat{v}}}_{t}}+\epsilon)\|\leq\|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq G when ϵ\epsilon is added to {\mbox{\hat{v}}}_{t}. We will provide a detailed explanation of Theorem 3.1 in Section 3.1.

Theorem 3.1 implies a sufficient condition that guarantees convergence of the Adam-type methods: s1(T)s_{1}(T) grows slower than s2(T)s_{2}(T). We will show in Section 3.2 that the rate s1(T)s_{1}(T) can be dominated by different terms in different cases, i.e. the non-constant quantities Term A and B below

From (4) in Theorem 3.1, it is evident that s1(T)=o(s2(T))s_{1}(T)=o(s_{2}(T)) can ensure proper convergence of the algorithm. This requirement has some important implications, which we discuss below.

\bullet (The Bounds for s1(T)s_{1}(T) and s2(T)s_{2}(T)) First, the requirement that s1(T)=o(s2(T))s_{1}(T)=o(s_{2}(T)) implies that E[\sum_{t=1}^{T}\|\alpha_{t}g_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|^{2}]=o(\sum_{t=1}^{T}\gamma_{t}). This is a common condition generalized from SGD. Term A in (5) is a generalization of the term t=1Tαtgt2\sum_{t=1}^{T}\|\alpha_{t}g_{t}\|^{2} for SGD (where {αt}\{\alpha_{t}\} is the stepsize sequence for SGD), and it quantifies possible increase in the objective function brought by higher order curvature. The term t=1Tγt\sum_{t=1}^{T}\gamma_{t} is the lower bound on the summation of effective stepsizes, which reduces to t=1Tαt\sum_{t=1}^{T}\alpha_{t} when Algorithm 1 is simplified to SGD.

\bullet (Oscillation of Effective Stepsizes) Term B in (5) characterizes the oscillation of effective stepsizes \alpha_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}. In our analysis such an oscillation term upper bounds the expected possible ascent in objective induced by skewed update direction g_{t}/\sqrt{{\mbox{\hat{v}}}_{t}} (“skewed” in the sense that E[g_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}] is not parallel with f(xt)\nabla f(x_{t})), therefore it cannot be too large. Bounding this term is critical, and to demonstrate this fact, in Section 3.2.2 we show that large oscillation can result in non-convergence of Adam for even simple unconstrained non-convex problems.

\bullet (Advantage of Adaptive Gradient). One possible benefit of adaptive gradient methods can be seen from Term A. When this term dominates the convergence speed in Theorem 3.1, it is possible that proper design of {\mbox{\hat{v}}}_{t} can help reduce this quantity compared with SGD (An example is provided in Appendix 6.1.1 to further illustrate this fact.) in certain cases. Intuitively, adaptive gradient methods like AMSGrad can provide a flexible choice of stepsizes, since {\mbox{\hat{v}}}_{t} can have a normalization effect to reduce oscillation and overshoot introduced by large stepsizes. At the same time, flexibility of stepsizes makes the hyperparameter tuning of an algorithm easier in practice.

2 Tightness of the rate bound (4)

In the next, we show our bound (4) is tight in the sense that there exist problems satisfying Assumption 1 such that certain algorithms belonging to the class of Algorithm 1 can diverge due to the high growth rate of Term A or Term B.

We demonstrate the importance of Term A in this subsection. Consider a simple one-dimensional optimization problem minxf(x)\min_{x}f(x), with f(x)=100x2f(x)=100x^{2} if x<=b|x|<=b, and f(x)=200bx100b2f(x)=200b|x|-100b^{2} if x>b|x|>b, where b=10b=10. In Figure 1, we show the growth rate of different terms given in Theorem 3.1, where α00\alpha_{0}\triangleq 0, αt=0.01\alpha_{t}=0.01 for t1t\geq 1, and β1,t=0,β2,t=0.9\beta_{1,t}=0,\beta_{2,t}=0.9 for both Adam and AMSGrad. We observe that both SGD and Adam are not converging to a stationary solution (x=0x=0), which is because \sum_{t=1}^{T}\left\|\alpha_{t}g_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\right\|^{2} grows with the same rate as accumulation of effective stepsizes as shown in the figure. Actually, SGD only converges when αt<0.01\alpha_{t}<0.01 and our theory provides an perspective of why SGD diverges when αt0.01\alpha_{t}\geq 0.01. In the example, Adam is also not converging to 0 due to Term A. From our observation, Adam oscillates for any constant stepsize within [104,0.1][10^{-4},0.1] for this problem and Term A always ends up growing as fast as accumulation of effective stepsizes, which implies Adam only converges with diminishing stepsizes even in non-stochastic optimization. In contrast to SGD and Adam, AMSGrad converges in this case since both Term A and Term B grow slower than accumulation of effective stepsizes. For AMSGrad and Adam, {\mbox{\hat{v}}}_{t} has a strong normalization effect and it allows the algorithm to use a larger range of αt\alpha_{t}. The practical benefit of this flexible choice of stepsizes is easier hyperparameter tuning, which is consistent with the impression of practitioners about the original Adam. We present more experimental results in Appendix 6.1.1 accompanied with more detailed discussions.

2.2 Non-convergence of Adam due to effect of Term B

Next, we use an example to demonstrate the importance of the Term B for the convergence of Adam-type algorithms.

Consider optimization problem minxf(x)=i=111fi(x)\min_{x}f(x)=\sum_{i=1}^{11}f_{i}(x) where

3 Convergence of AMSGrad and AdaFom

Theorem 3.1 provides a general approach for the design of the weighting sequence \{{\mbox{\hat{v}}}_{t}\} and the convergence analysis of Adam-type algorithms. For example, SGD specified by Table 1 with stepsizes αt=1/t\alpha_{t}=1/\sqrt{t} yields O(logT/T)O(\log T/\sqrt{T}) convergence speed by Theorem 3.1. Moreover, the explanation on the non-convergence of Adam in (Reddi et al., 2018) is consistent with our analysis in Section 3.2. That is, Term B in (5) can grow as fast as s2(T)s_{2}(T) so that s1(T)/s2(T)s_{1}(T)/s_{2}(T) becomes a constant. Further, we notice that Term A in (5) can also make Adam diverge which is unnoticed before. Aside from checking convergence of an algorithm, Theorem 3.1 can also provide convergence rates of AdaGrad and AMSGrad, which will be given as corollaries later.

Our proposed convergence rate of AMSGrad matches the result in (Reddi et al., 2018) for stochastic convex optimization. However, the analysis of AMSGrad in (Reddi et al., 2018) is constrained to diminishing momentum controlling parameter β1,t\beta_{1,t}. Instead, our analysis is applicable to the more popular constant momentum parameter, leading to a more general non-increasing parameter setting.

In Corollary 3.1 and Corollary 3.2, we derive the convergence rates of AMSGrad (Algorithm 3 in Appendix 6.2.3) and AdaFom (Algorithm 4 in Appendix 6.2.4), respectively. Note that AdaFom is more general than AdaGrad since when β1,t=0\beta_{1,t}=0, AdaFom becomes AdaGrad.

Assume c>0\exists\,c>0 such that (g1)ic,i[d]|(g_{1})_{i}|\geq c,\forall i\in[d], for AMSGrad (Algorithm 3 in Appendix 6.2.3) with β1,tβ1[0,1)\beta_{1,t}\leq\beta_{1}\in[0,1) and β1,t\beta_{1,t} is non-increasing, αt=1/t\alpha_{t}=1/\sqrt{t}, we have for any TT,

where Q1Q_{1} and Q2Q_{2} are two constants independent of TT.

Proof: See Appendix 6.2.3. \blacksquare

Assume c>0\exists\,c>0 such that (g1)ic,i[d]|(g_{1})_{i}|\geq c,\forall i\in[d], for AdaFom (Algorithm 4 in Appendix 6.2.4) with β1,tβ1[0,1)\beta_{1,t}\leq\beta_{1}\in[0,1) and β1,t\beta_{1,t} is non-increasing, αt=1/t\alpha_{t}=1/\sqrt{t}, we have for any TT,

where Q1Q_{1}^{\prime} and Q2Q_{2}^{\prime} are two constants independent of TT.

Proof: See Appendix 6.2.4. \blacksquare

The assumption (g1)ic,i|(g_{1})_{i}|\geq c,\forall i is a mild assumption and it is used to ensure {\mbox{\hat{v}}}_{1}\geq r for some constant rr. It is also usually needed in practice for numerical stability (for AMSGrad and AdaGrad, if (g1)i=0(g_{1})_{i}=0 for some ii, division by 0 error may happen at the first iteration). In some implementations, to avoid numerical instability, the update rule of algorithms like Adam, AMSGrad, and AdaGrad take the form of x_{t+1}=x_{t}-\alpha_{t}m_{t}/(\sqrt{{\mbox{\hat{v}}}_{t}}+\epsilon) with ϵ\epsilon being a positive number. These modified algorithms still fall into the framework of Algorithm 1 since ϵ\epsilon can be incorporated into the definition of {\mbox{\hat{v}}}_{t}. Meanwhile, our convergence proof for Corollary 3.1 and Corollary 3.2 can go through without assuming (g1)ic,i|(g_{1})_{i}|\geq c,\forall i because \sqrt{{\mbox{\hat{v}}}_{t}}\geq\epsilon. In addition, ϵ\epsilon can affect the worst case convergence rate by a constant factor in the analysis.

We remark that the derived convergence rate of AMSGrad and AdaFom involves an additional logT\log T factor compared to the fastest rate of first order methods (1/T1/\sqrt{T}). However, such a slowdown can be mitigated by choosing an appropriate stepsize. To be specific, the logT\log T factor for AMSGrad would be eliminated when we adopt a constant rather than diminishing stepsize, e.g., αt=1/T\alpha_{t}=1/\sqrt{T}. It is also worth mentioning that our theoretical analysis focuses on the convergence rate of adaptive methods in the worst case for nonconvex optimization. Thus, a sharper convergence analysis that can quantify the benefits of adaptive methods still remains an open question in theory.

Empirical performance of Adam-type algorithms on MNIST

In this section, we compare the empirical performance of Adam-type algorithms, including AMSGrad, Adam, AdaFom and AdaGrad, on training two convolutional neural networks (CNNs). In the first example, we train a CNN of 33 convolutional layers and 22 fully-connected layers on MNIST. In the second example, we train a CIFARNET on CIFAR-10. We refer readers to Appendix 6.1.2 for more details on the network model and the parameter setting.

In Figure 3, we present the training loss and the classification accuracy of Adam-type algorithms versus the number of iterations. As we can see, AMSGrad performs quite similarly to Adam which confirms the result in (Reddi et al., 2018). The performance of AdaGrad is worse than other algorithms, because of the lack of momentum and/or the significantly different choice of {\mbox{\hat{v}}}_{t}. We also observe that the performance of AdaFom lies between AMSGrad/Adam and AdaGrad. This is not surprising, since AdaFom can be regarded as a momentum version of AdaGrad but uses a simpler adaptive learning rate (independent on β2\beta_{2}) compared to AMSGrad/Adam. In Figure 4, we consider to train a larger network (CIFARNET) on CIFAR-10. As we can see, Adam and AMSGrad perform similarly and yield the best accuracy. AdaFom outperforms AdaGrad in both training and testing, which agrees with the results obtained in the MNIST experiment.

Conclusion and Discussion

We provided some mild conditions to ensure convergence of a class of Adam-type algorithms, which includes Adam, AMSGrad, AdaGrad, AdaFom, SGD, SGD with momentum as special cases. Apart from providing general convergence guarantees for algorithms, our conditions can also be checked in practice to monitor empirical convergence. To the best of our knowledge, the convergence of Adam-type algorithm for non-convex problems was unknown before. We also provide insights on how oscillation of effective stepsizes can affect convergence rate for the class of algorithms which could be beneficial for the design of future algorithms. This paper focuses on unconstrained non-convex optimization problems, and one future direction is to study a more general setting of constrained non-convex optimization.

This work was supported by the MIT-IBM Watson AI Lab. Mingyi Hong and Xiangyi Chen are supported partly by an NSF grant CMMI-1727757,and by an AFOSR grant 15RT0767.

References

Appendix

Momentum methods take into account the history of first-order information (Nesterov, 2013; 1983, Nemirovskii et al., 1983, Ghadimi & Lan, 2016, Polyak, 1964, Ghadimi et al., 2015, Ochs et al., 2015, Yang et al., 2016, Johnson & Zhang, 2013, Reddi et al., 2016, Lei et al., 2017). A well-known method, called Nesterov’s accelerated gradient (NAG) originally designed for convex deterministic optimization (Nesterov, 2013; 1983, Nemirovskii et al., 1983), constructs the descent direction Δt\boldsymbol{\Delta}_{t} using the difference between the current iterate and the previous iterate. A recent work (Ghadimi & Lan, 2016) studied a generalization of NAG for non-convex stochastic programming. Similar in spirit to NAG, heavy-ball (HB) methods (Polyak, 1964, Ghadimi et al., 2015, Ochs et al., 2015, Yang et al., 2016) form the descent direction vector through a decaying sum of the previous gradient information. In addition to NAG and HB methods, stochastic variance reduced gradient (SVRG) methods integrate SGD with GD to acquire a hybrid descent direction of reduced variance (Johnson & Zhang, 2013, Reddi et al., 2016, Lei et al., 2017). Recently, certain accelerated version of perturbed gradient descent (PAGD) algorithm is also proposed in (Jin et al., 2017), which shows the fastest convergence rate among all Hessian free algorithms.

Adaptive learning rate methods accelerate ordinary SGD by using knowledge of the past gradients or second-order information into the current learning rate αt\alpha_{t} (Becker et al., 1988, Duchi et al., 2011, Zeiler, 2012, Dauphin et al., 2015). In (Becker et al., 1988), the diagonal elements of the Hessian matrix were used to penalize a constant learning rate. However, acquiring the second-order information is computationally prohibitive. More recently, an adaptive subgradient method (i.e., AdaGrad) penalized the current gradient by dividing the square root of averaging of the squared gradient coordinates in earlier iterations (Duchi et al., 2011). Although AdaGrad works well when gradients are sparse, its convergence is only analyzed in the convex world. Other adaptive learning rate methods include Adadelta (Zeiler, 2012) and ESGD (Dauphin et al., 2015), which lacked theoretical investigation although some convergence improvement was shown in practice.

Adaptive gradient methods update the descent direction and the learning rate simultaneously using knowledge in the past, and thus enjoy dual advantages of momentum and adaptive learning rate methods. Algorithms of this family include RMSProp (Tieleman & Hinton, 2012), Nadam (Dozat, 2016), and Adam (Kingma & Ba, 2014). Among these, Adam has become the most widely-used method to train deep neural networks (DNNs). Specifically, Adam adopts exponential moving averages (with decaying/forgetting factors) of the past gradients to update the descent direction. It also uses inverse of exponential moving average of squared past gradients to adjust the learning rate. The work (Kingma & Ba, 2014) showed Adam converges with at most O(1/T)O(1/\sqrt{T}) rate for convex problems. However, the recent work (Reddi et al., 2018) pointed out the convergence issues of Adam even in the convex setting, and proposed a modified version of Adam (i.e., AMSGrad), which utilizes a non-increasing quadratic normalization and avoids the pitfalls of Adam. Although AMSGrad has made a significant progress toward understanding the theoretical behavior of adaptive gradient methods, the convergence analysis of (Reddi et al., 2018) only works for convex problems.

In this section, we provide some additional experiments to demonstrate how specific Adam-type algorithms can perform better than SGD and how SGD can out perform Adam-type algorithms in different situations.

One possible benefit of adaptive gradient methods is the “sparse noise reduction” effect pointed out in Bernstein et al. (2018). Below we illustrate another possible practical advantage of adaptive gradient methods when applied to solve non-convex problems, which we refer to as flexibility of stepsizes.

To highlight ideas, let us take AMSGrad as an example, and compare it with SGD. First, in non-convex problems there can be multiple valleys with different curvatures. When using fixed stepsizes (or even a slowly diminishing stepsize), SGD can only converge to local optima in valleys with small curvature while AMSGrad and some other adaptive gradient algorithms can potentially converge to optima in valleys with relative high curvature (this may not be beneficial if one don’t want to converge to a sharp local minimum). Second, the flexible choice of stepsizes implies less hyperparameter tuning and this coincides with the popular impression about original Adam.

We empirically demonstrate the flexible stepsizes property of AMSGrad using a deterministic quadratic problem. Consider a toy optimization problem minxf(x),f(x)=100x2\min_{x}f(x),f(x)=100x^{2}, the gradient is given by 200x200x. For SGD (which reduces to gradient descent in this case) to converge, we must have αt<0.01\alpha_{t}<0.01; for AMSGrad, {\mbox{\hat{v}}}_{t} has a strong normalization effect and it allows the algorithm to use larger αt\alpha_{t}’s. We show the growth rate of different terms given in Theorem 3.1 for different stepsizes in Figure A1 to Figure A4 (where we choose β1,t=0,β2,t=0.9\beta_{1,t}=0,\beta_{2,t}=0.9 for both Adam and AMSGrad). In Figure A1, αt=0.1\alpha_{t}=0.1 and SGD diverges due to large αt\alpha_{t}, AMSGrad converges in this case, Adam is oscillating between two non-zero points. In Figure A2, stepsizes αt\alpha_{t} is set to 0.01, SGD and Adam are oscillating, AMSGrad converges to 0. For Figure A3, SGD converges to 0 and AMSGrad is converging slower than SGD due to its smaller effective stepsizes, Adam is oscillating. One may wonder how diminishing stepsizes affects performance of the algorithms, this is shown in Figure A4 where αt=0.1/t\alpha_{t}=0.1/\sqrt{t}, we can see SGD is diverging until stepsizes is small, AMSGrad is converging all the time, Adam appears to get stuck but it is actually converging very slowly due to diminishing stepsizes. This example shows AMSGrad can converge with a larger range of stepsizes compared with SGD.

From the figures, we can see that the term \sum_{t=1}^{T}\|\alpha_{t}g_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|^{2} is the key quantity that limits the convergence speed of algorithms in this case. In Figure A1, Figure A2, and early stage of Figure A4, the quantity is obviously a good sign of convergence speed. In Figure A3, since the difference of quantity between AMSGrad and SGD is compensated by the larger effective stepsizes of SGD and some problem independent constant, SGD converges faster. In fact, Figure A3 provides a case where AMSGrad does not perform well. Note that the normalization factor \sqrt{{\mbox{\hat{v}}}_{t}} can be understood as imitating the largest Lipschitz constant along the way of optimization, so generally speaking dividing by this number makes the algorithm converge easier. However when the Lipschitz constant becomes smaller locally around a local optimal point, the stepsizes choice of AMSGrad dictates that \sqrt{{\mbox{\hat{v}}}_{t}} does not change, resulting a small effective stepsizes. This could be mitigated by AdaGrad and its momentum variants which allows {\mbox{\hat{v}}}_{t} to decrease when gtg_{t} keeps decreasing.

1.2 Details of experiments on MNIST and CIFAR-10

In the experiment on MNIST, we consider a convolutional neural network (CNN), which includes 33 convolutional layers and 22 fully-connected layers. In convolutional layers, we adopt filters of sizes 6×6×16\times 6\times 1 (with stride 11), 5×5×65\times 5\times 6 (with stride 22), and 6×6×126\times 6\times 12 (with stride 22), respectively. In both AMSGradWe customized our algorithms based on the open source code https://github.com/taki0112/AMSGrad-Tensorflow. and Adam, we set β1=0.9\beta_{1}=0.9 and β2=0.99\beta_{2}=0.99. In AdaFom, we set β1=0.9\beta_{1}=0.9. We choose 5050 as the mini-batch size and the stepsize is choose to be αt=0.0001+0.003et/2000\alpha_{t}=0.0001+0.003e^{-t/2000}.

The architecture of the CIFARNET that we are using is as below. The model starts with two convolutional layers with 32 and 64 kernels of size 3 x 3, followed by 2 x 2 max pooling and dropout with keep probability 0.25. The next layers are two convolutional layers with 128 kernels of size 3 x 3 and 2 x 2, respectively. Each of the two convolutional layers is followed by a 2 x 2 max pooling layer. The last layer is a fully connected layer with 1500 nodes. Dropout with keep probability 0.25 is added between the fully connected layer and the convolutional layer. All convolutional layers use ReLU activation and stride 1. The learning rate αt\alpha_{t} of Adam and AMSGrad starts with 0.001 and decrease 10 times every 20 epochs. The learning rate of AdaGrad and AdaFom starts with 0.05 and decreases to 0.001 after 20 epochs and to 0.0001 after 40 epochs. These learning rates are tuned so that each algorithm has its best performance.

2 Convergence proof for Generalized Adam (Algorithm 1)

In this section, we present the convergence proof of Algorithm 1. We will first give several lemmas prior to proving Theorem 3.1.

Let x0x1x_{0}\triangleq x_{1} in Algorithm 1, consider the sequence

Proof. [Proof of Lemma 6.1] By the update rules S1-S3 in Algorithm 1, we have when t>1t>1,

Since xt+1xt=(1β1,t)xt+1+β1,t(xt+1xt)(1β1,t)xtx_{t+1}-x_{t}=(1-\beta_{1,t})x_{t+1}+\beta_{1,t}(x_{t+1}-x_{t})-(1-\beta_{1,t})x_{t}, based on (10) we have

Divide both sides by 1β1,t1-\beta_{1,t}, we have

where the second equality is due to x_{t+1}-x_{t}=-\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}.

For t=1t=1, we have z1=x1z_{1}=x_{1} (due to x1=x0x_{1}=x_{0}), and

where the forth equality holds due to (S1) and (S3) of Algorithm 1.

Without loss of generality, we initialize Algorithm 1 as below to simplify our analysis in what follows,

Suppose that the conditions in Theorem 3.1 hold, then

Proof. [Proof of Lemma 6.2] By the Lipschitz smoothness of f\nabla f, we obtain

where dt=zt+1ztd_{t}=z_{t+1}-z_{t}, and Lemma 6.1 together with (12) yield

where {Ti}\{T_{i}\} have been defined in (14)-(19). Further, using inequality a+b+c23a2+3b2+3c2\|a+b+c\|^{2}\leq 3\|a\|^{2}+3\|b\|^{2}+3\|c\|^{2} and (6.2.1), we have

Substituting the above inequality into (6.2.1), we then obtain (13). Q.E.D.

The next series of lemmas separately bound the terms on RHS of (13).

Suppose that the conditions in Theorem 3.1 hold, T1T_{1} in (14) can be bounded as

Proof. [Proof of Lemma 6.3] Since gtH\|g_{t}\|\leq H, by the update rule of mtm_{t}, we have mtH\|m_{t}\|\leq H, this can be proved by induction as below.

Recall that mt=β1,tmt1+(1β1,t)gtm_{t}=\beta_{1,t}m_{t-1}+(1-\beta_{1,t})g_{t}, suppose mt1H\|m_{t-1}\|\leq H, we have

then since m0=0m_{0}=0, we have m0H\|m_{0}\|\leq H which completes the induction.

where the first equality holds due to (12), and the last inequality is due to β1β1,i\beta_{1}\geq\beta_{1,i}.

Suppose the conditions in Theorem 3.1 hold. For T3T_{3} in (16), we have

where the first inequality is due toa,b12(a2+b2)\langle a,b\rangle\leq\frac{1}{2}(\|a\|^{2}+\|b\|^{2}), the second inequality is using due to upper bound on f(xt)H\|\nabla f(x_{t})\|\leq H and \|\alpha_{i}m_{i}/\sqrt{{\mbox{\hat{v}}}_{i}}\|\leq G given by the assumptions in Theorem 3.1, the third equality is because β1,tβ1\beta_{1,t}\leq\beta_{1} and β1,t\beta_{1,t} is non-increasing, the last inequality is due to telescope sum.

Suppose the assumptions in Theorem 3.1 hold. For T4T_{4} in (17), we have

Proof. [Proof of Lemma 6.5] The proof is similar to the previous lemma.

where the first inequality is due to αtmt/vtG\|\alpha_{t}m_{t}/\sqrt{v_{t}}\|\leq G by our assumptions, the second inequality is due to non-decreasing property of β1,t\beta_{1,t} and β1β1,t\beta_{1}\geq\beta_{1,t}, the last inequality is due to telescoping sum.

Suppose the assumptions in Theorem 3.1 hold. For T5T_{5} in (18), we have

where the fist inequality is due to β1β1,t\beta_{1}\geq\beta_{1,t} and (12), the second inequality is due to mi<H\|m_{i}\|<H.

Suppose the assumptions in Theorem 3.1 hold. For T2T_{2} in (15), we have

Proof. [Proof of Lemma 6.7] Recall from the definition (9), we have

Further we have z1=x1z_{1}=x_{1} by definition of z1z_{1}. We have

The second term of (26) can be bounded as

where the first inequality is because a,b12(a2+b2)\langle a,b\rangle\leq\frac{1}{2}\left(\|a\|^{2}+\|b\|^{2}\right) and the fact that z1=x1z_{1}=x_{1}, the second inequality is because

We next bound the T7T_{7} in (28), by update rule mi=β1,imi1+(1β1,igi)m_{i}=\beta_{1,i}m_{i-1}+(1-\beta_{1,i}g_{i}), we have mi=k=1i[(l=k+1iβ1,l)(1β1,k)gk]m_{i}=\sum_{k=1}^{i}[(\prod_{l=k+1}^{i}\beta_{1,l})(1-\beta_{1,k})g_{k}]. Based on that, we obtain

where the first inequality is due to β1,tβ1\beta_{1,t}\leq\beta_{1}, the second equality is by substituting expression of mtm_{t}, the last inequality is because (a+b)22(a2+b2)(a+b)^{2}\leq 2(\|a\|^{2}+\|b\|^{2}), and we have introduced T8T_{8} and T9T_{9} for ease of notation.

In (6.2.1), we first bound T8T_{8} as below

where (i)(i) is due to ab<12(a2+b2)ab<\frac{1}{2}(a^{2}+b^{2}) and follows from β1,tβ1\beta_{1,t}\leq\beta_{1} and β1,t[0,1)\beta_{1,t}\in[0,1), (ii)(ii) is due to symmetry of pp and kk in the summation, (iii)(iii) is because of p=1i1(β1i1p)11β1\sum_{p=1}^{i-1}\left(\beta_{1}^{i-1-p}\right)\leq\frac{1}{1-\beta_{1}}, (iv)(iv) is exchanging order of summation, and the second-last inequality is due to the similar reason as (iii)(iii).

where the first inequality holds due to β1,k<1\beta_{1,k}<1 and (gk)jH|(g_{k})_{j}|\leq H, the second inequality holds due to β1,kβ1\beta_{1,k}\leq\beta_{1}, and the last inequality applied the triangle inequality. For RHS of (6.2.1), using Lemma 6.8 (that will be proved later) with a_{i}=\left|\frac{\alpha_{i}}{\sqrt{{\mbox{\hat{v}}}_{i}}}-\frac{\alpha_{i-1}}{\sqrt{{\mbox{\hat{v}}}_{i-1}}}\right|_{j}, we further have

Based on (6.2.1), (6.2.1), (6.2.1) and (6.2.1), we can then bound the second term of (26) as

Let us turn to the first term in (26). Reparameterize gtg_{t} as gt=f(xt)+δtg_{t}=\nabla f(x_{t})+\delta_{t} with E[δt]=0E[\delta_{t}]=0, we have

It can be seen that the first term in RHS of (34) is the desired descent quantity, the second term is a bias term to be bounded. For the second term in RHS of (34), we have

where the last equation is because given x_{i},{\mbox{\hat{v}}}_{i-1}, E\left[\delta_{i}\odot(1/\sqrt{{\mbox{\hat{v}}}_{i-1}})|x_{i},{\mbox{\hat{v}}}_{i-1}\right]=0 and δi2H\|\delta_{i}\|\leq 2H due to giH\|g_{i}\|\leq H and f(xi)H\|\nabla f(x_{i})\|\leq H based on Assumptions A2 and A3. Further, we have

Substituting (6.2.1) and (6.2.1) into (34), we then bound the first term of (26) as

We finally apply (6.2.1) and (6.2.1) to obtain (6.7). The proof is now complete. Q.E.D.{\bf Q.E.D.}

For ai0a_{i}\geq 0, β[0,1)\beta\in[0,1), and bi=k=1iβikl=k+1ialb_{i}=\sum_{k=1}^{i}\beta^{i-k}\sum_{l=k+1}^{i}a_{l}, we have

Proof. [Proof of Lemma 6.8] The result is proved by following

where (i)(i) is by changing order of summation, (ii)(ii) is due to k=1l1βl1k11β\sum_{k=1}^{l-1}\beta^{l-1-k}\leq\frac{1}{1-\beta}, (iii)(iii) is by the fact that ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}), (iv)(iv) is due to symmetry of ala_{l} and ama_{m} in the summation, (v)(v) is because m=2iβim+1β1β\sum_{m=2}^{i}\beta^{i-m+1}\leq\frac{\beta}{1-\beta} and the last inequality is for similar reason.

2.2 Proof of Theorem 3.1

Proof. [Proof of Theorem 3.1] We combine Lemma 6.2, Lemma 6.3, Lemma 6.4, Lemma 6.5, Lemma 6.6, and Lemma 6.7 to bound the overall expected descent of the objective. First, from Lemma 6.2, we have

Then from above inequality and Lemma 6.3, Lemma 6.4, Lemma 6.5, Lemma 6.6, Lemma 6.7, we get

By merging similar terms in above inequality, we further have

and zz^{*} is an optimal of ff, i.e. zarg minzf(z)z^{*}\in\operatornamewithlimits{arg\,min}_{z}{f(z)}.

Using the fact that (\alpha_{i}/\sqrt{{\mbox{\hat{v}}}_{i}})_{j}\geq\gamma_{i},\forall j by definition, inequality (4) directly follows.

2.3 Proof of Corollary 3.1

Algorithm 3. AMSGrad (S0). Define m0=0m_{0}=0, v0=0v_{0}=0, {\mbox{\hat{v}}}_{0}=0; For t=1,,Tt=1,\cdots,T, do (S1). mt=β1,tmt1+(1β1,t)gtm_{t}=\beta_{1,t}m_{t-1}+(1-\beta_{1,t})g_{t} (S2). vt=β2vt1+(1β2)gt2v_{t}=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t}^{2} (S3). {\mbox{\hat{v}}}_{t}=\max\{{\mbox{\hat{v}}}_{t-1},v_{t}\} (S4). x_{t+1}=x_{t}-\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}} End

We first bound non-constant terms in RHS of (3), which is given by

For the term with C1C_{1}, assume \min_{j\in[d]}{(\sqrt{{\mbox{\hat{v}}}_{1}})_{j}}\geq c>0 (this is natural since if it is 0, division by 0 error will happen), we have

where the first inequality is due to ({\mbox{\hat{v}}}_{t})_{j}\geq({\mbox{\hat{v}}}_{t-1})_{j}, and the last inequality is due to t=1T1/t1+logT\sum_{t=1}^{T}1/t\leq 1+\log T.

where the first equality is due to({\mbox{\hat{v}}}_{t})_{j}\geq({\mbox{\hat{v}}}_{t-1})_{j} and αtαt1\alpha_{t}\leq\alpha_{t-1}, and the second equality is due to telescope sum.

where the first inequality is due to |(\alpha_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}-\alpha_{t-1}/\sqrt{{\mbox{\hat{v}}}_{t-1}})_{j}|\leq 1/c.

Now we lower bound the effective stepsizes, since {\mbox{\hat{v}}}_{t} is exponential moving average of gt2g_{t}^{2} and gtH\|g_{t}\|\leq H, we have ({\mbox{\hat{v}}}_{t})_{j}\leq H^{2}, we have

One more thing is to verify the assumption \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq G in Theorem 3.1, since {\alpha_{t+1}}/{(\sqrt{{\mbox{\hat{v}}}_{t+1}})_{j}}\leq{\alpha_{t}}/{(\sqrt{{\mbox{\hat{v}}}_{t}})_{j}} and {\alpha_{1}}/{(\sqrt{{\mbox{\hat{v}}}_{1}})_{j}}\leq 1/c in the algorithm, we have \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq\|m_{t}\|/c\leq H/c.

2.4 Proof of Corollary 3.2

Algorithm 4. AdaFom (S0). Define m0=0m_{0}=0, {\mbox{\hat{v}}}_{0}=0; For t=1,,Tt=1,\cdots,T, do (S1). mt=β1,tmt1+(1β1,t)gtm_{t}=\beta_{1,t}m_{t-1}+(1-\beta_{1,t})g_{t} (S2). {\mbox{\hat{v}}}_{t}=(1-1/t){\mbox{\hat{v}}}_{t-1}+(1/t)g_{t}^{2} (S3). x_{t+1}=x_{t}-\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}} End

The proof is similar to proof for Corollary 3.1, first let’s bound RHS of (3) which is

We recall from Table 1 that in AdaGrad, {\mbox{\hat{v}}}_{t}=\frac{1}{t}\sum_{i=1}^{t}g_{t}^{2}. Thus, when αt=1/t\alpha_{t}=1/\sqrt{t}, we obtain \alpha_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}=1/\sum_{i=1}^{t}g_{i}^{2}. We assume minj[d](g1)jc>0\min_{j\in[d]}{|(g_{1})_{j}|}\geq c>0, which is equivalent to \min_{j\in[d]}(\sqrt{{\mbox{\hat{v}}}_{1}})_{j}\geq c>0 (a requirement of the AdaGrad). For C1C_{1} term we have

where the third inequality used Lemma 6.9 and the last inequality used gtH\|g_{t}\|\leq H and minj[d](g1)jc>0\min_{j\in[d]}{|(g_{1})_{j}|}\geq c>0.

where the first inequality is due to |(\alpha_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}-\alpha_{t-1}/\sqrt{{\mbox{\hat{v}}}_{t-1}})_{j}|\leq 1/c.

Now we lower bound the effective stepsizes \alpha_{t}/(\sqrt{{\mbox{\hat{v}}}_{t}})_{j},

where we recall that αt=1/t\alpha_{t}=1/\sqrt{t} and gtH\|g_{t}\|\leq H. Following the same argument in the proof of Corollary 3.1 and the previously derived upper bounds, we have

The last thing is to verify the assumption \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq G in Theorem 3.1, since {\alpha_{t+1}}/{(\sqrt{{\mbox{\hat{v}}}_{t+1}})_{j}}\leq{\alpha_{t}}/{(\sqrt{{\mbox{\hat{v}}}_{t}})_{j}} and {\alpha_{1}}/{(\sqrt{{\mbox{\hat{v}}}_{1}})_{j}}\leq 1/c in the algorithm, we have \|\alpha_{t}m_{t}/\sqrt{{\mbox{\hat{v}}}_{t}}\|\leq\|m_{t}\|/c\leq H/c.

For at0a_{t}\geq 0 and i=1tai0\sum_{i=1}^{t}a_{i}\neq 0, we have

Proof. [Proof of Lemma 6.9] We will prove it by induction. Suppose

Applying the definition of concavity to log(x)\log(x), with f(z)log(z)f(z)\triangleq\log(z), we have f(z)f(z0)+f(z0)(zz0)f(z)\leq f(z_{0})+f^{\prime}(z_{0})(z-z_{0}), then substitute z=xb,z0=xz=x-b,z_{0}=x, we have f(xb)f(x)+f(x)(b)f(x-b)\leq f(x)+f^{\prime}(x)(-b) which is equivalent to log(x)log(xb)+b/x\log(x)\geq\log(x-b)+b/x for b<xb<x, using x=i=1Taix=\sum_{i=1}^{T}a_{i}, b=aTb=a_{T}, we have

Now it remains to check first iteration. We have