On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Dongruo Zhou, Jinghui Chen, Yuan Cao, Ziyan Yang, Quanquan Gu

Introduction

Stochastic gradient descent (SGD) (Robbins and Monro, 1951) and its variants have been widely used in training deep neural networks. Among those variants, adaptive gradient methods (AdaGrad) (Duchi et al., 2011; McMahan and Streeter, 2010), which scale each coordinate of the gradient by a function of past gradients, can achieve better performance than vanilla SGD in practice when the gradients are sparse. An intuitive explanation for the success of AdaGrad is that it automatically adjusts the learning rate for each feature based on the partial gradient, which accelerates the convergence. However, AdaGrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense, due to rapid decay of learning rate. This problem is especially exacerbated in deep learning due to the huge number of optimization variables. To overcome this issue, RMSProp (Tieleman and Hinton, 2012) was proposed to use exponential moving average rather than the arithmetic average to scale the gradient, which mitigates the rapid decay of the learning rate. Kingma and Ba (2014) proposed an adaptive momentum estimation method (Adam), which incorporates the idea of momentum (Polyak, 1964; Sutskever et al., 2013) into RMSProp. Other related algorithms include AdaDelta (Zeiler, 2012) and Nadam (Dozat, 2016), which combine the idea of exponential moving average of the historical gradients, Polyak’s heavy ball (Polyak, 1964) and Nesterov’s accelerated gradient descent (Nesterov, 2013). Recently, by revisiting the original convergence analysis of Adam, Reddi et al. (2018) found that for some handcrafted simple convex optimization problem, Adam does not even converge to the global minimizer. In order to address this convergence issue of Adam, Reddi et al. (2018) proposed a new variant of the Adam algorithm named AMSGrad, which has guaranteed convergence in the convex setting. The update rule of AMSGrad is as followsWith slight abuse of notation, here we denote by vt\sqrt{\mathbf{v}_{t}} the element-wise square root of the vector vt\mathbf{v}_{t}, mt/vt\mathbf{m}_{t}/\sqrt{\mathbf{v}_{t}} the element-wise division between mt\mathbf{m}_{t} and vt\sqrt{\mathbf{v}_{t}}, and max(v^t1,vt)\max(\widehat{\mathbf{v}}_{t-1},\mathbf{v}_{t}) the element-wise maximum between v^t1\widehat{\mathbf{v}}_{t-1} and vt\mathbf{v}_{t}.:

Here β1,β2\beta_{1},\beta_{2}\in are algorithm hyperparameters, and gt\mathbf{g}_{t} is the stochastic gradient at xt\mathbf{x}_{t}.

Despite the successes of adaptive gradient methods for training deep neural networks, the convergence guarantees for these algorithms are mostly restricted to online convex optimization (Duchi et al., 2011; Kingma and Ba, 2014; Reddi et al., 2018). Therefore, there is a huge gap between existing online convex optimization guarantees for adaptive gradient methods and the empirical successes of adaptive gradient methods in nonconvex optimization. In order to bridge this gap, there are a few recent attempts to prove the nonconvex optimization guarantees for adaptive gradient methods. More specifically, Basu et al. (2018) proved the convergence rate of RMSProp and Adam when using deterministic gradient rather than stochastic gradient. Li and Orabona (2018) proved the convergence rate of AdaGrad, assuming the gradient is LL-Lipschitz continuous. Ward et al. (2018) proved the convergence rate of AdaGrad-Norm where the moving average of the norms of the gradient vectors is used to adjust the gradient vector in both deterministic and stochastic settings for smooth nonconvex functions. Nevertheless, the convergence guarantees in Basu et al. (2018); Ward et al. (2018) are still limited to simplified algorithms. Another attempt to obtain the convergence rate under stochastic setting is prompted recently by Zou and Shen (2018), in which they only focus on the condition when the momentum vanishes. Chen et al. (2018a) studies the convergence properties of adaptive gradient methods in the nonconvex setting, however, its convergence rate has a quadratic dependency on the problem dimension dd. Défossez et al. (2020) proves the convergence of Adam and Adagrad in nonconvex smooth optimization under the assumption of almost sure uniform bound on the LL_{\infty} norm of the gradients. In this paper, we provide a fine-grained convergence analysis of the adaptive gradient methods. In particular, we analyze several representative adaptive gradient methods, i.e., AMSGrad (Reddi et al., 2018), which fixed the non-convergence issue in Adam and the RMSProp (fixed version via Reddi et al. (2018)), and prove its convergence rate for smooth nonconvex objective functions in the stochastic optimization setting. Moreover, existing theoretical guarantees for adaptive gradient methods are mostly bounds in expectation over the randomness of stochastic gradients, and are therefore only on-average convergence guarantees. However, in practice the optimization algorithm is usually only run once, and therefore the performance cannot be guaranteed by the in-expectation bounds. To deal with this problem, we also provide high probability convergence rates for AMSGrad and RMSProp, which can characterize the performance of the algorithms on single runs.

The main contributions of our work are summarized as follows:

We prove that the convergence rate of AMSGrad to a stationary point for stochastic nonconvex optimization is

when g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}. Here g1:T,i=[g1,i,g2,i,,gT,i]\mathbf{g}_{1:T,i}=[g_{1,i},g_{2,i},\ldots,g_{T,i}]^{\top} with {gt}t=1T\{\mathbf{g}_{t}\}_{t=1}^{T} being the stochastic gradients satisfying gtG\|\mathbf{g}_{t}\|_{\infty}\leq G_{\infty}, and s[0,1/2]s\in[0,1/2] is a parameter that characterizes the growth rate of the cumulative stochastic gradient g1:T,i\mathbf{g}_{1:T,i}. When the stochastic gradients are sparse, i.e., s<1/2s<1/2, AMSGrad achieves strictly faster convergence rate than that of vanilla SGD (Ghadimi and Lan, 2016) in terms of iteration number TT.

Our result implies that the worst case (i.e., s=1/2s=1/2) convergence rate for AMSGrad is

which has a better dependence on the dimension dd and TT than the convergence rate proved in Chen et al. (2018a), i.e.,

We also establish high probability bounds for adaptive gradient methods. To the best of our knowledge, it is the first high probability convergence guarantees for AMSGrad and RMSProp in nonconvex stochastic optimization setting.

Additional Related Work

Here we briefly review other related work that is not covered before.

Adaptive gradient methods: Mukkamala and Hein (2017) proposed SC-Adagrad and SC-RMSprop, which derives logarithmic regret bounds for strongly convex functions. Chen et al. (2018b) proposed SADAGRAD for solving stochastic strongly convex optimization and more generally stochastic convex optimization that satisfies the second order growth condition. Zaheer et al. (2018) studied the effect of adaptive denominator constant ϵ\epsilon and minibatch size in the convergence of adaptive gradient methods. Chen et al. (2020) proposed a partially adaptive gradient method and proved its convergence in nonconvex settings.

Nonconvex Stochastic Optimization: Ghadimi and Lan (2013) proposed a randomized stochastic gradient (RSG) method, and proved its O(1/T)O(1/\sqrt{T}) convergence rate to a stationary point. Ghadimi and Lan (2016) proposed an randomized stochastic accelerated gradient (RSAG) method, which achieves O(1/T+σ2/T)O(1/T+\sigma^{2}/\sqrt{T}) convergence rate, where σ2\sigma^{2} is an upper bound on the variance of the stochastic gradient. Motivated by the success of stochastic momentum methods in deep learning (Sutskever et al., 2013), Yang et al. (2016) provided a unified convergence analysis for both stochastic heavy-ball method and the stochastic variant of Nesterov’s accelerated gradient method, and proved O(1/T)O(1/\sqrt{T}) convergence rate to a stationary point for smooth nonconvex functions. Reddi et al. (2016); Allen-Zhu and Hazan (2016) proposed variants of stochastic variance-reduced gradient (SVRG) method (Johnson and Zhang, 2013) that is provably faster than gradient descent in the nonconvex finite-sum setting. Lei et al. (2017) proposed a stochastically controlled stochastic gradient (SCSG), which further improves convergence rate of SVRG for finite-sum smooth nonconvex optimization. Recently, Zhou et al. (2018) proposed a new algorithm called stochastic nested variance-reduced gradient (SNVRG), which achieves strictly better gradient complexity than both SVRG and SCSG for finite-sum and stochastic smooth nonconvex optimization.

Stochastic Smooth Nonconvex Optimization: There is another line of research in stochastic smooth nonconvex optimization, which makes use of the λ\lambda-nonconvexity of a nonconvex function ff (i.e., 2fλI\nabla^{2}f\succeq-\lambda\mathbf{I}). More specifically, Natasha 1 (Allen-Zhu, 2017b) and Natasha 1.5 (Allen-Zhu, 2017a) have been proposed, which solve a modified regularized problem and achieve faster convergence rate to first-order stationary points than SVRG and SCSG in the finite-sum and stochastic settings respectively. In addition, Allen-Zhu (2018) proposed an SGD4 algorithm, which optimizes a series of regularized problems, and is able to achieve a faster convergence rate than SGD.

High Probability Bounds: There are only few work on high probability results. Kakade and Tewari (2009) proved high probability bounds for PEGASOS algorithm via Freeman’s inequality. Harvey et al. (2019a, b) proved convergence bounds for non-smooth, strongly convex case via generalized Freeman’s inequality. Jain et al. (2019) makes the last iterate of SGD information theoretically optimal by providing a high probability bound. Li and Orabona (2020) presented a high probability analysis for Delayed AdaGrad algorithm with momentum in the smooth nonconvex setting.

Notation

Algorithms

We mainly consider the following three algorithms: AMSGrad (Reddi et al., 2018), a corrected version of RMSProp (Tieleman and Hinton, 2012; Reddi et al., 2018), and AdaGrad (Duchi et al., 2011).

In Algorithm 3 we further present the AdaGrad algorithm (Duchi et al., 2011), which adopts the summation of past stochastic gradient squares instead of the running average of them to compute the effective learning rate.

Convergence of Adaptive Gradient Methods in Expectation

In this section we present our main theoretical results on the convergence of AMSGrad, RMSProp as well as AdaGrad. We study the following stochastic nonconvex optimization problem

Assumption 5.2 is a standard assumption in the analysis of gradient-based algorithms. It is equivalent to the LL-gradient Lipschitz condition, which is often written as f(x)f(y)2Lxy2\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_{2}\leq L\|\mathbf{x}-\mathbf{y}\|_{2}.

We are now ready to present our main result.

Suppose β1<β21/2\beta_{1}<\beta_{2}^{1/2}, αt=α\alpha_{t}=\alpha and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2. Then under Assumptions 5.1 and 5.2, the iterates xt\mathbf{x}_{t} of AMSGrad satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

and Δ=f(x1)infxf(x)\Delta=f(\mathbf{x}_{1})-\inf_{\mathbf{x}}f(\mathbf{x}).

Note that in Theorem 5.3 we have a condition that g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}. Here ss characterizes the growth rate condition (Liu et al., 2019) of g1:T,i\mathbf{g}_{1:T,i}, i.e., the cumulative stochastic gradient. In the worse case where the stochastic gradients are not sparse at all, s=1/2s=1/2, while in practice when the stochastic gradients are sparse, we have s<1/2s<1/2.

If we choose \alpha=\Theta\big{(}d^{1/2}T^{1/4+s/2}\big{)}^{-1}, then (5.1) implies that AMSGrad achieves

convergence rate. In cases where the stochastic gradients are sparse, i.e., s<1/2s<1/2, we can see that the convergence rate of AMSGrad is strictly better than that of nonconvex SGD (Ghadimi and Lan, 2016), i.e., O(d/T+d/T)O(\sqrt{d/T}+d/T). In the worst case when s=1/2s=1/2, this result matches the convergence rate of nonconvex SGD (Ghadimi and Lan, 2016). Note that Chen et al. (2018a) also provided similar bound for AMSGrad that

It can be seen that the dependence of dd in their bound is quadratic, which is worse than the linear dependence implied by (5.1). A very recent work (Défossez et al., 2020) discussed the convergence issue of Adam by showing that the bound consists of a constant term and does not converge to zero. In comparison, our result for AMSGrad does not have such a constant term and converges to zero in a rate O(d1/2T3/4s/2)O(d^{1/2}T^{3/4-s/2}). This demonstrates that the convergence issue of Adam is indeed fixed in AMSGrad.

Under the same conditions of Theorem 5.3, if αt=α\alpha_{t}=\alpha and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2, then the the iterates xt\mathbf{x}_{t} of RMSProp satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

Under the same conditions of Theorem 5.3, if αt=α\alpha_{t}=\alpha and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2, then the the iterates xt\mathbf{x}_{t} of AdaGrad satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

Corollaries 5.6, 5.7 imply that RMSProp and AdaGrad algorithm achieve the same rate of convergence as AMSGrad. In worst case where s=1/2s=1/2, both algorithms again, achieves O(d/T+d/T)O(\sqrt{d/T}+d/T) convergence rate, which matches the convergences rate of nonconvex SGD given by Ghadimi and Lan (2016).

Défossez et al. (2020) gave a bound O(α1T1/2+(1+α)dT1/2)O(\alpha^{-1}T^{-1/2}+(1+\alpha)dT^{-1/2}) for Adagrad, which gives the following rate with optimal α\alpha:

For the ease of comparison, we calculate the iteration complexity of both results. To converge to a ϵtarget\epsilon_{\text{target}}-approximate first order stationary point, the result of Défossez et al. (2020) suggests that AdaGrad requires Ω(ϵtarget2d2)\Omega(\epsilon_{\text{target}}^{-2}d^{2}) iterations. In sharp contrast, our result suggests that AdaGrad only requires Ω(ϵtarget2d)\Omega(\epsilon_{\text{target}}^{-2}d) iterations. Evidently, our iteration complexity is better than theirs by a factor of dd.

Convergence of Adaptive Gradient Methods with High Probability

Theorem 5.3, Corollaries 5.6 and 5.7 bound the expectation of full gradients over the randomness of stochastic gradients. In other words, these bounds can only guarantee the average performance of a large number of trials of the algorithm, but cannot rule out extremely bad solutions. What’s more, for practical applications such as training deep neural networks, usually we only perform one single run of the algorithm since the training time can be fairly large. Hence, it is essential to get high probability bounds which guarantee the performance of the algorithm on single runs. To overcome this limitation, in this section, we further establish high probability bounds of the convergence rate for AMSGrad and RMSProp as well as AdaGrad. We make the following additional assumption.

The stochastic gradients are sub-Gaussian random vectors (Jin et al., 2019):

Sub-Gaussian gradient assumptions are commonly considered when studying high probability bounds (Li and Orabona, 2020). Note that our Assumption 6.1 is weaker than Assumption B2 in Li and Orabona (2020): for the case when f(x)f(x,ξ)\nabla f(\mathbf{x})-\nabla f(\mathbf{x},\xi) is a standard Gaussian vector, σ2\sigma^{2} defined in Li and Orabona (2020) is of order O(d)O(d), while σ2=O(1)\sigma^{2}=O(1) in our definition.

Suppose β1<β21/2\beta_{1}<\beta_{2}^{1/2}, αt=α1/2\alpha_{t}=\alpha\leq 1/2 and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2. Then for any δ>0\delta>0, under Assumptions 5.1, 5.2 and 6.1, with probability at least 1δ1-\delta, the iterates xt\mathbf{x}_{t} of AMSGrad satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

and Δf=f(x1)infxf(x)\Delta f=f(\mathbf{x}_{1})-\inf_{\mathbf{x}}f(\mathbf{x}).

Similar to the discussion in Remark 5.5, we can choose \alpha=\Theta\big{(}d^{1/2}T^{1/4+s/2}\big{)}^{-1}, to achieve an O(d1/2/T3/4s/2+d/T)O(d^{1/2}/T^{3/4-s/2}+d/T) convergence rate. When s<1/2s<1/2, this rate of AMSGrad is strictly better than that of nonconvex SGD (Ghadimi and Lan, 2016).

We also have the following corollaries characterizing the high probability bounds for RMSProp and AdaGrad.

Under the same conditions of Theorem 6.3, if αt=α1/2\alpha_{t}=\alpha\leq 1/2 and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2, then for any δ>0\delta>0, with probability at least 1δ1-\delta, the iterates xt\mathbf{x}_{t} of RMSProf satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

and Δ=f(x1)infxf(x)\Delta=f(\mathbf{x}_{1})-\inf_{\mathbf{x}}f(\mathbf{x}).

Under the same conditions of Theorem 6.3, if αt=α1/2\alpha_{t}=\alpha\leq 1/2 and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,T,0s1/2t=1,\ldots,T,0\leq s\leq 1/2, then for any δ>0\delta>0, with probability at least 1δ1-\delta, the iterates xt\mathbf{x}_{t} of AdaGrad satisfy that

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined as follows:

and Δ=f(x1)infxf(x)\Delta=f(\mathbf{x}_{1})-\inf_{\mathbf{x}}f(\mathbf{x}).

Proof Sketch of the Main Theory

In this section, we provide a proof sketch of Theorems 5.3 and 6.3, and the complete proofs as well as proofs for other corollaries and technical lemmas can be found in the appendix. Compared with the analysis of standard stochastic gradient descent, the main difficulty of analyzing the convergence rate of adaptive gradient methods is caused by the existence of stochastic momentum mt\mathbf{m}_{t} and adaptive stochastic gradient V^t1/2gt\widehat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}. To address this challenge, following Yang et al. (2016), we define an auxiliary sequence zt\mathbf{z}_{t}: let x0=x1\mathbf{x}_{0}=\mathbf{x}_{1}, and for each t1t\geq 1,

The following lemma shows that zt+1zt\mathbf{z}_{t+1}-\mathbf{z}_{t} can be represented by mt,gt\mathbf{m}_{t},\mathbf{g}_{t} and V^t1/2\widehat{\mathbf{V}}_{t}^{-1/2}. This indicates that by considering the sequence {zt}\{\mathbf{z}_{t}\}, it is possible to analyze algorithms which include stochastic momentum, such as AMSGrad.

Let zt\mathbf{z}_{t} be defined in (7.1). Then for t2t\geq 2, we have the following expression for zt+1zt\mathbf{z}_{t+1}-\mathbf{z}_{t}.

We can also represent zt+1zt\mathbf{z}_{t+1}-\mathbf{z}_{t} as the following:

For t=1t=1, we have z2z1=α1V^11/2g1.\mathbf{z}_{2}-\mathbf{z}_{1}=-\alpha_{1}\widehat{\mathbf{V}}_{1}^{-1/2}\mathbf{g}_{1}.

With Lemma 7.1, we have the following two lemmas giving upper bounds for zt+1zt2\|\mathbf{z}_{t+1}-\mathbf{z}_{t}\|_{2} and f(zt)f(xt)2\|\nabla f(\mathbf{z}_{t})-\nabla f(\mathbf{x}_{t})\|_{2} , which are useful to the proof of the main theorem.

Let zt\mathbf{z}_{t} be defined in (7.1). For t2t\geq 2, we have

Let zt\mathbf{z}_{t} be defined in (7.1). For t2t\geq 2, we have

We also need the following lemma to bound f(x),v^t\|\nabla f(\mathbf{x})\|_{\infty},\|\widehat{\mathbf{v}}_{t}\|_{\infty} and mt\|\mathbf{m}_{t}\|_{\infty}. Basically, it shows that these quantities can be bounded by GG_{\infty}.

Let v^t\widehat{\mathbf{v}}_{t} and mt\mathbf{m}_{t} be as defined in Algorithm 1. Then under Assumption 5.1, we have f(x)G\|\nabla f(\mathbf{x})\|_{\infty}\leq G_{\infty}, v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2} and mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}.

Last, we need the following lemma that provides upper bounds for V^t1/2mt2\|\hat{\mathbf{V}}_{t}^{-1/2}\mathbf{m}_{t}\|_{2} and V^t1/2gt2\|\hat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}\|_{2}. More specifically, it shows that we can bound V^t1/2mt2\|\hat{\mathbf{V}}_{t}^{-1/2}\mathbf{m}_{t}\|_{2} and V^t1/2gt2\|\hat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}\|_{2} with i=1dg1:T,i2\sum_{i=1}^{d}\|\mathbf{g}_{1:T,i}\|_{2}.

Let β1,β2\beta_{1},\beta_{2} be the weight parameters, αt\alpha_{t}, t=1,,Tt=1,\ldots,T be the step sizes in Algorithm 1. We denote γ=β1/β21/2\gamma=\beta_{1}/\beta_{2}^{1/2}. Suppose that αt=α\alpha_{t}=\alpha and γ1\gamma\leq 1, then under Assumption 5.1, we have the following two results:

With all lemmas provided, now we are ready to show the proof sketch of Theorem 5.3.

In the following, we bound I1I_{1}, I2I_{2} and I3I_{3} separately.

Bounding term I1I_{1}: We can prove that when t=1t=1,

For t2t\geq 2, by Lemma 7.1, we can prove the following result:

Bounding term I2I_{2}: For t1t\geq 1, by Lemma 7.1 and Lemma 7.2, we can prove that

Bounding term I3I_{3}: For t1t\geq 1, by Lemma 7.1, we have

Now we get back to (7.2). We provide upper bounds of (7.2) for t=1t=1 and t>1t>1 separately. For t=1t=1, substituting (7.3), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms, we have

For t2t\geq 2, substituting (7.4), (7.5) and (7.6) into (7.2), taking expectation and rearranging terms, we have

We now telescope (7.8) for t=2t=2 to TT, and add it with (7.7). Rearranging it, we have

Finally, rearranging (7.10), and adopting the theorem condition that g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}, we obtain

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined in Theorem 5.3. This completes the proof. ∎

We then show the proof sketch of for high probability result, i.e, Theorem 6.3.

Following the same procedure as in the proof for Theorem 5.3 until (7.6). For t=1t=1, substituting (7.3), (7.5) and (7.6) into (7.2), rearranging terms, we have

For t2t\geq 2, substituting (7.4), (7.5) and (7.6) into (7.2), rearranging terms, we have

We now telescope (7.12) for t=2t=2 to TT and add it with (7.11). Rearranging it, we have

Now consider the filtration Ft=σ(ξ1,,ξt)\mathcal{F}_{t}=\sigma(\xi_{1},\ldots,\xi_{t}). Since xt\mathbf{x}_{t} and V^t11/2\widehat{\mathbf{V}}_{t-1}^{-1/2} only depend on ξ1,,ξt1\xi_{1},\ldots,\xi_{t-1}, by Assumption 6.1 and an martingale concentration argument, we obtain we obtain

By using Lemma 7.5 and substituting (7.14) into (7.13), we have

Moreover, by Lemma 7.4, we have f(xt)V^t11/2f(xt)G1f(xt)22\nabla f(\mathbf{x}_{t})^{\top}\widehat{\mathbf{V}}_{t-1}^{-1/2}\nabla f(\mathbf{x}_{t})\geq G_{\infty}^{-1}\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}, and therefore by choosing αt=α1/2\alpha_{t}=\alpha\leq 1/2 and rearranging terms, we have

where CC^{\prime} is an absolute constant. Finally, rearranging (7.15) and adopting the theorem condition g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}, we have

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined in Theorem 6.3. This completes the proof. ∎

Conclusions

In this paper, we provided a fine-grained analysis of a general class of adaptive gradient methods, and proved their convergence rates for smooth nonconvex optimization. Our results provide faster convergence rates of AMSGrad and the corrected version of RMSProp as well as AdaGrad for smooth nonconvex optimization compared with previous works. In addition, we also prove high probability bounds on the convergence rates of AMSGrad and RMSProp as well as AdaGrad, which have not been established before.

Appendix A Detailed Proof of the Main Theory

Here we provide the detailed proof of the main theorem.

Let x0=x1\mathbf{x}_{0}=\mathbf{x}_{1}. To prove Theorem 5.3, we need the following lemmas:

Let v^t\widehat{\mathbf{v}}_{t} and mt\mathbf{m}_{t} be as defined in Algorithm 1. Then under Assumption 5.1, we have f(x)G\|\nabla f(\mathbf{x})\|_{\infty}\leq G_{\infty}, v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2} and mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}.

Let β1,β2\beta_{1},\beta_{2}, β1,β2\beta_{1}^{\prime},\beta_{2}^{\prime} be the weight parameters such that

αt\alpha_{t}, t=1,,Tt=1,\ldots,T be the step sizes. We denote γ=β1/β21/2\gamma=\beta_{1}/\beta_{2}^{1/2}. Suppose that αt=α\alpha_{t}=\alpha and γ1\gamma\leq 1, then under Assumption 5.1, we have the following two results:

Note that Lemma A.2 is general and applicable to various algorithms. Specifically, set β1=β1\beta_{1}^{\prime}=\beta_{1} and β2=β2\beta_{2}^{\prime}=\beta_{2}, we recover the case in Algorithm 1. Further set β1=0\beta_{1}=0 we recover the case in Algorithm 2. Set β1=β1=0\beta_{1}^{\prime}=\beta_{1}=0 and β2=1,β2=0\beta_{2}=1,\beta_{2}^{\prime}=0 we recover the case in Algorithm 3.

To deal with stochastic momentum mt\mathbf{m}_{t} and stochastic weight V^t1/2\widehat{\mathbf{V}}_{t}^{-1/2}, following Yang et al. (2016), we define an auxiliary sequence zt\mathbf{z}_{t} as follows: let x0=x1\mathbf{x}_{0}=\mathbf{x}_{1}, and for each t1t\geq 1,

Lemma A.3 shows that zt+1zt\mathbf{z}_{t+1}-\mathbf{z}_{t} can be represented in two different ways.

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

By Lemma A.3, we connect zt+1zt\mathbf{z}_{t+1}-\mathbf{z}_{t} with xt+1xt\mathbf{x}_{t+1}-\mathbf{x}_{t} and αtV^t1/2gt\alpha_{t}\hat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}

The following two lemmas give bounds on zt+1zt2\|\mathbf{z}_{t+1}-\mathbf{z}_{t}\|_{2} and f(zt)f(xt)2\|\nabla f(\mathbf{z}_{t})-\nabla f(\mathbf{x}_{t})\|_{2}, which play important roles in our proof.

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

In the following, we bound I1I_{1}, I2I_{2} and I3I_{3} separately.

Bounding term I1I_{1}: When t=1t=1, we have

where the first equality holds due to (A.3) in Lemma A.3. For f(xt)(αt1V^t11/2αtV^t1/2)mt1\nabla f(\mathbf{x}_{t})^{\top}(\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}-\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2})\mathbf{m}_{t-1} in (A.7), we have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t11/2αtV^t1/20\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\succeq 0. Next we bound f(xt)αtV^t1/2gt-\nabla f(\mathbf{x}_{t})^{\top}\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}. We have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t11/2αtV^t1/20\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\succeq 0. Substituting (A.8) and (A.9) into (A.7), we have

Bounding term I2I_{2}: For t1t\geq 1, we have

where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds due to Young’s inequality.

Bounding term I3I_{3}: For t1t\geq 1, we have

The first inequality is obtained by introducing Lemma A.3.

For t=1t=1, substituting (A.6), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

For t2t\geq 2, substituting (A.10), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

where γ=β1/β21/2\gamma=\beta_{1}/\beta_{2}^{1/2}. We also have

Substituting (A.16) and (A.17) into (A.15), and rearranging (A.15), we have

where the second inequality holds because αt=α\alpha_{t}=\alpha. Rearranging (A.18), and note that in the theorem condition we have g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}, we obtain

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined in Theorem 5.3. This completes the proof. ∎

A.2 Proof of Corollary 5.6

Following the proof for Theorem 5.3, setting β1=β1=0\beta_{1}^{\prime}=\beta_{1}=0 and β2=β2=β\beta_{2}^{\prime}=\beta_{2}=\beta in Lemma A.2 we get the conclusion. ∎

A.3 Proof of Corollary 5.7

Following the proof for Theorem 5.3, setting β1=β1=0\beta_{1}^{\prime}=\beta_{1}=0, β2=1\beta_{2}=1 and β2=0\beta_{2}^{\prime}=0 in Lemma A.2 we get the conclusion. ∎

A.4 Proof of Theorem 6.3

In the following, we bound I1I_{1}, I2I_{2} and I3I_{3} separately.

Bounding term I1I_{1}: When t=1t=1, we have

where the first equality holds due to (A.3) in Lemma A.3. For f(xt)(αt1V^t11/2αtV^t1/2)mt1\nabla f(\mathbf{x}_{t})^{\top}(\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}-\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2})\mathbf{m}_{t-1} in (A.21), we have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t11/2αtV^t1/20\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\succeq 0. Next we bound f(xt)αtV^t1/2gt-\nabla f(\mathbf{x}_{t})^{\top}\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\mathbf{g}_{t}. We have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t11/2αtV^t1/20\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2}\succeq 0. Substituting (A.22) and (A.23) into (A.21), we have

Bounding term I2I_{2}: For t1t\geq 1, we have

where the second inequality holds because of Lemma A.3 and Lemma A.4, the last inequality holds due to Young’s inequality.

Bounding term I3I_{3}: For t1t\geq 1, we have

The first inequality is obtained by introducing Lemma A.3.

For t=1t=1, substituting (A.20), (A.25) and (A.26) into (A.19) and rearranging terms, we have

For t2t\geq 2, substituting (A.24), (A.25) and (A.26) into (A.19) and rearranging terms, we have

Telescoping (A.28) for t=2t=2 to TT and adding (A.27), we have

where γ=β1/β21/2\gamma=\beta_{1}/\beta_{2}^{1/2}. We also have

Moreover, consider the filtration Ft=σ(ξ1,,ξt)\mathcal{F}_{t}=\sigma(\xi_{1},\ldots,\xi_{t}). Since xt\mathbf{x}_{t} and V^t11/2\widehat{\mathbf{V}}_{t-1}^{-1/2} only depend on ξ1,,ξt1\xi_{1},\ldots,\xi_{t-1}. For any τ,λ>0\tau,\lambda>0, by Assumption 6.1 with v=λαt1V^t11/2f(xt)\mathbf{v}=\lambda\cdot\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2}\nabla f(\mathbf{x}_{t}), we have

Denote Zt=αt1f(xt)V^t11/2(gtf(xt))Z_{t}=\alpha_{t-1}\nabla f(\mathbf{x}_{t})^{\top}\widehat{\mathbf{V}}_{t-1}^{-1/2}(\mathbf{g}_{t}-\nabla f(\mathbf{x}_{t})). Then we have

With exactly the same proof, we also have

Choosing λ=[σ2αt12V^t11/2f(xt)22]1τ\lambda=[\sigma^{2}\alpha_{t-1}^{2}\|\widehat{\mathbf{V}}_{t-1}^{-1/2}\nabla f(\mathbf{x}_{t})\|_{2}^{2}]^{-1}\tau, we finally obtain

for all τ>0\tau>0, where σt=σαt1V^t11/2f(xt)2\sigma_{t}=\sigma\alpha_{t-1}\|\widehat{\mathbf{V}}_{t-1}^{-1/2}\nabla f(\mathbf{x}_{t})\|_{2}. The tail bound (A.32) enables the application of Lemma 6 in Jin et al. (2019), which gives that with probability at least 1δ1-\delta,

where CC is an absolute constant. Plugging in the definitions of ZtZ_{t} and σt\sigma_{t}, we obtain

where the second inequality is by the fact that the diagonal entries of V^t1\widehat{\mathbf{V}}_{t-1} are all loewr bounded by ϵ\epsilon. Substituting (A.30), (A.31) and (A.33) into (A.29), we have

Moreover, by Lemma A.1, we have f(xt)V^t11/2f(xt)G1f(xt)22\nabla f(\mathbf{x}_{t})^{\top}\widehat{\mathbf{V}}_{t-1}^{-1/2}\nabla f(\mathbf{x}_{t})\geq G_{\infty}^{-1}\|\nabla f(\mathbf{x}_{t})\|_{2}^{2}, and therefore by choosing αt=α\alpha_{t}=\alpha and rearranging terms, we have

where CC^{\prime} is an absolute constant.

Now by the theorem condition g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s}, we have

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined in Theorem 6.3. This completes the proof. ∎

A.5 Proof of Corollary 6.5

Following the proof for Theorem 6.3, setting β1=β1=0\beta_{1}^{\prime}=\beta_{1}=0 and β2=β2=β\beta_{2}^{\prime}=\beta_{2}=\beta in Lemma A.2 we get the conclusion. ∎

A.6 Proof of Corollary 6.6

Following the proof for Theorem 6.3, setting β1=β1=0\beta_{1}^{\prime}=\beta_{1}=0, β2=1\beta_{2}=1 and β2=0\beta_{2}^{\prime}=0 in Lemma A.2 we get the conclusion. ∎

Appendix B Proof of Technical Lemmas

Since ff has GG_{\infty}-bounded stochastic gradient, for any x\mathbf{x} and ξ\xi, f(x;ξ)G\|\nabla f(\mathbf{x};\xi)\|_{\infty}\leq G_{\infty}. Thus, we have

Next we bound mt\|\mathbf{m}_{t}\|_{\infty}. We have m0=0G\|\mathbf{m}_{0}\|_{\infty}=0\leq G_{\infty}. Suppose that mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}, then for mt+1\mathbf{m}_{t+1}, we have

Thus, for any t0t\geq 0, we have mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}. Finally we bound v^t\|\widehat{\mathbf{v}}_{t}\|_{\infty}. First we have v0=v^0=0G2\|\mathbf{v}_{0}\|_{\infty}=\|\widehat{\mathbf{v}}_{0}\|_{\infty}=0\leq G_{\infty}^{2}. Suppose that v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2} and vtG2\|\mathbf{v}_{t}\|_{\infty}\leq G_{\infty}^{2}. Note that we have

and by definition, we have v^t+1=max{v^t,vt+1}G2\|\widehat{\mathbf{v}}_{t+1}\|_{\infty}=\max\{\|\widehat{\mathbf{v}}_{t}\|_{\infty},\|\mathbf{v}_{t+1}\|_{\infty}\}\leq G_{\infty}^{2}. Thus, for any t0t\geq 0, we have v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2}. ∎

B.2 Proof of Lemma A.2

Recall that v^t,j,mt,j,gt,j\widehat{v}_{t,j},m_{t,j},g_{t,j} denote the jj-th coordinate of v^t,mt\widehat{\mathbf{v}}_{t},\mathbf{m}_{t} and gt\mathbf{g}_{t}. We have

where the first inequality holds since a+b2aba+b\geq 2\sqrt{ab} and the second inequality holds because v^t,ivt,i\widehat{v}_{t,i}\geq v_{t,i}. Next we have

where the first inequality holds due to Cauchy inequality, and the last inequality holds because j=1tβ1tj(1β1)1\sum_{j=1}^{t}\beta_{1}^{t-j}\leq(1-\beta_{1})^{-1}. Note that

where the equality holds due to the definition of γ\gamma. Substituting (B.2) and (B.3) into (B.1), we have

Telescoping (B.4) for t=1t=1 to TT, we have

where the inequality holds due to Hölder’s inequality. Substituting (B.6) into (B.5), we have

Specifically, taking β1=0\beta_{1}=0, we have mt=gt\mathbf{m}_{t}=\mathbf{g}_{t}, then

B.3 Proof of Lemma A.3

The equities above are based on definition. Then we have

The equalities above follow by combining the like terms. ∎

B.4 Proof of Lemma A.4

where the inequality holds because the term β1/(1β1)\beta_{1}/(1-\beta_{1}) is positive, and triangle inequality. Considering that αtv^t,j1/2αt1v^t1,j1/2\alpha_{t}\widehat{\mathbf{v}}_{t,j}^{-1/2}\leq\alpha_{t-1}\widehat{\mathbf{v}}_{t-1,j}^{-1/2}, when p>0p>0, we have \Big{\|}\mathbf{I}-(\alpha_{t}\widehat{\mathbf{V}}_{t}^{-1/2})(\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-1/2})^{-1}\Big{\|}_{\infty,\infty}\leq 1. With that fact, the term above can be bound as:

B.5 Proof of Lemma A.5

For term f(zt)f(xt)2\|\nabla f(\mathbf{z}_{t})-\nabla f(\mathbf{x}_{t})\|_{2}, we have:

where the last inequality holds because the term β1/(1β1)\beta_{1}/(1-\beta_{1}) is positive. ∎

References