On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

Xiaoyu Li, Francesco Orabona

INTRODUCTION

In recent years, Stochastic Gradient Descent (SGD) has become the tool of choice to train machine learning models. In particular, in the Deep Learning community, it is widely used to minimize the training error of deep networks. In this setting, the stochasticity arises from the use of so-called mini-batches, that allows to keep the complexity per iteration constant with respect to the size of the training set.

More in details, SGD iteratively updates the solution as xt+1=xtηtg(xt,ξt)\boldsymbol{x}_{t+1}=\boldsymbol{x}_{t}-\eta_{t}\boldsymbol{g}(\boldsymbol{x}_{t},\xi_{t}), starting from an arbitrary point x1\boldsymbol{x}_{1}, where g(xt,ξt)\boldsymbol{g}(\boldsymbol{x}_{t},\xi_{t}) is a stochastic gradient in xt\boldsymbol{x}_{t} that depends on the stochastic variable ξt\xi_{t}.Classic convergence analysis of the SGD algorithm for non-convex smooth functions relies on conditions on the positive stepsizes ηt\eta_{t} (Robbins and Monro, 1951). In particular, sufficient conditions are that (ηt)t=1(\eta_{t})_{t=1}^{\infty} is a deterministic sequence of non-negative numbers that satisfies

However, state-of-the-art SGD variants use adaptive stepsizes, that is ηt\eta_{t} is a function of past stochastic gradients. These stepsizes are believed to require less tweaking to achieve good performance in machine learning applications and we have some partial explanations in the convex setting, i.e. sparsity of the gradients (Duchi et al., 2011). However, in the non-convex setting, we do not have any theory explaining the better performance.

Indeed, for a large number of SGD variants employed by practitioners the conditions above are not satisfied. In fact, these algorithms are often designed and analyzed for the convex domain under restrictive conditions, e.g. bounded domains, or they do not provide convergence guarantees at all, (e.g. Zeiler, 2012), or even worse they are known to fail to converge on simple one-dimensional convex stochastic optimization problems (Reddi et al., 2018). Even considering an infinite number of iterations, the behavior of these algorithms is often unknown.

We focus on a generalized version of the adaptive stepsizes popularized by AdaGrad (Duchi et al., 2011). This kind of stepsizes has become the basis of all other adaptive optimization algorithms used in machine learning, (e.g. Zeiler, 2012; Tieleman and Hinton, 2012; Kingma and Ba, 2015; Reddi et al., 2018). We analyze two types of step size: a global step size

where α>0\alpha>0 and β,ϵ0\beta,\epsilon\geq 0. Note that, with ϵ=0\epsilon=0, (3) are the coordinate-wise stepsizes used in AdaGrad (Duchi et al., 2011), while (2) have been used in online convex optimization to achieve adaptive regret guarantees, (e.g. Rakhlin and Sridharan, 2013; Orabona and Pál, 2018). The additional parameter ϵ\epsilon allows us to increase the decrease rate of the stepsize and it will be critical to obtain our almost sure convergence results.

In this paper, we want to answer two basic questions: 1) Are there conditions under which the generalized AdaGrad stepsize converge almost surely with an infinite number of iterations in the non-convex setting? 2) Are there conditions under which the rate is better than the one of the plain SGD with decreasing stepsizes?

We answer positively to both questions. More in details, the contributions of this paper are the following:

In Section 5, we prove for the first time in the non-convex setting almost sure asymptotic convergence to zero of the gradients of SGD with both coordinate-wise and global adaptive stepsizes.

In Section 6.1, we prove that in the convex setting the generalized global AdaGrad stepsizes adapts to the noise level, through a finite-time convergence rate. In particular, we show that, depending on the noise level, SGD with the generalized AdaGrad updates automatically interpolates between the convergence rates of Gradient Descent (GD) and SGD, up to polylogarithmic terms. We do so removing the strong assumptions present in previous analyses.

In Section 6.2, similarly to the results of Section 6.1, we show that in the non-convex setting the generalized global AdaGrad stepsizes adapts to the noise level, through a novel finite-time convergence rate. A low noise will result in an automatic faster convergence. As far as we know, these are the first theoretical results for the advantage of AdaGrad-like stepsizes over the plain SGD in the non-convex setting.

The next Section discusses more in details the related work, while Section 3 introduces formally the setting, and Section 4 discusses the details of the adaptive stepsizes considered in this work.

RELATED WORK

The convergence of a random iterate of SGD for non-convex smooth functions has been proved by Ghadimi and Lan (2013), and it was already implied by the results in Bottou (1991). With additional regularity assumptions, these results imply almost sure convergence of the gradient to zero (Bottou, 1991, 2016). In alternative to the regularity assumptions, Bottou (1998) proposed to assume that beyond a certain horizon the update always moves the iterate closer to the origin on average, that implies the confinement in a bounded domain and, in turn, the almost sure convergence. On the other hand, the weakest assumptions for the almost sure convergence of SGD for non-convex smooth functions have been established in Bertsekas and Tsitsiklis (2000): the variance of the noise on the gradient in xt\boldsymbol{x}_{t} can grow as 1+f(xt)21+\|\nabla f(\boldsymbol{x}_{t})\|^{2}, ff is lower bounded, and the stepsizes satisfy (1). However, both approaches do not cover adaptive stepsizes.

The first work we know on adaptive stepsizes for non-convex stochastic optimization is Kresoja et al. (2017). They study the convergence of a choice of adaptive stepsizes that require access to the function values, under strict conditions on the direction of the gradients. Wu et al. (2018) also consider adaptive stepsizes, but they only consider deterministic gradients in the non-convex setting. Later, Ward et al. (2018), independently and the same time with us, improved their guarantees proving results similar to our Theorems 3 and 4. They use the original AdaGrad stepsizes, but with the assumption of bounded expected squared norm of the stochastic gradients. Some other related works were proposed after our submission. Zhou et al. (2018) analyze an adaptive gradient method in the non-convex setting, but their bounds give advantages only in very sparse case.

A weak condition for almost sure convergence to the global optimum of non-convex functions was proposed in Bottou (1998) and recently independently reproposed in Zhou et al. (2017). However, this condition implies the very strong assumption that the gradients never point in the opposite direction of the global optimum. In this paper, in our most restrictive case in Section 5, we will only assume the function to be smooth and Lipschitz.

PROBLEM SET-UP

We will also make alternatively one of the following assumptions on the variance of the noise.

The noise in the stochastic gradient has bounded support, that is g(x,ξ)f(x)S, x\|\boldsymbol{g}(\boldsymbol{x},\xi)-\nabla f(\boldsymbol{x})\|\leq S,\ \forall\boldsymbol{x}.

(H4’) has been already used by Nemirovski et al. (2009) to prove high probability convergence guarantees. This condition allows to control the expectation of the maximum of the terms f(xt)g(xt,ξt)2\|\nabla f(\boldsymbol{x}_{t})-g(\boldsymbol{x}_{t},\xi_{t})\|^{2}. Note that, using Jensen’s inequality, this condition implies a bounded variance. Also, (H4) implies (H4’).

KEEPING THE UPDATE DIRECTION UNBIASED

A key difference between the generalized AdaGrad stepsizes in (2) and (3) with the AdaGrad stepsizes in Duchi et al. (2011) is the fact that g(xt,ξt)\boldsymbol{g}(\boldsymbol{x}_{t},\xi_{t}) is not used in ηt\eta_{t}. It is easy to see that doing otherwise introduces a spurious bias in the update direction. Indeed, as we show in the Example below, if the stepsize does depend on the current gradient, things can go wrong. The details can be found in the Appendix.

In words, the example says that including the current noisy gradient in ηt\eta_{t} (that is, using ηt+1\eta_{t+1}) can make the algorithm deviate in expectation more than 9090 degrees from the correct direction. While in the convex bounded case the algorithm can recover, it is intuitive that this could have catastrophic consequences in the unconstrained non-convex setting, especially when the function is not Lipschitz. So, in the following, we will analyze this minor variant of the AdaGrad stepsizes.

On the other hand, this difference makes the analysis more involved, because the quantity t=1Tηt2g(xt,ξt)2\sum_{t=1}^{T}\eta_{t}^{2}\|\boldsymbol{g}(\boldsymbol{x}_{t},\xi_{t})\|^{2} cannot be bounded anymore in a straightforward way, see Lemma 8 in the next Section. Previous analyses, (e.g. Duchi et al., 2011), solved this issue by assuming the knowledge of the Lipschitz constant of the function ff, while we will assume the function to be Lipschitz only to prove the asymptotic guarantee and no knowledge of it. We believe that removing the assumption of knowing the Lipschitz constant more closely follow the use in real-world applications.

In the following, we will show that these stepsize allow to prove adaptive guarantees in the convex and non-convex settings.

ALMOST SURE CONVERGENCE FOR NON-CONVEX FUNCTIONS

In this section, we show that SGD with the generalized AdaGrad stepsizes in (2) and (3) allows to decrease the gradients to zero almost surely, that is, with probability 1. This is considered a required basic property for any optimization algorithm.

The stepsizes in (2) and (3) do not satisfy (1), not even in expectation, because the g(xt,ξt)\boldsymbol{g}(\boldsymbol{x}_{t},\xi_{t}) could decrease fast enough to have t=1ηt2=\sum_{t=1}^{\infty}\eta_{t}^{2}=\infty. Hence, the results here cannot be obtained from the classic results in stochastic approximation (e.g. Bertsekas and Tsitsiklis, 2000).

Here, we will have to assume our strongest assumptions. In particular, we will need the function to be Lipschitz and the noise to have bounded support. This is mainly needed in order to be sure that the sum of the stepsizes diverges.

We now state our almost sure convergence results.

Assume (H1, H2, H3, H4). The stepsizes are chosen as in (2), where α,β>0\alpha,\beta>0 and ϵ(0,12]\epsilon\in(0,\frac{1}{2}]. Then, the gradients of SGD converges to zero almost surely. Moreover, liminftf(xt)2t\nicefrac12ϵ=0\lim\inf_{t\rightarrow\infty}\|\nabla f(\boldsymbol{x}_{t})\|^{2}t^{\nicefrac{{1}}{{2}}-\epsilon}=0 almost surely.

We also state a similar result for the coordinate-wise stepsizes in (3). We remind the reader that these stepsizes closely mirror the ones used in AdaGrad, but with the power of the denominator 12+ϵ\frac{1}{2}+\epsilon with ϵ>0\epsilon>0, rather than 12\frac{1}{2}. Also, differently from what is stated in the original AdaGrad paper, here we do not project onto a bounded closed convex set. This mirrors the actual implementation of AdaGrad in machine learning libraries, e.g. Tensorflow (Abadi et al., 2015).

Assume (H1, H2, H3, H4). The stepsizes are given by a diagonal matrix ηt\boldsymbol{\eta}_{t} whose diagonal values are defined in (3), where α,β>0\alpha,\beta>0 and ϵ(0,12]\epsilon\in(0,\frac{1}{2}]. Then, the gradients of SGD converges to zero almost surely. Moreover, liminftf(xt)2t\nicefrac12ϵ=0\lim\inf_{t\rightarrow\infty}\|\nabla f(\boldsymbol{x}_{t})\|^{2}t^{\nicefrac{{1}}{{2}}-\epsilon}=0 almost surely.

As far as we know, the above theorems are the first results on the almost sure convergence of the gradients using generalized AdaGrad stepsizes and assuming ϵ>0\epsilon>0. In particular, Theorem 2 is the first theoretical support to the common heuristic of selecting the last iterate, rather than the minimum over the iterations.

For the proofs, we will need some technical lemmas, whose proofs are in the Appendix.

(Alber et al., 1998, Proposition 2)(Mairal, 2013, Lemma A.5) Let (at)t1,(bt)t1(a_{t})_{t\geq 1},(b_{t})_{t\geq 1} be two non-negative real sequences. Assume that t=1atbt\sum_{t=1}^{\infty}a_{t}b_{t} converges and t=1at\sum_{t=1}^{\infty}a_{t} diverges, and there exists K0K\geq 0 such that bt+1btKat|b_{t+1}-b_{t}|\leq Ka_{t}. Then btb_{t} converges to 0.

Let a0>0a_{0}>0, ai0, i=1,,Ta_{i}\geq 0,\ i=1,\cdots,T and β>1\beta>1. Then t=1Tat(a0+i=1tai)β1(β1)a0β1\sum_{t=1}^{T}\frac{a_{t}}{(a_{0}+\sum_{i=1}^{t}a_{i})^{\beta}}\leq\frac{1}{(\beta-1)a_{0}^{\beta-1}}.

We now state a Lemma that allows us to study the progress made in TT steps. The proof is in the Appendix.

We can prove Theorem 1. Given that the proof of Theorem 2 is virtually identical to the one of Theorem 1, we defer its proof to the Appendix.

From the result in Lemma 3, taking the limit for TT\rightarrow\infty and exchanging the expectation and the limits because the terms are non-negative, we have

where in the first inequality we have used Lemma 2, and in the third one the elementary inequality x+y22x2+2y2\|\boldsymbol{x}+\boldsymbol{y}\|^{2}\leq 2\|\boldsymbol{x}\|^{2}+2\|\boldsymbol{y}\|^{2}.

Now, observe that the Lipschitzness of ff and the bounded support of the noise on the gradients gives

Using the fact the ff is LL-Lipschitz and MM-smooth, we have

Hence, we can use Lemma 1 to obtain limtf(xt)2=0\lim_{t\to\infty}\|\nabla f(\boldsymbol{x}_{t})\|^{2}=0.

For the second statement, observe that, with probability 1,

where in the first inequality we used the Lipschitzness of ff and the bounded support of the noise on the gradients. Hence, noting that t=11t=\sum_{t=1}^{\infty}\frac{1}{t}=\infty, we have that liminftf(xt)2t\nicefrac12ϵ=0\lim\inf_{t\rightarrow\infty}\|\nabla f(\boldsymbol{x}_{t})\|^{2}t^{\nicefrac{{1}}{{2}}-\epsilon}=0. ∎

Even if the above results hold with probability 1, the above convergence guarantees rates are only asymptotic. So, in the next Section, we show finite-time convergence rates in expectation. Moreover, we will show that the generalized AdaGrad stepsizes adapt to the level of noise for both the convex and non-convex case.

ADAPTIVE CONVERGENCE RATES

We will now show that the global generalized AdaGrad stepsizes give rises to adaptive convergence rates. In particular, we will show that for a large range of the parameters α,β,ϵ\alpha,\beta,\epsilon and independently from the noise variance σ\sigma, the algorithms will have a faster convergence when σ\sigma is small and worst-case optimal convergence when σ\sigma is large. Note that to achieve the same behavior with SGD we should use a different stepsize for each level of noise.

In the following, we will consider the convex and non-convex case.

As a warm-up, in this section, we show that the global stepsizes (2) give adaptive rates of convergence that interpolate between the rate of GD and SGD, for a wide range of the parameters α,β\alpha,\beta, and ϵ\epsilon and without knowledge of the variance of the noise. Note that, differently from the other proofs on SGD with adaptive rates (e.g. Duchi et al., 2011), we do not assume to use projections onto bounded domains. This makes our novel proof more technically challenging, but at the same time, it mirrors the setting of many applications of SGD in machine learning optimization problems.

Assume (H1, H3, H4’) and ff convex. Let the stepsizes set as in (2), where α,β>0\alpha,\beta>0, 0ϵ<120\leq\epsilon<\frac{1}{2}, and 4αM<β\nicefrac12+ϵ4\alpha M<\beta^{\nicefrac{{1}}{{2}}+\epsilon}. Then, the iterates of SGD satisfy the following bound

where xˉT=1Tt=1Txt\bar{\boldsymbol{x}}_{T}=\frac{1}{T}\sum_{t=1}^{T}\boldsymbol{x}_{t} and

Using Markov’s inequality, from the above bound it is immediate to get that, with probability at least 1δ1-\delta, we have

Up to polylog terms, if σ=0\sigma=0 this recovers the GD rate, O(1T)O(\tfrac{1}{T}), and otherwise we get the worst-case optimal rate of SGD, O(1T)O(\tfrac{1}{\sqrt{T}}). The same behavior was proved in Dekel et al. (2012) with the knowledge of σ\sigma and stepsize depending on it. Instead, here we do not need to know the noise level nor assuming a bounded domain. In the case the constants of the slow term are small compared with the ones of the first term, we can expect a first quick convergent phase, followed by a slow one, as it is often observed in empirical experiments.

For the proof, we first state some technical lemmas, whose proofs are in the Appendix.

Assume (H1). Then f(x)22M(f(x)minyf(y)), x\|\nabla f(\boldsymbol{x})\|^{2}\leq 2M(f(\boldsymbol{x})-\min_{\boldsymbol{y}}f(\boldsymbol{y})),\ \forall\boldsymbol{x}.

If x0x\geq 0 and xC(A+Bx)12+ϵx\leq C(A+Bx)^{\frac{1}{2}+\epsilon}, then x<max([C(2B)12+ϵ]11/2ϵ,C(2A)12+ϵ)x<\max([C(2B)^{\frac{1}{2}+\epsilon}]^{\frac{1}{1/2-\epsilon}},C(2A)^{\frac{1}{2}+\epsilon}).

If x0x\geq 0, A,C,D0A,C,D\geq 0, B>0B>0, and x2(A+Bx)(C+Dln(A+Bx))x^{2}\leq(A+Bx)(C+D\ln(A+Bx)), then x<32B3D2+2BC+8B2DC+A/Bx<32B^{3}D^{2}+2BC+8B^{2}D\sqrt{C}+A/B.

If x,y0x,y\geq 0 and 0p10\leq p\leq 1, then (x+y)pxp+yp(x+y)^{p}\leq x^{p}+y^{p}.

Assume (H1, H3, H4’). The stepsizes are chosen as (2), where α,β,ϵ0\alpha,\beta,\epsilon\geq 0. Then,

For simplicity, denote by δt:=f(xt)f(x)\delta_{t}:=f(\boldsymbol{x}_{t})-f(\boldsymbol{x}^{\star}) and by Δ:=t=1Tδt\Delta:=\sum_{t=1}^{T}\delta_{t}.

Taking the conditional expectation with respect to ξ1,,ξt1\xi_{1},\cdots,\xi_{t-1}, we have that

where in the inequality we used the fact that ff is convex. Hence, summing over t=1t=1 to TT, we have

From Lemma 4 and Lemma 8, when ϵ>0\epsilon>0 we have that

On the other hand, when ϵ=0\epsilon=0 we have

We can also lower bound the l.h.s. of (27) and (28) with

where in the first inequality we used the elementary inequality x+y22x2+2y2\|\boldsymbol{x}+\boldsymbol{y}\|^{2}\leq 2\|\boldsymbol{x}\|^{2}+2\|\boldsymbol{y}\|^{2} and Lemma 4 in the second one.

where KK will be defined in the following for the case ϵ=0\epsilon=0 and ϵ>0\epsilon>0.

where in the third inequality we used Lemma 7 and we define K=α22ϵβ2ϵα(14αMβ\nicefrac12+ϵ)K=\frac{\frac{\alpha^{2}}{2\epsilon\beta^{2\epsilon}}}{\alpha(1-\frac{4\alpha M}{\beta^{\nicefrac{{1}}{{2}}+\epsilon}})}. Proceeding in the same way, for the case ϵ=0\epsilon=0 we get

where A=β+2Tσ2A=\sqrt{\beta+2T\sigma^{2}}, B=2MB=2\sqrt{M}, D=α14αMβD=\frac{\alpha}{1-\frac{4\alpha M}{\sqrt{\beta}}} and C=βx1x2+4α2(1+lnT)σ22αβ(14αMβ)C=\frac{\beta\|\boldsymbol{x}_{1}-\boldsymbol{x}^{\star}\|^{2}+4\alpha^{2}(1+\ln T)\sigma^{2}}{2\alpha\beta(1-\frac{4\alpha M}{\sqrt{\beta}})}. Using Lemma 6, we have that

We use this upper bound in the logarithmic term, so that for ϵ0\epsilon\geq 0, we have (37) again, this time with K=Dln(2A+32B4D2+2B2C+8B3DC)=O(lnT14αMβ)K=D\ln(2A+32B^{4}D^{2}+2B^{2}C+8B^{3}D\sqrt{C})=O(\frac{\ln T}{1-\frac{4\alpha M}{\sqrt{\beta}}}).

Hence, we proceed using Lemma 5 to have for ϵ0\epsilon\geq 0

Using Jensen’s inequality on the l.h.s. of last inequality concludes the proof. ∎

2 Adaptive Convergence for Non-Convex Functions

We now prove that the generalized AdaGrad stepsizes in (2) allow a faster convergence of the gradients to zero when the noise over the gradients is small.

Given that SGD is not a descent method, we are not aware of any result of convergence with an explicit rate for the last iterate for non-convex functions. Hence, here we will prove a convergence guarantee for the best iterate over TT iterations rather than for the last one. Note that choosing a random stopping time as in Ghadimi and Lan (2013) would be equivalent in expectation to choose the best iterate. For simplicity, we choose to state the theorem for the best iterate.

Assume (H1, H3, H4’). Let the stepsizes set as (2), where α,β>0\alpha,\beta>0, ϵ(0,12)\epsilon\in(0,\frac{1}{2}), and 2αM<β12+ϵ2\alpha M<\beta^{\frac{1}{2}+\epsilon}. Then, the iterates of SGD satisfies the following bound

where γ={O(1+α2lnTα(12αβ))for ϵ=0O(1+α2(1ϵ+σ2lnT)α(12αβ\nicefrac12+ϵ))for ϵ>0.\gamma=\begin{cases}O\left(\frac{1+\alpha^{2}\ln T}{\alpha(1-\frac{2\alpha}{\sqrt{\beta}})}\right)&\text{for }\epsilon=0\\ O\left(\frac{1+\alpha^{2}(\frac{1}{\epsilon}+\sigma^{2}\ln T)}{\alpha(1-\frac{2\alpha}{\beta^{\nicefrac{{1}}{{2}}+\epsilon}})}\right)&\text{for }\epsilon>0.\end{cases}

As in the previous Section, using Markov’s inequality it’s easy to get that, with probability at least 1δ1-\delta,

This theorem mirrors Theorem 3, proving again a convergence rate that is adaptive to the noise level. Hence, the same observations on adaptation to the noise level and convergence hold here as well. The main difference w.r.t. Theorem 3 is that here we only prove that the gradients are converging to zero rather than the suboptimality gap, because we do not assume convexity.

Note that such bounds were already known with an oracle tuning of the stepsizes, in particular with the knowledge of the variance of the noise, see, e.g., Ghadimi and Lan (2013). In fact, the required stepsize in the deterministic case must be constant, while it has to be of the order of O(1σt)O(\frac{1}{\sigma\sqrt{t}}) in the stochastic case. However, here we obtain the same behavior automatically, without having to estimate the variance of the noise, thanks to the adaptive stepsizes. This shows for the first time a clear advantage of the global generalized AdaGrad stepsizes over plain SGD.

For simplicity, denote by Δ:=t=1Tf(xt)2\Delta:=\sum_{t=1}^{T}\|\nabla f(\boldsymbol{x}_{t})\|^{2}.

Using Lemma 8, we can upper bound the expected sum in the r.h.s. of last inequality. When ϵ>0\epsilon>0, we have

With similar methods in the proof of Theorem 3, we lower bound the l.h.s. of both (46) and (47) with

where KK will be defined separately for the case ϵ=0\epsilon=0 and ϵ>0.\epsilon>0.

where in this case we define K=αM4ϵβ2ϵα(12αMβ\nicefrac12+ϵ).K=\frac{\frac{\alpha^{M}}{4\epsilon\beta^{2\epsilon}}}{\alpha(1-\frac{2\alpha M}{\beta^{\nicefrac{{1}}{{2}}+\epsilon}})}. Proceeding in the same way, when ϵ=0\epsilon=0, we have

where A=β+2Tσ2A=\sqrt{\beta+2T\sigma^{2}}, B=2B=\sqrt{2}, D=αM12αMβD=\frac{\alpha M}{1-\frac{2\alpha M}{\sqrt{\beta}}}, C=β(f(x1)f)+2α(1+lnT)σ2αβ(12αMβ)C=\frac{\beta(f(\boldsymbol{x}_{1})-f^{\star})+2\alpha(1+\ln T)\sigma^{2}}{\alpha\beta(1-\frac{2\alpha M}{\sqrt{\beta}})}.

Similar with Theorem 3, we use this upper bound in the logarithmic term so that for ϵ0\epsilon\geq 0, we have (52) again, this time with K=Dln(2A+32B4D2+2B2C+8B3DC)=O(lnT12αMβ).K=D\ln(2A+32B^{4}D^{2}+2B^{2}C+8B^{3}D\sqrt{C})=O(\frac{\ln T}{1-\frac{2\alpha M}{\beta}}).

Hence, we proceed using Lemma 5 to have for ϵ0\epsilon\geq 0

DISCUSSION AND FUTURE WORK

We have presented an analysis of adaptive stepsizes based on the generalized AdaGrad stepsizes for stochastic gradient descent, with convex and non-convex functions. In the convex setting, our result shows an adaptive convergence rate, also overcoming the limitations of previous results. In the non-convex setting, we show almost sure convergence and adaptive convergence rates. Moreover, we show for the first time sufficient condition for a convergence guarantee for non-convex functions for a minor variation of AdaGrad.

We believe these results have twofold importance. First, we go in the direction of closing the gap between theory and practice for widely used optimization algorithms. Second, our adaptive rates provide a possible explanation for the empirical success of these kinds of algorithms in practical machine learning applications.

One of the limitations of the current analysis is the fact that our analysis implies high probability bounds that depends polynomially on 1δ\frac{1}{\delta}, due to the application of Markov’s inequality. It would be better to prove bounds that depend on ln(1δ)\ln(\frac{1}{\delta}), as they are possible for SGD under conditions (1). However, the generalized AdaGrad updates do not satisfy the conditions (1) and the analysis is not straightforward. Our future work will focus on shedding light on this issue.

In the future, we would also like to understand if the conditions we impose can be weakened. For example, the almost sure convergence requires a bounded support noise, that, while it might be verified in many practical scenarios, still seems unsatisfying from a theoretical point of view. Moreover, we would like to adapt the recent approaches for parameter-free online optimization (Orabona and Pál, 2016; Cutkosky and Orabona, 2018) to the non-convex setting.

The authors thank Dávid Pál for the comments and discussions and Léon Bottou for the comments on prior work. This material is based upon work supported by the National Science Foundation under grant no. 1740762 “Collaborative Research: TRIPODS Institute for Optimization and Learning” and by a Google Research Award.

References

Appendix A Appendix

Here, we report the proofs missing from the main text.

Consider the function f(x)=12x2f(x)=\frac{1}{2}x^{2}. The gradient in tt-th iteration is f(xt)=xt\nabla f(x_{t})=x_{t}. Let the stochastic gradient be defined as gt=f(xt)+ξt\boldsymbol{g}_{t}=\nabla f(x_{t})+\xi_{t}, where P(ξt=σt)=715P(\xi_{t}=\sigma_{t})=\frac{7}{15}, P(ξt=32σt)=15P(\xi_{t}=-\frac{3}{2}\sigma_{t})=\frac{1}{5} and P(ξt=12σt)=13P(\xi_{t}=-\frac{1}{2}\sigma_{t})=\frac{1}{3}.

Let Ai=1t1gi2+βA\triangleq\sum_{i=1}^{t-1}g_{i}^{2}+\beta. Then

This expression can be negative, for example, setting xt=1x_{t}=1, σt=10\sigma_{t}=10, A=10A=10, ϵ=0\epsilon=0 or ϵ=0.1\epsilon=0.1.

A.2 Proof of Lemma 2

Let ai0,,Ta_{i}\geq 0,\cdots,T and f:[0,+)[0,+)f:[0,+\infty)\rightarrow[0,+\infty) nonincreasing function. Then

Summing over i=1,,Ti=1,\cdots,T, we have the stated bound. ∎

A.3 Proofs of Section 6.1

Take y=1Mf(x)\boldsymbol{y}=-\frac{1}{M}\nabla f(\boldsymbol{x}), to have

If ABxA\leq Bx, then xC(2Bx)12+ϵx\leq C(2Bx)^{\frac{1}{2}+\epsilon}, so x[C(2B)12+ϵ]11/2ϵx\leq\left[C(2B)^{\frac{1}{2}+\epsilon}\right]^{\frac{1}{1/2-\epsilon}}. And if A>BxA>Bx, then x<C(2A)12+ϵx<C(2A)^{\frac{1}{2}+\epsilon}. Taking the maximum of the two cases, we have the stated bound. ∎

On the other hand, if BxABx\leq A, we have xABx\leq\frac{A}{B}. Taking the sum of these two case, we have the stated bound. ∎

Let f(x)=(x+y)pxpypf(x)=(x+y)^{p}-x^{p}-y^{p}. We can see that f(x)=p(x+y)p1pxp10f^{\prime}(x)=p(x+y)^{p-1}-px^{p-1}\leq 0 when x,y0x,y\geq 0. So f(x)f(0)=0f(x)\leq f(0)=0. The inequality holds. ∎

If x>0x>0, α>0\alpha>0, then ln(x)α(x1α1)\ln(x)\leq\alpha(x^{\frac{1}{\alpha}}-1).

Let f(x)=ln(x)αx1α+αf(x)=\ln(x)-\alpha x^{\frac{1}{\alpha}}+\alpha. f(x)=1xx1α1f^{\prime}(x)=\frac{1}{x}-x^{\frac{1}{\alpha}-1} is positive when 0<x<10<x<1, f(1)=0f^{\prime}(1)=0 and f(x)<0f^{\prime}(x)<0 when x>1x>1. So f(x)f(1)=0.f(x)\leq f(1)=0. The inequality holds. ∎

Using the assumption on the noise, we have

where in second inequality we used Lemma 2 and in fourth one we used (58). Note that the analysis after the second inequality also holds when ϵ=0\epsilon=0.

where in first inequality we used Lemma 10 and in the third one we used Jensen’s inequality. Putting things together, we have

A.4 Proofs of Section 5

Taking the conditional expectation with respect to ξ1,,ξt1\xi_{1},\cdots,\xi_{t-1}, we have that

Hence, from the law of total expectation, we have

Summing over t=1t=1 to TT and lower bounding f(xT+1)f(\boldsymbol{x}_{T+1}) with ff^{\star}, we have the stated bound. ∎

Since the series t=1at\sum_{t=1}^{\infty}a_{t} diverges, given that t=1atbt\sum_{t=1}^{\infty}a_{t}b_{t} converges, we necessarily have lim inftbt=0\liminf_{t\rightarrow\infty}b_{t}=0. So there exists a subsequence {bi(t)}\{b_{i(t)}\} of {bt}\{b_{t}\} such that limtbi(t)=0.\lim_{t\to\infty}b_{i(t)}=0.

Let us proceed by contradiction and assume that there exists some α>0\alpha>0 and some other subsequence {bm(t)}\{b_{m(t)}\} of {bt}\{b_{t}\} such that bm(t)αb_{m(t)}\geq\alpha for all tt. In this case, we can construct a third subsequence {bj(t)}\{b_{j(t)}\} of {bt}\{b_{t}\} where the subindices j(t)j(t) are chosen in the following way:

Note that the existence of {bi(t)}\{b_{i(t)}\} and {bm(t)}\{b_{m(t)}\} guarantees that j(t)j(t) is well defined. Also by \eqrefeq:indice2\eqref{eq: indice_2} and \eqrefeq:indice3\eqref{eq: indice_3}

Then, denoting ϕt=l=2tj(2t+1)1al\phi_{t}=\sum_{l=2t}^{j(2t+1)-1}a_{l}, we have

Therefore, we have limtϕt=0.\lim_{t\to\infty}\phi_{t}=0.

On the other hand, by \eqrefeq:indice2\eqref{eq: indice_2} and \eqrefeq:indice3\eqref{eq: indice_3}, we have bj(2t)αb_{j(2t)}\geq\alpha, bj(2t+1)1αb_{j(2t+1)}\leq\frac{1}{\alpha}, so that

So ϕtα2K\phi_{t}\geq\frac{\alpha}{2K}, which is in contradiction with limtϕt=0.\lim_{t\to\infty}\phi_{t}=0. Therefore, btb_{t} goes to zero.

We proceed similarly to the proof of Theorem 1, to get

where the last inequality comes from the same reasoning in (12). Hence, we have

Now, observe that the Lipschitzness of ff and the bounded support of the noise on the gradients gives

Using the fact the ff is LL-Lipschitz and MM-smooth, we also have

For the second statement, observe that, with probability 1,

Hence, noting that t=11t=\sum_{t=1}^{\infty}\frac{1}{t}=\infty, we have that liminft((f(xt))j)2t\nicefrac12ϵ=0\lim\inf_{t\rightarrow\infty}((\nabla f(\boldsymbol{x}_{t}))_{j})^{2}t^{\nicefrac{{1}}{{2}}-\epsilon}=0. ∎