Momentum Improves Normalized SGD

Ashok Cutkosky, Harsh Mehta

Non-Convex Stochastic Optimization

The rise of deep learning has focused research attention on the problem of solving optimization problems that are high-dimensional, large-scale, and non-convex. Modern neural networks can have billions of parameters (high-dimensional) (Raffel et al., 2019; Shazeer et al., 2017), are trained using datasets containing millions of examples (large scale) (Deng et al., 2009) on objective functions that are non-convex. Because of these considerations, stochastic gradient descent (SGD) has emerged as the de-facto method-of-choice for training deep models. SGD has several properties that make it attractive for this setting. First, it can operate in a streaming manner by running over the large dataset while only keeping a few examples in memory at any one time. Second, SGD has a dimension-free convergence guarantee for finding critical points on non-convex objectives. This means that SGD’s convergence rate is independent of the dimension of the problem, allowing it to scale more easily to massive neural networks. In this paper we will investigate a variant of SGD that incorporates gradient normalization and momentum, which are both commonly used empirical modifications to SGD that are poorly understood in the non-convex setting.

A common way to analyze of algorithms for training neural nets is through the lens of stochastic optimization. In this setting, we are interested in minimizing a function:

where $\|\cdot\|$ indicates the standard 2-norm. The expectation is taken over the randomness of the stochastic gradient oracle and any randomness inherent in the algorithm itself. In order to make the problem tractable, we make three structural assumptions. First, we assume that $F$ is $L$ -smooth, which means that for all $\vec{w}$ and $\vec{x}$ ,

Second, we assume that the objective $F$ is bounded from below, and without loss of generality (since our algorithms will only access gradients rather than function values), we may assume it is positive:

Finally, we assume that the stochastic gradient oracle has bounded variance:

Under these assumptions, SGD with an optimally tuned learning rate ensures that after $T$ iterations, we can output a point $\vec{w}$ such that (Ghadimi & Lan, 2013):

Spurred by the empirical success of SGD, there has been a tremendous amount of work in designing modifications that improve upon its performance in various ways. However, it has recently been shown that the $O(1/T^{1/4})$ rate is optimal in the worst-case (Arjevani et al., 2019), and so improvements must either make more assumptions about the problem setting, or provide an algorithm that can somehow interpolate between a worst-case and non-worst-case problem.

One popular modification to SGD is the use of adaptive learning rates, popularized by AdaGrad (Duchi et al., 2010). These learning rates enable SGD to converge faster when the objective under consideration is in some technical sense “easier” than a worst-case objectiveIn this paper we use “adaptive” in a statistical sense of the word: an algorithm is adaptive if it automatically adjusts itself to some unknown parameter, such as the variance of the gradients. This is different from the idea of using a different learning rate for each dimension of an optimization problem that was also popularized by Duchi et al. (2010)., specifically when the variance of the loss from one example to another is small (Li & Orabona, 2019; Ward et al., 2019). The success of AdaGrad has inspired a huge number of related algorithms, including notably the Adam algorithm (Kingma & Ba, 2014).

One of the key improvements added to AdaGrad by Adam is the use of momentum. Inspired by the heavy-ball and acceleration algorithms in convex optimization (Polyak, 1964; Nesterov, 1983), momentum attempts to improve the convergence rate on non-convex objectives by modifying the update to have the form:

Intuitively, the value $\vec{m}$ is holding a running average of the past gradient values, and the hope is that this style of update may provide some kind of better stability or conditioning that enables improvements over the base SGD. Momentum has had dramatic empirical success, but although prior analyses have considered momentum updates (Reddi et al., 2018; Zaheer et al., 2018), none of these have shown a strong theoretical benefit in using momentum, as their bounds do not improve on (1).

Finally, a third popular modification is the use of normalized updates. For example, the LARS (You et al., 2017) and LAMB (You et al., 2019) optimizers use updates similar to:

Intuitively, this style of update attempts to capture the idea that in non-convex objectives, unlike convex ones, the magnitude of the gradient provides less information about the value of the function, while the direction still indicates the direction of steepest descent. However, all analyses of normalized updates we know of (e.g. You et al. (2017, 2019); Hazan et al. (2015)) require the variance of the gradient oracle to be very small, or, equivalently, for the algorithm to make use of an extremely large batch-size in order to achieve any convergence guarantees. Intuitively, this requirement arises because the normalization can inflate very small errors: even if $m_{t}=\nabla F(\vec{w}_{t})+\zeta$ for some small error $\zeta$ , it might be that $\frac{m_{t}}{\|m_{t}\|}$ is actually very far from $\frac{\nabla F(\vec{w}_{t})}{\|\nabla F(\vec{w}_{t})\|}$ . Nevertheless, this update has also been used to empirically accelerate training neural networks.

From a more theoretical perspective, a recent approach to going beyong the SGD rate (1) is to add some additional structural assumptions to the loss. One appealing assumption is second-order smoothness, in which we assume that the third derivative of $F$ has bounded operator norm. Equivalently, for all $\vec{w}$ , and $\vec{z}$ , we have

Many previous works have achieved improved results by utilizing this assumption in concert with stronger oracles, such as evaluating two gradients per each example or evaluating Hessian-vector products (Allen-Zhu, 2018; Tripuraneni et al., 2018). However, using our same stochastic gradient oracle model and this second-order smoothness assumption, it was recently proven (Fang et al., 2019) that a variant of SGD can achieve a convergence rate of

where $d$ is the dimension of the problem.In fact, this result provided a convergence to a local minimum, which is stronger than a critical point. This break-through result shows that even for fairly high-dimensional problems, SGD can obtain faster convergence than the initial analysis suggests. However, in modern deep learning architectures, $d$ can be on the order of billions (Radford et al., 2019). In this regime, it may easily hold that the logarithmic term is large enough that this new analysis of SGD does not actually suggest improved performance over the previous $O(1/T^{1/4})$ rate. To deal with extremely high-dimensional regimes, we would like a convergence rate that is completely dimension-free.

Finally, another technique that also suggests a connection to momentum in reducing variance is implicit gradient transport (Arnold et al., 2019). Implicit gradient transport is a very recent discovery that provably reduces the variance in gradient estimates in the special case that the Hessian of the objective is constant (e.g. if the objective is a linear regression problem). The original presentation considers the following version of momentum:

To gain some intuition for why this is a good idea when the Hessian is constant, suppose we have $\vec{m}_{t-1}=\nabla F(\vec{w}_{t-1})+X$ where $X$ is some mean-zero random variable with variance $\sigma^{2}/t$ . Let us also write $\nabla f(\vec{w}_{t}+t(\vec{w}_{t}-\vec{w}_{t-1}),\mathbf{\xi}_{t})=\nabla F(\vec{w}_{t}+t(\vec{w}_{t}-\vec{w}_{t-1}))+Y$ where $Y$ is also a mean-zero random variable with variance $\sigma^{2}$ . Then since the Hessian is constant, we have:

Notice that $\frac{tX+Y}{t+1}$ has variance $\sigma^{2}/(t+1)$ , so that by induction, we will have for all $t$ that $\vec{m}_{t}$ is an unbiased estimate of $\nabla F(\vec{w}_{t})$ with variance only $\sigma^{2}/(t+1)$ . This technique shows great promise, but the current theoretical analysis is limited to convex quadratic objectives.

In this paper we address some of these preceding issues. First, we show that the normalized stochastic gradient descent update with momentum does not require small variance or large batches in order to match the convergence rate of SGD. Next, we tackle the case of second-order smooth objectives. For this, we introduce a further modification of the momentum update based on the implicit gradient transport idea. We show that combining this newer kind of momentum with normalized gradient descent enables a dimension-free convergence rate of $O(1/T^{2/7})$ . Moreover, we feel that our new analysis is substantially more straightforward than prior work, as demonstrated by the number of pages required to prove our results. Finally, we propose an adaptive version of our algorithm and show that this final algorithm’s convergence guarantee automatically improves when the stochastic gradient oracle has small variance. We hope our results can shed some light in the empirical success of momentum in training deep networks.

In addition to the theoretical discussion, we also demonstrate effectiveness of our method NIGT (pronounced “night”) on popular deep learning tasks. We show comparisons of our method on 1) BERT pretraining and 2) Resnet-50. Adam is typically used to achieve state of the art results on BERT task (Devlin et al., 2019) whereas SGD is used for Resnet-50 on Imagenet dataset since Adam fails to perform well on it (You et al., 2019; Anil et al., 2019). We show comparison with the stronger baseline in each case. Finally, since momentum is the only significant slot variable kept, our method incurs greatly reduced memory overhead compared to other methods such as Adam. This is a significant practical advantage due to the continuing trend of increasingly large models (Shazeer & Stern, 2018; Anil et al., 2019).

The rest of this paper is organized as follows: in Section 2, we provide our first result showing that momentum can be used to “rescue” normalized SGD in the high-variance/small-batch regime. In Section 3, we expand upon this theory to show that incorporating a momentum update inspired by Arnold et al. (2019) allows for improved convergence rates when the objective is second-order smooth, and we proceed to incorporate adaptivity into our analysis in Section 4. In Section 5 we describe our implementation and experimental study, and we provide some concluding remarks and open problems in Section 6.

Normalized SGD with Momentum

Suppose $F$ and $\mathcal{D}$ satisfy the assumptions (A1), (A2) and (A3). Let $\vec{w}_{1}$ be some given initial point and set $\vec{m}_{1}=\nabla f(\vec{w}_{1},\xi_{1})$ . Let $R$ be any upper bound on $F(\vec{w}_{1})$ . Set $\alpha=\min\left(\frac{\sqrt{RL}}{\sigma\sqrt{T}},1\right)$ and $\eta=\frac{\sqrt{R\alpha}}{\sqrt{TL}}$ Then let $\vec{w}_{t}$ be the iterates produced by the recurrences (2) and (3) with $\eta_{t}=\eta$ and $\beta_{t}=1-\alpha$ for all $t$ . Then the average norm of $\|\nabla F(\vec{w}_{t})\|$ satisfies:

Before proving this Theorem, we provide a general technical Lemma about each step of normalized SGD:

uppose $F$ and $\mathcal{D}$ satisfy the assumptions (A1) and (A3). Let $\vec{w}_{1}$ be some initial point, and consider the updates:

where $\hat{g}_{t}$ is generated by some arbitrary process or algorithm. Let $\hat{\epsilon}_{t}=\hat{g}_{t}-\nabla F(\vec{w}_{t})$ . Then we have

In particular, if $F$ also satisfies (A2) and $\eta_{t}$ is a constant $\eta$ for all $t$ , we have:

Now consider two cases, either $\|\nabla F(\vec{w}_{t})\|\geq 2\|\hat{\epsilon}_{t}\|$ or not. In the former case, we have

So that the same bound holds in both cases. Putting everything together now proves the first statement of the Lemma. The second statement follows by summing over $T$ , observing that the left-hand-side of the equation telescopes, and rearranging. ∎

By this Lemma, if we can show that our choices for $\beta$ and $\eta$ result in a small value for $\sum_{t=1}^{T}\|\hat{\epsilon}_{t}\|$ , then setting $\eta$ appropriately will prove Theorem 1. We carry out this agenda in more detail below:

Define $\hat{\epsilon}_{t}=\vec{m}_{t}-\nabla F(\vec{w}_{t})$ and define $\epsilon_{t}=\nabla f(\vec{w}_{t},\xi_{t})-\nabla F(\vec{w}_{t})$ . Notice that by our assumptions, we have

Further, define $S(a,b)=\nabla F(a)-\nabla F(b)$ . By smoothness, we must have $\|S(\vec{w}_{t},\vec{w}_{t+1})\|\leq L\|\vec{w}_{t}-\vec{w}_{t+1}\|=\eta L$ for all $t$ . With this notation, we have the following recursive formulation for any $t\geq 1$ :

Now, we unravel the recursion for $t$ iterations:

Next, take the magnitude of both sides and use triangle inequality:

Where in the above we have observed that $\hat{\epsilon}_{1}=\epsilon_{1}$ . Now take expectation, and use the fact that any two $\epsilon_{i}$ are uncorrelated and apply Jensen inequality to obtain:

Where we have used $\alpha\leq 1$ . Set $\alpha=\min\left(\frac{\sqrt{RL}}{\sigma\sqrt{T}},1\right)$ and $\eta=\frac{\sqrt{R\alpha}}{\sqrt{TL}}$ . Observe from the setting of $\eta$ that we have:

Now suppose $\alpha=1$ . This implies that $\sigma\leq\frac{\sqrt{RL}}{\sqrt{T}}$ . Therefore the RHS of the above expression is bounded by $29\sqrt{RLT}$ . On the other hand, if $\alpha=\frac{\sqrt{RL}}{\sigma\sqrt{T}}$ , the RHS is bounded by

Adding these expressions yields the desired result. ∎

Faster Convergence with Second-Order Smoothness

Now that we have seen a simpler example of how to analyze the convergence of normalized SGD with momentum in the non-convex setting, we can proceed to consider the more advanced case that $F$ is second-order smooth. Notice that in the proof of Theorem 1, we made use of the quantity $S(a,b)=\nabla F(a)-\nabla F(b)$ , which satisfies $\|S(a,b)\|\leq L\|a-b\|$ . In this section, we will need a related quantity:

Assuming $F$ satisfies (A4), Taylor’s theorem implies that $Z$ satisfies $\|Z(a,b)\|\leq\rho\|a-b\|^{2}$ . The improved dependency on $\|a-b\|$ from linear to quadratic will allow us to achieve a faster convergence rate.

Without further ado, we present our algorithm NIGT in Algorithm 1 below:

The auxiliary variable $\vec{x}_{t}$ is introduced for notational convenience - it can be recomputed every iteration from $\vec{w}_{t}$ and $\vec{w}_{t-1}$ . Notice that, with $\beta_{t}=\frac{t}{t+1}$ , the update for $\vec{m}_{t}$ corresponds exactly to the implicit gradient transport update proposed by Arnold et al. (2019). We will instead use the different setting $\beta=1-O(T^{-4/7})$ . We show that this value of $\beta$ , combined with the normalized SGD update, enables faster convergence on second-order smooth non-convex objectives. This significantly extends the prior results of Arnold et al. (2019), which hold only on convex quadratic objectives with constant Hessian, and helps realize the theoretical potential of the implicit gradient transport idea.

Suppose $f$ and $\mathcal{D}$ satisfies the assumptions (A1), (A2), (A3), and (A4). Let $\vec{w}_{1}$ be an arbitrary starting point, and set $\vec{m}_{1}=\nabla f(\vec{w}_{1},\xi_{1})$ . Let $R\geq F(\vec{w}_{1})$ and set $\eta=\min\left(\frac{R^{5/7}}{T^{5/7}\rho^{1/7}\sigma^{4/7}},\sqrt{\frac{R}{TL}}\right)$ and $\alpha=\min\left(\frac{R^{4/7}\rho^{2/7}}{T^{4/7}\sigma^{6/7}},1\right)$ . Suppose $\vec{w}_{1},\dots,\vec{w}_{T}$ are the iterates produced by Algorithm 1 with $\eta_{t}=\eta$ and $\beta_{t}=1-\alpha$ for all $t$ . Then the average expected gradient norm satisfies:

In particular, the dependence on $T$ is $O(1/T^{2/7})$ .

Our proof begins very similarly to that of Theorem 1. We define $\hat{\epsilon}_{t}=m_{t}-\nabla F(\vec{w}_{t})$ , and $\epsilon_{t}=\nabla f(\vec{x}_{t},\xi_{t})-\nabla F(\vec{x}_{t})$ . Then we have the following recursive formulation:

Here we have already used the key insight of implicit gradient transport, as well as our second-order smoothness assumption. The carefully constructed equation for $\vec{x}_{t}$ is such that the $\nabla^{2}F(\vec{w}_{t+1})$ terms cancel out, and we keep track of the error introduced by having a non-constant Hessian in the $Z$ function. Next, we unroll the recursion to obtain:

Now, recall that $Z(a,b)\leq\rho\|a-b\|$ . This implies:

Now just as in the proof of Theorem 1, we take magnitudes, use triangle inequality, and take expectations, but this time we use the bounds on $Z$ :

Where we have used $\hat{\epsilon}_{1}=\epsilon_{1}$ in the first line. Now sum over $t$ to obtain:

Now set $\eta=\min\left(\frac{R^{5/7}}{T^{5/7}\rho^{1/7}\sigma^{4/7}},\sqrt{\frac{R}{TL}}\right)$ , so that we have:

Further, set $\alpha=\min\left(\frac{R^{4/7}\rho^{2/7}}{T^{4/7}\sigma^{6/7}},1\right)$ . Notice that in all cases we have $T\sqrt{\alpha}\sigma\leq R^{2/7}\rho^{1/7}\sigma^{4/7}T^{5/7}$ , so we can conclude,

Let us consider the case that $\alpha=1$ . In this case, the last two terms in (4) are zero, so we have:

Next, suppose $\alpha=\frac{R^{4/7}\rho^{2/7}}{T^{4/7}\sigma^{6/7}}$ . Then instead we use the fact that $\eta=\frac{R^{5/7}}{T^{5/7}\rho^{1/7}\sigma^{4/7}}$ to refine (4) as:

Taking the maximum of (5) and (6) yields the desired convergence guarantee. ∎

Adaptive Algorithm

In the previous sections, we considered the case of constant learning rate $\eta$ and momentum $\beta$ parameters. We showed that with appropriate settings, we can obtain fast convergence rates. In this section, we go further and consider varying $\eta$ and $\beta$ values. In particular, we will provide an adaptive schedule in which $\eta$ and $\beta$ are adjusted on-the-fly based on the observed gradients. This adaptive schedule yields a convergence guarantee that automatically improves when $\sigma$ is small, without knowing $\sigma$ . In order to do this, we will need to sample two gradients at each point $\vec{x}_{t}$ . The two gradient samples are independent, so this still fits within our stochastic gradient oracle model of computation. We denote the second independent gradient estimate evaluated at the point $\vec{x}_{t}$ as $\nabla f(\vec{x}_{t},\xi_{t}^{\prime})$ . Finally, we will add another assumption: the function $F$ (or at least the values of $F(\vec{x}_{t})$ ) should be bounded above by some constant $M$ . With this notation, our adaptive algorithm is presented in Algorithm 2 and analyzed in Theorem 4 below.

Suppose that $f$ and $\mathcal{D}$ satisfies (A1), (A2), (A3), and (A4). Further, suppose $\|\nabla f(\vec{w},\xi)\|\leq\mathfrak{g}$ for all $\vec{w}$ with probability 1, and that $F(\vec{w})\leq M$ for all $\vec{w}$ for some $M$ . Then Algorithm 2 guarantees:

Experiments

Now, we turn to experimental evaluation of the proposed method NIGT on two popular large-scale deep learning benchmarks: BERT pretraining and ResNet-50. First we explain the setup and choice of hyperparameters for our method, and then elucidate both tasks and the details on the baseline used for each task.

We implemented our algorithm in the Tensorflow framework. For simplicity, we implemented a per-layer version of our algorithm, normalizing the gradients for each layer in the network, rather than normalizing the full gradient. Taking our cue from defaults from previous empirical literature on momentum, we set the $\beta$ parameter to 0.9 for NIGT for both BERT and ResNet-50. For BERT, we stick with the learning rate schedule used for Adam in Devlin et al. (2019) i.e linear warmup and polynomial decay of $\eta_{t}=\eta_{0}*(1-\frac{t}{T})$ . Whereas for ResNet-50, we found that linear warmup and polynomical decay of $\eta_{t}=\eta_{0}*(1-\frac{t}{T})^{2}$ worked best (You et al., 2017). We performed a grid search on base learning rate $\eta_{0}\in[10^{-5},10^{-4},10^{-3},10^{-2},10^{-1},10^{0}]$ for both the tasks. In our implementation, we also scale the learning rate with the norm of the weights for each layer similar to You et al. (2017). We did not normalize gradients for bias, batch normalization and layer normalization parameters, and we scaled their learning rates by a factor of 1000. All our experiments were conducted on a TPUv3 architecture.

2 BERT pretraining

We replicate most of the setup created by Devlin et al. (2019) for our BERT baseline. Our text corpus is a concatenation of Wikipedia and BooksCorpus, with a total of roughly 3.3B words. We use examples of sequence length 512 with token masking rate of 0.15 and train the BERT-Base model with Masked Language Modeling target accuracy of 70.0% for 500k steps. Fig 1 compares masked language modeling accuracy resulting with our method vs the baseline. We set NIGT base learning rate $\eta_{0}$ to 0.001 and batch size to 256. We replicate the hyperparameters used for our baseline method Adam exactly from Devlin et al. (2019) i.e. step size $\alpha=0.0001$ , $\beta_{1}=0.9$ , $\beta_{2}=0.99$ .

3 Resnet-50

We train the ImageNet dataset with Resnet-50 architecture and compare our method against the commonly used SGD optimizer with momentum in Fig 1. Others have observed that Adam/Adagrad fails to achieve accuracy attained by SGD on this task (You et al., 2019; Anil et al., 2019). For our experiments, we set the batch size to 1024 and train the models for 90 epochs. We set the base learning rate $\eta_{0}$ for NIGT to 0.01. For the SGD baseline, we stick with the common practice of employing gradual warmup for 5 epochs up to a base learning rate of 0.4 and then divide it by 10 at 30-th, 60-th and 80-th epoch, as done in Goyal et al. (2017); He et al. (2015).

Conclusion and Future Work

We have presented a new analysis of the effects of momentum on normalized SGD, and showed that when the objective is second-order smooth, normalized SGD with momentum can find an $\epsilon$ -critical point in $O(1/\epsilon^{3.5})$ iterations, matching the best-known rates in a dimension-free manner and surprisingly straightforward manner. Further, we provided an adaptive version of our algorithm whose convergence guarantee automatically improves when the underlying distribution has low variance. Finally, we provided an empirical evaluation showing that our algorithm matches the performance of the methods used to achieve state-of-the-art in training ResNet-50 on ImageNet and on BERT pretraining. Notably, these two tasks are often tackled with different optimizers (SGD and Adam respectively). Our method is one of the select few (You et al., 2019; Anil et al., 2019) which performs well on both simultaneously. Also, since NIGT does not maintain any second order statistic, it is significantly more memory efficient compared to popular optimizers such as Adam or LAMB.

Our results provide some intuition that can help explain the practical success of momentum in stochastic non-convex optimization. Prior analyses of momentum have failed to demonstrate significant theoretical benefits, but our results show that if one is willing to posit second-order smoothness, momentum may play a role in accelerating convergence. However, there are several open problems remaining. First, we suspect that the upper bound $M$ imposed on $F$ in our adaptive analysis is unnecessary, and also that the use of an extra gradient sample per iteration can be removed. Second, we note that in the noise-free case with second-order smoothness, the best-known dimension-free convergence rate is actually $O(1/\epsilon^{1.75})$ (Carmon et al., 2017) rather than the $O(1/\epsilon^{2})$ achieved by gradient descent. Is it possible to design an adaptive algorithm that interpolates between this rate and $O(1/\epsilon^{3.5})$ in the noisy setting?

Finally, our implementation employs a few extra learning rate heuristics popular in practice, notably linear warm-up and polynomial decay (Devlin et al., 2019), as well as scaling the learning rate by the norm of the weights (You et al., 2017). We found these heuristics to be better-performing than our adaptive algorithm, despite their lack of theoretical motivation. Understanding the principles underlying the success of these heuristics poses an interesting orthogonal question to understanding momentum, and we hope that our work inspires future investigation into these areas.

References

Appendix A Proof of Theorem 4

In this section we provide the missing proof of Theorem 4, restated below: See 4

Notice from the definition of $\alpha$ that we always have

where we defined $G_{0}=D$ . Thus $\alpha_{t}\leq 1$ always.

We begin with the now-familiar definitions:

Unfortunately, it is no longer clear how to unroll this recurrence to solve for $\hat{\epsilon}_{t}$ in a tractable manner. Instead, we will take a different path, inspired by the potential function analysis of Cutkosky & Orabona (2019). Start with the observations:

Define $K_{t}=\frac{1}{\alpha_{t}^{2}\eta_{t-1}G_{t}}$ . Then we use $(a+b)^{2}\leq(1+1/x)a^{2}+(1+x)b^{2}$ for all $x$ and the fact that $\epsilon_{t}$ is uncorrelated with anything that does not depend on $\xi_{t}$ to obtain:

where in the last inequality we have set $x=\alpha_{t}$ . This implies:

Now, observe that $\delta_{t+1}\leq 2\mathfrak{g}^{2}$ , so that we have

where we have used the fact that $\eta_{t}$ is non-increasing.

Next, we tackle $\rho^{2}\frac{8(1-\alpha_{t})^{2}\eta_{t-1}^{3}}{\alpha_{t}^{5}G_{t}}$ . We have

Now, finally we turn to bounding $-\left(\frac{1}{\alpha_{t-1}^{2}G_{t-1}\eta_{t-2}}-\frac{1}{\alpha_{t}^{2}G_{t}\eta_{t-1}}+\frac{1}{\alpha_{t}G_{t}\eta_{t-1}}\right)\|\epsilon_{t-1}\|^{2}$ . To do this, we first upper-bound $\frac{1}{\alpha_{t}^{2}G_{t}\eta_{t-1}}-\frac{1}{\alpha_{t-1}^{2}G_{t-1}\eta_{t-2}}$ . Note that:

So now we can upper bound $\frac{1}{\alpha_{t}^{2}\eta_{t-1}}-\frac{1}{\alpha_{t-1}^{2}\eta_{t-2}}$ and divide the bound by $G_{t}$ .

Next, we analyze $G_{t-1}^{8/7}-G_{t-2}^{8/7}$ . Recall our definition $\delta_{t}=G_{t}-G_{t-1}$ , and we have $0\leq\delta_{t}\leq 2\mathfrak{g}^{2}$ for all $t$ . Then by convexity of the function $x\mapsto x^{8/7}$ , we have

Then since we set $C$ so that $\frac{26C^{2}\mathfrak{g}^{6/7}}{7}=1/2$ , we obtain:

Putting all this together, we have shown:

Now, define the potential $\Phi_{t}=\frac{3F(\vec{w}_{t+1})}{\eta_{t}}+K_{t+1}\|\hat{\epsilon}_{t+1}\|^{2}$ . Then, by Lemma 2, we obtain:

So summing over $t$ and taking expectations yields:

Now, we examine the term $\sum_{t=1}^{T}8\|\hat{\epsilon}_{t}\|-\frac{\|\hat{\epsilon}_{t}\|^{2}}{2\alpha_{t+1}G_{t+1}\eta_{t}}$ . By Cauchy-Schwarz we have:

Finally, observe that since $G_{t}\geq\mathfrak{g}^{2}t^{1/4}$ , we have $\eta_{t}\leq\frac{C}{\sqrt{T}}$ . Therefore $\sum_{t=1}^{T}\eta_{t}\leq 2C\sqrt{T}$ . Putting all this together again, we have

Observe that we have $\Phi_{0}\leq\frac{3M}{\eta_{0}}+K_{1}\mathfrak{g}^{2}$ .

Let us define $Z=3M+\log(2\mathfrak{g}^{2}(T+1)/D)+\log(T+2)+32(\log(T)+1)$ . Then we have

Now we look carefully at the definition of $G_{t}$ and $\eta_{t}$ . By Jensen inequality, we have