An Improved Analysis of Stochastic Gradient Descent with Momentum

Yanli Liu, Yuan Gao, Wotao Yin

Introduction

Stochastic gradient methods have been a widespread practice in machine learning. They aim to minimize the following empirical risk:

In general, stochastic gradient methods can be written as

The idea behind SGDM originates from Polyak’s heavy-ball method for deterministic optimization. For strongly convex and smooth objectives, heavy-ball method enjoys an accelerated linear convergence rate over gradient descent . However, the theoretical understanding of its stochastic counterpart is far from being complete.

In the case of fixed stepsize and momentum weight, most of the current results only apply to restrictive settings. In and , the behavior of SGDM on least square regression is analyzed and linear convergence is established. analyzes the local convergence rate of SGDM for strongly convex and smooth functions, where the initial point x0x^{0} is assumed to be close enough to the minimizer xx^{*}. provides global convergence of SGDM, but only for objectives with uniformly bounded gradients, thus excluding many machine learning models such as Ridge regression. Very recently, presents a convergence bound of O(1kα+α1β)\mathcal{O}(\frac{1}{k\alpha}+\frac{\alpha}{1-\beta}) for general smooth nonconvex objectivesHere kk is the number of iterations. Note that in , a different but equivalent formulation of SGDM is analyzed; their stepsize γ\gamma is effectively α1β\frac{\alpha}{1-\beta} in our setting.. When β=0\beta=0, this recovers the classical convergence bound of O(1kα+α)\mathcal{O}(\frac{1}{k\alpha}+\alpha) of SGD . However, the size of stationary distribution O(α1β)\mathcal{O}(\frac{\alpha}{1-\beta}) is 11β\frac{1}{1-\beta} times larger than that of SGD. This factor is not negligible, especially when large β\beta values such as 0.990.99 and 0.9950.995 is applied . Therefore, their result does not explain the competitiveness of SGDM compared to SGD. Concurrent to this work, shows that SGDM converges as fast as SGD under convexity and strong convexity, and that it is asymptotically faster than SGD for overparameterized models. Remarkably, their analysis considers a different stepsize and momentum weight schedule from this work, and applies to arbitrary sampling without assuming the bounded variance of the gradient noise.

In deep learning, SGDM is often applied with various parameter tuning rules to achieve efficient training. One of the most widely adopted rules is called “constant and drop", where a constant stepsize is applied for a long period and is dropped by some constant factor to allow for refined training, while the momentum weight is either kept unchanged (usually 0.90.9) or gradually increasing. We call this strategy Multistage SGDM and summarize it in Algorithm 1. Practically, (multistage) SGDM was successfully applied to training large-scale neural networks , and it was found that appropriate parameter tuning leads to superior performance . Since then, (multistage) SGDM has become increasingly popular .

At each stage, Multistage SGDM (Algorithm 1) requires three parameters: stepsize, momentum weight, and stage length. In and , doubling argument based rules are analyzed for SGD on strongly convex objectives, where the stage length is doubled whenever the stepsize is halved. Recently, certain stepsize schedules are shown to yield faster convergence for SGD on nonconvex objectives satisfying growth conditions , and a nearly optimal stepsize schedule is provided for SGD on least square regression . These results consider only the momentum-free case. Another recent work focuses on the asymptotic convergence of SGDM (i.e., without convergence rate) , which requires the momentum weights to approach either or 11, and therefore contradicts the common practice in neural network training. In summary, the convergence rate of Multistage SGDM (Algorithm 1) has not been established except for the momentum-free case, and the role of parameters in different stages is unclear.

In this work, we provide new convergence analysis for SGDM and Multistage SGDM that resolve the aforementioned issues. A comparison of our results with prior work can be found in Table 1.

We show that for both strongly convex and nonconvex objectives, SGDM (2) enjoys the same convergence bound as SGD. This helps explain the empirical observations that SGDM is at least as fast as SGD . Our analysis relies on a new observation that, the update direction mkm^{k} of SGDM (2) has a controllable deviation from the current full gradient f(xk)\nabla f(x^{k}), and enjoys a smaller variance. Inspired by this, we construct a new Lyapunov function that properly handles this deviation and exploits an auxiliary sequence to take advantage of the reduced variance.

Compared to aforementioned previous work, our analysis applies to not only least squares, does not assume uniformly bounded gradient, and improves the convergence bound.

For the more popular SGDM in the multistage setting (Algorithm 1), we establish its convergence and demonstrate that the multistage strategy are faster at initial stages. Specifically, we allow larger stepsizes in the first few stages to boost initial performance, and smaller stepsizes in the final stages decrease the size of stationary distribution. Theoretically, we properly redefine the aforementioned auxiliary sequence and Lyapunov function to incorporate the stagewise parameters.

To the best of our knowledge, this is the first convergence guarantee for SGDM in the multistage setting.

2 Other related work

Nesterov’s momentum achieves optimal convergence rate in deterministic optimization , and has also been combined with SGD for neural network training . Recently, its multistage version has been analyzed for convex or strongly convex objectives . Other forms of momentum for stochastic optimization include PID Control-based methods , Accelerated SGD , and Quasi-Hyperbolic Momentum . In this work, we restrict ourselves to heavy-ball momentum, which is arguably the most popular form of momentum in current deep learning practice.

Notation and Preliminaries

The following assumption is effective throughout, which is standard in stochastic optimization.

Smoothness: The objective f(x)f(x) in (1) is LL-smooth.

Independent samples: the random samples {ζk}k=1\{\zeta_{k}\}_{k=1}^{\infty} are independent.

Unless otherwise noted, all the proof in the paper are deferred to the appendix.

Key Ingredients of Convergence Theory

In this section, we present some key insights for the analysis of stochastic momentum methods. For simplicity, we first focus on the case of fixed stepsize and momentum weight, and make proper generalizations for Multistage SGDM in App. C.

In this section, we make the following observation on the role of momentum:

With a momentum weight β[0,1)\beta\in[0,1), the update vector mkm^{k} enjoys a reduced “variance" of (1β)σ2(1-\beta)\sigma^{2}, while having a controllable deviation from the full gradient gkg^{k} in expectation.

First, without loss of generality, we can take m0=0m^{0}=0, and express mkm^{k} as

mkm^{k} is a moving average of the past stochastic gradients, with smaller weights for older onesNote the sum of weights (1β)i=1kβki=1βk1(1-\beta)\sum_{i=1}^{k}\beta^{k-i}=1-\beta^{k}\rightarrow 1 as kk\rightarrow\infty..

we have the following result regarding the “variance" of mkm^{k}, which is measured between mkm^{k} and its deterministic version (1β)i=1kβkigi(1-\beta)\sum_{i=1}^{k}\beta^{k-i}g^{i}.

Under Assumption 1, the update vector mkm^{k} in SGDM (2) satisfies

Lemma 1 follows directly from the property of the moving average.

On the other hand, (1β)i=1kβkigi(1-\beta)\sum_{i=1}^{k}\beta^{k-i}g^{i} is a moving average of all past gradients, which is in contrast to SGD. It seems unclear how far is (1β)i=1kβkigi(1-\beta)\sum_{i=1}^{k}\beta^{k-i}g^{i} from the ideal descent direction gkg^{k}, which could be unbounded unless stronger assumptions are imposed. Previous analysis such as and make the blanket assumption of bounded f\nabla f to circumvent this difficulty.

In this work, we provide a different perspective to resolve this issue.

From Lemma 2, we know the deviation of 11βk(1β)i=1kβkigi\frac{1}{1-\beta^{k}}(1-\beta)\sum_{i=1}^{k}\beta^{k-i}g^{i} from gkg^{k} is controllable sum of past successive iterate differences, in the sense that the coefficients ak,ia_{k,i} decays linearly for older ones. This inspires the construction of a new Lyapunov function to handle the deviation brought by the momentum, as we shall see next.

2 A new Lyapunov function

Let us construct the following Lyapunov function for SGDM:

In the Lyapunov function (5), {ci}i=1\{c_{i}\}_{i=1}^{\infty} are positive constants to be specified later corresponding to the deviation described in Lemma 2. Since the coefficients in (4) converges linearly to as kk\rightarrow\infty, we can choose {ci}i=1\{c_{i}\}_{i=1}^{\infty} in a diminishing fashion, such that this deviation can be controlled, and LkL^{k} defined in (5) is indeed a Lyapunov function under strongly convex and nonconvex settings (see Propositions 1 and 2).

In (5), zkz^{k} is an auxiliary sequence defined as

This auxiliary sequence first appeared in the analysis of deterministic heavy ball methods in , and later applied in the analysis of SGDM . It enjoys the following property.

Since the coefficients of the deviation in Lemma 2 converges linearly to as kk\rightarrow\infty, we can choose {ci}i=1\{c_{i}\}_{i=1}^{\infty} in a diminishing fashion, such that this deviation can be controlled. Remarkably, we shall see in Sec. 4 that with c1=O(L1β)c_{1}=\mathcal{O}\left(\frac{L}{1-\beta}\right), LkL^{k} defined in (5) is indeed a Lyapunov function under strongly convex and nonconvex settings, and that SGDM converges as fast as SGD.

Now, let us turn to the Multistage SGDM (Algorithm 1), which has been very successful in neural network training. However, its convergence still remains unclear except for the momentum-free case. To establish convergence, we require the parameters of Multistage SGDM to satisfy

where αi,βi,\alpha_{i},\beta_{i}, and TiT_{i} are the stepsize, momentum weight, and stage length of iith stage, respectively, and A1,A2A_{1},A_{2} are properly chosen constants. In principle, one applies larger stepsizes αi\alpha_{i} at the initial stages, which will accelerate initial convergence, and smaller stepsizes for the final stages, which will shrink the size of final stationary distribution. As a result, (7) stipulates that less iterations are required for stages with large stepsizes and more iterations for stages with small stepsizes. Finally, (7) requires the momentum weights to be monotonically increasing, which is consistent with what’s done in practice . often, using constant momentum weight also works.

Under the parameter choices in (7), let us define the auxiliary sequence zkz^{k} by

This {zk}k=1\{z^{k}\}_{k=1}^{\infty} sequence reduces to (6) when a constant stepsize and momentum weight are applied. Furthermore, the observations made in Lemmas 1, 2, and 3 can also be generalized (see Lemmas 4, 5, 6, and 7 in App. C). In Sec. 5. we shall see that with (7) and appropriately chosen {ci}i=1\{c_{i}\}_{i=1}^{\infty}, LkL^{k} in (5) also defines a Lyapunov function in the multistage setting, which in turn leads to the convergence of Multistage SGDM.

Convergence of SGDM

In this section, we proceed to establish the convergence of SGDM described in (2). First, by following the idea presented in Sec. 3, we can show that LkL^{k} defined in (5) is a Lyapunov function.

Let Assumption 1 hold. In (2), let α1β22Lβ+β2\alpha\leq\frac{1-\beta}{2\sqrt{2}L\sqrt{\beta+\beta^{2}}}. Let {ci}i=1\{c_{i}\}_{i=1}^{\infty} in (5) be defined by

By telescoping (9), we obtain the stationary convergence of SGDM under nonconvex settings.

Let Assumption 1 hold. In (2), let ααmin{1βL(4β+β2),1β22Lβ+β2}\alpha\leq\alpha\leq\min\{\frac{1-\beta}{L(4-\beta+\beta^{2})},\frac{1-\beta}{2\sqrt{2}L\sqrt{\beta+\beta^{2}}}\}. Then,

Now let us turn to the strongly convex setting, for which we have

Let Assumption 1 hold. Assume in addition that ff is μ\mu-strongly convex. In (2), let αmin{1β5L,1βL(3β+2β2+48β252L+18μL)}\alpha\leq\min\{\frac{1-\beta}{5L},\frac{1-\beta}{L\left(3-\beta+2\beta^{2}+\frac{48\sqrt{\beta}}{25}\frac{2L+18\mu}{L}\right)}\}. Then, there exists positive constants cic_{i} for (5) such that for all kk0log0.5logβk\geq k_{0}\coloneqq\lfloor\frac{\log 0.5}{\log\beta}\rfloor, we have

The choices of {ci}i=1\{c_{i}\}_{i=1}^{\infty} is similar to those of Proposition 1 and can be found in App. B.4. With Proposition 2, we immediately have

Let Assumption 1 hold and assume in addition that ff is μ\mu-strongly convex. Under the same settings as in Proposition 2, for all kk0=log0.5logβk\geq k_{0}=\lfloor\frac{\log 0.5}{\log\beta}\rfloor we have

Let Assumption 1 hold and assume in addition that ff is μ\mu-strongly convex. Under the same settings as in Proposition 2, for all kk0=log0.5logβk\geq k_{0}=\lfloor\frac{\log 0.5}{\log\beta}\rfloor we have

Under nonconvex settings, the classical convergence bound of SGD is O(f(x1)fkα+Lασ2)\mathcal{O}\left(\frac{f(x^{1})-f^{*}}{k\alpha}+L\alpha\sigma^{2}\right) with α=O(1L)\alpha=\mathcal{O}(\frac{1}{L}) (see, e.g., Theorem 4.8 of ). Therefore, Theorem 1 tells us that with α=O(1βL)\alpha=\mathcal{O}(\frac{1-\beta}{L}), SGDM achieves the same convergence bound as SGD.

In contrast, the radius of the stationary distribution for SGDM in and is O(ασ21β)\mathcal{O}(\frac{\alpha\sigma^{2}}{1-\beta}), and the latter one also assumes that f\nabla f is uniformly bounded.

In Theorem 2 and Corollary 1, the convergence bounds hold for kk0=log0.5logβk\geq k_{0}=\lfloor\frac{\log 0.5}{\log\beta}\rfloor, where k0k_{0} is a mild constantFor example, we have k0=6k_{0}=6 for the popular choice β=0.9\beta=0.9..

The convergence bound of SGD under strong convexity is O((1αμ)k+Lμασ2)\mathcal{O}\left((1-\alpha\mu)^{k}+\frac{L}{\mu}\alpha\sigma^{2}\right) (see, e.g, Theorem 4.6 of ), our result for SGDM in Corollary 1 recovers this when β=0\beta=0.

Convergence of Multistage SGDM

In this section, we switch to the Multistage SGDM (Algorithm 1).

Let us first show that when the (7) is applied, we can define the constants cic_{i} properly so that (5) still produces a Lyapunov function.

Let Assumption 1 hold. In Algorithm 1, let the parameters satisfy (7) with A1=1242LA_{1}=\frac{1}{24\sqrt{2}L}. In addition, let

Then, we have ci>0c_{i}>0 for any i1i\geq 1. Furthermore, with zkz^{k} defined in (8), for any k1k\geq 1, we have

where α(k),β(k)\alpha(k),\beta(k) are the stepsize and momentum weight applied at kkth iteration, respectively.

Let Assumption 1 hold. Under the same settings as in Proposition 3, let β112\beta_{1}\geq\frac{1}{2} and let A2A_{2} be large enough such that

On the left hand side of (11), we have the average of the averaged squared gradient norm of nn stages.

On the right hand side of (11), the first term dominates at initial stages, we can apply large αi\alpha_{i} for these stages to accelerate initial convergence, and use smaller αi\alpha_{i} for later stages so that the size of stationary distribution is small. In contrast, (static) SGDM need to use a small stepsize α\alpha to make the size of stationary distribution with small.

It is unclear whether the overall iteration complexity of Multistage SGDM is better than SGDM or not. Numerically, we do observe that Multistage SGDM is faster. We leave the possible improved analysis of Multistage SGDM to future work.

Experiments

In this section, we verify our theoretical claims by numerical experiments. For each combination of algorithm and training task, training is performed with 33 random seeds 1,2,31,2,3. Unless otherwise stated, we report the average of losses of the past mm batches, where mm is the number of batches for the whole dataset. Additional implementation details can be found in App. E.

Setup. The MNIST dataset consists of n=60000n=60000 labeled examples of 28×2828\times 28 gray-scale images of handwritten digits in K=10K=10 classes 0,1,,90,1,\dots,9. For all algorithms, we use batch size s=64s=64 (and hence number batches per epoch is m=1874m=1874), number of epochs T=50T=50. The regularization parameter is λ=5×104\lambda=5\times 10^{-4}.

The effect of α\alpha in (static) SGDM. By Theorem 2 we know that, with a fixed β\beta, a larger α\alpha leads to faster loss decrease to the stationary distribution. However, the size of the stationary distribution is also larger. This is well illustrated in Figure 1. For example, α=1.0\alpha=1.0 and α=0.5\alpha=0.5 make losses decrease more rapidly than α=0.1\alpha=0.1. During later iterations, α=0.1\alpha=0.1 leads to a lower final loss.

Multistage SGDM. We take 33 stages for Multistage SGDM. The parameters are chosen according to (7): T1=3,T2=6,T3=21T_{1}=3,T_{2}=6,T_{3}=21, αi=A2/Ti\alpha_{i}=A_{2}/T_{i}, βi=A1/(c2+αi)\beta_{i}=A_{1}/(c_{2}+\alpha_{i}), where A2=2.0A_{2}=2.0 and A1=1.0A_{1}=1.0.Here, A1A_{1} is not set by its theoretical value 124L\frac{1}{24L}, since the dataset is very large and the gradient Lipschitz constant LL cannot be computed easily. We compare Multistage SGDM with SGDM with (α,β)=(0.66,0.9)(\alpha,\beta)=(0.66,0.9) and (α,β)=(0.095,0.9)(\alpha,\beta)=(0.095,0.9), where 0.660.66, 0.0950.095 are the stepsizes of the first and last stage of Multistage SGDM, respectively. The training losses of initial and later iterations are shown in Figure 2.

We can see that SGDM with (α,β)=(0.66,0.9)(\alpha,\beta)=(0.66,0.9) converges faster initially, but has a higher final loss; while SGDM with (α,β)=(0.095,0.9)(\alpha,\beta)=(0.095,0.9) behaves the other way. Multistage SGDM takes the advantage of both, as predicted by Theorem 3. The performances of SGDM and Vanilla SGD with the same stepsize are similar.

2 Image classification

For the task of training ResNet18 on the CIFAR-10 dataset, we compare Multistage SGDM, a baseline SGDM, and YellowFin , an automatic momentum tuner based on heuristics from optimizing strongly convex quadratics. The initial learning rate of YellowFin is set to 0.10.1,We have experimented with initial learning rates 0.0010.001 (default), 0.010.01, 0.10.1 and 0.50.5, each repeated 33 times; we found that the choice 0.10.1 is the best in terms of the final training loss. and other parameters are set as their default values. All algorithms are run for T=50T=50 epochs and the batch size is fixed as s=128s=128.

For Multistage SGDM, the parameters choices are governed by (7): the stage lengths are T1=5T_{1}=5, T2=10T_{2}=10, and T3=35T_{3}=35. Take A1=1.0A_{1}=1.0, A2=2.0A_{2}=2.0, set the per-stage stepsizes and momentum weights as αi=A2/Ti\alpha_{i}=A_{2}/T_{i} and βi=A1/(A1+αi)\beta_{i}=A_{1}/(A_{1}+\alpha_{i}), for stages i=1,2,3i=1,2,3. For the baseline SGDM, the stepsize schedule of Multistage SGDM is applied, but with a fixed momentum β=0.9\beta=0.9.

In Figure 3, we present training losses and end-of-epoch validation accuracy of the tested algorithms. We can see that Multistage SGDM performs the best. Baseline SGDM is slightly worse, possibly because of its fixed momentum weight.

Summary and Future Directions

In this work, we provide new theoretical insights into the convergence behavior of SGDM and Multistage SGDM. For SGDM, we show that it is as fast as plain SGD in both nonconvex and strongly convex settings. For the widely adopted multistage SGDM, we establish its convergence and show the advantage of stagewise training.

There are still open problems to be addressed. For example, (a) Is it possible to show that SGDM converges faster than SGD for special objectives such as quadratic ones? (b) Are there more efficient parameter choices than (7) that guarantee even faster convergence?

References

Appendix A Proof of Preliminary Lemmas

It is straightforward to see that the same conclusion holds for i<ji<j.

Finally, we know from the item 4 in Assumption 1 that

A.2 Proof of Lemma 2

where we have applied Cauchy-Schwarz in the first inequality.

By applying triangle inequality and the smoothness of ff (item 1 in Assumption 1), we further have

Therefore, by defining ak,j=1β1βkL2i=1jβki(ki)a^{\prime}_{k,j}=\frac{1-\beta}{1-\beta^{k}}L^{2}\sum_{i=1}^{j}\beta^{k-i}(k-i), we get

Furthermore, ak,ja^{\prime}_{k,j} can be calculated as

Combining this with (12), we finally arrive at

A.3 Proof of Lemma 3

Let us consider the cases of k=1k=1 and k2k\geq 2 separately.

Appendix B Main Theory for SGDM

In order to prove Proposition 1, let us first show an auxiliary result.

Take Assumption 1. Then, for zkz^{k} defined in (6), we have

where we have applied Lemma 3 in the second step.

where ρ0>0\rho_{0}>0 can be any positive constant (to be determined later).

Combining (16) and the last inequality, we arrive at

Since zk=xkz^{k}=x^{k} when k=1k=1 and zk=11βxkβ1βxk1z^{k}=\frac{1}{1-\beta}x^{k}-\frac{\beta}{1-\beta}x^{k-1} when k2k\geq 2, it can be verified that zkxk=β1βαmk1z^{k}-x^{k}=-\frac{\beta}{1-\beta}\alpha m^{k-1}. Consequently,

On the other hand, from Lemma 1 we know that

Finally, using 1βk1<11-\beta^{k-1}<1 and ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha} gives

B.2 Proof of Proposition 1

where ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha}.

In the rest of the proof, let us show that the sum of the last three terms in (22) is non-positive.

Therefore, in order to make the sum of the last three terms of (22) to be non-positive, we need to have

Since 1βk<11-\beta^{k}<1, it suffices to enforce the following for all i1i\geq 1:

And in order for ci>0c_{i}>0 for all i1i\geq 1, we can determine c1c_{1} by

we have i=1iβi=β(1β)2\sum_{i=1}^{\infty}i\beta^{i}=\frac{\beta}{(1-\beta)^{2}} and

Notice that α1β4Lβ+β2\alpha\leq\frac{1-\beta}{4L\sqrt{\beta+\beta^{2}}} ensures c1>0c_{1}>0.

With the choices of cic_{i} in (23) and (24), the sum of the last three terms of (22) is non-positive. Therefore,

B.3 Proof of Theorem 1

In the rest the proof, we will bound R1R_{1} and R2R_{2} appropriately.

First, let us show that R1α2R_{1}\geq\frac{\alpha}{2} when ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha} as in (26) and αmin{1βL(4β+β2),1β2Lβ+β2}\alpha\leq\min\{\frac{1-\beta}{L(4-\beta+\beta^{2})},\frac{1-\beta}{2L\sqrt{\beta+\beta^{2}}}\}.

Since α1β22Lβ+β2\alpha\leq\frac{1-\beta}{2\sqrt{2}L\sqrt{\beta+\beta^{2}}}, we have

Therefore, in order to ensure R1α2R_{1}\geq\frac{\alpha}{2} where R1R_{1} is defined in (28), it suffices to have

Applying ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha} yields

where we have applied α1βL(4β+2β2)\alpha\leq\frac{1-\beta}{L(4-\beta+2\beta^{2})} in the last step.

Now let us turn to R2R_{2}. By (32) we know that

Since ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha}, we have

By applying αmin{1βL(4β+β2),1β22Lβ+β2}1β3.75L<1β4L\alpha\leq\min\{\frac{1-\beta}{L(4-\beta+\beta^{2})},\frac{1-\beta}{2\sqrt{2}L\sqrt{\beta+\beta^{2}}}\}\leq\frac{1-\beta}{3.75L}<\frac{1-\beta}{4L}, we further have

By putting (34) and (35) into (30), we finally obtain

B.4 Proof of Proposition 2

In order to prove Proposition 2, we will set

Take ρ0=1β2Lα\rho_{0}=\frac{1-\beta}{2L\alpha} in (22), we have

Let us first derive a lower bound of the first term on the right hand side of (36).

where ρ\rho, ρ1>0\rho_{1}>0 are to be determined later.

On the other hand, we have from (18) that

Putting these two inequalities into (38) and rearranging gives

Taking ρ=11β\rho=\frac{1}{1-\beta} and ρ1=1β\rho_{1}=\frac{1}{\beta} gives

Since α1β5L\alpha\leq\frac{1-\beta}{5L}, we have that

Therefore, by α1βL(3β+2β2+48β252L+18μL)\alpha\leq\frac{1-\beta}{L\left(3-\beta+2\beta^{2}+\frac{48\sqrt{\beta}}{25}\frac{2L+18\mu}{L}\right)} we have

By combining (44) with (41), we further obtain

In the rest of the proof, we will show that if the constants cic_{i} are chosen such that

Then, we have ci>0c_{i}>0 for all i1i\geq 1 and

And therefore, we will have the desired result:

First of all. by klog0.5logβk\geq\frac{\log 0.5}{\log\beta}, we know that βk12\beta^{k}\leq\frac{1}{2}, and (47) gives

Therefore, in order for (51) to hold, it suffices to set

On the other hand, (50) is also equivalent to

Therefore, in order to have ci>0c_{i}>0 for all i1i\geq 1, we can set

Since ββ1+B1=1αμ11+8μL\beta\leq\sqrt{\beta}\leq 1+B_{1}=1-\alpha\mu\frac{1}{1+\frac{8\mu}{L}} and

for any q(0,1)q\in(0,1), (52) is equivalent to

Since α1βL\alpha\leq\frac{1-\beta}{L}, we further have

Since B1=αμ1+8μLB_{1}=-\frac{\alpha\mu}{1+\frac{8\mu}{L}} and α1β5L\alpha\leq\frac{1-\beta}{5L}, it can be verified that β1+B1β\frac{\beta}{1+B_{1}}\leq\sqrt{\beta} for all β[0,1)\beta\in[0,1) and μL\mu\leq L. Therefore,

As a result, in order to have (53), it suffices to set

Since α1β5L\alpha\leq\frac{1-\beta}{5L}, we have

which is exactly our choice of c1c_{1} as in (49).

B.5 Proof of Theorem 2

From Proposition 2 we know that for all kk0=log0.5logβk\geq k_{0}=\lfloor\frac{\log 0.5}{\log\beta}\rfloor,

where we have applied α1β5L\alpha\leq\frac{1-\beta}{5L} in the last step. Therefore,

By ci0c_{i}\geq 0 for all i1i\geq 1 and (42), we conclude that

B.6 Proof of Corollary 1

In fact, by (6) we can express xkx^{k} as a convex combination of {zi}i=1k\{z^{i}\}_{i=1}^{k}:

The desired result follows directly from the convexity of ff and Theorem 2.

Appendix C Generalizations of Lemmas 1, 2, and 3 for Multistage SGDM

In order to establish the convergence of Multistage SGDM(Algorithm 1), we need to generalize the Lemmas 1 and 2 , which play a key role in the convergence of SGDM in (2).

Under the assumptions of Theorem 3, the variance of update vector mkm^{k} in Algorithm 1 satisfies

where b_{k,i}=\big{(}1-\beta(i)\big{)}\prod_{j=i+1}^{k}\beta(j).

To begin with, let us express mkm^{k} by the past stochastic gradients:

where we have applied m0=0m^{0}=0 and defined

It can be verified that the sum of weights is

As a result, by applying Assumption 1 we have

Note that by setting k=T1++Tnk+rkk=T_{1}+\dots+T_{n_{k}}+r_{k}, we have

Under the assumptions of Theorem 3, the update vector mk1m^{k-1} in Algorithm 1 satisfies

By setting k1=T1++Tnk1+rk1k-1=T_{1}+\dots+T_{n_{k-1}}+r_{k-1}, we have

C.2 Generalization of Lemma 3 for Multistage SGDM

where α(k)\alpha(k) is the stepsize applied at the kkth step.

Recall that the auxiliary sequence zkz^{k} is defined by

where A1αiβi1βiA_{1}\equiv\frac{\alpha_{i}\beta_{i}}{1-\beta_{i}} and αi,βi\alpha_{i},\beta_{i} are the stepsize and momentum weight at the iith stage, respectively. Therefore, we also have

where α(k),β(k)\alpha(k),\beta(k) are the stepsize and momentum weight applied at the kkth step. Using this, we obtain

C.3 Generalization of Lemma 2 for Multistage SGDM

In Multistage SGDM(Algorithm 1), assume that the momentum weights at nn stages satisfy β1β2...βn\beta_{1}\leq\beta_{2}\leq...\leq\beta_{n}. Then, we have

where b_{k,i}=\big{(}1-\beta(i)\big{)}\prod_{j=i+1}^{k}\beta(j) and β(i)\beta(i) is the momentum weight applied at the iith iteration, and

By By (55), (56) and (57), we can compute that

where we have used (57) in the first and third equality, and Cauchy-Schwarz in the first inequality. In the last inequality, we have applied the triangle inequality and the LL-smoothness of ff.

In the Proposition 5 below, we shall see that dk,iak,id_{k,i}\leq a_{k,i} for all ik1i\leq k-1, where ak,ia_{k,i} is defined in (58). Therefore,

dk,id_{k,i} defined in (59) and ak,ia_{k,i} defined in (58) satisfy

We aim to show that dk,iak,id_{k,i}\leq a_{k,i} for all ik1i\leq k-1. Or equivalently, dk,jak,jd_{k,j}\leq a_{k,j} for all jk1j\leq k-1.

In order to show dk,jak,jd_{k,j}\leq a_{k,j}, we just need to show that

Let k=T1+T2++Tnk+rkk=T_{1}+T_{2}+\dots+T_{n_{k}}+r_{k}, where 0nkn10\leq n_{k}\leq n-1. If nk<n1n_{k}<n-1, then 0rkTnk+110\leq r_{k}\leq T_{n_{k}+1}-1. If nk=n1n_{k}=n-1, then 0rkTnk+1=Tn0\leq r_{k}\leq T_{n_{k}+1}=T_{n}.

Since jk1j\leq k-1, we have j=T1++Tnj+rjj=T_{1}+\dots+T_{n_{j}}+r_{j}, where 0njnk0\leq n_{j}\leq n_{k}.

Now, let us compute the left hand side of (60) explicitly.

where we have applied rjTnj+1r_{j}\leq T_{n_{j}+1} if nj<nkn_{j}<n_{k} and rjrkr_{j}\leq r_{k} if nj=nkn_{j}=n_{k} in the last term. Since

By applying these equalities on (61), we have

On the right hand side, the first njn_{j} terms are non-positive since β1β2...βn\beta_{1}\leq\beta_{2}\leq...\leq\beta_{n}. Therefore,

By applying βnj+1rjβnj+1Tnj+1\beta_{n_{j}+1}^{r_{j}}\geq\beta^{T_{n_{j}}+1}_{n_{j}+1} and kT1Tnj1=k(jrj)10k-T_{1}-\dots-T_{n_{j}}-1=k-(j-r_{j})-1\geq 0 (since jk1j\leq k-1), we arrive at

Now let us consider two cases: rk>0r_{k}>0 and rk=0r_{k}=0.

In this case, we apply β1...βn\beta_{1}\leq...\leq\beta_{n} to get

Since rk>0r_{k}>0, iteration kk is at the (nk+1)(n_{k}+1)-th stage, we have β(k)=βnk+1\beta(k)=\beta_{n_{k}+1}, and the above inequality is exactly what we want to show in (60).

In this case, we apply β1...βn\beta_{1}\leq...\leq\beta_{n} to get

Since rk=0r_{k}=0, we have β(k)=βnk\beta(k)=\beta_{n_{k}} and by jk1j\leq k-1 we deduce that njnk1n_{j}\leq n_{k}-1 (Otherwise j=T1++Tnj+rj=T1++Tnk+rjT1++Tnk=kj=T_{1}+\dots+T_{n_{j}}+r_{j}=T_{1}+\dots+T_{n_{k}}+r_{j}\geq T_{1}+\dots+T_{n_{k}}=k). Therefore, we have

which is exactly what we want to show in (60).

Appendix D Main Theory for Multistage SGDM

In this section, we prove the main convergence theory of Multistage SGDM.

Proposition 3 is a generalization of Propositions 4 and 1 to the multistage case. Therefore, its proof is similar to those of Propositions 4 and 1.

First of all, by the smoothness of ff we have

where we have applied Lemma 6 in the second step. Note that α(k)\alpha(k) is the stepsize applied at the kk-th iteration.

where ρ0,k>0\rho_{0,k}>0 can be any positive constant.

By (8) we know that zkxk=A1mk1z^{k}-x^{k}=-A_{1}m^{k-1}, which leads to

Plugging (65) and (67) into (64) gives us

In the rest of the proof, we will show that the sum of the last three terms in (68) is non-positive.

Therefore, in order to make the sum of the last three terms of (68) to be non-positive, we need to enforce that

Since 1i=1kβ(i)<11-\prod_{i=1}^{k}\beta(i)<1, β1β(k)βn\beta_{1}\leq\beta(k)\leq\beta_{n}, and α1α(k)αn\alpha_{1}\leq\alpha(k)\leq\alpha_{n}, we need to enforce the following for all i1i\geq 1:

Recall that αiβi1βiA1\frac{\alpha_{i}\beta_{i}}{1-\beta_{i}}\equiv A_{1} for all nn stages i=1,2,...,ni=1,2,...,n. This gives us

Since β1β(k)\beta_{1}\leq\beta(k), it suffices to enforce that

Note that the equalities in (70) does not depend on kk. In order for ci>0c_{i}>0 for all i1i\geq 1, we can determine c1c_{1} by

we have i=1iβni=βn(1βn)2\sum_{i=1}^{\infty}i\beta_{n}^{i}=\frac{\beta_{n}}{(1-\beta_{n})^{2}} and

Notice that A1=1242LA_{1}=\frac{1}{24\sqrt{2}L} and 1β1β1121βnβn+βn2\frac{1-\beta_{1}}{\beta_{1}}\leq 12\frac{1-\beta_{n}}{\sqrt{\beta_{n}+\beta^{2}_{n}}} ensures

With the choices of cic_{i} in (70) and (71), the sum of the last three terms of (68) is non-positive. Therefore,

Taking ρ0,k=1β(k)2Lα(k)\rho_{0,k}=\frac{1-\beta(k)}{2L\alpha(k)} in (73) gives

Finally, by applying Lemma 4 and Lemma 5, we arrive at

D.2 Proof of Theorem 3

In the rest the proof, we will bound R1,iR_{1,i} and R2,iR_{2,i} appropriately.

First, let us show that R1,iα(i)2R_{1,i}\geq\frac{\alpha(i)}{2} under ρ0,i=1β(i)2Lα(i)\rho_{0,i}=\frac{1-\beta(i)}{2L\alpha(i)} as in (69) and α(i)=A1(1β(i))β(i)=1β(i)242Lβ(i)\alpha(i)=\frac{A_{1}(1-\beta(i))}{\beta(i)}=\frac{1-\beta(i)}{24\sqrt{2}L\beta(i)}.

Therefore, in order for R1,iα(i)2R_{1,i}\geq\frac{\alpha(i)}{2}, it suffices to have

By β(i)β112\beta(i)\geq\beta_{1}\geq\frac{1}{2} we know that

Therefore, Lα2(i)2α(i)4\frac{L\alpha^{2}(i)}{2}\leq\frac{\alpha(i)}{4}. Furthermore, ρ0,i=1β(i)2Lα(i)\rho_{0,i}=\frac{1-\beta(i)}{2L\alpha(i)} yields

where in the inequality above, we have applied

Now let us turn to R2,iR_{2,i}. By (76) and (72) we know that

Since ρ0,i=1β(i)2Lα(i)\rho_{0,i}=\frac{1-\beta(i)}{2L\alpha(i)} and α(i)β(i)1β(i)A1\frac{\alpha(i)\beta(i)}{1-\beta(i)}\equiv A_{1}, we have

By applying Lemmas 4 and 5, we further have

By putting (79) and (80) into (77) with k=T1+T2++Tnk=T_{1}+T_{2}+\dots+T_{n}, we obtain

Dividing both sides by 12nA212nαlTl\frac{1}{2}nA_{2}\equiv\frac{1}{2}n\alpha_{l}T_{l} gives

Appendix E Details of computational infrastructure

All experiments were performed on a computing server with Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz and NVidia GeForce RTX 2080 P8. The weights of the neural networks are initialized by the default, random initialization routines.