Understanding the Role of Momentum in Stochastic Gradient Methods

Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao

Introduction

Stochastic gradient methods have become extremely popular in machine learning for solving stochastic optimization problems of the form

where dkd^{k} is a (stochastic) search direction and αk>0\alpha_{k}>0 is the step size or learning rate. The classical stochastic gradient descent (SGD) method uses dk=xf(xk,ζk)d^{k}=\nabla_{x}f(x^{k},\zeta^{k}), where ζk\zeta^{k} is a random sample collected at step kk. For the ease of notation, we use gkg^{k} to denote xf(xk,ζk)\nabla_{x}f(x^{k},\zeta^{k}) throughout this paper.

There is a vast literature on modifications of SGD that aim to improve its theoretical and empirical performance. The most common such modification is the addition of a momentum term, which sets the search direction dkd^{k} as the combination of the current stochastic gradient gkg^{k} and past search directions. For example, the stochastic variant of Polyak’s heavy ball method uses

where βk[0,1)\beta_{k}\in[0,1). We call the combination of (2) and (3) the Stochastic Heavy Ball (SHB) method. Gupal and Bazhenov studied a “normalized” version of SHB, where

In the context of modern deep learning, Sutskever et al. proposed to use a stochastic variant of Nesterov’s accelerated gradient (NAG) method, where

The number of variations on momentum has kept growing in recent years; see, e.g., Synthesized Nesterov Variants (SNV) , Triple Momentum , Robust Momentum , PID Control-based methods , Accelerated SGD (AccSGD) , and Quasi-Hyperbolic Momentum (QHM) .

Despite various empirical successes reported for these different methods, there is a lack of clear understanding of how the different forms of momentum and their associated parameters affect convergence properties of the algorithms and other performance measures, such as final loss value. For example, Sutskever et al. show that momentum is critical to obtaining good performance in deep learning. But using different parametrizations, Ma and Yarats claim that momentum may have little practical effect. In order to clear up this confusion, several recent works [see, e.g., 40, 1, 18] have aimed to develop and analyze general frameworks that capture many different momentum methods as special cases.

In this paper, we focus on a class of algorithms captured by the general form of QHM :

Our theoretical results on the QHM model (6) cover three different aspects: asymptotic convergence with probability one, stability region and local convergence rates, and characterizations of the stationary distribution of {xk}\{x^{k}\} under constant parameters α\alpha, β\beta, and ν\nu. Specifically:

In Section 3, we show that for minimizing smooth nonconvex functions, QHM converges almost surely as βk0\beta_{k}\to 0 for arbitrary values of νk\nu_{k}. And more surprisingly, we show that QHM converges as νkβk1\nu_{k}\beta_{k}\to 1 (which requires both νk1\nu_{k}\to 1 and βk1\beta_{k}\to 1) as long as νkβk1\nu_{k}\beta_{k}\to 1 slow enough, as compared with the speed of αk0\alpha_{k}\to 0.

In Section 4, we consider local convergence behaviors of QHM for fixed parameters α\alpha, β\beta, and ν\nu. In particular, we derive joint conditions on (α,β,ν)(\alpha,\beta,\nu) that ensure local stability (or convergence when there is no stochastic noise in the gradient approximations) of the algorithm near a strict local minimum. We also characterize the local convergence rate within the stability region.

In Section 5, we investigate the stationary distribution of {xk}\{x^{k}\} generated by the QHM dynamics around a local minimum (using a simple quadratic model with noise). We derive the dependence of the stationary variance on (α,β,ν)(\alpha,\beta,\nu) up to the second-order Taylor expansion in α\alpha. These results reveal interesting effects of β\beta and ν\nu that cannot be seen from first-order expansions.

Our asymptotic convergence results in Section 3 give strong guarantees for the convergence of QHM with diminishing learning rates under different regimes (βk0\beta_{k}\to 0 and βk1\beta_{k}\to 1). However, as with most asymptotic results, they provide limited guidance on how to set the parameters in practice for fast convergence. Our results in Sections 4 and 5 complement the asymptotic results by providing principled guidelines for tuning these parameters. For example, one of the most effective schemes used in deep learning practice is called “constant and drop”, where constant parameters (α,β,ν)(\alpha,\beta,\nu) are used to train the model for a long period until it reaches a stationary state and then the learning rate α\alpha is dropped by a constant factor for refined training. Each stage of the constant-and-drop scheme runs variants of QHM with constant parameters, and their choices dictate the overall performance of the algorithm. In Section 6, by combining our results in Sections 4 and 5, we obtain new and, in some cases, counter-intuitive insight into how to set these parameters in practice.

Related work

Asymptotic convergence There exist many classical results concerning the asymptotic convergence of the stochastic gradient methods [see, e.g. 37, 28, 14, and references therein]. For the classical SGD method without momentum, i.e., (2) with dk=gkd^{k}=g^{k}, a well-known general condition for asymptotic convergence is k=0αk=\sum_{k=0}^{\infty}\alpha_{k}=\infty and k=0αk2<\sum_{k=0}^{\infty}\alpha_{k}^{2}<\infty. In general, we will always need αk0\alpha_{k}\to 0 to counteract the effect of noise. But interestingly, the conditions on βk\beta_{k} are much less restricted. For normalized SHB, Polyak and Kaniovski studied its asymptotic convergence properties in the regime of αk0\alpha_{k}\to 0 and βk0\beta_{k}\to 0, while Gupal and Bazhenov investigated asymptotic convergence in the regime of αk0\alpha_{k}\to 0 and βk1\beta_{k}\to 1, both for convex optimization problems. More recently, Gadat et al. extended asymptotic convergence analysis for the normalized SHB update to smooth nonconvex functions for βk1\beta_{k}\to 1. In this work we generalize the classical SGD and SHB results to the case of QHM for smooth nonconvex functions.

Local convergence rate The stability region and local convergence rate of the deterministic gradient descent and heavy ball algorithms were established by Boris Polyak for the case of convex functions near a strict twice-differentiable local minimum . For this class of functions heavy ball method is optimal in terms of the local convergence rate . However, it might fail to converge globally for the general strongly convex twice-differentiable functions and is no longer optimal for the class of smooth convex functions. For the latter case, Nesterov’s accelerated gradient was shown to attain the optimal global convergence rate . In this paper we extend the results of Polyak on local convergence to the more general QHM algorithm.

Stationary analysis The limit behavior analysis of SGD algorithms with momentum and constant step size was used in various applications. establish sufficient conditions on detecting whether iterates reach stationarity and use them in combination with statistical tests to automatically change learning rate during training. prove many properties of limiting behavior of SGD with constant step size by using tools from Markov chain theory. Our results are most closely related to the work of Mandt et al. who use stationary analysis of SGD with momentum to perform approximate Bayesian inference. In fact, our Theorem 4 extends their results to the case of QHM and our Theorem 5 establishes more precise relations (to the second order in α\alpha), revealing interesting dependence on the parameters β\beta and ν\nu which cannot be seen from the first order equations.

Asymptotic convergence

In this section, we generalize the classical asymptotic results to provide conditions under which QHM converges almost surely to a stationary point for smooth nonconvex functions. Throughout this section, "a.s." refers to "almost surely". We need to make the following assumptions.

The following conditions hold for FF defined in (1) and the stochastic gradient oracle:

FF is differentiable and F\nabla F is Lipschitz continuous, i.e., there is a constant LL such that

FF is bounded below and F(x)\|\nabla F(x)\| is bounded above, i.e., there exist FF_{*} and GG such that

For k=0,1,2,k=0,1,2,\ldots, the stochastic gradient gk=F(xk)+ξkg^{k}=\nabla F(x^{k})+\xi^{k}, where the random noise ξk\xi^{k} satisfies

where Ek[]\mathbf{E}_{k}[\cdot] denotes expectation conditioned on {x0,g0,,xk1,gk1,xk}\{x^{0},g^{0},\ldots,x^{k-1},g^{k-1},x^{k}\}, and CC is a constant.

Note that Assumption A.3 allows the distribution of ξk\xi^{k} to depend on xkx^{k}, and we simply require the second moment to be conditionally bounded uniformly in kk. The assumption F(x)G\|\nabla F(x)\|\leq G can be removed if we assume a bounded domain for xx. However, this will complicate the proof by requiring special treatment (e.g., using the machinery of gradient mapping ) when {xk}\{x^{k}\} converges to the boundary of the domain. Here we assume this condition to simplify the analysis.

By convergence to a stationary point, we mean that the sequence {xk}\{x^{k}\} satisfies the condition

Intuitively, as βk0\beta_{k}\to 0, regardless of νk\nu_{k}, the QHM dynamics become more like SGD, so there should be no issue with convergence. The following theorem, which generalizes the analysis technique of Ruszczyński and Syski to QHM, shows formally that this is indeed the case:

Let FF satisfy Assumption A. Additionally, assume 0νk10\leq\nu_{k}\leq 1 and the sequences {αk}\{\alpha_{k}\} and {βk}\{\beta_{k}\} satisfy the following conditions:

Then the sequence {xk}\{x^{k}\} generated by the QHM algorithm (6) satisfies (7). Moreover, we have

More surprisingly, however, one can actually send νkβk1\nu_{k}\beta_{k}\to 1 as long as νkβk1\nu_{k}\beta_{k}\to 1 slow enough, although we require a stronger condition on the noise ξ\xi. We extend the technique of Gupal and Bazhenov to show asymptotic convergence of QHM for minimizing smooth nonconvex functions.

Let FF satisfy assumption A, and additionally assume that ξk2<C||\xi^{k}||^{2}<C almost surely, i.e., the noise ξ\xi is a.s. bounded. Let the sequences {αk}\{\alpha_{k}\}, {βk}\{\beta_{k}\}, and {νk}\{\nu_{k}\} satisfy the following conditions:

Then the sequence {xk}\{x^{k}\} generated by Algorithm (6) satisfies (7).

The conditions in Theorem 2 can be satisfied by, for example, taking αk=kω\alpha_{k}=k^{-\omega} and (1νkβk)=kc(1-\nu_{k}\beta_{k})=k^{-c} for 1+c2<ω1\frac{1+c}{2}<\omega\leq 1 and 12<c<1\frac{1}{2}<c<1. We should note that, even though setting νkβk1\nu_{k}\beta_{k}\to 1 is somewhat unusual in practice, we think the result of Theorem 2 is interesting from both theoretical and practical points of view. From the theoretical side, this result shows that it is possible to always be increasing the amount of momentum (in the limit when νkβk=1\nu_{k}\beta_{k}=1, we are not using the fresh gradient information at all) and still obtain convergence for smooth functions. From the practical point of view, our Theorem 5 in Section 5 shows that for a fixed α\alpha, increasing νkβk\nu_{k}\beta_{k} might lead to smaller stationary distribution size, which may give better empirical results.

Also, note that when νk=βk\nu_{k}=\beta_{k}, Theorems 1 and 2 give asymptotic convergence guarantees for the common practical variant of NAG, which have not appeared in the literature before. However, we should mention that the bounded noise assumption of Theorem 2 (i.e. ξk2<C||\xi^{k}||^{2}<C a.s.) is quite restrictive. In fact, Ruszczyński and Syski prove a similar result for SGM with a more general noise condition, and their technique may extend to QHM, but bounded noise greatly simplifies the derivations. We provide the proofs of Theorems 1 and 2 in Appendix B.

The results in this section indicate that both βk0\beta_{k}\to 0 and νkβk1\nu_{k}\beta_{k}\to 1 are admissible from the perspective of asymptotic convergence. However, they give limited guidance on how to choose momentum parameters in practice, where non-asymptotic behaviors are of main concern. In the next two sections, we study local convergence and stationary behaviors of QHM with constant learning rate and momentum parameters; our analysis provides new insights that could be very useful in practice.

Stability region and local convergence rate

Let the sequence {xk}\{x^{k}\} be generated by the QHM algorithm (6) with constant parameters αk=α\alpha_{k}=\alpha, βk=β\beta_{k}=\beta and νk=ν\nu_{k}=\nu. In this case, xkx^{k} does not converge to any local minimum in the asymptotic sense, but its distribution may converge to a stationary distribution around a local minimum. Since the objective function FF is smooth, we can approximate FF around a strict local minimum xx^{*} by a convex quadratic function. Since F(x)=0\nabla F(x^{*})=0, we have

where the Hessian 2F(x)\nabla^{2}F(x^{*}) is positive definite. Therefore, for the ease of analysis, we focus on convex quadratic functions of the form F(x)=(1/2)(xx)TA(xx),F(x)=(1/2)(x-x^{*})^{T}A(x-x^{*}), where AA is positive definite (and we can set x=0x^{*}=0 without loss of generality). In addition, we assume

where the noise ξk\xi^{k} satisfies Assumption A.3 and in addition, ξk\xi^{k} is independent of xkx^{k} for all k0k\geq 0. Mandt et al. observe that this independence assumption often holds approximately when the dynamics of SHB are approaching stationarity around a local minimum.

where TT and SS are functions of (α,β,ν)(\alpha,\beta,\nu) and AA:

It is well-known that the linear system (10) is stable if and only if the spectral radius of TT, denoted by ρ(T)\rho(T), is less than 1. When ρ(T)<1\rho(T)<1, the dynamics of (10) is the superposition of two components:

A deterministic part described by the dynamics zk+1=Tzkz^{k+1}=Tz^{k} with initial condition z0=[0;x0]z^{0}=[0;x^{0}] (we always take d1=0d^{-1}=0). This part asymptotically decays to zero.

An auto-regressive stochastic process (10) driven by {ξk}\{\xi^{k}\} with zero initial condition z0=[0;0]z^{0}=[0;0].

Roughly speaking, ρ(T)\rho(T) determines how fast the dynamics converge from an arbitrary initial point x0x^{0} to the stationary distribution, while properties of the stationary distribution (such as its variance and auto-correlations) depends on the full spectrum of the matrix TT as well as SS. Both aspects have important implications for the practical performance of QHM on stochastic optimization problems. Often there are trade-offs that we have to make in choosing the parameters α\alpha, β\beta and ν\nu to balance the transient convergence behavior and stationary distribution properties.

In the rest of this section, we focus on the deterministic dynamics zk+1=Tzkz^{k+1}=Tz^{k} to derive the conditions on (α,β,ν)(\alpha,\beta,\nu) that ensure ρ(T)<1\rho(T)<1 and characterize the convergence rate. Let λi(A)\lambda_{i}(A) for i=1,,ni=1,\ldots,n denote the eigenvalues of AA (they are all real and positive). In addition, we define

where κ\kappa is the condition number. The local convergence rate for strictly convex quadratic functions is well studied for the case of gradient descent (ν=0\nu=0) and heavy ball (ν=1\nu=1) . In fact, heavy ball achieves the best possible convergence rate of (κ1)/(κ+1)(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1). Thus, it is immediately clear that the optimal convergence rate of QHM will be the same and will be achieved with ν=1\nu=1. However, there are no results in the literature characterizing how the optimal rate or optimal parameters change as a function of ν\nu. Our next result establishes the convergence region and dependence of the convergence rate on the parameters α\alpha, β\beta, and ν\nu. We present the result for quadratic functions, but it can be generalized to any LL-smooth and μ\mu-strongly convex functions, assuming the initial point x0x^{0} is close enough to the optimal point xx_{*} (see Theorem 6 in Appendix C).

Let’s denote θ={α,β,ν}\theta=\{\alpha,\beta,\nu\}We drop the dependence of some functions on θ\theta for brevity.. For any function F(x)=xTAx+bTx+cF(x)=x^{T}Ax+b^{T}x+c that satisfies 0<μλi(A)L0<\mu\leq\lambda_{i}(A)\leq L for all i=1,,ni=1,\ldots,n and any x0x^{0}, {ϵk}\exists\{\epsilon_{k}\}, with ϵk0\epsilon_{k}\geq 0, such that the deterministic QHM algorithm zk+1=Tzkz^{k+1}=Tz^{k} satisfies

where x=arg minxF(x)x_{*}=\operatorname*{arg\,min}_{x}{F(x)}, limkϵk=0\lim_{k\to\infty}\epsilon_{k}=0 and R(θ,μ,L)=ρ(T)R(\theta,\mu,L)=\rho(T), which can be characterized as

To ensure R(θ,μ,L)<1R(\theta,\mu,L)<1, the parameters α,β,ν\alpha,\beta,\nu must satisfy the following constraints:

In addition, the optimal rate depends only on κ\kappa: minθR(θ,μ,L)\min_{\theta}R(\theta,\mu,L) is a function of only κ\kappa.

The conditions in (13) characterize the stability region of QHM. Note that when ν=0\nu=0 we have the classical result for gradient descent: α<2/L\alpha<2/L; when ν=1\nu=1, the condition matches that of the normalized heavy ball: α<2(1+β)/(L(1β))\alpha<2(1+\beta)/(L(1-\beta)).

The equations (12) define the convergence rate for any fixed values of the parameters α,β,ν\alpha,\beta,\nu. While it does not give a simple analytic form, it allows us to conduct easy numerical investigations. To gain more intuition into the effect that momentum parameters ν\nu and β\beta have on the convergence rate, we study how the optimal ν\nu changes as a function of β\beta and vice versa. To find the optimal parameters and rate, we solve the corresponding optimization problem numerically (using the procedure described in Appendix D). For each pair {β,ν}\left\{\beta,\nu\right\} we set α\alpha to the optimal value in order to remove its effect. These plots are presented in Figure 1.

A natural way to think about the interplay between parameters α,β\alpha,\beta and ν\nu is in terms of the total “amount of momentum”. Intuitively, it should be controlled by the product of ν×β\nu\times\beta. This intuition helps explain Figure 1 (a), (b), which show the dependence of the optimal β\beta as a function of ν\nu for different values of κ\kappa. We can see that for bigger values of ν\nu we need to use smaller values of β\beta, since increasing each one of them increases the “amount of momentum” in QHM. However, the same intuition fails when considering ν\nu as a function of β\beta (and β\beta is big enough), as shown in Figure 1 (c), (d). In this case there are 3 regimes of different behavior. In the first regime, since β\beta is small, the amount of momentum is not enough for the problem and thus the optimal ν\nu is always 11. In this phase we also need to increase α\alpha when increasing β\beta (it is typical to use larger learning rate when the momentum coefficient is bigger). The second phase begins when we reach the optimal value of β\beta (rate is minimal) and, after that, the amount of momentum becomes too big and we need to decrease ν\nu and α\alpha. However, somewhat surprisingly, there is a third phase, when β\beta becomes big enough we need to start increasing ν\nu and α\alpha again. Thus we can see that it’s not just the product of νβ\nu\beta that governs the behavior of QHM, but a more complicated function.

Finally, based on our analytic and numerical investigations, we conjecture that the optimal convergence rate is a monotonically decreasing function of ν\nu (if α\alpha and β\beta are chosen optimally for each ν\nu). While we can’t prove this statementIn fact, we hypothesise that R(ν,κ)R^{*}(\nu,\kappa) might not have analytical formula, since it is possible to show that the optimization problem over α\alpha and β\beta is equivalent to the system of highly non-linear equations., we verify this conjecture numerically in Appendix D. The code of all of our experiments is available at https://github.com/Kipok/understanding-momentum.

Stationary analysis

In this section, we study the stationary behavior of QHM with constant parameters α\alpha, β\beta and ν\nu. Again we only consider quadratic functions for the same reasons as outlined in the beginning of Section 4. In other words, we focus on the linear dynamics of (10) driven by the noise ξk\xi^{k} as kk\to\infty (where the deterministic part depending on x0x^{0} dies out). Under the assumptions of Section 4 we have the following result on the covariance matrix defined as \Sigma_{x}\triangleq\lim_{k\to\infty}\mathbf{E}\bigl{[}x^{k}(x^{k})^{T}\bigr{]}.

Suppose F(x)=12xTAxF(x)=\frac{1}{2}x^{T}Ax, where AA is symmetric positive definite matrix. The stochastic gradients satisfy gk=F(xk)+ξg^{k}=\nabla F(x^{k})+\xi, where ξ\xi is a random vector independent of xkx^{k} with zero mean E[ξ]=0\mathbf{E}\left[\xi\right]=0 and covariance matrix E[ξξT]=Σξ\mathbf{E}\left[\xi\xi^{T}\right]=\Sigma_{\xi}. Also, suppose the parameters α,β,ν\alpha,\beta,\nu satisfy (13). Then the QHM algorithm (6), equivalently (10) in this case, converges to a stationary distribution satisfying

When ν=1\nu=1, this result matches the known formula for the stationary distribution of unnormalized SHB with reparametrization of αα/(1β)\alpha\to\alpha/(1-\beta). Note that Theorem 4 shows that for the normalized version of the algorithm, the stationary distribution’s covariance does not depend on β\beta (or ν\nu) to the first order in α\alpha. In order to explore such dependence, we need to expand the dependence on α\alpha to the second order. In that case, we are not able to obtain a matrix equation, but can get the following relation for tr(AΣx)\mathbf{tr}(A\Sigma_{x}).

Under the conditions of Theorem 4, we have

We note that tr(AΣx)\mathbf{tr}(A\Sigma_{x}) is twice the mean value of F(x)F(x) when the dynamics have reached stationarity, so the right-hand side of (15) is approximately the “achievable loss” given the values of α,β\alpha,\beta and ν\nu. It is interesting to consider several special cases:

ν=0\nu=0 (SGD): tr(AΣx)=α2tr(Σξ)+α24tr(AΣξ)+O(α3)\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3}).

ν=1\nu=1 (SHB): tr(AΣx)=α2tr(Σξ)+α241β1+βtr(AΣξ)+O(α3)\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\frac{1-\beta}{1+\beta}\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3}).

ν=β\nu=\beta (NAG): tr(AΣx)=α2tr(Σξ)+α24(12β2(1+2β)1+β)tr(AΣξ)+O(α3)\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\left(1-\frac{2\beta^{2}(1+2\beta)}{1+\beta}\right)\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3}) .

From the expressions for SHB and NAG, it might be beneficial to move β\beta to 11 during training in order to make the achievable loss smaller. While moving β\beta to 11 is somewhat counter-intuitive, we proved in Section 3 that QHM still converges asymptotically in this regime, assuming ν\nu also goes to 11 and νβ\nu\beta converges to 11 “slower” than α\alpha converges to . However, since we only consider Taylor expansion in α\alpha, there is no guarantee that the approximation remains accurate when ν\nu and β\beta converge to 11 (see Appendix G for evaluation of this approximation error). In order to precisely investigate the dependence on β\beta and ν\nu, it is necessary to further extend our results by considering Taylor expansion with respect to them as well, especially in terms of 1β1-\beta. We leave this for future work.

Figure 2 shows a visualization of the QHM stationary distribution on a 2-dimensional quadratic problem. We can see that our prediction about the dependence on α\alpha and β\beta holds in this case. However, the dependence on ν\nu is more complicated: the top and bottom rows of Figure 2 show opposite behavior. Comparing this experiment with our analysis of the convergence rate (Figure 1) we can see another confirmation that for big values of β\beta, increasing ν\nu can, in a sense, decrease the “amount of momentum” in the system. Next, we evaluate the average final loss for a large grid of parameters α,β\alpha,\beta and ν\nu on three problems: a 2-dimensional quadratic function (where all of our assumptions are satisfied), logistic regression on the MNIST dataset (where the quadratic assumption is approximately satisfied, but gradient noise comes from mini-batches) and ResNet-18 on CIFAR-10 (where all of our assumptions are likely violated). Figure 3 shows the results of this experiment. We can indeed see that β1\beta\to 1 and α0\alpha\to 0 make the final loss smaller in all cases. The dependence on ν\nu is less clear, but we can see that for large values of β\beta it is approximately quadratic, with a minimum at some ν<1\nu<1. Thus from this point of view ν1\nu\neq 1 helps when β\beta is big enough, which might be one of the reasons for the empirical success of the QHM algorithm. Notice that the empirical dependence on ν\nu is qualitatively the same as predicted by formula (15), but with optimal value shifted closer to 11. See Appendix F for details.

Some practical implications and guidelines

In this section, we present some practical implications and guidelines for setting learning rate and momentum parameters in practical machine learning applications. In particular, we consider the question of how to set the optimal parameters in each stage of the popular constant-and-drop scheme for deep learning. We argue that in order to answer this question, it is necessary to consider both convergence rate and stationary distribution perspectives. There is typically a trade-off between obtaining a fast rate and a small stationary distribution. You can see an illustration of this trade-off in Figure 4 (a). Interestingly, by combining stationary analysis of Section 5 and results for the convergence rate (3), we can find certain regimes of parameters α,β\alpha,\beta, and ν\nu where the final loss and the convergence speed do not compete with each other.

One of the most important of these regimes happens in the case of the SHB algorithm (ν=1\nu=1). In that case, we can see that when C12(l)C22(l)0,l{μ,L}C_{1}^{2}(l)-C_{2}^{2}(l)\leq 0,l\in\left\{\mu,L\right\}, the convergence rate equals β\sqrt{\beta} and does not depend on α\alpha. Thus, as long as this inequality is satisfied, we can set α\alpha as small as possible and it would not harm the convergence rate, but will decrease the size of stationary distribution. To get the best possible convergence rate, we, in fact, have to set α\alpha and β\beta in such a way that this inequality will turn into equality and thus there will be only a single value of α\alpha that could be used. However, as long as β\beta is not exactly at the optimal value, there is going to be some freedom in choosing α\alpha and it should be used to decrease the size of stationary distribution. From this point of view, the optimal value of \alpha=\bigl{(}1-\sqrt{\beta}\bigr{)}\big{/}\Bigl{(}\mu\bigl{(}1+\sqrt{\beta}\bigr{)}\Bigr{)}, which will be smaller then the largest possible α\alpha for convergence as long as κ>2\kappa>2 and β\beta is set close to 11 (see proof of Theorem 3 for more details). This guideline contradicts some typical advice to set α\alpha as big as possible while algorithm still convergesFor example says “So set β\beta as close to 1 as you can, and then find the highest α\alpha which still converges. Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.”. The refined guideline for the constant-and-drop scheme would be to set α\alpha as small as possible until the convergence noticeably slows down. You can see an illustration of this behavior on a simple quadratic problem (Figure 4 (b)), as well as for ResNet-18 on CIFAR-10 (Figure 4 (c)). Such regimes of no trade-off can be identified for β\beta and ν\nu as well.

Conclusion

Using the general formulation of QHM, we have derived a unified set of new analytic results that give us better understanding of the role of momentum in stochastic gradient methods. Our results cover several different aspects: asymptotic convergence, stability region and local convergence rate, and characterizations of stationary distribution. We show that it is important to consider these different aspects together to understand the key trade-offs in tuning the learning rate and momentum parameters for better performance in practice. On the other hand, we note that the obtained guidelines are mainly for stochastic optimization, meaning the minimization of the training loss. There is evidence that different heuristics and guidelines may be necessary for achieving better generalization performance in machine learning, but this topic is beyond the scope of our current paper.

References

Appendix

Appendix A From NAG to QHM

In this appendix we will mention exact steps needed to come from the original NAG formulation to the formulation assumed by the QHM algorithm. We refer the reader to the ( Appendix A.1) for the derivation of NAG as the following momentum method Note that we change the notation to be consistent with the notation of QHM.

Next, we will move the learning rate out of the momentum into the iterates update:

When αk\alpha_{k} and βk\beta_{k} are constant, the two methods produce the same sequence of iterates xkx_{k} if d0d_{0} is initialized at . To make the notation more similar to the QHM algorithm, let’s move all indices (except for dkd_{k}) up by 1:

This again does not change the algorithm. Now, let’s normalize the momentum update by 1βk1-\beta_{k}:

This version is equivalent to the unnormalized by re-scaling αα/(1β)\alpha\to\alpha/(1-\beta) for constant parametersIn fact, for non-constant βk\beta_{k} the algorithms are no longer equivalent.. Finally, following we need to make a change of variables yk=xkαkβkdk1y_{k}=x_{k}-\alpha_{k}\beta_{k}d_{k-1} and additionally assume that βk=β\beta_{k}=\beta is constant:

Renaming yky_{k} back to xkx_{k} and replacing f(yk)\nabla f(y_{k}) with stochastic gradient if necessary we obtain the exact formula used in QHM update.

Overall, assuming d0=0d_{0}=0 and βk\beta_{k} is constant, the QHM version of NAG is indeed equivalent (up to a change of variable) to the original NAG with re-scaling of αα/(1β)\alpha\to\alpha/(1-\beta). However, if βk\beta_{k} is changing from iteration to iteration, the two algorithms are no longer equivalent.

Appendix B Asymptotic Convergence Proofs

In this section we prove Theorems 1 and 2. For simplicity, we assume throughout that αk\alpha_{k}, νk\nu_{k} and βk\beta_{k} are nonrandom.

Here we generalize the meta-analysis of Ruszczyński and Syski to include νk\nu_{k}.

We consider the following variant of QHM:

where νk\nu_{k} is in $andandi_{k}isabinaryswitchintroducedtohandleunboundednoise.Specifically,forsomeconstantis a binary switch introduced to handle unbounded noise. Specifically, for some constant\rho>0$, we let

Conditions repeated for convenience.

Let FF satisfy Assumption A. Additionally, assume the sequences {αk}\{\alpha_{k}\}, {βk}\{\beta_{k}\} and {νk}\{\nu_{k}\} satisfy the following conditions:

Then the dynamics (6) satisfy (7) and (8).

which we will use often in the convergence analysis.

Suppose Assumption A, (20), and (23) hold, and for every k0k\geq 0,

where {Sk}\{S_{k}\} and {Wk}\{W_{k}\} are sequences of scalar random variables satisfying and

For the first part, suppose for a contradiction that lim infkF(xk)>0\liminf_{k\to\infty}||\nabla F(x^{k})||>0. So there exists ϵ>0\epsilon>0 and k0k_{0} such that for all kk0k\geq k_{0}, F(xk)ϵ||\nabla F(x^{k})||\geq\epsilon. By (26), there exists k1k0k_{1}\geq k_{0} such that Skϵ/2S_{k}\leq\epsilon/2 for all kk1k\geq k_{1}. Then from (25) we obtain:

Summing over kk from k1k_{1} to \infty and using Assumption A.2, we get that:

The right-hand-side is finite by (27). But since F(xk)ϵ||\nabla F(x^{k})||\geq\epsilon for all kk1k\geq k_{1} and βkβˉ<1\beta_{k}\leq\bar{\beta}<1 and 0νk10\leq\nu_{k}\leq 1, we have:

This implies k=k1αk<\sum_{k=k_{1}}^{\infty}\alpha_{k}<\infty, which contradicts (20). So we must have (7), i.e. lim infkF(xk)=0\liminf_{k}||\nabla F(x^{k})||=0.

For every ll, define the index k(l)=max{kK:k<l}k(l)=\max\{k\in\mathcal{K}:k<l\}. Since K\mathcal{K} is infinite, k(l)k(l)\to\infty as ll\to\infty. Then for sufficiently large ll, i.e., when k(l)k1k(l)\geq k_{1}, (25) becomes

As ll\to\infty, because of (27) and k(l)k(l)\to\infty, we get i=k(l)l1Wi0\sum_{i=k(l)}^{l-1}W_{i}\to 0, so

Since the reverse inequality is trivial, we obtain (8).

Because αk<αˉ<\alpha_{k}<\bar{\alpha}<\infty for all kk, νk\nu_{k}, βk\beta_{k} are in $,and, and||\nabla F(x^{k(l)})||\to 0,thelattertwotermsintheaboveinequalityconvergetozeroas, the latter two terms in the above inequality converge to zero asl\to\infty$. So we obtain (30) again. This concludes the proof of the lemma. ∎

All that remains now is to use the smoothness inequality (24) to identify the sequences SkS_{k} and WkW_{k} for the dynamics of the modified algorithm (16)-(18), and prove (26) and (27).

From the update formula (18) and using gk=F(xk)+ξkg^{k}=\nabla F(x^{k})+\xi^{k}, we obtain

By the smoothness assumption A.1, we have

First we show Sk0S_{k}\to 0. From the update formula, we have

Then because ikdk1ρi_{k}||d^{k-1}||\leq\rho, βk0\beta_{k}\to 0, and supkβk=βˉ<1\sup_{k}\beta_{k}=\bar{\beta}<1, we have

Because Sk0S_{k}\geq 0, limkSk=0\lim_{k}S_{k}=0.

Now we show that WkW_{k} is summable almost surely. To begin, we need to show that Δxk+12||{\Delta x^{k+1}}||^{2} is summable, for which we need the following lemma:

There is a random variable CC, constant in kk, such that Ek[bk2]C\mathbf{E}_{k}[||b^{k}||^{2}]\leq C for all kk almost surely.

where in the last inequality we used ikdk1ρi_{k}||d^{k-1}||\leq\rho. Then

By assumption A.2 and A.3, the first and second conditional moments Ek[gk]\mathbf{E}_{k}[||g^{k}||] and Ek[gk2]\mathbf{E}_{k}[||g^{k}||^{2}] are both bounded uniformly in kk. Then because νk\nu_{k} and βk\beta_{k} are in $andand\rhoisconstantinis constant ink$, we can put

which is what we wanted. Note that CC could be a random variable (depending on ω\omega), but this bound holds almost surely. ∎

k=1Δxk+12<\sum_{k=1}^{\infty}||{\Delta x^{k+1}}||^{2}<\infty almost surely.

We will use the following useful proposition (known as Levy’s sharpening of Borel-Cantelli Lemma, see e.g. Meyer [20, Chapter 1, Theorem 21]):

Let {bk}\{b_{k}\} be a sequence of positive, integrable random variables, and let ai=E[biFi]a_{i}=\mathbf{E}[b_{i}|\mathcal{F}_{i}], where Fi={b0,bi1}\mathcal{F}_{i}=\{b_{0},\ldots b_{i-1}\}. Then defining the partial sums Bk=i=1kbiB_{k}=\sum_{i=1}^{k}b_{i}, Ak=i=1kaiA_{k}=\sum_{i=1}^{k}a_{i},

So to prove k=1Δxk+12<\sum_{k=1}^{\infty}||{\Delta x^{k+1}}||^{2}<\infty, we only need to prove k=1Ek[Δxk+12]<\sum_{k=1}^{\infty}\mathbf{E}_{k}[||{\Delta x^{k+1}}||^{2}]<\infty. To see this, observe that

so because Ek[bk2]C\mathbf{E}_{k}[||b^{k}||^{2}]\leq C a.s.,

where we used (21). Applying the proposition finishes the lemma. ∎

The last term remaining in WkW_{k} is αk(1νkβk)F(xk),ξk-\alpha_{k}(1-\nu_{k}\beta_{k})\langle\nabla F(x^{k}),\xi^{k}\rangle. We show that

is a convergent martingale. First, note that Ei[αi(1νiβi)F(xi),ξi]=0\mathbf{E}_{i}[\alpha_{i}(1-\nu_{i}\beta_{i})\langle\nabla F(x^{i}),\xi^{i}\rangle]=0, so Ek[Mk]=Mk1E_{k}[M_{k}]=M_{k-1}, and MkM_{k} is a martingale. Now we show that supkE[Mk2]\sup_{k}\mathbf{E}[M_{k}^{2}] is bounded, which will imply a.s. convergence by Doob’s forward convergence theorem [38, Section 11.5]. Indeed, E[Mk2]=E[M02]+i=1kE[(MiMi1)2]\mathbf{E}[M_{k}^{2}]=\mathbf{E}[M_{0}^{2}]+\sum_{i=1}^{k}\mathbf{E}[(M_{i}-M_{i-1})^{2}] [38, Section 12.1].

Because ξ0\xi^{0} depends only on x0x_{0}, we can upper bound this expectation with some constant CC by using Assumption A.3. Then we have that E[M02]C\mathbf{E}[M_{0}^{2}]\leq C, so E[Mk2]C+i=1kE[(MiMi1)2]\mathbf{E}[M_{k}^{2}]\leq C+\sum_{i=1}^{k}\mathbf{E}[(M_{i}-M_{i-1})^{2}]. Therefore,

where the last inquality used Assumption A.2 and the fact that 0νiβi10\leq\nu_{i}\beta_{i}\leq 1 almost surely. Moreover,

where the first equality is by the law of total expectation and the inequality comes from Assumption A.3. Because i=1αi2<\sum_{i=1}^{\infty}\alpha_{i}^{2}<\infty, we finally have

so MkM_{k} is a convergent martingale. In particular,

We have shown (26) and (27), which concludes the proof. ∎

Now we prove Theorem 2, where under a stronger noise assumption we show that βk1\beta_{k}\to 1 is admissible as long as it goes to 1 slow enough.

Assume the sequences {αk}\{\alpha_{k}\}, {βk}\{\beta_{k}\}, and {νk}\{\nu_{k}\} satisfy the following:

then sequence {xk}\{x^{k}\} generated by the algorithm (6) satisfies

where in the second inequality we used a,b14a2+b2\langle a,b\rangle\leq\frac{1}{4}\|a\|^{2}+\|b\|^{2} for any two vectors aa and bb, and in the last inequality we used 0νkβk10\leq\nu_{k}\beta_{k}\leq 1. Taking conditional expectation on both sides of the above inequality and using E[ξk]=0\mathbf{E}[\xi^{k}]=0, we get

Next we analyze the sequence {dk1F(xk)}\{d^{k-1}-\nabla F(x^{k})\}. From the update formula in (6), we have

where in the first inequality we used a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}, and in the second inequality we used 2a,bηa2+1ηb22\langle a,\,b\rangle\leq\eta\|a\|^{2}+\frac{1}{\eta}\|b\|^{2} for any η>0\eta>0. By the smoothness assumption, we have

which, combining with the previous inequality, leads to

Choosing η=1νkβk\eta=1-\nu_{k}\beta_{k}, we obtain

Taking expectation conditioned on {x0,g0,,xk1,dk1,xk}\{x^{0},g^{0},\ldots,x^{k-1},d^{k-1},x^{k}\}, we have

To show that dk1F(xk)2||d^{k-1}-\nabla F(x^{k})||^{2} is a convergent martingale, we prove the following lemma, similar to Ermoliev .

Assume we are given a sequence such that Ek[Xk+1]Xk+Yk\mathbf{E}_{k}[X_{k+1}]\leq X_{k}+Y_{k}, where 0XkC0\leq X_{k}\leq C and 0YkC0\leq Y_{k}\leq C almost surely for some constant CC, the random variables YkY_{k} are Fk\mathcal{F}_{k}-measurable, and they satisfy k=0Yk<\sum_{k=0}^{\infty}Y_{k}<\infty almost surely. Then the sequence XkX_{k} converges almost surely.

We show that Zk=Xk+k=0YkZ_{k}=X_{k}+\sum_{k=0}^{\infty}Y_{k} is a convergent supermartingale. By Doob decomposition, ZkZ_{k} is a supermartingale if and only if the sequence

satisfies P(Ak+1Ak)=1\mathbf{P}(A_{k+1}\leq A_{k})=1 for all kk [38, section 12.11]. Here

and we assumed this is non-positive. So Ak+1AkA_{k+1}\leq A_{k} almost surely. The upper bound on XkX_{k} and the convergence of k=0Yk\sum_{k=0}^{\infty}Y_{k} implies that the supermartingale ZZ is in L1\mathcal{L}^{1}, so the sequence {Zk}\{Z_{k}\} converges almost surely by Doob’s forward convergence theorem [38, chapter 11]. By the convergence of Yk\sum Y_{k}, this in turn implies that the sequence XkX_{k} converges almost surely.∎

We can apply the above lemma to show that \bigl{\|}d^{k-1}-\nabla F(x^{k})\bigr{\|}^{2} is a convergent semimartingale. Because the noise ξk2||\xi^{k}||^{2} is uniformly bounded almost surely and F(xk)G||\nabla F(x^{k})||\leq G, dk1F(xk)2||d^{k-1}-\nabla F(x^{k})||^{2} is uniformly bounded in kk. The uniform bound on the noise ξk2||\xi^{k}||^{2} also implies that bk2||b^{k}||^{2} is uniformly bounded in kk. In the notation of the lemma, we have

Note that Yk0Y_{k}\geq 0. To show convergence of Yk\sum Y_{k}, note that the uniform bounds imply

for suitably large CC. Then convergence follows from the conditions on the sequences (1νkβk)2(1-\nu_{k}\beta_{k})^{2}, αk2\alpha_{k}^{2}, and αk2/(1νkβk)\alpha_{k}^{2}/(1-\nu_{k}\beta_{k}). This proves that dk1F(xk)2||d^{k-1}-\nabla F(x^{k})||^{2} converges almost surely.

Summing up the two inequalities (41) and (43) gives

If (1+αk)νkβk1(1+\alpha_{k})\nu_{k}\beta_{k}\leq 1, then

Since αk/(1νkβk)0\alpha_{k}/(1-\nu_{k}\beta_{k})\to 0, there exists mm such that (1+αk)νkβk1(1+\alpha_{k})\nu_{k}\beta_{k}\leq 1 for all kmk\geq m. Taking full expectation on both sides of the above inequality and summing up for all kmk\geq m, we obtain

The right-hand side is bounded by assumption (the index mm is finite), so we have

So because kαk=\sum_{k}\alpha_{k}=\infty, there must be a subsequence ktk_{t} with F(xkt)20||\nabla F(x^{k_{t}})||^{2}\to 0. This proves (7). ∎

Appendix C Local Convergence Rate Proofs

In this section we give a proof to Theorem 3 and it’s generalized version, which we present below.

We will denote with λi(A)\lambda_{i}(A), ρ(A)\rho(A) the ii-th eigenvalue and spectral radius of the matrix AA respectively.

Let’s recall the equations of deterministic QHM algorithm (6) with constant parameters α,β,ν\alpha,\beta,\nu:

In this section we will assume that d0d^{0} is initialized with zero vector.

Taking the gradient of the quadratic function F(x)=xTAx+bTx+cF(x)=x^{T}Ax+b^{T}x+c and substituting it into (44) yields

where II denotes the n×nn\times n identity matrix and θ={α,β,ν}\theta=\left\{\alpha,\beta,\nu\right\}. It is known that the sequence of Tk(θ,A)T^{k}(\theta,A) converges to zero if and only if the spectral radius ρ(T)<1\rho(T)<1. Moreover, Gelfand’s Formula states that ρ(T)=limkTk1k\rho(T)=\lim_{k\to\infty}\left\|T^{k}\right\|^{\frac{1}{k}}, which means that {ϵk}0,limkϵk=0\exists\left\{\epsilon_{k}\right\}_{0}^{\infty},\lim_{k\to\infty}\epsilon_{k}=0 such that

Thus, the behavior of the algorithm is determined by the eigenvalues of T(θ)T(\theta). To find them, we will use a standard technique of changing basis. Let A=QΛQTA=Q\Lambda Q^{T} be an eigendecomposition of the matrix AA. Then, multiplying AA with QQ and appropriate permutation matrix PPSee, e.g. for the exact form of matrix PP. we get

Thus, to compute eigenvalues of TT, it is enough to compute the eigenvalues of all matrices TiT_{i}.

We use the following Lemma to establish the region when ρ(Ti)<1\rho(T_{i})<1:

Let α>0,β[0,1),ν,λi(A)>0\alpha>0,\beta\in[0,1),\nu\in,\lambda_{i}(A)>0. Then

Let’s denote with λ\lambda eigenvalues of TiT_{i}. Let’s also define lλi(A)l\triangleq\lambda_{i}(A). Then, λ\lambda satisfies the following equation:

Let’s denote by S(A)={α,β,ν:A is true}S(A)=\left\{\alpha,\beta,\nu:A\ \text{is true}\right\}. The final convergence set

Let’s look at the case when D0D\geq 0. Then S(λ<1D0)=S(D0)S(λ1<1)S(λ2<1)S(\left|\lambda\right|<1\cap D\geq 0)=S(D\geq 0)\cap S(\left|\lambda_{1}\right|<1)\cap S(\left|\lambda_{2}\right|<1)

Let’s look at S(λ1<1)=S(λ1<1)S(λ1>1)S(\left|\lambda_{1}\right|<1)=S(\lambda_{1}<1)\cap S(\lambda_{1}>-1)

Let’s solve the second inequality: S(λ1<1)S(\lambda_{1}<1). Since we are only interested in the case when D0D\geq 0 we get

The last inequality is true since 1+β(12ν)>01+\beta(1-2\nu)>0. Thus

Since 3+β>2(1+β)3+\beta>2(1+\beta) and l(1νβ)l(1+β(12ν))l(1-\nu\beta)\leq l(1+\beta(1-2\nu)).

Therefore we have that λ1>1\lambda_{1}>-1 always holds and thus λ1<1\left|\lambda_{1}\right|<1 always holds.

Now we compute S(λ2<1)=S(λ2<1)S(λ2>1)S(\left|\lambda_{2}\right|<1)=S(\lambda_{2}<1)\cap S(\lambda_{2}>-1). The first term is

Now let’s move to the second case and compute S(λ<1D<0)S(\left|\lambda\right|<1\cap D<0). If D<0D<0 we have that 1αl+ανl>01-\alpha l+\alpha\nu l>0 and then

Finally, let’s find a simplified form of S(D0)S(D\geq 0) and S(D<0)S(D<0).

Let’s denote the discriminant of that equation (divided by 4) with respect to αl\alpha l as D1D_{1}:

since (1+νβ)2<2(1+β)(1+\sqrt{\nu\beta})^{2}<2(1+\beta) (left side is less than 2 and right side is greater than 2) and also

Now, let’s establish a precise equation for the spectral radius ρ(Ti)\rho(T_{i}).

Let α>0,β[0,1),ν,λi(A)>0\alpha>0,\beta\in[0,1),\nu\in,\lambda_{i}(A)>0. Let’s define lλi(A)l\triangleq\lambda_{i}(A) and

In addition, r(θ,l)r(\theta,l) is non-increasing as a function of ll for 0<l<1βα(1νβ)20<l<\frac{1-\beta}{\alpha(1-\sqrt{\nu\beta})^{2}} and is non-decreasing for l>1βα(1νβ)2l>\frac{1-\beta}{\alpha(1-\sqrt{\nu\beta})^{2}}.

Following derivations from the proof of Lemmas 5 we get

Considering 4 cases for different signs of C1C_{1} and C2C_{2}, the first statement of the Lemma immediately follows. To prove the second statement, let’s define the following 3 points:

From equation (45) and definition of C1C_{1} we get that

and it is easy to check that if βνp1p2p3\beta\geq\nu\Rightarrow p_{1}\leq p_{2}\leq p_{3} and if β<νp1p3p2\beta<\nu\Rightarrow p_{1}\leq p_{3}\leq p_{2}. Moreover, both C1(l)C_{1}(l) and C2(l)C_{2}(l) are non-increasing function of ll, and C12(l)4C2(l)C_{1}^{2}(l)-4C_{2}(l) is non-increasing when lp2l\leq p_{2} and non-decreasing when lp1l\geq p_{1}.

Let’s first prove the second statement of the Lemma for the case when β<ν\beta<\nu. In that case, when l<p1l<p_{1} the function is non-increasing, since both C1(l)C_{1}(l) and C12(l)4C2(l)C_{1}^{2}(l)-4C_{2}(l) are non-increasing. When p1lp2p_{1}\leq l\leq p_{2}, the function is non-increasing, because C2(l)C_{2}(l) is non-increasing. Finally, when l>p2l>p_{2}, the function is non-decreasing, because both C12(l)4C2(l)C_{1}^{2}(l)-4C_{2}(l) and C1(l)-C_{1}(l) are non-decreasing.

When βν\beta\geq\nu, the same reasoning applies, but we additionally need to prove that the function is non-decreasing when p2lp3p_{2}\leq l\leq p_{3}. In that case r(θ,l)=0.5(C12(l)4C2(l)+C1(l))r(\theta,l)=0.5(\sqrt{C_{1}^{2}(l)-4C_{2}(l)}+C_{1}(l)). Taking the derivative of rr with respect to ll we get

Let’s show that this derivative is always non-negative when lp2l\geq p_{2}

which is always true since left side is less than zero.

The last thing that we need in order to prove Theorem 3 is given by the following Lemma:

Let μminiλi(A)\mu\leq\min_{i}\lambda_{i}(A) and Lmaxiλi(A)L\geq\max_{i}\lambda_{i}(A). Then

In addition, the minimal spectral radius with respect to θ\theta depends on μ\mu and LL only through κ\kappa, i.e. minθR(θ,μ,L)R(κ)\min_{\theta}R(\theta,\mu,L)\triangleq R^{*}(\kappa)

To prove this first statement of the Lemma, let’s notice that by definition

But Lemma 6 states that ρ(Ti(θ))\rho(T_{i}(\theta)) is first non-increasing and then non-decreasing with respect to the eigenvalues of AA. Thus, the maximum can only be achieved on the boundaries, which are precisely equal to or smaller than r(θ,μ)r(\theta,\mu) and r(θ,L)r(\theta,L).

Let’s prove the second statement of the Lemma by contradiction. Let’s assume that the optimal rate does in fact depend on μ\mu and LL not only through κ\kappa. That means that μ1,L1,μ2,L2\exists\mu_{1},L_{1},\mu_{2},L_{2}, such that L1/μ1=L2/μ2L_{1}/\mu_{1}=L_{2}/\mu_{2}, but minθR(θ,μ1,L1)minθR(θ,μ2,L2)\min_{\theta}R(\theta,\mu_{1},L_{1})\neq\min_{\theta}R(\theta,\mu_{2},L_{2}). Let’s consider the optimal rates if the function ff is divided by μ1\mu_{1} for the first case and by μ2\mu_{2} for the second. In that case, minθR(θ,1,L1/μ1)=minθR(θ,1,L2/μ2)\min_{\theta}R(\theta,1,L_{1}/\mu_{1})=\min_{\theta}R(\theta,1,L_{2}/\mu_{2}). But on the other hand, they can’t be equal, since we have that minθR(θ,1,L1/μ1)=minθR(θ,μ1,L1)\min_{\theta}R(\theta,1,L_{1}/\mu_{1})=\min_{\theta}R(\theta,\mu_{1},L_{1}) and minθR(θ,1,L2/μ2)=minθR(θ,μ2,L2)\min_{\theta}R(\theta,1,L_{2}/\mu_{2})=\min_{\theta}R(\theta,\mu_{2},L_{2}), because multiplying learning rate by μ1\mu_{1} for the first case and by μ2\mu_{2} for the second yields exactly the same sequence of iterates and thus the optimal rate can’t change. ∎

Now we are ready to prove Theorem 3. We restate it below for convenience

Theorem 3. Let’s denote θ={α,β,ν}\theta=\{\alpha,\beta,\nu\}. For any function F(x)=xTAx+bTx+cF(x)=x^{T}Ax+b^{T}x+c that satisfies μλi(A)L\mu\leq\lambda_{i}(A)\leq L for all i=1,,ni=1,\ldots,n and any x0x^{0}, the deterministic QHM algorithm zk+1=Tzkz^{k+1}=Tz^{k} satisfies

where x=arg minxF(x)x_{*}=\operatorname*{arg\,min}_{x}{F(x)}, limkϵk=0\lim_{k\to\infty}\epsilon_{k}=0 and R(θ,μ,L)=ρ(T)R(\theta,\mu,L)=\rho(T), which can be characterized as

To ensure R(θ,μ,L)<1R(\theta,\mu,L)<1, the parameters α,β,ν\alpha,\beta,\nu must satisfy the following constraints:

In addition, the optimal rate depends only on κ\kappa: minθR(θ,μ,L)\min_{\theta}R(\theta,\mu,L) is a function of only κ\kappa.

Lemma 6 and Lemma 7 immediately give the first statement of the Theorem. One can also get the bound on the function values by using definition of the Lipschitz continuous gradient:

Finally, to get the stability region, we apply Lemma 5 and notice that λi(A)L i\lambda_{i}(A)\leq L\ \forall i. ∎

To generalize this result, let’s define the following class of functions

Then, Theorem 3 can be generalized to any function FFμ,L1F\in\mathcal{F}^{1}_{\mu,L} in the following way:

Let’s denote θ={α,β,ν}\theta=\{\alpha,\beta,\nu\}. For any function FFμ,L1F\in\mathcal{F}^{1}_{\mu,L} that is additionally twice differentiable at the point x=arg minxF(x)x_{*}=\operatorname*{arg\,min}_{x}{F(x)}, deterministic QHM algorithm locally converges to xx_{*} with linear rate, from any initialization x0x^{0} sufficiently close to xx_{*}.

Precisely, for any ϵ[0,1R(θ,μ,L))  δ>0\epsilon\in[0,1-R(\theta,\mu,L))\ \exists\ \delta>0 and c0c\geq 0, such that k0\forall k\geq 0 the following holds

if x0xδ\left\|x^{0}-x_{*}\right\|\leq\delta and α,β,ν\alpha,\beta,\nu satisfy the following constraints:

In addition, the optimal rate depends on μ\mu and LL only through κ\kappa, i.e. minθR(θ,μ,L)R(κ)\min_{\theta}R(\theta,\mu,L)\triangleq R^{*}(\kappa).

To prove this result we apply Lyapunov’s method (see e.g. Chapter 2, Theorem 1 of ) to the QHM equations. The proof is then identical to the proof of Theorem 3, with matrix AA replaced by 2F(x)\nabla^{2}F(x_{*}). ∎

Appendix D Numerical Evaluation of the Convergence Rate

In this section we provide details on the numerical evaluation of the local convergence rate of QHM. We need to numerically estimate the following function

From Lemma 6 (Appendix C) we know that r(α,β,ν,l)r(\alpha,\beta,\nu,l) is a non-increasing function of ll until some point and non-decreasing after. Also note that in fact dependence of rr on α\alpha is the same as on ll, since they only appear in formulas as a product αl\alpha l. Thus, it is easy to see that for optimal α\alpha we will have

because otherwise α\alpha could be changed to decrease the value of the maximum.

Thus, to find optimal α\alpha for fixed β,ν\beta,\nu, we can solve equation (46) for α\alpha using binary search (with precision set to 10810^{-8}). To find optimal β\beta or ν\nu we just use grid search (with grid size equal to 10310^{3}) on [0,1105][0,1-10^{-5}] for β\beta and $forfor\nu$.

To numerically verify that the dependence of the optimal rate on ν\nu is monotonic, we run this procedure for 10310^{3} values of κ\kappa which are sampled (on a uniform grids) in the following way: 100 values on ,100valueson, 100 values on, 100 values on $,150valueson, 150 values on[10^{3},10^{4}],150valueson, 150 values on[10^{4},10^{5}],200valueson, 200 values on[10^{5},10^{6}],200valueson, 200 values on[10^{6},10^{7}]$. All experiments were run in parallel using GNU Parallel Command-Line Tool .

Since rate estimation is non-exact, it happens sometimes that very close points ν\nu show non-monotonic rate dependence, but it is always the case that the rate is approximately non-increasing in ν\nu. Precisely, we verify that the following condition holds for all estimated values of κ\kappa:

where Rˉ\bar{R}^{*} is estimated rate and νi\nu_{i} is i-th sample of ν\nu. Figure 5 shows the dependence of R(ν,κ)R^{*}(\nu,\kappa) on ν\nu for different values of κ\kappa.

Appendix E Stationary Distribution Proofs

In this section we present proofs of Theorems 4, 5. We will restate the combined statement of both theorems below for convenience.

Theorems 4, 5. Suppose F(x)=12xTAxF(x)=\frac{1}{2}x^{T}Ax, where AA is symmetric positive definite matrix. The stochastic gradients satisfy gk=F(xk)+ξg^{k}=\nabla F(x^{k})+\xi, where ξ\xi is a random vector independent of xkx^{k} with zero mean E[ξ]=0\mathbf{E}\left[\xi\right]=0 and covariance matrix E[ξξT]=Σξ\mathbf{E}\left[\xi\xi^{T}\right]=\Sigma_{\xi}. Also suppose the parameteres α,β,ν\alpha,\beta,\nu satisfy (13), then QHM algorithm (6), equivalently (10) in this case, converges to stationary distribution satisfying

Consequently, when ν=0\nu=0 (SGD), Σx\Sigma_{x} satisfies

When ν=1\nu=1 (SHB), Σx\Sigma_{x} satisfies

When ν=β\nu=\beta (NAG), Σx\Sigma_{x} satisfies

We consider the behavior of QHM with constant α\alpha, β\beta, and ν\nu, described in (47).

Under assumptions of Theorems 4, 5 we have that stochastic gradient gkg^{k} is generated as

where the noise ξk\xi^{k} is independent of xkx^{k}, has zero mean and constant covariance matrix. More explicitly, for all k0k\geq 0,

where Σξ\Sigma_{\xi} is a constant covariance matrix. Substituting the expression of gkg^{k} in (48) into (47) yields

where II denotes the n×nn\times n identity matrix.

Let L>0L>0 be the largest eigenvalue of AA. From Theorem 3 we know that under the conditions (13) the dynamical system (49) is stable, i.e., the spectral radius of the matrix

To simplify notation, we rewrite Equation (49) as

As kk\to\infty, the effect of the initial point z0z^{0} dies out and the covariance matrix of the state zkz^{k} becomes constant. Let

Then using the linear dynamics (50) and the assumption that {ξk}\{\xi^{k}\} is i.i.d. and has zero mean, we obtain

Following the partition of Σz\Sigma_{z}, we partition the above matrix equation into 2 by 2 blocks and obtain

Or, letting VV be the column block matrix with entries [Σd,Σxd,Σdx,AΣx,ΣxA,AΣxd,ΣdxA,AΣxA][\Sigma_{d},\Sigma_{xd},\Sigma_{dx},A\Sigma_{x},\Sigma_{x}A,A\Sigma_{xd},\Sigma_{dx}A,A\Sigma_{x}A], and defining symbolically UU to be the block matrix with coefficients:

(each block is an n×nn\times n identity matrix), we have

Next we use combinations of the above equations to obtain simplified relations: First, we can do

We take the following asymptotic expansion of Σz\Sigma_{z}:

Here, we explicitly write the zero’th order of (Σdx,Σxd,Σx)(\Sigma_{dx},\Sigma_{xd},\Sigma_{x}) to be zero. This can be easily proved from (54)-(54).

The first order term of (54) (and (54)) gives

Let’s now extend this result to the second-order in α\alpha. The first order term of (54) gives

The second order term of (54) (and (54)) gives

Plugging (64) into (65) and (66), we obtain

Let’s get an expression for tr(AΣξ)\mathbf{tr}(A\Sigma_{\xi}). From (70) we get (by taking trace and dividing by 2)

From (62) we get (by multiplying by A, taking trace and dividing by 2)

From (63) we get (by multiplying by A and taking trace)

The special cases of SGD, SHB and NAG can be straightforwardly obtained by substituting corresponding value of ν\nu into the general formula.

Appendix F Evaluation of Stationary Distribution Size

In this section we describe experimental details for evaluation of stationary distribution size on different machine learning problems (Section 5). The first problem we consider is a simple 2-dimensional quadratic function, where we add additive zero-mean Gaussian noise independent of the point xx, so that all assumptions of Theorem 5 are fully satisfied. For this function we have

We run the QHM algorithm for 1000 iterations starting at the optimal value and plot final loss as average across all 1000 iterations. We evaluate QHM for the following sweeps of hyperparameters: 30 values of α\alpha on a uniform grid on [0.01,1.5][0.01,1.5], 30 values of β\beta on a uniform grid on [0,0.999][0,0.999], 30 values of ν\nu on a uniform grid on $.Foreachcombinationofhyperparametersweverifythat. For each combination of hyperparameters we verify that\alpha\to 1,\beta\to 1indeeddecreasestheaverageloss.However,forsmallervaluesofindeed decreases the average loss. However, for smaller values of\nutheeffectofthe effect of\betaissmaller,asexpected.Thedependenceonis smaller, as expected. The dependence on\nucanbedescribedbyaquadraticfunctionwithminimumatsomecan be described by a quadratic function with minimum at some\nu<1.Notethatfromformula(15)thedependenceon. Note that from formula (15) the dependence on\nuisindeedquadraticwithoptimalis indeed quadratic with optimal\nu$ given by

From this equation the optimal ν(β)0.5\nu_{*}(\beta)\geq 0.5 and ν(β)0.5\nu_{*}(\beta)\to 0.5 as β1\beta\to 1. In the experiments we see the same qualitative behavior, but the optimal value of ν\nu is much closer to 11 than predicted by equation (76).

The second problem we consider is logistic regression on MNIST dataset. We run QHM for 5050 epochs with batch size of 128128 and weight decay (applied both to weights and biases) of 10410^{-4} (thus, μ104\mu\approx 10^{-4}). The final loss is averaged across last 10001000 batches. We evaluate algorithm for 50 values of α[0.01,30]\alpha\in[0.01,30] (log-uniform grid), 20 values of β[0,0.999]\beta\in[0,0.999] (uniform grid), 20 values of ν\nu\in (uniform grid).

The final problem we consider is ResNet-18 on CIFAR-10 dataset. We run QHM with batch size of 256256 and weight decay of 10410^{-4} (applied only to weights). We run algorithm for 8080 epochs with constant parameters and average final loss across last 100 batches. We evaluate α{0.01,0.05,0.1,0.5,1.0,2.0,3.0,5.0,7.0,8.5}\alpha\in\left\{0.01,0.05,0.1,0.5,1.0,2.0,3.0,5.0,7.0,8.5\right\}, β{0.0,0.01,0.2,0.5,0.7,0.9,0.99,0.999}\beta\in\left\{0.0,0.01,0.2,0.5,0.7,0.9,0.99,0.999\right\}. In this experiment we always set ν=1\nu=1.

Appendix G Approximation error of Theorem 5

In this section we run a set of experiments to check for which values of parameters the equation (15) is not accurate. In fact, we can immediately see that the approximation error grows unboundedly as β1\beta\to 1 if ν{0,β,1}\nu\notin\left\{0,\beta,1\right\}, because the right-hand-side of equation (15) converges to -\infty, while the left-hand-side is bounded from below.

Since we are interested in the approximation error from the higher-order terms, we run experiments on a 2-dimensional quadratic problem where all assumptions are satisfied. We follow the same experimental settings as in the appendix F. We test a uniform grid of 20 β\beta and 20 ν\nu values on $forfor\alpha\in\left\{0.05,0.1,0.2\right\}.Note,thatwecancomputetherighthandsideofequation(15)exactly,butneedtoestimatethelefthandside.ForthatwerunQHMfor10000iterationsandcomputeanempiricalcovarianceoftheiterates.Figure6showstheresultsofthisexperiment.Weplotarelativeerrorwithcolorandthresholditat0.2(i.e.weconsidertheformulatobeinaccurateiftherelativedifferencebetweenrighthandsideandlefthandsideisbiggerthan. Note, that we can compute the right-hand-side of equation (15) exactly, but need to estimate the left-hand-side. For that we run QHM for 10000 iterations and compute an empirical covariance of the iterates. Figure 6 shows the results of this experiment. We plot a relative error with color and threshold it at 0.2 (i.e. we consider the formula to be inaccurate if the relative difference between right-hand-side and left-hand-side is bigger than20\%).Wecanseethatindeedwhen). We can see that indeed when\alphaismoderatelybig,theformulabecomesimpreciseformanydifferentvaluesofis moderately big, the formula becomes imprecise for many different values of\nuandand\beta.However,when. However, when\alphaissmall,theformulaisonlyimpreciseforaverylargevaluesofis small, the formula is only imprecise for a very large values of\betaanditbecomesmoreinaccuratewhenand it becomes more inaccurate when\nu$ is far from 0 or 1.