Understanding the Role of Momentum in Stochastic Gradient Methods

Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao

Introduction

Stochastic gradient methods have become extremely popular in machine learning for solving stochastic optimization problems of the form

where $d^{k}$ is a (stochastic) search direction and $\alpha_{k}>0$ is the step size or learning rate. The classical stochastic gradient descent (SGD) method uses $d^{k}=\nabla_{x}f(x^{k},\zeta^{k})$ , where $\zeta^{k}$ is a random sample collected at step $k$ . For the ease of notation, we use $g^{k}$ to denote $\nabla_{x}f(x^{k},\zeta^{k})$ throughout this paper.

There is a vast literature on modifications of SGD that aim to improve its theoretical and empirical performance. The most common such modification is the addition of a momentum term, which sets the search direction $d^{k}$ as the combination of the current stochastic gradient $g^{k}$ and past search directions. For example, the stochastic variant of Polyak’s heavy ball method uses

where $\beta_{k}\in[0,1)$ . We call the combination of (2) and (3) the Stochastic Heavy Ball (SHB) method. Gupal and Bazhenov studied a “normalized” version of SHB, where

In the context of modern deep learning, Sutskever et al. proposed to use a stochastic variant of Nesterov’s accelerated gradient (NAG) method, where

The number of variations on momentum has kept growing in recent years; see, e.g., Synthesized Nesterov Variants (SNV) , Triple Momentum , Robust Momentum , PID Control-based methods , Accelerated SGD (AccSGD) , and Quasi-Hyperbolic Momentum (QHM) .

Despite various empirical successes reported for these different methods, there is a lack of clear understanding of how the different forms of momentum and their associated parameters affect convergence properties of the algorithms and other performance measures, such as final loss value. For example, Sutskever et al. show that momentum is critical to obtaining good performance in deep learning. But using different parametrizations, Ma and Yarats claim that momentum may have little practical effect. In order to clear up this confusion, several recent works [see, e.g., 40, 1, 18] have aimed to develop and analyze general frameworks that capture many different momentum methods as special cases.

In this paper, we focus on a class of algorithms captured by the general form of QHM :

Our theoretical results on the QHM model (6) cover three different aspects: asymptotic convergence with probability one, stability region and local convergence rates, and characterizations of the stationary distribution of $\{x^{k}\}$ under constant parameters $\alpha$ , $\beta$ , and $\nu$ . Specifically:

In Section 3, we show that for minimizing smooth nonconvex functions, QHM converges almost surely as $\beta_{k}\to 0$ for arbitrary values of $\nu_{k}$ . And more surprisingly, we show that QHM converges as $\nu_{k}\beta_{k}\to 1$ (which requires both $\nu_{k}\to 1$ and $\beta_{k}\to 1$ ) as long as $\nu_{k}\beta_{k}\to 1$ slow enough, as compared with the speed of $\alpha_{k}\to 0$ .

In Section 4, we consider local convergence behaviors of QHM for fixed parameters $\alpha$ , $\beta$ , and $\nu$ . In particular, we derive joint conditions on $(\alpha,\beta,\nu)$ that ensure local stability (or convergence when there is no stochastic noise in the gradient approximations) of the algorithm near a strict local minimum. We also characterize the local convergence rate within the stability region.

In Section 5, we investigate the stationary distribution of $\{x^{k}\}$ generated by the QHM dynamics around a local minimum (using a simple quadratic model with noise). We derive the dependence of the stationary variance on $(\alpha,\beta,\nu)$ up to the second-order Taylor expansion in $\alpha$ . These results reveal interesting effects of $\beta$ and $\nu$ that cannot be seen from first-order expansions.

Our asymptotic convergence results in Section 3 give strong guarantees for the convergence of QHM with diminishing learning rates under different regimes ( $\beta_{k}\to 0$ and $\beta_{k}\to 1$ ). However, as with most asymptotic results, they provide limited guidance on how to set the parameters in practice for fast convergence. Our results in Sections 4 and 5 complement the asymptotic results by providing principled guidelines for tuning these parameters. For example, one of the most effective schemes used in deep learning practice is called “constant and drop”, where constant parameters $(\alpha,\beta,\nu)$ are used to train the model for a long period until it reaches a stationary state and then the learning rate $\alpha$ is dropped by a constant factor for refined training. Each stage of the constant-and-drop scheme runs variants of QHM with constant parameters, and their choices dictate the overall performance of the algorithm. In Section 6, by combining our results in Sections 4 and 5, we obtain new and, in some cases, counter-intuitive insight into how to set these parameters in practice.

Related work

Asymptotic convergence There exist many classical results concerning the asymptotic convergence of the stochastic gradient methods [see, e.g. 37, 28, 14, and references therein]. For the classical SGD method without momentum, i.e., (2) with $d^{k}=g^{k}$ , a well-known general condition for asymptotic convergence is $\sum_{k=0}^{\infty}\alpha_{k}=\infty$ and $\sum_{k=0}^{\infty}\alpha_{k}^{2}<\infty$ . In general, we will always need $\alpha_{k}\to 0$ to counteract the effect of noise. But interestingly, the conditions on $\beta_{k}$ are much less restricted. For normalized SHB, Polyak and Kaniovski studied its asymptotic convergence properties in the regime of $\alpha_{k}\to 0$ and $\beta_{k}\to 0$ , while Gupal and Bazhenov investigated asymptotic convergence in the regime of $\alpha_{k}\to 0$ and $\beta_{k}\to 1$ , both for convex optimization problems. More recently, Gadat et al. extended asymptotic convergence analysis for the normalized SHB update to smooth nonconvex functions for $\beta_{k}\to 1$ . In this work we generalize the classical SGD and SHB results to the case of QHM for smooth nonconvex functions.

Local convergence rate The stability region and local convergence rate of the deterministic gradient descent and heavy ball algorithms were established by Boris Polyak for the case of convex functions near a strict twice-differentiable local minimum . For this class of functions heavy ball method is optimal in terms of the local convergence rate . However, it might fail to converge globally for the general strongly convex twice-differentiable functions and is no longer optimal for the class of smooth convex functions. For the latter case, Nesterov’s accelerated gradient was shown to attain the optimal global convergence rate . In this paper we extend the results of Polyak on local convergence to the more general QHM algorithm.

Stationary analysis The limit behavior analysis of SGD algorithms with momentum and constant step size was used in various applications. establish sufficient conditions on detecting whether iterates reach stationarity and use them in combination with statistical tests to automatically change learning rate during training. prove many properties of limiting behavior of SGD with constant step size by using tools from Markov chain theory. Our results are most closely related to the work of Mandt et al. who use stationary analysis of SGD with momentum to perform approximate Bayesian inference. In fact, our Theorem 4 extends their results to the case of QHM and our Theorem 5 establishes more precise relations (to the second order in $\alpha$ ), revealing interesting dependence on the parameters $\beta$ and $\nu$ which cannot be seen from the first order equations.

Asymptotic convergence

In this section, we generalize the classical asymptotic results to provide conditions under which QHM converges almost surely to a stationary point for smooth nonconvex functions. Throughout this section, "a.s." refers to "almost surely". We need to make the following assumptions.

The following conditions hold for $F$ defined in (1) and the stochastic gradient oracle:

$F$ is differentiable and $\nabla F$ is Lipschitz continuous, i.e., there is a constant $L$ such that

$F$ is bounded below and $\|\nabla F(x)\|$ is bounded above, i.e., there exist $F_{*}$ and $G$ such that

For $k=0,1,2,\ldots$ , the stochastic gradient $g^{k}=\nabla F(x^{k})+\xi^{k}$ , where the random noise $\xi^{k}$ satisfies

where $\mathbf{E}_{k}[\cdot]$ denotes expectation conditioned on $\{x^{0},g^{0},\ldots,x^{k-1},g^{k-1},x^{k}\}$ , and $C$ is a constant.

Note that Assumption A.3 allows the distribution of $\xi^{k}$ to depend on $x^{k}$ , and we simply require the second moment to be conditionally bounded uniformly in $k$ . The assumption $\|\nabla F(x)\|\leq G$ can be removed if we assume a bounded domain for $x$ . However, this will complicate the proof by requiring special treatment (e.g., using the machinery of gradient mapping ) when $\{x^{k}\}$ converges to the boundary of the domain. Here we assume this condition to simplify the analysis.

By convergence to a stationary point, we mean that the sequence $\{x^{k}\}$ satisfies the condition

Intuitively, as $\beta_{k}\to 0$ , regardless of $\nu_{k}$ , the QHM dynamics become more like SGD, so there should be no issue with convergence. The following theorem, which generalizes the analysis technique of Ruszczyński and Syski to QHM, shows formally that this is indeed the case:

Let $F$ satisfy Assumption A. Additionally, assume $0\leq\nu_{k}\leq 1$ and the sequences $\{\alpha_{k}\}$ and $\{\beta_{k}\}$ satisfy the following conditions:

Then the sequence $\{x^{k}\}$ generated by the QHM algorithm (6) satisfies (7). Moreover, we have

More surprisingly, however, one can actually send $\nu_{k}\beta_{k}\to 1$ as long as $\nu_{k}\beta_{k}\to 1$ slow enough, although we require a stronger condition on the noise $\xi$ . We extend the technique of Gupal and Bazhenov to show asymptotic convergence of QHM for minimizing smooth nonconvex functions.

Let $F$ satisfy assumption A, and additionally assume that $||\xi^{k}||^{2}<C$ almost surely, i.e., the noise $\xi$ is a.s. bounded. Let the sequences $\{\alpha_{k}\}$ , $\{\beta_{k}\}$ , and $\{\nu_{k}\}$ satisfy the following conditions:

Then the sequence $\{x^{k}\}$ generated by Algorithm (6) satisfies (7).

The conditions in Theorem 2 can be satisfied by, for example, taking $\alpha_{k}=k^{-\omega}$ and $(1-\nu_{k}\beta_{k})=k^{-c}$ for $\frac{1+c}{2}<\omega\leq 1$ and $\frac{1}{2}<c<1$ . We should note that, even though setting $\nu_{k}\beta_{k}\to 1$ is somewhat unusual in practice, we think the result of Theorem 2 is interesting from both theoretical and practical points of view. From the theoretical side, this result shows that it is possible to always be increasing the amount of momentum (in the limit when $\nu_{k}\beta_{k}=1$ , we are not using the fresh gradient information at all) and still obtain convergence for smooth functions. From the practical point of view, our Theorem 5 in Section 5 shows that for a fixed $\alpha$ , increasing $\nu_{k}\beta_{k}$ might lead to smaller stationary distribution size, which may give better empirical results.

Also, note that when $\nu_{k}=\beta_{k}$ , Theorems 1 and 2 give asymptotic convergence guarantees for the common practical variant of NAG, which have not appeared in the literature before. However, we should mention that the bounded noise assumption of Theorem 2 (i.e. $||\xi^{k}||^{2}<C$ a.s.) is quite restrictive. In fact, Ruszczyński and Syski prove a similar result for SGM with a more general noise condition, and their technique may extend to QHM, but bounded noise greatly simplifies the derivations. We provide the proofs of Theorems 1 and 2 in Appendix B.

The results in this section indicate that both $\beta_{k}\to 0$ and $\nu_{k}\beta_{k}\to 1$ are admissible from the perspective of asymptotic convergence. However, they give limited guidance on how to choose momentum parameters in practice, where non-asymptotic behaviors are of main concern. In the next two sections, we study local convergence and stationary behaviors of QHM with constant learning rate and momentum parameters; our analysis provides new insights that could be very useful in practice.

Stability region and local convergence rate

Let the sequence $\{x^{k}\}$ be generated by the QHM algorithm (6) with constant parameters $\alpha_{k}=\alpha$ , $\beta_{k}=\beta$ and $\nu_{k}=\nu$ . In this case, $x^{k}$ does not converge to any local minimum in the asymptotic sense, but its distribution may converge to a stationary distribution around a local minimum. Since the objective function $F$ is smooth, we can approximate $F$ around a strict local minimum $x^{*}$ by a convex quadratic function. Since $\nabla F(x^{*})=0$ , we have

where the Hessian $\nabla^{2}F(x^{*})$ is positive definite. Therefore, for the ease of analysis, we focus on convex quadratic functions of the form $F(x)=(1/2)(x-x^{*})^{T}A(x-x^{*}),$ where $A$ is positive definite (and we can set $x^{*}=0$ without loss of generality). In addition, we assume

where the noise $\xi^{k}$ satisfies Assumption A.3 and in addition, $\xi^{k}$ is independent of $x^{k}$ for all $k\geq 0$ . Mandt et al. observe that this independence assumption often holds approximately when the dynamics of SHB are approaching stationarity around a local minimum.

where $T$ and $S$ are functions of $(\alpha,\beta,\nu)$ and $A$ :

It is well-known that the linear system (10) is stable if and only if the spectral radius of $T$ , denoted by $\rho(T)$ , is less than 1. When $\rho(T)<1$ , the dynamics of (10) is the superposition of two components:

A deterministic part described by the dynamics $z^{k+1}=Tz^{k}$ with initial condition $z^{0}=[0;x^{0}]$ (we always take $d^{-1}=0$ ). This part asymptotically decays to zero.

An auto-regressive stochastic process (10) driven by $\{\xi^{k}\}$ with zero initial condition $z^{0}=[0;0]$ .

Roughly speaking, $\rho(T)$ determines how fast the dynamics converge from an arbitrary initial point $x^{0}$ to the stationary distribution, while properties of the stationary distribution (such as its variance and auto-correlations) depends on the full spectrum of the matrix $T$ as well as $S$ . Both aspects have important implications for the practical performance of QHM on stochastic optimization problems. Often there are trade-offs that we have to make in choosing the parameters $\alpha$ , $\beta$ and $\nu$ to balance the transient convergence behavior and stationary distribution properties.

In the rest of this section, we focus on the deterministic dynamics $z^{k+1}=Tz^{k}$ to derive the conditions on $(\alpha,\beta,\nu)$ that ensure $\rho(T)<1$ and characterize the convergence rate. Let $\lambda_{i}(A)$ for $i=1,\ldots,n$ denote the eigenvalues of $A$ (they are all real and positive). In addition, we define

where $\kappa$ is the condition number. The local convergence rate for strictly convex quadratic functions is well studied for the case of gradient descent ( $\nu=0$ ) and heavy ball ( $\nu=1$ ) . In fact, heavy ball achieves the best possible convergence rate of $(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)$ . Thus, it is immediately clear that the optimal convergence rate of QHM will be the same and will be achieved with $\nu=1$ . However, there are no results in the literature characterizing how the optimal rate or optimal parameters change as a function of $\nu$ . Our next result establishes the convergence region and dependence of the convergence rate on the parameters $\alpha$ , $\beta$ , and $\nu$ . We present the result for quadratic functions, but it can be generalized to any $L$ -smooth and $\mu$ -strongly convex functions, assuming the initial point $x^{0}$ is close enough to the optimal point $x_{*}$ (see Theorem 6 in Appendix C).

Let’s denote $\theta=\{\alpha,\beta,\nu\}$ We drop the dependence of some functions on $\theta$ for brevity.. For any function $F(x)=x^{T}Ax+b^{T}x+c$ that satisfies $0<\mu\leq\lambda_{i}(A)\leq L$ for all $i=1,\ldots,n$ and any $x^{0}$ , $\exists\{\epsilon_{k}\}$ , with $\epsilon_{k}\geq 0$ , such that the deterministic QHM algorithm $z^{k+1}=Tz^{k}$ satisfies

where $x_{*}=\operatorname*{arg\,min}_{x}{F(x)}$ , $\lim_{k\to\infty}\epsilon_{k}=0$ and $R(\theta,\mu,L)=\rho(T)$ , which can be characterized as

To ensure $R(\theta,\mu,L)<1$ , the parameters $\alpha,\beta,\nu$ must satisfy the following constraints:

In addition, the optimal rate depends only on $\kappa$ : $\min_{\theta}R(\theta,\mu,L)$ is a function of only $\kappa$ .

The conditions in (13) characterize the stability region of QHM. Note that when $\nu=0$ we have the classical result for gradient descent: $\alpha<2/L$ ; when $\nu=1$ , the condition matches that of the normalized heavy ball: $\alpha<2(1+\beta)/(L(1-\beta))$ .

The equations (12) define the convergence rate for any fixed values of the parameters $\alpha,\beta,\nu$ . While it does not give a simple analytic form, it allows us to conduct easy numerical investigations. To gain more intuition into the effect that momentum parameters $\nu$ and $\beta$ have on the convergence rate, we study how the optimal $\nu$ changes as a function of $\beta$ and vice versa. To find the optimal parameters and rate, we solve the corresponding optimization problem numerically (using the procedure described in Appendix D). For each pair $\left\{\beta,\nu\right\}$ we set $\alpha$ to the optimal value in order to remove its effect. These plots are presented in Figure 1.

A natural way to think about the interplay between parameters $\alpha,\beta$ and $\nu$ is in terms of the total “amount of momentum”. Intuitively, it should be controlled by the product of $\nu\times\beta$ . This intuition helps explain Figure 1 (a), (b), which show the dependence of the optimal $\beta$ as a function of $\nu$ for different values of $\kappa$ . We can see that for bigger values of $\nu$ we need to use smaller values of $\beta$ , since increasing each one of them increases the “amount of momentum” in QHM. However, the same intuition fails when considering $\nu$ as a function of $\beta$ (and $\beta$ is big enough), as shown in Figure 1 (c), (d). In this case there are 3 regimes of different behavior. In the first regime, since $\beta$ is small, the amount of momentum is not enough for the problem and thus the optimal $\nu$ is always $1$ . In this phase we also need to increase $\alpha$ when increasing $\beta$ (it is typical to use larger learning rate when the momentum coefficient is bigger). The second phase begins when we reach the optimal value of $\beta$ (rate is minimal) and, after that, the amount of momentum becomes too big and we need to decrease $\nu$ and $\alpha$ . However, somewhat surprisingly, there is a third phase, when $\beta$ becomes big enough we need to start increasing $\nu$ and $\alpha$ again. Thus we can see that it’s not just the product of $\nu\beta$ that governs the behavior of QHM, but a more complicated function.

Finally, based on our analytic and numerical investigations, we conjecture that the optimal convergence rate is a monotonically decreasing function of $\nu$ (if $\alpha$ and $\beta$ are chosen optimally for each $\nu$ ). While we can’t prove this statementIn fact, we hypothesise that $R^{*}(\nu,\kappa)$ might not have analytical formula, since it is possible to show that the optimization problem over $\alpha$ and $\beta$ is equivalent to the system of highly non-linear equations., we verify this conjecture numerically in Appendix D. The code of all of our experiments is available at https://github.com/Kipok/understanding-momentum.

Stationary analysis

In this section, we study the stationary behavior of QHM with constant parameters $\alpha$ , $\beta$ and $\nu$ . Again we only consider quadratic functions for the same reasons as outlined in the beginning of Section 4. In other words, we focus on the linear dynamics of (10) driven by the noise $\xi^{k}$ as $k\to\infty$ (where the deterministic part depending on $x^{0}$ dies out). Under the assumptions of Section 4 we have the following result on the covariance matrix defined as $\Sigma_{x}\triangleq\lim_{k\to\infty}\mathbf{E}\bigl{[}x^{k}(x^{k})^{T}\bigr{]}$ .

Suppose $F(x)=\frac{1}{2}x^{T}Ax$ , where $A$ is symmetric positive definite matrix. The stochastic gradients satisfy $g^{k}=\nabla F(x^{k})+\xi$ , where $\xi$ is a random vector independent of $x^{k}$ with zero mean $\mathbf{E}\left[\xi\right]=0$ and covariance matrix $\mathbf{E}\left[\xi\xi^{T}\right]=\Sigma_{\xi}$ . Also, suppose the parameters $\alpha,\beta,\nu$ satisfy (13). Then the QHM algorithm (6), equivalently (10) in this case, converges to a stationary distribution satisfying

When $\nu=1$ , this result matches the known formula for the stationary distribution of unnormalized SHB with reparametrization of $\alpha\to\alpha/(1-\beta)$ . Note that Theorem 4 shows that for the normalized version of the algorithm, the stationary distribution’s covariance does not depend on $\beta$ (or $\nu$ ) to the first order in $\alpha$ . In order to explore such dependence, we need to expand the dependence on $\alpha$ to the second order. In that case, we are not able to obtain a matrix equation, but can get the following relation for $\mathbf{tr}(A\Sigma_{x})$ .

Under the conditions of Theorem 4, we have

We note that $\mathbf{tr}(A\Sigma_{x})$ is twice the mean value of $F(x)$ when the dynamics have reached stationarity, so the right-hand side of (15) is approximately the “achievable loss” given the values of $\alpha,\beta$ and $\nu$ . It is interesting to consider several special cases:

$\nu=0$ (SGD): $\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3})$ .

$\nu=1$ (SHB): $\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\frac{1-\beta}{1+\beta}\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3})$ .

$\nu=\beta$ (NAG): $\mathbf{tr}(A\Sigma_{x})=\frac{\alpha}{2}\mathbf{tr}(\Sigma_{\xi})+\frac{\alpha^{2}}{4}\left(1-\frac{2\beta^{2}(1+2\beta)}{1+\beta}\right)\mathbf{tr}(A\Sigma_{\xi})+O(\alpha^{3})$ .

From the expressions for SHB and NAG, it might be beneficial to move $\beta$ to $1$ during training in order to make the achievable loss smaller. While moving $\beta$ to $1$ is somewhat counter-intuitive, we proved in Section 3 that QHM still converges asymptotically in this regime, assuming $\nu$ also goes to $1$ and $\nu\beta$ converges to $1$ “slower” than $\alpha$ converges to . However, since we only consider Taylor expansion in $\alpha$ , there is no guarantee that the approximation remains accurate when $\nu$ and $\beta$ converge to $1$ (see Appendix G for evaluation of this approximation error). In order to precisely investigate the dependence on $\beta$ and $\nu$ , it is necessary to further extend our results by considering Taylor expansion with respect to them as well, especially in terms of $1-\beta$ . We leave this for future work.

Figure 2 shows a visualization of the QHM stationary distribution on a 2-dimensional quadratic problem. We can see that our prediction about the dependence on $\alpha$ and $\beta$ holds in this case. However, the dependence on $\nu$ is more complicated: the top and bottom rows of Figure 2 show opposite behavior. Comparing this experiment with our analysis of the convergence rate (Figure 1) we can see another confirmation that for big values of $\beta$ , increasing $\nu$ can, in a sense, decrease the “amount of momentum” in the system. Next, we evaluate the average final loss for a large grid of parameters $\alpha,\beta$ and $\nu$ on three problems: a 2-dimensional quadratic function (where all of our assumptions are satisfied), logistic regression on the MNIST dataset (where the quadratic assumption is approximately satisfied, but gradient noise comes from mini-batches) and ResNet-18 on CIFAR-10 (where all of our assumptions are likely violated). Figure 3 shows the results of this experiment. We can indeed see that $\beta\to 1$ and $\alpha\to 0$ make the final loss smaller in all cases. The dependence on $\nu$ is less clear, but we can see that for large values of $\beta$ it is approximately quadratic, with a minimum at some $\nu<1$ . Thus from this point of view $\nu\neq 1$ helps when $\beta$ is big enough, which might be one of the reasons for the empirical success of the QHM algorithm. Notice that the empirical dependence on $\nu$ is qualitatively the same as predicted by formula (15), but with optimal value shifted closer to $1$ . See Appendix F for details.

Some practical implications and guidelines

In this section, we present some practical implications and guidelines for setting learning rate and momentum parameters in practical machine learning applications. In particular, we consider the question of how to set the optimal parameters in each stage of the popular constant-and-drop scheme for deep learning. We argue that in order to answer this question, it is necessary to consider both convergence rate and stationary distribution perspectives. There is typically a trade-off between obtaining a fast rate and a small stationary distribution. You can see an illustration of this trade-off in Figure 4 (a). Interestingly, by combining stationary analysis of Section 5 and results for the convergence rate (3), we can find certain regimes of parameters $\alpha,\beta$ , and $\nu$ where the final loss and the convergence speed do not compete with each other.

One of the most important of these regimes happens in the case of the SHB algorithm ( $\nu=1$ ). In that case, we can see that when $C_{1}^{2}(l)-C_{2}^{2}(l)\leq 0,l\in\left\{\mu,L\right\}$ , the convergence rate equals $\sqrt{\beta}$ and does not depend on $\alpha$ . Thus, as long as this inequality is satisfied, we can set $\alpha$ as small as possible and it would not harm the convergence rate, but will decrease the size of stationary distribution. To get the best possible convergence rate, we, in fact, have to set $\alpha$ and $\beta$ in such a way that this inequality will turn into equality and thus there will be only a single value of $\alpha$ that could be used. However, as long as $\beta$ is not exactly at the optimal value, there is going to be some freedom in choosing $\alpha$ and it should be used to decrease the size of stationary distribution. From this point of view, the optimal value of $\alpha=\bigl{(}1-\sqrt{\beta}\bigr{)}\big{/}\Bigl{(}\mu\bigl{(}1+\sqrt{\beta}\bigr{)}\Bigr{)}$ , which will be smaller then the largest possible $\alpha$ for convergence as long as $\kappa>2$ and $\beta$ is set close to $1$ (see proof of Theorem 3 for more details). This guideline contradicts some typical advice to set $\alpha$ as big as possible while algorithm still convergesFor example says “So set $\beta$ as close to 1 as you can, and then find the highest $\alpha$ which still converges. Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.”. The refined guideline for the constant-and-drop scheme would be to set $\alpha$ as small as possible until the convergence noticeably slows down. You can see an illustration of this behavior on a simple quadratic problem (Figure 4 (b)), as well as for ResNet-18 on CIFAR-10 (Figure 4 (c)). Such regimes of no trade-off can be identified for $\beta$ and $\nu$ as well.

Conclusion

Using the general formulation of QHM, we have derived a unified set of new analytic results that give us better understanding of the role of momentum in stochastic gradient methods. Our results cover several different aspects: asymptotic convergence, stability region and local convergence rate, and characterizations of stationary distribution. We show that it is important to consider these different aspects together to understand the key trade-offs in tuning the learning rate and momentum parameters for better performance in practice. On the other hand, we note that the obtained guidelines are mainly for stochastic optimization, meaning the minimization of the training loss. There is evidence that different heuristics and guidelines may be necessary for achieving better generalization performance in machine learning, but this topic is beyond the scope of our current paper.

References

Appendix

Appendix A From NAG to QHM

In this appendix we will mention exact steps needed to come from the original NAG formulation to the formulation assumed by the QHM algorithm. We refer the reader to the ( Appendix A.1) for the derivation of NAG as the following momentum method Note that we change the notation to be consistent with the notation of QHM.

Next, we will move the learning rate out of the momentum into the iterates update:

When $\alpha_{k}$ and $\beta_{k}$ are constant, the two methods produce the same sequence of iterates $x_{k}$ if $d_{0}$ is initialized at . To make the notation more similar to the QHM algorithm, let’s move all indices (except for $d_{k}$ ) up by 1:

This again does not change the algorithm. Now, let’s normalize the momentum update by $1-\beta_{k}$ :

This version is equivalent to the unnormalized by re-scaling $\alpha\to\alpha/(1-\beta)$ for constant parametersIn fact, for non-constant $\beta_{k}$ the algorithms are no longer equivalent.. Finally, following we need to make a change of variables $y_{k}=x_{k}-\alpha_{k}\beta_{k}d_{k-1}$ and additionally assume that $\beta_{k}=\beta$ is constant:

Renaming $y_{k}$ back to $x_{k}$ and replacing $\nabla f(y_{k})$ with stochastic gradient if necessary we obtain the exact formula used in QHM update.

Overall, assuming $d_{0}=0$ and $\beta_{k}$ is constant, the QHM version of NAG is indeed equivalent (up to a change of variable) to the original NAG with re-scaling of $\alpha\to\alpha/(1-\beta)$ . However, if $\beta_{k}$ is changing from iteration to iteration, the two algorithms are no longer equivalent.

Appendix B Asymptotic Convergence Proofs

In this section we prove Theorems 1 and 2. For simplicity, we assume throughout that $\alpha_{k}$ , $\nu_{k}$ and $\beta_{k}$ are nonrandom.

Here we generalize the meta-analysis of Ruszczyński and Syski to include $\nu_{k}$ .

We consider the following variant of QHM:

where $\nu_{k}$ is in $ $and$ i_{k} $is a binary switch introduced to handle unbounded noise. Specifically, for some constant$ \rho>0$, we let

Conditions repeated for convenience.

Let $F$ satisfy Assumption A. Additionally, assume the sequences $\{\alpha_{k}\}$ , $\{\beta_{k}\}$ and $\{\nu_{k}\}$ satisfy the following conditions:

Then the dynamics (6) satisfy (7) and (8).

which we will use often in the convergence analysis.

Suppose Assumption A, (20), and (23) hold, and for every $k\geq 0$ ,

where $\{S_{k}\}$ and $\{W_{k}\}$ are sequences of scalar random variables satisfying and

For the first part, suppose for a contradiction that $\liminf_{k\to\infty}||\nabla F(x^{k})||>0$ . So there exists $\epsilon>0$ and $k_{0}$ such that for all $k\geq k_{0}$ , $||\nabla F(x^{k})||\geq\epsilon$ . By (26), there exists $k_{1}\geq k_{0}$ such that $S_{k}\leq\epsilon/2$ for all $k\geq k_{1}$ . Then from (25) we obtain:

Summing over $k$ from $k_{1}$ to $\infty$ and using Assumption A.2, we get that:

The right-hand-side is finite by (27). But since $||\nabla F(x^{k})||\geq\epsilon$ for all $k\geq k_{1}$ and $\beta_{k}\leq\bar{\beta}<1$ and $0\leq\nu_{k}\leq 1$ , we have:

This implies $\sum_{k=k_{1}}^{\infty}\alpha_{k}<\infty$ , which contradicts (20). So we must have (7), i.e. $\liminf_{k}||\nabla F(x^{k})||=0$ .

For every $l$ , define the index $k(l)=\max\{k\in\mathcal{K}:k<l\}$ . Since $\mathcal{K}$ is infinite, $k(l)\to\infty$ as $l\to\infty$ . Then for sufficiently large $l$ , i.e., when $k(l)\geq k_{1}$ , (25) becomes

As $l\to\infty$ , because of (27) and $k(l)\to\infty$ , we get $\sum_{i=k(l)}^{l-1}W_{i}\to 0$ , so

Since the reverse inequality is trivial, we obtain (8).

Because $\alpha_{k}<\bar{\alpha}<\infty$ for all $k$ , $\nu_{k}$ , $\beta_{k}$ are in $ $, and$ ||\nabla F(x^{k(l)})||\to 0 $, the latter two terms in the above inequality converge to zero as$ l\to\infty$. So we obtain (30) again. This concludes the proof of the lemma. ∎

All that remains now is to use the smoothness inequality (24) to identify the sequences $S_{k}$ and $W_{k}$ for the dynamics of the modified algorithm (16)-(18), and prove (26) and (27).

From the update formula (18) and using $g^{k}=\nabla F(x^{k})+\xi^{k}$ , we obtain

By the smoothness assumption A.1, we have

First we show $S_{k}\to 0$ . From the update formula, we have

Then because $i_{k}||d^{k-1}||\leq\rho$ , $\beta_{k}\to 0$ , and $\sup_{k}\beta_{k}=\bar{\beta}<1$ , we have

Because $S_{k}\geq 0$ , $\lim_{k}S_{k}=0$ .

Now we show that $W_{k}$ is summable almost surely. To begin, we need to show that $||{\Delta x^{k+1}}||^{2}$ is summable, for which we need the following lemma:

There is a random variable $C$ , constant in $k$ , such that $\mathbf{E}_{k}[||b^{k}||^{2}]\leq C$ for all $k$ almost surely.

where in the last inequality we used $i_{k}||d^{k-1}||\leq\rho$ . Then

By assumption A.2 and A.3, the first and second conditional moments $\mathbf{E}_{k}[||g^{k}||]$ and $\mathbf{E}_{k}[||g^{k}||^{2}]$ are both bounded uniformly in $k$ . Then because $\nu_{k}$ and $\beta_{k}$ are in $ $and$ \rho $is constant in$ k$, we can put

which is what we wanted. Note that $C$ could be a random variable (depending on $\omega$ ), but this bound holds almost surely. ∎

$\sum_{k=1}^{\infty}||{\Delta x^{k+1}}||^{2}<\infty$ almost surely.

We will use the following useful proposition (known as Levy’s sharpening of Borel-Cantelli Lemma, see e.g. Meyer [20, Chapter 1, Theorem 21]):

Let $\{b_{k}\}$ be a sequence of positive, integrable random variables, and let $a_{i}=\mathbf{E}[b_{i}|\mathcal{F}_{i}]$ , where $\mathcal{F}_{i}=\{b_{0},\ldots b_{i-1}\}$ . Then defining the partial sums $B_{k}=\sum_{i=1}^{k}b_{i}$ , $A_{k}=\sum_{i=1}^{k}a_{i}$ ,

So to prove $\sum_{k=1}^{\infty}||{\Delta x^{k+1}}||^{2}<\infty$ , we only need to prove $\sum_{k=1}^{\infty}\mathbf{E}_{k}[||{\Delta x^{k+1}}||^{2}]<\infty$ . To see this, observe that

so because $\mathbf{E}_{k}[||b^{k}||^{2}]\leq C$ a.s.,

where we used (21). Applying the proposition finishes the lemma. ∎

The last term remaining in $W_{k}$ is $-\alpha_{k}(1-\nu_{k}\beta_{k})\langle\nabla F(x^{k}),\xi^{k}\rangle$ . We show that

is a convergent martingale. First, note that $\mathbf{E}_{i}[\alpha_{i}(1-\nu_{i}\beta_{i})\langle\nabla F(x^{i}),\xi^{i}\rangle]=0$ , so $E_{k}[M_{k}]=M_{k-1}$ , and $M_{k}$ is a martingale. Now we show that $\sup_{k}\mathbf{E}[M_{k}^{2}]$ is bounded, which will imply a.s. convergence by Doob’s forward convergence theorem [38, Section 11.5]. Indeed, $\mathbf{E}[M_{k}^{2}]=\mathbf{E}[M_{0}^{2}]+\sum_{i=1}^{k}\mathbf{E}[(M_{i}-M_{i-1})^{2}]$ [38, Section 12.1].

Because $\xi^{0}$ depends only on $x_{0}$ , we can upper bound this expectation with some constant $C$ by using Assumption A.3. Then we have that $\mathbf{E}[M_{0}^{2}]\leq C$ , so $\mathbf{E}[M_{k}^{2}]\leq C+\sum_{i=1}^{k}\mathbf{E}[(M_{i}-M_{i-1})^{2}]$ . Therefore,

where the last inquality used Assumption A.2 and the fact that $0\leq\nu_{i}\beta_{i}\leq 1$ almost surely. Moreover,

where the first equality is by the law of total expectation and the inequality comes from Assumption A.3. Because $\sum_{i=1}^{\infty}\alpha_{i}^{2}<\infty$ , we finally have

so $M_{k}$ is a convergent martingale. In particular,

We have shown (26) and (27), which concludes the proof. ∎

Now we prove Theorem 2, where under a stronger noise assumption we show that $\beta_{k}\to 1$ is admissible as long as it goes to 1 slow enough.

Assume the sequences $\{\alpha_{k}\}$ , $\{\beta_{k}\}$ , and $\{\nu_{k}\}$ satisfy the following:

then sequence $\{x^{k}\}$ generated by the algorithm (6) satisfies

where in the second inequality we used $\langle a,b\rangle\leq\frac{1}{4}\|a\|^{2}+\|b\|^{2}$ for any two vectors $a$ and $b$ , and in the last inequality we used $0\leq\nu_{k}\beta_{k}\leq 1$ . Taking conditional expectation on both sides of the above inequality and using $\mathbf{E}[\xi^{k}]=0$ , we get

Next we analyze the sequence $\{d^{k-1}-\nabla F(x^{k})\}$ . From the update formula in (6), we have

where in the first inequality we used $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ , and in the second inequality we used $2\langle a,\,b\rangle\leq\eta\|a\|^{2}+\frac{1}{\eta}\|b\|^{2}$ for any $\eta>0$ . By the smoothness assumption, we have

which, combining with the previous inequality, leads to

Choosing $\eta=1-\nu_{k}\beta_{k}$ , we obtain

Taking expectation conditioned on $\{x^{0},g^{0},\ldots,x^{k-1},d^{k-1},x^{k}\}$ , we have

To show that $||d^{k-1}-\nabla F(x^{k})||^{2}$ is a convergent martingale, we prove the following lemma, similar to Ermoliev .

Assume we are given a sequence such that $\mathbf{E}_{k}[X_{k+1}]\leq X_{k}+Y_{k}$ , where $0\leq X_{k}\leq C$ and $0\leq Y_{k}\leq C$ almost surely for some constant $C$ , the random variables $Y_{k}$ are $\mathcal{F}_{k}$ -measurable, and they satisfy $\sum_{k=0}^{\infty}Y_{k}<\infty$ almost surely. Then the sequence $X_{k}$ converges almost surely.

We show that $Z_{k}=X_{k}+\sum_{k=0}^{\infty}Y_{k}$ is a convergent supermartingale. By Doob decomposition, $Z_{k}$ is a supermartingale if and only if the sequence

satisfies $\mathbf{P}(A_{k+1}\leq A_{k})=1$ for all $k$ [38, section 12.11]. Here

and we assumed this is non-positive. So $A_{k+1}\leq A_{k}$ almost surely. The upper bound on $X_{k}$ and the convergence of $\sum_{k=0}^{\infty}Y_{k}$ implies that the supermartingale $Z$ is in $\mathcal{L}^{1}$ , so the sequence $\{Z_{k}\}$ converges almost surely by Doob’s forward convergence theorem [38, chapter 11]. By the convergence of $\sum Y_{k}$ , this in turn implies that the sequence $X_{k}$ converges almost surely.∎

We can apply the above lemma to show that $\bigl{\|}d^{k-1}-\nabla F(x^{k})\bigr{\|}^{2}$ is a convergent semimartingale. Because the noise $||\xi^{k}||^{2}$ is uniformly bounded almost surely and $||\nabla F(x^{k})||\leq G$ , $||d^{k-1}-\nabla F(x^{k})||^{2}$ is uniformly bounded in $k$ . The uniform bound on the noise $||\xi^{k}||^{2}$ also implies that $||b^{k}||^{2}$ is uniformly bounded in $k$ . In the notation of the lemma, we have

Note that $Y_{k}\geq 0$ . To show convergence of $\sum Y_{k}$ , note that the uniform bounds imply

for suitably large $C$ . Then convergence follows from the conditions on the sequences $(1-\nu_{k}\beta_{k})^{2}$ , $\alpha_{k}^{2}$ , and $\alpha_{k}^{2}/(1-\nu_{k}\beta_{k})$ . This proves that $||d^{k-1}-\nabla F(x^{k})||^{2}$ converges almost surely.

Summing up the two inequalities (41) and (43) gives

If $(1+\alpha_{k})\nu_{k}\beta_{k}\leq 1$ , then

Since $\alpha_{k}/(1-\nu_{k}\beta_{k})\to 0$ , there exists $m$ such that $(1+\alpha_{k})\nu_{k}\beta_{k}\leq 1$ for all $k\geq m$ . Taking full expectation on both sides of the above inequality and summing up for all $k\geq m$ , we obtain

The right-hand side is bounded by assumption (the index $m$ is finite), so we have

So because $\sum_{k}\alpha_{k}=\infty$ , there must be a subsequence $k_{t}$ with $||\nabla F(x^{k_{t}})||^{2}\to 0$ . This proves (7). ∎

Appendix C Local Convergence Rate Proofs

In this section we give a proof to Theorem 3 and it’s generalized version, which we present below.

We will denote with $\lambda_{i}(A)$ , $\rho(A)$ the $i$ -th eigenvalue and spectral radius of the matrix $A$ respectively.

Let’s recall the equations of deterministic QHM algorithm (6) with constant parameters $\alpha,\beta,\nu$ :

In this section we will assume that $d^{0}$ is initialized with zero vector.

Taking the gradient of the quadratic function $F(x)=x^{T}Ax+b^{T}x+c$ and substituting it into (44) yields

where $I$ denotes the $n\times n$ identity matrix and $\theta=\left\{\alpha,\beta,\nu\right\}$ . It is known that the sequence of $T^{k}(\theta,A)$ converges to zero if and only if the spectral radius $\rho(T)<1$ . Moreover, Gelfand’s Formula states that $\rho(T)=\lim_{k\to\infty}\left\|T^{k}\right\|^{\frac{1}{k}}$ , which means that $\exists\left\{\epsilon_{k}\right\}_{0}^{\infty},\lim_{k\to\infty}\epsilon_{k}=0$ such that

Thus, the behavior of the algorithm is determined by the eigenvalues of $T(\theta)$ . To find them, we will use a standard technique of changing basis. Let $A=Q\Lambda Q^{T}$ be an eigendecomposition of the matrix $A$ . Then, multiplying $A$ with $Q$ and appropriate permutation matrix $P$ See, e.g. for the exact form of matrix $P$ . we get

Thus, to compute eigenvalues of $T$ , it is enough to compute the eigenvalues of all matrices $T_{i}$ .

We use the following Lemma to establish the region when $\rho(T_{i})<1$ :

Let $\alpha>0,\beta\in[0,1),\nu\in,\lambda_{i}(A)>0$ . Then

Let’s denote with $\lambda$ eigenvalues of $T_{i}$ . Let’s also define $l\triangleq\lambda_{i}(A)$ . Then, $\lambda$ satisfies the following equation:

Let’s denote by $S(A)=\left\{\alpha,\beta,\nu:A\ \text{is true}\right\}$ . The final convergence set

Let’s look at the case when $D\geq 0$ . Then $S(\left|\lambda\right|<1\cap D\geq 0)=S(D\geq 0)\cap S(\left|\lambda_{1}\right|<1)\cap S(\left|\lambda_{2}\right|<1)$

Let’s look at $S(\left|\lambda_{1}\right|<1)=S(\lambda_{1}<1)\cap S(\lambda_{1}>-1)$

Let’s solve the second inequality: $S(\lambda_{1}<1)$ . Since we are only interested in the case when $D\geq 0$ we get

The last inequality is true since $1+\beta(1-2\nu)>0$ . Thus

Since $3+\beta>2(1+\beta)$ and $l(1-\nu\beta)\leq l(1+\beta(1-2\nu))$ .

Therefore we have that $\lambda_{1}>-1$ always holds and thus $\left|\lambda_{1}\right|<1$ always holds.

Now we compute $S(\left|\lambda_{2}\right|<1)=S(\lambda_{2}<1)\cap S(\lambda_{2}>-1)$ . The first term is

Now let’s move to the second case and compute $S(\left|\lambda\right|<1\cap D<0)$ . If $D<0$ we have that $1-\alpha l+\alpha\nu l>0$ and then

Finally, let’s find a simplified form of $S(D\geq 0)$ and $S(D<0)$ .

Let’s denote the discriminant of that equation (divided by 4) with respect to $\alpha l$ as $D_{1}$ :

since $(1+\sqrt{\nu\beta})^{2}<2(1+\beta)$ (left side is less than 2 and right side is greater than 2) and also

Now, let’s establish a precise equation for the spectral radius $\rho(T_{i})$ .

Let $\alpha>0,\beta\in[0,1),\nu\in,\lambda_{i}(A)>0$ . Let’s define $l\triangleq\lambda_{i}(A)$ and

In addition, $r(\theta,l)$ is non-increasing as a function of $l$ for $0<l<\frac{1-\beta}{\alpha(1-\sqrt{\nu\beta})^{2}}$ and is non-decreasing for $l>\frac{1-\beta}{\alpha(1-\sqrt{\nu\beta})^{2}}$ .

Following derivations from the proof of Lemmas 5 we get

Considering 4 cases for different signs of $C_{1}$ and $C_{2}$ , the first statement of the Lemma immediately follows. To prove the second statement, let’s define the following 3 points:

From equation (45) and definition of $C_{1}$ we get that

and it is easy to check that if $\beta\geq\nu\Rightarrow p_{1}\leq p_{2}\leq p_{3}$ and if $\beta<\nu\Rightarrow p_{1}\leq p_{3}\leq p_{2}$ . Moreover, both $C_{1}(l)$ and $C_{2}(l)$ are non-increasing function of $l$ , and $C_{1}^{2}(l)-4C_{2}(l)$ is non-increasing when $l\leq p_{2}$ and non-decreasing when $l\geq p_{1}$ .

Let’s first prove the second statement of the Lemma for the case when $\beta<\nu$ . In that case, when $l<p_{1}$ the function is non-increasing, since both $C_{1}(l)$ and $C_{1}^{2}(l)-4C_{2}(l)$ are non-increasing. When $p_{1}\leq l\leq p_{2}$ , the function is non-increasing, because $C_{2}(l)$ is non-increasing. Finally, when $l>p_{2}$ , the function is non-decreasing, because both $C_{1}^{2}(l)-4C_{2}(l)$ and $-C_{1}(l)$ are non-decreasing.

When $\beta\geq\nu$ , the same reasoning applies, but we additionally need to prove that the function is non-decreasing when $p_{2}\leq l\leq p_{3}$ . In that case $r(\theta,l)=0.5(\sqrt{C_{1}^{2}(l)-4C_{2}(l)}+C_{1}(l))$ . Taking the derivative of $r$ with respect to $l$ we get

Let’s show that this derivative is always non-negative when $l\geq p_{2}$

which is always true since left side is less than zero.

The last thing that we need in order to prove Theorem 3 is given by the following Lemma:

Let $\mu\leq\min_{i}\lambda_{i}(A)$ and $L\geq\max_{i}\lambda_{i}(A)$ . Then

In addition, the minimal spectral radius with respect to $\theta$ depends on $\mu$ and $L$ only through $\kappa$ , i.e. $\min_{\theta}R(\theta,\mu,L)\triangleq R^{*}(\kappa)$

To prove this first statement of the Lemma, let’s notice that by definition

But Lemma 6 states that $\rho(T_{i}(\theta))$ is first non-increasing and then non-decreasing with respect to the eigenvalues of $A$ . Thus, the maximum can only be achieved on the boundaries, which are precisely equal to or smaller than $r(\theta,\mu)$ and $r(\theta,L)$ .

Let’s prove the second statement of the Lemma by contradiction. Let’s assume that the optimal rate does in fact depend on $\mu$ and $L$ not only through $\kappa$ . That means that $\exists\mu_{1},L_{1},\mu_{2},L_{2}$ , such that $L_{1}/\mu_{1}=L_{2}/\mu_{2}$ , but $\min_{\theta}R(\theta,\mu_{1},L_{1})\neq\min_{\theta}R(\theta,\mu_{2},L_{2})$ . Let’s consider the optimal rates if the function $f$ is divided by $\mu_{1}$ for the first case and by $\mu_{2}$ for the second. In that case, $\min_{\theta}R(\theta,1,L_{1}/\mu_{1})=\min_{\theta}R(\theta,1,L_{2}/\mu_{2})$ . But on the other hand, they can’t be equal, since we have that $\min_{\theta}R(\theta,1,L_{1}/\mu_{1})=\min_{\theta}R(\theta,\mu_{1},L_{1})$ and $\min_{\theta}R(\theta,1,L_{2}/\mu_{2})=\min_{\theta}R(\theta,\mu_{2},L_{2})$ , because multiplying learning rate by $\mu_{1}$ for the first case and by $\mu_{2}$ for the second yields exactly the same sequence of iterates and thus the optimal rate can’t change. ∎

Now we are ready to prove Theorem 3. We restate it below for convenience

Theorem 3. Let’s denote $\theta=\{\alpha,\beta,\nu\}$ . For any function $F(x)=x^{T}Ax+b^{T}x+c$ that satisfies $\mu\leq\lambda_{i}(A)\leq L$ for all $i=1,\ldots,n$ and any $x^{0}$ , the deterministic QHM algorithm $z^{k+1}=Tz^{k}$ satisfies

where $x_{*}=\operatorname*{arg\,min}_{x}{F(x)}$ , $\lim_{k\to\infty}\epsilon_{k}=0$ and $R(\theta,\mu,L)=\rho(T)$ , which can be characterized as

To ensure $R(\theta,\mu,L)<1$ , the parameters $\alpha,\beta,\nu$ must satisfy the following constraints:

In addition, the optimal rate depends only on $\kappa$ : $\min_{\theta}R(\theta,\mu,L)$ is a function of only $\kappa$ .

Lemma 6 and Lemma 7 immediately give the first statement of the Theorem. One can also get the bound on the function values by using definition of the Lipschitz continuous gradient:

Finally, to get the stability region, we apply Lemma 5 and notice that $\lambda_{i}(A)\leq L\ \forall i$ . ∎

To generalize this result, let’s define the following class of functions

Then, Theorem 3 can be generalized to any function $F\in\mathcal{F}^{1}_{\mu,L}$ in the following way:

Let’s denote $\theta=\{\alpha,\beta,\nu\}$ . For any function $F\in\mathcal{F}^{1}_{\mu,L}$ that is additionally twice differentiable at the point $x_{*}=\operatorname*{arg\,min}_{x}{F(x)}$ , deterministic QHM algorithm locally converges to $x_{*}$ with linear rate, from any initialization $x^{0}$ sufficiently close to $x_{*}$ .

Precisely, for any $\epsilon\in[0,1-R(\theta,\mu,L))\ \exists\ \delta>0$ and $c\geq 0$ , such that $\forall k\geq 0$ the following holds

if $\left\|x^{0}-x_{*}\right\|\leq\delta$ and $\alpha,\beta,\nu$ satisfy the following constraints:

In addition, the optimal rate depends on $\mu$ and $L$ only through $\kappa$ , i.e. $\min_{\theta}R(\theta,\mu,L)\triangleq R^{*}(\kappa)$ .

To prove this result we apply Lyapunov’s method (see e.g. Chapter 2, Theorem 1 of ) to the QHM equations. The proof is then identical to the proof of Theorem 3, with matrix $A$ replaced by $\nabla^{2}F(x_{*})$ . ∎

Appendix D Numerical Evaluation of the Convergence Rate

In this section we provide details on the numerical evaluation of the local convergence rate of QHM. We need to numerically estimate the following function

From Lemma 6 (Appendix C) we know that $r(\alpha,\beta,\nu,l)$ is a non-increasing function of $l$ until some point and non-decreasing after. Also note that in fact dependence of $r$ on $\alpha$ is the same as on $l$ , since they only appear in formulas as a product $\alpha l$ . Thus, it is easy to see that for optimal $\alpha$ we will have

because otherwise $\alpha$ could be changed to decrease the value of the maximum.

Thus, to find optimal $\alpha$ for fixed $\beta,\nu$ , we can solve equation (46) for $\alpha$ using binary search (with precision set to $10^{-8}$ ). To find optimal $\beta$ or $\nu$ we just use grid search (with grid size equal to $10^{3}$ ) on $[0,1-10^{-5}]$ for $\beta$ and $ $for$ \nu$.

To numerically verify that the dependence of the optimal rate on $\nu$ is monotonic, we run this procedure for $10^{3}$ values of $\kappa$ which are sampled (on a uniform grids) in the following way: 100 values on $, 100 values on$ , 100 values on $ $, 150 values on$ [10^{3},10^{4}] $, 150 values on$ [10^{4},10^{5}] $, 200 values on$ [10^{5},10^{6}] $, 200 values on$ [10^{6},10^{7}]$. All experiments were run in parallel using GNU Parallel Command-Line Tool .

Since rate estimation is non-exact, it happens sometimes that very close points $\nu$ show non-monotonic rate dependence, but it is always the case that the rate is approximately non-increasing in $\nu$ . Precisely, we verify that the following condition holds for all estimated values of $\kappa$ :

where $\bar{R}^{*}$ is estimated rate and $\nu_{i}$ is i-th sample of $\nu$ . Figure 5 shows the dependence of $R^{*}(\nu,\kappa)$ on $\nu$ for different values of $\kappa$ .

Appendix E Stationary Distribution Proofs

In this section we present proofs of Theorems 4, 5. We will restate the combined statement of both theorems below for convenience.

Theorems 4, 5. Suppose $F(x)=\frac{1}{2}x^{T}Ax$ , where $A$ is symmetric positive definite matrix. The stochastic gradients satisfy $g^{k}=\nabla F(x^{k})+\xi$ , where $\xi$ is a random vector independent of $x^{k}$ with zero mean $\mathbf{E}\left[\xi\right]=0$ and covariance matrix $\mathbf{E}\left[\xi\xi^{T}\right]=\Sigma_{\xi}$ . Also suppose the parameteres $\alpha,\beta,\nu$ satisfy (13), then QHM algorithm (6), equivalently (10) in this case, converges to stationary distribution satisfying

Consequently, when $\nu=0$ (SGD), $\Sigma_{x}$ satisfies

When $\nu=1$ (SHB), $\Sigma_{x}$ satisfies

When $\nu=\beta$ (NAG), $\Sigma_{x}$ satisfies

We consider the behavior of QHM with constant $\alpha$ , $\beta$ , and $\nu$ , described in (47).

Under assumptions of Theorems 4, 5 we have that stochastic gradient $g^{k}$ is generated as

where the noise $\xi^{k}$ is independent of $x^{k}$ , has zero mean and constant covariance matrix. More explicitly, for all $k\geq 0$ ,

where $\Sigma_{\xi}$ is a constant covariance matrix. Substituting the expression of $g^{k}$ in (48) into (47) yields

where $I$ denotes the $n\times n$ identity matrix.

Let $L>0$ be the largest eigenvalue of $A$ . From Theorem 3 we know that under the conditions (13) the dynamical system (49) is stable, i.e., the spectral radius of the matrix

To simplify notation, we rewrite Equation (49) as

As $k\to\infty$ , the effect of the initial point $z^{0}$ dies out and the covariance matrix of the state $z^{k}$ becomes constant. Let

Then using the linear dynamics (50) and the assumption that $\{\xi^{k}\}$ is i.i.d. and has zero mean, we obtain

Following the partition of $\Sigma_{z}$ , we partition the above matrix equation into 2 by 2 blocks and obtain

Or, letting $V$ be the column block matrix with entries $[\Sigma_{d},\Sigma_{xd},\Sigma_{dx},A\Sigma_{x},\Sigma_{x}A,A\Sigma_{xd},\Sigma_{dx}A,A\Sigma_{x}A]$ , and defining symbolically $U$ to be the block matrix with coefficients:

(each block is an $n\times n$ identity matrix), we have

Next we use combinations of the above equations to obtain simplified relations: First, we can do

We take the following asymptotic expansion of $\Sigma_{z}$ :

Here, we explicitly write the zero’th order of $(\Sigma_{dx},\Sigma_{xd},\Sigma_{x})$ to be zero. This can be easily proved from (54)-(54).

The first order term of (54) (and (54)) gives

Let’s now extend this result to the second-order in $\alpha$ . The first order term of (54) gives

The second order term of (54) (and (54)) gives

Plugging (64) into (65) and (66), we obtain

Let’s get an expression for $\mathbf{tr}(A\Sigma_{\xi})$ . From (70) we get (by taking trace and dividing by 2)

From (62) we get (by multiplying by A, taking trace and dividing by 2)

From (63) we get (by multiplying by A and taking trace)

The special cases of SGD, SHB and NAG can be straightforwardly obtained by substituting corresponding value of $\nu$ into the general formula.

Appendix F Evaluation of Stationary Distribution Size

In this section we describe experimental details for evaluation of stationary distribution size on different machine learning problems (Section 5). The first problem we consider is a simple 2-dimensional quadratic function, where we add additive zero-mean Gaussian noise independent of the point $x$ , so that all assumptions of Theorem 5 are fully satisfied. For this function we have

We run the QHM algorithm for 1000 iterations starting at the optimal value and plot final loss as average across all 1000 iterations. We evaluate QHM for the following sweeps of hyperparameters: 30 values of $\alpha$ on a uniform grid on $[0.01,1.5]$ , 30 values of $\beta$ on a uniform grid on $[0,0.999]$ , 30 values of $\nu$ on a uniform grid on $ $. For each combination of hyperparameters we verify that$ \alpha\to 1,\beta\to 1 $indeed decreases the average loss. However, for smaller values of$ \nu $the effect of$ \beta $is smaller, as expected. The dependence on$ \nu $can be described by a quadratic function with minimum at some$ \nu<1 $. Note that from formula (15) the dependence on$ \nu $is indeed quadratic with optimal$ \nu$ given by

From this equation the optimal $\nu_{*}(\beta)\geq 0.5$ and $\nu_{*}(\beta)\to 0.5$ as $\beta\to 1$ . In the experiments we see the same qualitative behavior, but the optimal value of $\nu$ is much closer to $1$ than predicted by equation (76).

The second problem we consider is logistic regression on MNIST dataset. We run QHM for $50$ epochs with batch size of $128$ and weight decay (applied both to weights and biases) of $10^{-4}$ (thus, $\mu\approx 10^{-4}$ ). The final loss is averaged across last $1000$ batches. We evaluate algorithm for 50 values of $\alpha\in[0.01,30]$ (log-uniform grid), 20 values of $\beta\in[0,0.999]$ (uniform grid), 20 values of $\nu\in$ (uniform grid).

The final problem we consider is ResNet-18 on CIFAR-10 dataset. We run QHM with batch size of $256$ and weight decay of $10^{-4}$ (applied only to weights). We run algorithm for $80$ epochs with constant parameters and average final loss across last 100 batches. We evaluate $\alpha\in\left\{0.01,0.05,0.1,0.5,1.0,2.0,3.0,5.0,7.0,8.5\right\}$ , $\beta\in\left\{0.0,0.01,0.2,0.5,0.7,0.9,0.99,0.999\right\}$ . In this experiment we always set $\nu=1$ .

Appendix G Approximation error of Theorem 5

In this section we run a set of experiments to check for which values of parameters the equation (15) is not accurate. In fact, we can immediately see that the approximation error grows unboundedly as $\beta\to 1$ if $\nu\notin\left\{0,\beta,1\right\}$ , because the right-hand-side of equation (15) converges to $-\infty$ , while the left-hand-side is bounded from below.

Since we are interested in the approximation error from the higher-order terms, we run experiments on a 2-dimensional quadratic problem where all assumptions are satisfied. We follow the same experimental settings as in the appendix F. We test a uniform grid of 20 $\beta$ and 20 $\nu$ values on $ $for$ \alpha\in\left\{0.05,0.1,0.2\right\} $. Note, that we can compute the right-hand-side of equation (15) exactly, but need to estimate the left-hand-side. For that we run QHM for 10000 iterations and compute an empirical covariance of the iterates. Figure 6 shows the results of this experiment. We plot a relative error with color and threshold it at 0.2 (i.e. we consider the formula to be inaccurate if the relative difference between right-hand-side and left-hand-side is bigger than$ 20\% $). We can see that indeed when$ \alpha $is moderately big, the formula becomes imprecise for many different values of$ \nu $and$ \beta $. However, when$ \alpha $is small, the formula is only imprecise for a very large values of$ \beta $and it becomes more inaccurate when$ \nu$ is far from 0 or 1.