Understanding the Role of Momentum in Stochastic Gradient Methods
Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao
Introduction
Stochastic gradient methods have become extremely popular in machine learning for solving stochastic optimization problems of the form
where is a (stochastic) search direction and is the step size or learning rate. The classical stochastic gradient descent (SGD) method uses , where is a random sample collected at step . For the ease of notation, we use to denote throughout this paper.
There is a vast literature on modifications of SGD that aim to improve its theoretical and empirical performance. The most common such modification is the addition of a momentum term, which sets the search direction as the combination of the current stochastic gradient and past search directions. For example, the stochastic variant of Polyak’s heavy ball method uses
where . We call the combination of (2) and (3) the Stochastic Heavy Ball (SHB) method. Gupal and Bazhenov studied a “normalized” version of SHB, where
In the context of modern deep learning, Sutskever et al. proposed to use a stochastic variant of Nesterov’s accelerated gradient (NAG) method, where
The number of variations on momentum has kept growing in recent years; see, e.g., Synthesized Nesterov Variants (SNV) , Triple Momentum , Robust Momentum , PID Control-based methods , Accelerated SGD (AccSGD) , and Quasi-Hyperbolic Momentum (QHM) .
Despite various empirical successes reported for these different methods, there is a lack of clear understanding of how the different forms of momentum and their associated parameters affect convergence properties of the algorithms and other performance measures, such as final loss value. For example, Sutskever et al. show that momentum is critical to obtaining good performance in deep learning. But using different parametrizations, Ma and Yarats claim that momentum may have little practical effect. In order to clear up this confusion, several recent works [see, e.g., 40, 1, 18] have aimed to develop and analyze general frameworks that capture many different momentum methods as special cases.
In this paper, we focus on a class of algorithms captured by the general form of QHM :
Our theoretical results on the QHM model (6) cover three different aspects: asymptotic convergence with probability one, stability region and local convergence rates, and characterizations of the stationary distribution of under constant parameters , , and . Specifically:
In Section 3, we show that for minimizing smooth nonconvex functions, QHM converges almost surely as for arbitrary values of . And more surprisingly, we show that QHM converges as (which requires both and ) as long as slow enough, as compared with the speed of .
In Section 4, we consider local convergence behaviors of QHM for fixed parameters , , and . In particular, we derive joint conditions on that ensure local stability (or convergence when there is no stochastic noise in the gradient approximations) of the algorithm near a strict local minimum. We also characterize the local convergence rate within the stability region.
In Section 5, we investigate the stationary distribution of generated by the QHM dynamics around a local minimum (using a simple quadratic model with noise). We derive the dependence of the stationary variance on up to the second-order Taylor expansion in . These results reveal interesting effects of and that cannot be seen from first-order expansions.
Our asymptotic convergence results in Section 3 give strong guarantees for the convergence of QHM with diminishing learning rates under different regimes ( and ). However, as with most asymptotic results, they provide limited guidance on how to set the parameters in practice for fast convergence. Our results in Sections 4 and 5 complement the asymptotic results by providing principled guidelines for tuning these parameters. For example, one of the most effective schemes used in deep learning practice is called “constant and drop”, where constant parameters are used to train the model for a long period until it reaches a stationary state and then the learning rate is dropped by a constant factor for refined training. Each stage of the constant-and-drop scheme runs variants of QHM with constant parameters, and their choices dictate the overall performance of the algorithm. In Section 6, by combining our results in Sections 4 and 5, we obtain new and, in some cases, counter-intuitive insight into how to set these parameters in practice.
Related work
Asymptotic convergence There exist many classical results concerning the asymptotic convergence of the stochastic gradient methods [see, e.g. 37, 28, 14, and references therein]. For the classical SGD method without momentum, i.e., (2) with , a well-known general condition for asymptotic convergence is and . In general, we will always need to counteract the effect of noise. But interestingly, the conditions on are much less restricted. For normalized SHB, Polyak and Kaniovski studied its asymptotic convergence properties in the regime of and , while Gupal and Bazhenov investigated asymptotic convergence in the regime of and , both for convex optimization problems. More recently, Gadat et al. extended asymptotic convergence analysis for the normalized SHB update to smooth nonconvex functions for . In this work we generalize the classical SGD and SHB results to the case of QHM for smooth nonconvex functions.
Local convergence rate The stability region and local convergence rate of the deterministic gradient descent and heavy ball algorithms were established by Boris Polyak for the case of convex functions near a strict twice-differentiable local minimum . For this class of functions heavy ball method is optimal in terms of the local convergence rate . However, it might fail to converge globally for the general strongly convex twice-differentiable functions and is no longer optimal for the class of smooth convex functions. For the latter case, Nesterov’s accelerated gradient was shown to attain the optimal global convergence rate . In this paper we extend the results of Polyak on local convergence to the more general QHM algorithm.
Stationary analysis The limit behavior analysis of SGD algorithms with momentum and constant step size was used in various applications. establish sufficient conditions on detecting whether iterates reach stationarity and use them in combination with statistical tests to automatically change learning rate during training. prove many properties of limiting behavior of SGD with constant step size by using tools from Markov chain theory. Our results are most closely related to the work of Mandt et al. who use stationary analysis of SGD with momentum to perform approximate Bayesian inference. In fact, our Theorem 4 extends their results to the case of QHM and our Theorem 5 establishes more precise relations (to the second order in ), revealing interesting dependence on the parameters and which cannot be seen from the first order equations.
Asymptotic convergence
In this section, we generalize the classical asymptotic results to provide conditions under which QHM converges almost surely to a stationary point for smooth nonconvex functions. Throughout this section, "a.s." refers to "almost surely". We need to make the following assumptions.
The following conditions hold for defined in (1) and the stochastic gradient oracle:
is differentiable and is Lipschitz continuous, i.e., there is a constant such that
is bounded below and is bounded above, i.e., there exist and such that
For , the stochastic gradient , where the random noise satisfies
where denotes expectation conditioned on , and is a constant.
Note that Assumption A.3 allows the distribution of to depend on , and we simply require the second moment to be conditionally bounded uniformly in . The assumption can be removed if we assume a bounded domain for . However, this will complicate the proof by requiring special treatment (e.g., using the machinery of gradient mapping ) when converges to the boundary of the domain. Here we assume this condition to simplify the analysis.
By convergence to a stationary point, we mean that the sequence satisfies the condition
Intuitively, as , regardless of , the QHM dynamics become more like SGD, so there should be no issue with convergence. The following theorem, which generalizes the analysis technique of Ruszczyński and Syski to QHM, shows formally that this is indeed the case:
Let satisfy Assumption A. Additionally, assume and the sequences and satisfy the following conditions:
Then the sequence generated by the QHM algorithm (6) satisfies (7). Moreover, we have
More surprisingly, however, one can actually send as long as slow enough, although we require a stronger condition on the noise . We extend the technique of Gupal and Bazhenov to show asymptotic convergence of QHM for minimizing smooth nonconvex functions.
Let satisfy assumption A, and additionally assume that almost surely, i.e., the noise is a.s. bounded. Let the sequences , , and satisfy the following conditions:
Then the sequence generated by Algorithm (6) satisfies (7).
The conditions in Theorem 2 can be satisfied by, for example, taking and for and . We should note that, even though setting is somewhat unusual in practice, we think the result of Theorem 2 is interesting from both theoretical and practical points of view. From the theoretical side, this result shows that it is possible to always be increasing the amount of momentum (in the limit when , we are not using the fresh gradient information at all) and still obtain convergence for smooth functions. From the practical point of view, our Theorem 5 in Section 5 shows that for a fixed , increasing might lead to smaller stationary distribution size, which may give better empirical results.
Also, note that when , Theorems 1 and 2 give asymptotic convergence guarantees for the common practical variant of NAG, which have not appeared in the literature before. However, we should mention that the bounded noise assumption of Theorem 2 (i.e. a.s.) is quite restrictive. In fact, Ruszczyński and Syski prove a similar result for SGM with a more general noise condition, and their technique may extend to QHM, but bounded noise greatly simplifies the derivations. We provide the proofs of Theorems 1 and 2 in Appendix B.
The results in this section indicate that both and are admissible from the perspective of asymptotic convergence. However, they give limited guidance on how to choose momentum parameters in practice, where non-asymptotic behaviors are of main concern. In the next two sections, we study local convergence and stationary behaviors of QHM with constant learning rate and momentum parameters; our analysis provides new insights that could be very useful in practice.
Stability region and local convergence rate
Let the sequence be generated by the QHM algorithm (6) with constant parameters , and . In this case, does not converge to any local minimum in the asymptotic sense, but its distribution may converge to a stationary distribution around a local minimum. Since the objective function is smooth, we can approximate around a strict local minimum by a convex quadratic function. Since , we have
where the Hessian is positive definite. Therefore, for the ease of analysis, we focus on convex quadratic functions of the form where is positive definite (and we can set without loss of generality). In addition, we assume
where the noise satisfies Assumption A.3 and in addition, is independent of for all . Mandt et al. observe that this independence assumption often holds approximately when the dynamics of SHB are approaching stationarity around a local minimum.
where and are functions of and :
It is well-known that the linear system (10) is stable if and only if the spectral radius of , denoted by , is less than 1. When , the dynamics of (10) is the superposition of two components:
A deterministic part described by the dynamics with initial condition (we always take ). This part asymptotically decays to zero.
An auto-regressive stochastic process (10) driven by with zero initial condition .
Roughly speaking, determines how fast the dynamics converge from an arbitrary initial point to the stationary distribution, while properties of the stationary distribution (such as its variance and auto-correlations) depends on the full spectrum of the matrix as well as . Both aspects have important implications for the practical performance of QHM on stochastic optimization problems. Often there are trade-offs that we have to make in choosing the parameters , and to balance the transient convergence behavior and stationary distribution properties.
In the rest of this section, we focus on the deterministic dynamics to derive the conditions on that ensure and characterize the convergence rate. Let for denote the eigenvalues of (they are all real and positive). In addition, we define
where is the condition number. The local convergence rate for strictly convex quadratic functions is well studied for the case of gradient descent () and heavy ball () . In fact, heavy ball achieves the best possible convergence rate of . Thus, it is immediately clear that the optimal convergence rate of QHM will be the same and will be achieved with . However, there are no results in the literature characterizing how the optimal rate or optimal parameters change as a function of . Our next result establishes the convergence region and dependence of the convergence rate on the parameters , , and . We present the result for quadratic functions, but it can be generalized to any -smooth and -strongly convex functions, assuming the initial point is close enough to the optimal point (see Theorem 6 in Appendix C).
Let’s denote We drop the dependence of some functions on for brevity.. For any function that satisfies for all and any , , with , such that the deterministic QHM algorithm satisfies
where , and , which can be characterized as
To ensure , the parameters must satisfy the following constraints:
In addition, the optimal rate depends only on : is a function of only .
The conditions in (13) characterize the stability region of QHM. Note that when we have the classical result for gradient descent: ; when , the condition matches that of the normalized heavy ball: .
The equations (12) define the convergence rate for any fixed values of the parameters . While it does not give a simple analytic form, it allows us to conduct easy numerical investigations. To gain more intuition into the effect that momentum parameters and have on the convergence rate, we study how the optimal changes as a function of and vice versa. To find the optimal parameters and rate, we solve the corresponding optimization problem numerically (using the procedure described in Appendix D). For each pair we set to the optimal value in order to remove its effect. These plots are presented in Figure 1.
A natural way to think about the interplay between parameters and is in terms of the total “amount of momentum”. Intuitively, it should be controlled by the product of . This intuition helps explain Figure 1 (a), (b), which show the dependence of the optimal as a function of for different values of . We can see that for bigger values of we need to use smaller values of , since increasing each one of them increases the “amount of momentum” in QHM. However, the same intuition fails when considering as a function of (and is big enough), as shown in Figure 1 (c), (d). In this case there are 3 regimes of different behavior. In the first regime, since is small, the amount of momentum is not enough for the problem and thus the optimal is always . In this phase we also need to increase when increasing (it is typical to use larger learning rate when the momentum coefficient is bigger). The second phase begins when we reach the optimal value of (rate is minimal) and, after that, the amount of momentum becomes too big and we need to decrease and . However, somewhat surprisingly, there is a third phase, when becomes big enough we need to start increasing and again. Thus we can see that it’s not just the product of that governs the behavior of QHM, but a more complicated function.
Finally, based on our analytic and numerical investigations, we conjecture that the optimal convergence rate is a monotonically decreasing function of (if and are chosen optimally for each ). While we can’t prove this statementIn fact, we hypothesise that might not have analytical formula, since it is possible to show that the optimization problem over and is equivalent to the system of highly non-linear equations., we verify this conjecture numerically in Appendix D. The code of all of our experiments is available at https://github.com/Kipok/understanding-momentum.
Stationary analysis
In this section, we study the stationary behavior of QHM with constant parameters , and . Again we only consider quadratic functions for the same reasons as outlined in the beginning of Section 4. In other words, we focus on the linear dynamics of (10) driven by the noise as (where the deterministic part depending on dies out). Under the assumptions of Section 4 we have the following result on the covariance matrix defined as \Sigma_{x}\triangleq\lim_{k\to\infty}\mathbf{E}\bigl{[}x^{k}(x^{k})^{T}\bigr{]}.
Suppose , where is symmetric positive definite matrix. The stochastic gradients satisfy , where is a random vector independent of with zero mean and covariance matrix . Also, suppose the parameters satisfy (13). Then the QHM algorithm (6), equivalently (10) in this case, converges to a stationary distribution satisfying
When , this result matches the known formula for the stationary distribution of unnormalized SHB with reparametrization of . Note that Theorem 4 shows that for the normalized version of the algorithm, the stationary distribution’s covariance does not depend on (or ) to the first order in . In order to explore such dependence, we need to expand the dependence on to the second order. In that case, we are not able to obtain a matrix equation, but can get the following relation for .
Under the conditions of Theorem 4, we have
We note that is twice the mean value of when the dynamics have reached stationarity, so the right-hand side of (15) is approximately the “achievable loss” given the values of and . It is interesting to consider several special cases:
(SGD): .
(SHB): .
(NAG): .
From the expressions for SHB and NAG, it might be beneficial to move to during training in order to make the achievable loss smaller. While moving to is somewhat counter-intuitive, we proved in Section 3 that QHM still converges asymptotically in this regime, assuming also goes to and converges to “slower” than converges to . However, since we only consider Taylor expansion in , there is no guarantee that the approximation remains accurate when and converge to (see Appendix G for evaluation of this approximation error). In order to precisely investigate the dependence on and , it is necessary to further extend our results by considering Taylor expansion with respect to them as well, especially in terms of . We leave this for future work.
Figure 2 shows a visualization of the QHM stationary distribution on a 2-dimensional quadratic problem. We can see that our prediction about the dependence on and holds in this case. However, the dependence on is more complicated: the top and bottom rows of Figure 2 show opposite behavior. Comparing this experiment with our analysis of the convergence rate (Figure 1) we can see another confirmation that for big values of , increasing can, in a sense, decrease the “amount of momentum” in the system. Next, we evaluate the average final loss for a large grid of parameters and on three problems: a 2-dimensional quadratic function (where all of our assumptions are satisfied), logistic regression on the MNIST dataset (where the quadratic assumption is approximately satisfied, but gradient noise comes from mini-batches) and ResNet-18 on CIFAR-10 (where all of our assumptions are likely violated). Figure 3 shows the results of this experiment. We can indeed see that and make the final loss smaller in all cases. The dependence on is less clear, but we can see that for large values of it is approximately quadratic, with a minimum at some . Thus from this point of view helps when is big enough, which might be one of the reasons for the empirical success of the QHM algorithm. Notice that the empirical dependence on is qualitatively the same as predicted by formula (15), but with optimal value shifted closer to . See Appendix F for details.
Some practical implications and guidelines
In this section, we present some practical implications and guidelines for setting learning rate and momentum parameters in practical machine learning applications. In particular, we consider the question of how to set the optimal parameters in each stage of the popular constant-and-drop scheme for deep learning. We argue that in order to answer this question, it is necessary to consider both convergence rate and stationary distribution perspectives. There is typically a trade-off between obtaining a fast rate and a small stationary distribution. You can see an illustration of this trade-off in Figure 4 (a). Interestingly, by combining stationary analysis of Section 5 and results for the convergence rate (3), we can find certain regimes of parameters , and where the final loss and the convergence speed do not compete with each other.
One of the most important of these regimes happens in the case of the SHB algorithm (). In that case, we can see that when , the convergence rate equals and does not depend on . Thus, as long as this inequality is satisfied, we can set as small as possible and it would not harm the convergence rate, but will decrease the size of stationary distribution. To get the best possible convergence rate, we, in fact, have to set and in such a way that this inequality will turn into equality and thus there will be only a single value of that could be used. However, as long as is not exactly at the optimal value, there is going to be some freedom in choosing and it should be used to decrease the size of stationary distribution. From this point of view, the optimal value of \alpha=\bigl{(}1-\sqrt{\beta}\bigr{)}\big{/}\Bigl{(}\mu\bigl{(}1+\sqrt{\beta}\bigr{)}\Bigr{)}, which will be smaller then the largest possible for convergence as long as and is set close to (see proof of Theorem 3 for more details). This guideline contradicts some typical advice to set as big as possible while algorithm still convergesFor example says “So set as close to 1 as you can, and then find the highest which still converges. Being at the knife’s edge of divergence, like in gradient descent, is a good place to be.”. The refined guideline for the constant-and-drop scheme would be to set as small as possible until the convergence noticeably slows down. You can see an illustration of this behavior on a simple quadratic problem (Figure 4 (b)), as well as for ResNet-18 on CIFAR-10 (Figure 4 (c)). Such regimes of no trade-off can be identified for and as well.
Conclusion
Using the general formulation of QHM, we have derived a unified set of new analytic results that give us better understanding of the role of momentum in stochastic gradient methods. Our results cover several different aspects: asymptotic convergence, stability region and local convergence rate, and characterizations of stationary distribution. We show that it is important to consider these different aspects together to understand the key trade-offs in tuning the learning rate and momentum parameters for better performance in practice. On the other hand, we note that the obtained guidelines are mainly for stochastic optimization, meaning the minimization of the training loss. There is evidence that different heuristics and guidelines may be necessary for achieving better generalization performance in machine learning, but this topic is beyond the scope of our current paper.
References
Appendix
Appendix A From NAG to QHM
In this appendix we will mention exact steps needed to come from the original NAG formulation to the formulation assumed by the QHM algorithm. We refer the reader to the ( Appendix A.1) for the derivation of NAG as the following momentum method Note that we change the notation to be consistent with the notation of QHM.
Next, we will move the learning rate out of the momentum into the iterates update:
When and are constant, the two methods produce the same sequence of iterates if is initialized at . To make the notation more similar to the QHM algorithm, let’s move all indices (except for ) up by 1:
This again does not change the algorithm. Now, let’s normalize the momentum update by :
This version is equivalent to the unnormalized by re-scaling for constant parametersIn fact, for non-constant the algorithms are no longer equivalent.. Finally, following we need to make a change of variables and additionally assume that is constant:
Renaming back to and replacing with stochastic gradient if necessary we obtain the exact formula used in QHM update.
Overall, assuming and is constant, the QHM version of NAG is indeed equivalent (up to a change of variable) to the original NAG with re-scaling of . However, if is changing from iteration to iteration, the two algorithms are no longer equivalent.
Appendix B Asymptotic Convergence Proofs
In this section we prove Theorems 1 and 2. For simplicity, we assume throughout that , and are nonrandom.
Here we generalize the meta-analysis of Ruszczyński and Syski to include .
We consider the following variant of QHM:
where is in $i_{k}\rho>0$, we let
Conditions repeated for convenience.
Let satisfy Assumption A. Additionally, assume the sequences , and satisfy the following conditions:
Then the dynamics (6) satisfy (7) and (8).
which we will use often in the convergence analysis.
Suppose Assumption A, (20), and (23) hold, and for every ,
where and are sequences of scalar random variables satisfying and
For the first part, suppose for a contradiction that . So there exists and such that for all , . By (26), there exists such that for all . Then from (25) we obtain:
Summing over from to and using Assumption A.2, we get that:
The right-hand-side is finite by (27). But since for all and and , we have:
This implies , which contradicts (20). So we must have (7), i.e. .
For every , define the index . Since is infinite, as . Then for sufficiently large , i.e., when , (25) becomes
As , because of (27) and , we get , so
Since the reverse inequality is trivial, we obtain (8).
Because for all , , are in $||\nabla F(x^{k(l)})||\to 0l\to\infty$. So we obtain (30) again. This concludes the proof of the lemma. ∎
All that remains now is to use the smoothness inequality (24) to identify the sequences and for the dynamics of the modified algorithm (16)-(18), and prove (26) and (27).
From the update formula (18) and using , we obtain
By the smoothness assumption A.1, we have
First we show . From the update formula, we have
Then because , , and , we have
Because , .
Now we show that is summable almost surely. To begin, we need to show that is summable, for which we need the following lemma:
There is a random variable , constant in , such that for all almost surely.
where in the last inequality we used . Then
By assumption A.2 and A.3, the first and second conditional moments and are both bounded uniformly in . Then because and are in $\rhok$, we can put
which is what we wanted. Note that could be a random variable (depending on ), but this bound holds almost surely. ∎
almost surely.
We will use the following useful proposition (known as Levy’s sharpening of Borel-Cantelli Lemma, see e.g. Meyer [20, Chapter 1, Theorem 21]):
Let be a sequence of positive, integrable random variables, and let , where . Then defining the partial sums , ,
So to prove , we only need to prove . To see this, observe that
so because a.s.,
where we used (21). Applying the proposition finishes the lemma. ∎
The last term remaining in is . We show that
is a convergent martingale. First, note that , so , and is a martingale. Now we show that is bounded, which will imply a.s. convergence by Doob’s forward convergence theorem [38, Section 11.5]. Indeed, [38, Section 12.1].
Because depends only on , we can upper bound this expectation with some constant by using Assumption A.3. Then we have that , so . Therefore,
where the last inquality used Assumption A.2 and the fact that almost surely. Moreover,
where the first equality is by the law of total expectation and the inequality comes from Assumption A.3. Because , we finally have
so is a convergent martingale. In particular,
We have shown (26) and (27), which concludes the proof. ∎
Now we prove Theorem 2, where under a stronger noise assumption we show that is admissible as long as it goes to 1 slow enough.
Assume the sequences , , and satisfy the following:
then sequence generated by the algorithm (6) satisfies
where in the second inequality we used for any two vectors and , and in the last inequality we used . Taking conditional expectation on both sides of the above inequality and using , we get
Next we analyze the sequence . From the update formula in (6), we have
where in the first inequality we used , and in the second inequality we used for any . By the smoothness assumption, we have
which, combining with the previous inequality, leads to
Choosing , we obtain
Taking expectation conditioned on , we have
To show that is a convergent martingale, we prove the following lemma, similar to Ermoliev .
Assume we are given a sequence such that , where and almost surely for some constant , the random variables are -measurable, and they satisfy almost surely. Then the sequence converges almost surely.
We show that is a convergent supermartingale. By Doob decomposition, is a supermartingale if and only if the sequence
satisfies for all [38, section 12.11]. Here
and we assumed this is non-positive. So almost surely. The upper bound on and the convergence of implies that the supermartingale is in , so the sequence converges almost surely by Doob’s forward convergence theorem [38, chapter 11]. By the convergence of , this in turn implies that the sequence converges almost surely.∎
We can apply the above lemma to show that \bigl{\|}d^{k-1}-\nabla F(x^{k})\bigr{\|}^{2} is a convergent semimartingale. Because the noise is uniformly bounded almost surely and , is uniformly bounded in . The uniform bound on the noise also implies that is uniformly bounded in . In the notation of the lemma, we have
Note that . To show convergence of , note that the uniform bounds imply
for suitably large . Then convergence follows from the conditions on the sequences , , and . This proves that converges almost surely.
Summing up the two inequalities (41) and (43) gives
If , then
Since , there exists such that for all . Taking full expectation on both sides of the above inequality and summing up for all , we obtain
The right-hand side is bounded by assumption (the index is finite), so we have
So because , there must be a subsequence with . This proves (7). ∎
Appendix C Local Convergence Rate Proofs
In this section we give a proof to Theorem 3 and it’s generalized version, which we present below.
We will denote with , the -th eigenvalue and spectral radius of the matrix respectively.
Let’s recall the equations of deterministic QHM algorithm (6) with constant parameters :
In this section we will assume that is initialized with zero vector.
Taking the gradient of the quadratic function and substituting it into (44) yields
where denotes the identity matrix and . It is known that the sequence of converges to zero if and only if the spectral radius . Moreover, Gelfand’s Formula states that , which means that such that
Thus, the behavior of the algorithm is determined by the eigenvalues of . To find them, we will use a standard technique of changing basis. Let be an eigendecomposition of the matrix . Then, multiplying with and appropriate permutation matrix See, e.g. for the exact form of matrix . we get
Thus, to compute eigenvalues of , it is enough to compute the eigenvalues of all matrices .
We use the following Lemma to establish the region when :
Let . Then
Let’s denote with eigenvalues of . Let’s also define . Then, satisfies the following equation:
Let’s denote by . The final convergence set
Let’s look at the case when . Then
Let’s look at
Let’s solve the second inequality: . Since we are only interested in the case when we get
The last inequality is true since . Thus
Since and .
Therefore we have that always holds and thus always holds.
Now we compute . The first term is
Now let’s move to the second case and compute . If we have that and then
Finally, let’s find a simplified form of and .
Let’s denote the discriminant of that equation (divided by 4) with respect to as :
since (left side is less than 2 and right side is greater than 2) and also
Now, let’s establish a precise equation for the spectral radius .
Let . Let’s define and
In addition, is non-increasing as a function of for and is non-decreasing for .
Following derivations from the proof of Lemmas 5 we get
Considering 4 cases for different signs of and , the first statement of the Lemma immediately follows. To prove the second statement, let’s define the following 3 points:
From equation (45) and definition of we get that
and it is easy to check that if and if . Moreover, both and are non-increasing function of , and is non-increasing when and non-decreasing when .
Let’s first prove the second statement of the Lemma for the case when . In that case, when the function is non-increasing, since both and are non-increasing. When , the function is non-increasing, because is non-increasing. Finally, when , the function is non-decreasing, because both and are non-decreasing.
When , the same reasoning applies, but we additionally need to prove that the function is non-decreasing when . In that case . Taking the derivative of with respect to we get
Let’s show that this derivative is always non-negative when
which is always true since left side is less than zero.
The last thing that we need in order to prove Theorem 3 is given by the following Lemma:
Let and . Then
In addition, the minimal spectral radius with respect to depends on and only through , i.e.
To prove this first statement of the Lemma, let’s notice that by definition
But Lemma 6 states that is first non-increasing and then non-decreasing with respect to the eigenvalues of . Thus, the maximum can only be achieved on the boundaries, which are precisely equal to or smaller than and .
Let’s prove the second statement of the Lemma by contradiction. Let’s assume that the optimal rate does in fact depend on and not only through . That means that , such that , but . Let’s consider the optimal rates if the function is divided by for the first case and by for the second. In that case, . But on the other hand, they can’t be equal, since we have that and , because multiplying learning rate by for the first case and by for the second yields exactly the same sequence of iterates and thus the optimal rate can’t change. ∎
Now we are ready to prove Theorem 3. We restate it below for convenience
Theorem 3. Let’s denote . For any function that satisfies for all and any , the deterministic QHM algorithm satisfies
where , and , which can be characterized as
To ensure , the parameters must satisfy the following constraints:
In addition, the optimal rate depends only on : is a function of only .
Lemma 6 and Lemma 7 immediately give the first statement of the Theorem. One can also get the bound on the function values by using definition of the Lipschitz continuous gradient:
Finally, to get the stability region, we apply Lemma 5 and notice that . ∎
To generalize this result, let’s define the following class of functions
Then, Theorem 3 can be generalized to any function in the following way:
Let’s denote . For any function that is additionally twice differentiable at the point , deterministic QHM algorithm locally converges to with linear rate, from any initialization sufficiently close to .
Precisely, for any and , such that the following holds
if and satisfy the following constraints:
In addition, the optimal rate depends on and only through , i.e. .
To prove this result we apply Lyapunov’s method (see e.g. Chapter 2, Theorem 1 of ) to the QHM equations. The proof is then identical to the proof of Theorem 3, with matrix replaced by . ∎
Appendix D Numerical Evaluation of the Convergence Rate
In this section we provide details on the numerical evaluation of the local convergence rate of QHM. We need to numerically estimate the following function
From Lemma 6 (Appendix C) we know that is a non-increasing function of until some point and non-decreasing after. Also note that in fact dependence of on is the same as on , since they only appear in formulas as a product . Thus, it is easy to see that for optimal we will have
because otherwise could be changed to decrease the value of the maximum.
Thus, to find optimal for fixed , we can solve equation (46) for using binary search (with precision set to ). To find optimal or we just use grid search (with grid size equal to ) on for and $\nu$.
To numerically verify that the dependence of the optimal rate on is monotonic, we run this procedure for values of which are sampled (on a uniform grids) in the following way: 100 values on , 100 values on $[10^{3},10^{4}][10^{4},10^{5}][10^{5},10^{6}][10^{6},10^{7}]$. All experiments were run in parallel using GNU Parallel Command-Line Tool .
Since rate estimation is non-exact, it happens sometimes that very close points show non-monotonic rate dependence, but it is always the case that the rate is approximately non-increasing in . Precisely, we verify that the following condition holds for all estimated values of :
where is estimated rate and is i-th sample of . Figure 5 shows the dependence of on for different values of .
Appendix E Stationary Distribution Proofs
In this section we present proofs of Theorems 4, 5. We will restate the combined statement of both theorems below for convenience.
Theorems 4, 5. Suppose , where is symmetric positive definite matrix. The stochastic gradients satisfy , where is a random vector independent of with zero mean and covariance matrix . Also suppose the parameteres satisfy (13), then QHM algorithm (6), equivalently (10) in this case, converges to stationary distribution satisfying
Consequently, when (SGD), satisfies
When (SHB), satisfies
When (NAG), satisfies
We consider the behavior of QHM with constant , , and , described in (47).
Under assumptions of Theorems 4, 5 we have that stochastic gradient is generated as
where the noise is independent of , has zero mean and constant covariance matrix. More explicitly, for all ,
where is a constant covariance matrix. Substituting the expression of in (48) into (47) yields
where denotes the identity matrix.
Let be the largest eigenvalue of . From Theorem 3 we know that under the conditions (13) the dynamical system (49) is stable, i.e., the spectral radius of the matrix
To simplify notation, we rewrite Equation (49) as
As , the effect of the initial point dies out and the covariance matrix of the state becomes constant. Let
Then using the linear dynamics (50) and the assumption that is i.i.d. and has zero mean, we obtain
Following the partition of , we partition the above matrix equation into 2 by 2 blocks and obtain
Or, letting be the column block matrix with entries , and defining symbolically to be the block matrix with coefficients:
(each block is an identity matrix), we have
Next we use combinations of the above equations to obtain simplified relations: First, we can do
We take the following asymptotic expansion of :
Here, we explicitly write the zero’th order of to be zero. This can be easily proved from (54)-(54).
The first order term of (54) (and (54)) gives
Let’s now extend this result to the second-order in . The first order term of (54) gives
The second order term of (54) (and (54)) gives
Plugging (64) into (65) and (66), we obtain
Let’s get an expression for . From (70) we get (by taking trace and dividing by 2)
From (62) we get (by multiplying by A, taking trace and dividing by 2)
From (63) we get (by multiplying by A and taking trace)
The special cases of SGD, SHB and NAG can be straightforwardly obtained by substituting corresponding value of into the general formula.
Appendix F Evaluation of Stationary Distribution Size
In this section we describe experimental details for evaluation of stationary distribution size on different machine learning problems (Section 5). The first problem we consider is a simple 2-dimensional quadratic function, where we add additive zero-mean Gaussian noise independent of the point , so that all assumptions of Theorem 5 are fully satisfied. For this function we have
We run the QHM algorithm for 1000 iterations starting at the optimal value and plot final loss as average across all 1000 iterations. We evaluate QHM for the following sweeps of hyperparameters: 30 values of on a uniform grid on , 30 values of on a uniform grid on , 30 values of on a uniform grid on $\alpha\to 1,\beta\to 1\nu\beta\nu\nu<1\nu\nu$ given by
From this equation the optimal and as . In the experiments we see the same qualitative behavior, but the optimal value of is much closer to than predicted by equation (76).
The second problem we consider is logistic regression on MNIST dataset. We run QHM for epochs with batch size of and weight decay (applied both to weights and biases) of (thus, ). The final loss is averaged across last batches. We evaluate algorithm for 50 values of (log-uniform grid), 20 values of (uniform grid), 20 values of (uniform grid).
The final problem we consider is ResNet-18 on CIFAR-10 dataset. We run QHM with batch size of and weight decay of (applied only to weights). We run algorithm for epochs with constant parameters and average final loss across last 100 batches. We evaluate , . In this experiment we always set .
Appendix G Approximation error of Theorem 5
In this section we run a set of experiments to check for which values of parameters the equation (15) is not accurate. In fact, we can immediately see that the approximation error grows unboundedly as if , because the right-hand-side of equation (15) converges to , while the left-hand-side is bounded from below.
Since we are interested in the approximation error from the higher-order terms, we run experiments on a 2-dimensional quadratic problem where all assumptions are satisfied. We follow the same experimental settings as in the appendix F. We test a uniform grid of 20 and 20 values on $\alpha\in\left\{0.05,0.1,0.2\right\}20\%\alpha\nu\beta\alpha\beta\nu$ is far from 0 or 1.