Improved Analysis of Clipping Algorithms for Non-convex Optimization
Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang
Introduction
The problem of the central interest in this paper is to minimize a general non-convex function presented below:
where can be potentially stochastic, i.e.
For non-convex optimization problems in form of (1), since obtaining the global minimum is NP-hard in general, this paper takes the concern on a reasonable relaxed criteria: finding an -approximate first-order stationary point such that .
Gradient clipping (Pascanu et al., 2012) is a simple and commonly used trick in algorithms that adaptively choose step sizes to make optimization stable. For the task of training deep neural networks (especially for language processing tasks), it is often a standard practice and is believed to be efficient in relieving the exploding gradient problem from empirical studies (Pascanu et al., 2013). More recently, Zhang et al. (2020a) proposed an inspiring theoretical justification on the clipping technique via introducing the -smoothness assumption. The concept of -smoothness is defined as follows.
This assumption can be further relaxed such that twice differentiability is not required (see Remark 2.3). Therefore the standard -smoothness assumption (i.e. the gradient of is -Lipschitz continuous) is stronger than the -smoothness one in the sense that the latter allows to have a linear growth with respect to .
Secondly, Zhang et al. (2020a) performed experiments to show that -smoothness is a preciser characterization of the landscapes for objective functions in many real-world tasks, especially for training a deep neural network model. It was observed that the local Lipschitz constant near the stationary point is thousands of times smaller than the global one in the LSTM training (see Figure 1(c) taken from Zhang et al. (2020a)).
Seeing this, it is desirable to give a comprehensive and deep analysis on iteration complexities for -smooth objectives. How fast can we achieve to find a first-order stationary point for -smooth functions? What are simple algorithms that provably achieve such a convergence rate? In this paper, we give affirmative answers to the above questions. In fact, due to the violent fluctuation of gradients, the efficiency of (stochastic) Gradient Descent with a constant step size degenerates, whereas we will show in this paper that by simply combining the clipping technique, a wide range of algorithms can achieve much better convergence rate for -smooth functions. In fact, when is small, the complexities (i.e. the number of gradient queries required) are for the deterministic setting and for the stochastic setting (see Section 3 for details), which are both independent of . Compared with Zhang et al. (2020a) who only studied clipped (stochastic) gradient descent, we consider proposing a unified framework which contains a variety of clipping-based algorithms and achieve much sharper complexities. The main technique for our proof is by introducing a novel Lyapunov function which does not appear in existing studies. We believe that our work provides better understandings for the clipping technique in training deep neural networks. We summarize the contributions of the paper in the following.
We provide a general framework to analyze the clipping technique for optimizing -smooth functions. It contains a variety of clipping algorithms, including gradient clipping and momentum clipping as special cases.
We provide convergence analysis for the general framework we propose. We show that our bounds are tight by comparing with existing lower bounds. For gradient clipping, a special case in our framework, our result is much sharper than that proposed by Zhang et al. (2020a).
We conduct extensive experiments on a variety of different tasks, and observe that the clipping algorithms consistently perform better than vanilla ones.
Clipping/normalizing Techniques. Clipping/normalizing has long been a popular technique in optimizing large-scale non-convex optimization problems (e.g. (Mikolov, 2012; Pascanu et al., 2013; Goodfellow et al., 2016; You et al., 2017)). There are several views which provide understandings for the clipping and normalizing techniques. Some show that clipping can reduce the stochastic noise. For example, Zhang et al. (2019); Gorbunov et al. (2020) showed that clipping is crucial for convergence when the stochastic gradient noise is heavy-tailed. Menon et al. (2020) pointed out that clipping can mitigate the effect of label noise. Cutkosky and Mehta (2020) found that adding momentum in normalized SGD provably reduces the stochastic noise. Another line of works try to understand the function of clipping and normalizing for the standard smooth optimization. For example, Levy (2016) showed that normalized GD can provably escape saddle points. Fang et al. (2018) designed a new algorithm based on normalized GD which achieves a faster convergence rate under suitable conditions. Gradient clipping has also been used to design differentially private optimization algorithms (Abadi et al., 2016).
The work of Zhang et al. (2020a) is mostly related to this paper. A detailed comparison between the two works is shown in Subsection 2.2.
Lower Bounds For Non-convex Optimization. A series of recent works establish lower bounds for finding an -stationary point of a general non-convex and -smooth function, either in deterministic setting (Carmon et al., 2019) or in stochastic setting (Drori and Shamir, 2019; Arjevani et al., 2019). In this paper we borrow their counter examples to show the tightness of our obtained complexities for the general -smooth functions.
Assumptions & Comparisons of Results
We first present the assumptions that will be used in our theoretical analysis, which follows from Zhang et al. (2020a).
We assume that is -smooth.
It does not need to be twice differentiable and is strictly weaker than -smoothness.
In the stochastic setting, for the briefness of our analysis, we assume that the noise is unbiased and bounded.
2 Comparisons of Results
In this paper, we mainly take our concern on the stochastic setting, though we also establish the complexities for the deterministic setting. We summarize the comparison in the stochastic setting with existing complexity results in Table 1, in which we only present the dominating complexities with respect to .
Compared with the bound for clipped SGD established in Zhang et al. (2020a), our results improve theirs on the dependencies for all problem-dependent parameters, i.e. , , , and especially by order. For SGD with arbitrarily chosen step sizes (thus include clipped SGD), the example in Drori and Shamir (2019) can be used to show that clipped SGD is optimal (cf. Section 3.3).
General Analysis of Clipping
We aim to present a general framework in which we can provide a unified analysis for commonly used clipping-based algorithms. Since momentum is one of the most popular acceleration technique in optimization community, our framework takes this acceleration procedure into account. We show our framework in Algorithm 1, where we can simply replace by for the deterministic setting.
We notice that our framework is similar to the Quasi-Hyperbolic Momentum(QHM) algorithm proposed by Ma and Yarats (2018), while they did not consider the clipping technique. They pointed out that QHM contains a wide range of popular algorithms (e.g. SGD+momentum, Nesterov Accelerated SGD, AccSGD, etc). As a result, for different choice of hyper-parameters, our framework encompasses the clipping version of all these algorithms. We now discuss several representative examples in our framework.
Gradient Clipping. By choosing in Algorithm 1, we obtain the clipped GD/SGD algorithm which can be written as:
It follows that in gradient clipping, the gradient is clipped to have its norm no more than .
Momentum Clipping. By choosing in Algorithm 1, we perform the update using a clipped version of momentum which can be written as:
The approach has already been used in previous works (Zhang et al., 2019, 2020b), albeit in different settings. To the best of our knowledge, there is no existing analysis of this algorithm even for optimizing standard -smooth functions.
Mixed Clipping. By choosing , we obtain the mixed clipping algorithm. Although this form of clipping is not widely used in practice, we observe from experiments that it typically converges faster than both gradient clipping and momentum clipping. Some explanations of this observation are provided in Appendix E.
Normalized Momentum. By choosing and in Algorithm 1, we recover the normalized SGD+momentum algorithm. This algorithm performs a normalized (rather than clipped) update in each iteration. It has been analyzed in Cutkosky and Mehta (2020) for -smooth functions and a layer-wise variant was used in the LARS algorithm (You et al., 2017). We will provide a detailed discussion of this algorithm in the Appendix C.
In this section we first deal with the deterministic case, in which we can get a strong justification that clipping is a natural choice to optimize -smooth functions. We have the following Theorem.
[Convergence of Algorithm 1, Deterministic Setting] Let the function satisfy Assumptions 2.1 and 2.2. Set in Algorithm 1 for simplicity. Fix be a small constant. For any and , if and where , , then
In Theorem 3.1, the -smoothness is precisely reflected in the restriction of hyper-parameters and . For large , we must use a small clipping hyper-parameter to guarantee convergence. This also coincides with the intuition that in highly non-smooth regions we should take a small step.
Theorem 3.1 states that in the deterministic setting, for any , our framework can find an -approximate stationary point in gradient evaluations if we choose and . When , the dominating term is .
Now we turn to our main result in the stochastic setting. We have the following theorem.
[Convergence of Algorithm 1, Stochastic setting] Let the function satisfy Assumptions 2.1 and 2.2, and the noise satisfies Assumption 2.4 with . Set in Algorithm 1 for simplicity. Fix be a small constant. For any and , if and where constants , then
Here the expectation is taken over all the randomness .
Theorem 3.2 shows that in the stochastic setting, for any , our framework can find an -approximate stationary point in gradient evaluations. When , the term reduces to . In this case no longer affects the choice of steps sizes and , and the complexity in (4) reduces to .
Theorem 3.2 suggests that the clipping threshold should take , which only depends on the noise and is several times larger than the its variance. This matches previous understanding of gradient clipping, in that clipping the stochastic gradient controls the variance while introducing some additional bias, and the clipping threshold should be tuned to trade-off variance with the introduced bias (Zhang et al., 2019).
We emphasize that in both settings, the dominating terms in our upper bounds are independent of the gradient upper bound and the smoothness parameter . In other words, the efficiency of Algorithm 1 is essentially unaffected by these quantities. Recall that and are related to steep cliffs in the landscape where the gradient may be large or fluctuate violently. Therefore, our results suggest that such non-smoothness can be tackled with clipping methods without sacrificing efficiency.
2 Proof Sketch
The analysis of Algorithm 1 is in fact challenging, as it uses both momentum and adaptive step sizes. Also, the general -smoothness assumption makes things more complicated. In this subsection we briefly introduce our proof technique. We hope our proof is also useful to a better understanding of other adaptive algorithms that combine momentum (such as Adam (Kingma and Ba, 2014)).
Proof sketch of Theorem 3.1. Due to the momentum term, each step in Algorithm 1 is not necessarily a descent one, which makes it difficult to prove convergence using traditional techniques. Instead, we construct a novel Lyapunov function as follows:
We prove this lemma by using the -smoothness properties deduced in Appendix A and conducting a comprehensive discussion on three cases in Lemmas B.1-B.3. Furthermore, we show in Lemma B.4 that . By choosing a small enough and carefully dealing with constants, we can conclude that the total amount of decrease of the Lyapunov function is .
For any , if and , then we have
We prove (7) in Lemma B.5-B.7 by the fact that Algorithm 1 performs an unclipped update if . Since the bound (7) is small in term of if and are close to 1, we convert to by proving that in Lemma B.8. The term related to can all be offset by the terms in Lemma 3.3, and the term related to combined with can be shown to ensure a descent amount of , which is as long as .
Finally, by combining the above two cases we obtain the conclusion in Theorem 3.1.
Firstly, since the gradients are not exact, the stochastic gradient we have access to is not guaranteed to be small even for the case of . Fortunately, the choice of parameters in Theorem 3.2 () settles such difficulty.
Finally, by choosing proper and , we can obtain Theorem 3.2.
3 Lower Bounds and Discussions
Theorem 3.1 and 3.2 provide upper bounds for the complexity of Algorithm 1. Now we compare these results with existing lower bounds and discuss the tightness of our results.
Deterministic Setting. Carmon et al. (2019) have shown that there exists an -smooth function such that any (possibly randomized) algorithm requires at least queries to gradient to ensure finding a point such that . Since Assumption 2.2 is weaker than -smoothness (), we have that the lower bound for -smooth functions is . From Theorem 3.1, Algorithm 1 is optimal since it can achieve the lower bound when ignoring numerical constants.
Stochastic Setting. From the example constructed in Drori and Shamir (2019), we have that for any SGD method with arbitrary (possibly adaptive) step sizes and aggregation schemesSee details in Appendix D., the complexity lower bound is exactly for -smooth functions that have -bounded gradient noises. Therefore Theorem 3.2 indicates that clipped SGD matches the lower bound.
One may ask : what is the lower bound for general stochastic gradient-based algorithms? In fact, from the example in Arjevani et al. (2019), we have that any algorithm needs stochastic gradient queries to find an -approximate stationary point for a hard -smooth function whose gradient noise has a -bounded variance. It is our conjecture that the lower bound for optimizing -smooth functions that have -bounded gradient noises is also . We leave the study as a future work.
Experiments
We conduct extensive experiments and find the clipping algorithms indeed consistently outperform their unclipped counterpart. We present experimental results on three deep learning benchmarks: CIFAR-10 classification using ResNet-32, Imagenet classification using ResNet-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. We put all the experimental details in the Appendix G. Our code is available at https://github.com/zbh2047/clipping-algorithms.
CIFAR-10 classification with ResNet. We train the standard ResNet-32 (He et al., 2016) architecture on CIFAR-10. We use SGD with momentum for the baseline algorithm with a decaying learning rate schedule, which is the standard choice to train the ResNet architecture. We set learning rate , momentum and minibatch size 128, following the common practice. For all the clipping algorithms, we choose the best and based on a course grid search, while keeping other hyper-parameters and training strategy the same as SGD+momentum. We simply set the hyper-parameters and in mixed clipping, as suggested in Ma and Yarats (2018) (for its unclipped counterpart QHM). We run 5 times for each algorithm using different random seeds to make the results more reliable.
Figures 2(a) demonstrates the results. It can be seen that all the algorithms achieve a test accuracy more than 93% on CIFAR-10. Note that all clipping algorithms converge faster than SGD+momentum. Particularly, the mixed clipping (Algorithm 1) outperforms SGD+momentum by a large margin in term of training speed. As a result, one can possibly adopt a more aggressive learning rate decaying schedule to reduce training time considerably.
ImagNet classification with ResNet. We train the standard ResNet-50 (He et al., 2016) architecture on ImageNet. For the baseline algorithm, we choose SGD with learning rate and momentum , following Goyal et al. (2017). We use batch size 256 on 4 GPUs.
Figure 2(b) plot the training loss curve and validation accuracy curve on ImageNet. All the algorithms reach a validation accuracy of about 76%. However, all the clipping algorithms train faster than the baseline SGD. Mixed clipping performs the best among the four algorithms.
Language modeling with LSTM. We train the state-of-the-art AWD-LSTM (Merity et al., 2017) on Penn Treebank (PTB) dataset (Mikolov et al., 2010). We first follow the training strategy in Merity et al. (2017), where they use averaged SGD without momentum with learning rate and clipping parameter . Since our purpose is to compare different algorithms rather than to achieve state-of-the-art results, we only train AWD-LSTM for 250 epochs. We then evaluate other algorithms including standard SGD without clipping, momentum clipping, and mixed clipping. We choose the best and (using validation perplexity criterion) based on a course grid search. Results are shown in Figure 2(c).
Figure 2(c) clearly shows all clipping methods converge much faster than SGD without clipping, and are much better in term of validation perplexity. This is consistent with our theory, in that the vanilla SGD must use a very small learning rate to guarantee convergence (Zhang et al., 2020a), which will be slow and be harmful to generalization on validation set according to previous works (Huang et al., 2017; Kleinberg et al., 2018) . Therefore clipping technique is crucial in LSTM models. We can also find that the training and test curve of mixed clipping is much better than both gradient clipping and momentum clipping. The mixed clipping improves validation perplexity for more than 1 point compared to clipped SGD after 250 epochs.
Other experiments. We also conduct experiments to compare clipping algorithms with Adam, and to directly compare our work with previous results (Zhang et al., 2020a) under the same setting. See Appendix H for details. Finally, we construct a provably -smooth optimization problem using MNIST dataset. We then run experiments in both deterministic setting and stochastic setting. The results are shown in Appendix I.
Conclusion
This paper proposes a detailed study for clipping methods under a general framework. In particular, we explore the possibility of combining clipping with other popular techniques, e.g. momentum acceleration, in deep learning. We provide a general and tight analysis for the framework, showing the efficiency of clipping methods in optimizing a class of non-convex and non-smooth (in traditional sense) functions. Experiments confirm that these methods have superior performance. We hope that our work affords more understandings on the clipping technique and smooth functions.
There are still many open questions that have not yet been answered. Firstly, as discussed in Section 3.3, we are not aware of any lower bounds for general first-order methods that can be applied our setting. Thus, it is interesting to explore such lower bound, or to relax Assumption 2.4 to the more general bounded variance assumption. Secondly, although we have shown the superiority of clipping-based methods, we do not provide theoretical explanation why some clipping schemes are better than others as observed in experiments. We believe that this can only be done by exploring new and better smoothness assumptions. Thirdly, the empirical superiority of other adaptive methods ( e.g. AdaGrad (Duchi et al., 2011), Adam (Kingma and Ba, 2014) ) have not been justified from a theoretical point of view. We hope that our analysis is helpful for the analysis of these methods. Finally, we are looking forward to seeing better optimization algorithms with better convergence properties in future work.
Acknowledgement
This work was supported by National Key R&D Program of China (2018YFB1402600), Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008)] and Beijing Academy of Artificial Intelligence.
References
In this section, we prove some important properties of -smooth functions. These properties will be frequently used in subsequent sections.
We first present a basic lemma without proof.
(Gronwall’s inequality) [Gronwall, 1919] Let denote an interval of the real line with . Let be continuous real-valued functions defined on . Assume is non-decreasing, is non-negative, and the negative part of is integrable on every closed and bounded subinterval of .If
The following result, Lemma A.2 , is a generalization of Lemma 9 in Zhang et al. [2020a].
Let be -smooth, and be a constant. Given , for any such that , we have .
Let be defined as , then we have
We then bound the norm of :
The first inequality uses the triangular inequality of 2-norm; The second inequality uses the property of spectral norm; The third inequality uses the definition of -smoothness. By applying the Gronwall’s inequality we get
The Lemma follows by setting .
Now we are able to prove a descent inequality, which is similar to the descent inequality for -smooth functions. In fact, if a function is -smooth, it is well-known that for any , we have
(Descent Inequality) Let be -smooth, and be a constant. For any and , as long as , we have
where .
Let be defined as . The following derivation uses Taylor’s theorem (in (16)), then uses triangular inequality, Cauchy-Schwarz inequality and the property of spectral norm (in (17)):
Then we use -smoothness and (14) to bound :
Substituting (20) into (18) concludes the proof.
Let be -smooth, and be a constant. For any and , as long as , we have
where .
Using (20) leads to the results.
Finally we prove a result which provides a way to upper-bound the gradient norm. A similar result for -smooth functions is the following: if is -smooth, then for any , we have
(Bounding the gradient norm) Let be an -smooth function, and be the optimal value. Then for any , we have
Define the constant and . It is easy to see that such exists. Let and . Then . By the descent inequality we have
If , then
If , then
The original definition of -smoothness requires the function to be twice-differentiable. Under this definition, -smoothness is actually not weaker than -smoothness, which only requires the function to be continuous differentiable. In this section we prove that the alternative definition provided in Remark 2.3 is sufficient for all the results in this paper.
We check that Lemma A.2 and A.3 still holds under the new assumption (with replaced by , up to numerical constants) We immediately obtain from (27) above that
which is of the same form as Lemma A.2. Next, we have
Since all the other results are established on the basis of these two lemmas, we can see that the conclusion still holds under (27).
Appendix B Proof of Theorems
We first prove the deterministic case (Theorem 3.1), then generalize the result to stochastic case (Theorem 3.2). In deterministic case we can use fewer notations, which will make the proof more readable and elegant. The proof in stochastic case will rely on all the techniques used in the deterministic case, as well as some new methods.
To simplify the notation, we write the update formula as
when analyzing a single iteration. The error between and is denoted as . Suppose for some constant , and we denote and , just the same as in the descent inequality (Lemma A.3).
Let be a real constant. For any vector and ,
To prove the theorem, we will construct an Lyapunov function and explore the decreasing property of this function. We define the Lyapunov function to be
and analyze . We first bound .
For any momentum vectors and , let , then
and . In this case
and . In this case
Thus in all cases can be upper bounded by .
Suppose . Then
We first write as
Based on Lemma B.2, we only need to bound . We will use the -smoothness assumption.
Where the first inequality uses the descent inequality (Lemma A.3), the second equation follows from the update rule, and the last inequality is obtained by the following two inequalities:
First we prove that (37) holds by considering the following three cases:
. In this case the algorithm performs a normalized update. Then (37) follows by directly using Lemma B.1 with :
and . In this case the algorithm performs an unnormalized update. We now prove .
and . This is the most complicated case. Due to the condition in Lemma B.3, . In this case, the algorithm also performs an unnormalized update. We first bound using the same calculation as in the second case:
where we use the fact that . We then bound as follows:
Combining the two inequalities, we obtain
Thus in all cases (37) holds. We now turn to (38) which is proven in a similar fashion. Specifically, consider the following three cases:
. In this case
and . In this case bound the same as in the third case of (37):
and . In this case . Using the same calculation above,
Thus (38) holds. Merging all the cases above, we finally obtain
Now we consider all the steps which satisfy the condition in Lemma B.3, denoted as . Similarly, use . Let , then .
Let set and be defined above. Then
We now focus on the summation of the term . Define . When , (see Lemma A.4). Thus we can expand using the recursive relation as follows
where the last inequality uses the fact that for all .
After substituting the above results into (LABEL:lemma_mom_clip_deterministic_large_sum_1) we obtain
Now we turn to the case in which .
Suppose . Then
where , and .
In the case of , we have , thus
We then bound and . Note that implies that the algorithm performs an update without normalization. Define , then again by descent inequality,
by definition of the Lyapunov function, we have
where .
Let and be defined in Lemma B.5. If , then the matrix
is symmetric and positive semi-definite, where is the identity matrix.
In fact we only need to consider the case when , because the eigenvalues of can only be those that appears in (). Denote two eigenvalues be when . A direct calculation shows that
If , then and , which is equivalent to the semi-definiteness of .
Suppose . If , Then
Let be defined in Lemma B.6. The result of Lemma B.5 can be written in a matrix form:
Using the fact that is positive semi-definite, we obtain the desired result.
Note that the amount of descent in Corollary B.7 is small in terms of if and are close to 1. We now try to convert the term into , which is stated in the following lemma.
Suppose and for some constant and . Let for simplicity. Let set and be defined above. Then
where the last inequality follows by Corollary A.4. Applying (50) recursively, we obtain
Using and some straightforward calculation, we obtain
Now we are ready to prove the main theorem.
Let be the optimal value, and . Assume for simplicity. If and , where constants , , and , then
By calculating , we can use Corollary B.7. Taking summation of the inequality (47) over steps , we obtain
Combining (56) and (40) in Corollary B.4 we obtain
we have and . Using Lemma B.14 we have
Therefore by standard inequality and (57) we obtain
We now simplify . Let . By (58) we have
Since , we have . Therefore by (LABEL:longineq) and Lemma A.5 we have
B.2 Proof of Theorem 3.2
We now prove the stochastic case. As before, to simplify the notation we write the update formula as
In stochastic case, we define the Lyapunov function to be
Suppose for some constant , and we denote and , just the same as in the descent inequality (Lemma A.3). When (in Theorem 3.2), we can take and .
Furthermore, using the noise assumption, for different time steps , we have
Based on Lemma B.2, we only need to bound . We use the -smooth condition:
Now we bound . The calculation is similar to the deterministic setting. We first bound . Consider the following three cases, all of which are analogous to the proof of Lemma B.3:
. The algorithm performs a normalized update. We have
and . The algorithm performs an unnormalized update. We have
where (77) uses the following two inequalities which can be obtained by Lemma B.10:
We next bound . Consider the following cases, all of which are analogous to the proof of Lemma B.3:
. In this case we can use Lemma B.1 with :
. In this case
We now bound . If , then . If and , then using the same calculation as in the deterministic case,
Let set and be defined above. Then
where . , and .
Because and , the algorithm performs an unnormalized update. The proof is similar to the one in Lemma B.5 except for bounding the term .
The proof of Lemma B.14 is similar to the proof of Lemma B.8. We first write (52) again as follows:
We now merge the two cases corresponding to Corollary B.12 and Lemma B.13. The proof of the following theorem involves many techniques which are different from the deterministic case and is far more challenging.
Let be the optimal value, and . Assume for simplicity. Fix be a small constant.If and where constants , then
Based on the previous results, we take summation over and obtain
We now simplify (93) by taking expectation. We first have
Applying the above equation recursively, we obtain
where the first inequality uses the proof of Corollary A.4 and Lemma B.10, and the last inequality uses . By taking summation of the above inequality we obtain
Let , and fix the ratio . Then for small enough and large enough noise ,
Applying the above estimates and rearranging (102), we have
Due to Lemma B.14 (), we clearly have
Plugging (LABEL:sigma_mt) into (LABEL:rearrange), we obtain
It clearly follows that . Therefore
We finally show . Using Lemma A.5,
For the term , if , we can similarly use Lemma A.5 to obtain . If , using and Lemma A.5 leads to the result.
Appendix C Discussion of the normalized momentum algorithm
In this section we analyze in detail the theoretical aspects of the normalized momentum algorithm, as well as some practical issues. Recall that this algorithm can be seen as a special case of our clipping framework. For convenience we re-write it in Algorithm 2.
We remark that SNM is different from the clipping methods in traditional sense, in that it makes a normalized update each iteration. This algorithm has been analyzed in Cutkosky and Mehta for -smooth functions. In that setting they were able to prove that SNM achieves a complexity of .
For -smooth functions, we show that: (a). With carefully chosen momentum parameter and step size , SNM can achieve a complexity of , which is the same as the complexity we obtain in Theorem 3.2. (b). There are some practical issues that make SNM less favorable than traditional clipping methods (such as the other three special cases of our framework discussed in Section 3 of the main paper).
The following results provides convergence guarantee for Algorithm 2.
Consider the algorithm that starts at and make updates . Define be the estimation error. Assume for some and let constants . Then
Since , by Lemma A.3 we have
where in the second inequality we use Lemma B.1.
holds in iterations.
Define the estimation errors . Denote , then for such that , we can upper bound using Corollary A.4:
We can use to get a recursive relationship:
Denote , then
Using triangle inequality and plugging in the estimate (113) , we have
Taking a telescope summation of 115 and using Assumption 2.4 we obtain
If we choose and , then
We have shown the theoretical superiority of Algorithm 2. Specifically, it enjoys the same complexity as Theorem 3.2. However we notice some potential drawbacks of Algorithm 2:
Firstly, the step size of Algorithm 2 is at the order of , while the step size we chose in Theorem 3.2 is . Previous works have noticed that a smaller step size makes it easier to be trapped in a sharp local minima , which may result in worse generalization [Kleinberg et al., 2018].
Secondly, although the complexity of Algorithm 2 is the same as Theorem 3.2 for small , it requires a more restrictive upper bound of to ensure the term dominates. For instance with a poor initialization, may very large. This suggests that in practice, where we do not get into a very small neighbourhood of stationary point, the performance of Algorithm 2 may be worse.
Appendix D Details of Lower Bounds in Section 3.3
In this section we discuss the lower bound for SGD in Drori and Shamir in detail. The following result is taken from this paper:
Now we discuss why this shows the optimality of clipped SGD under Assumptions 2.1, 2.2 and 2.4.
Firstly, Theorem D.1 assumes an upper bound on rather than the one assumed in Assumption 2.1 (). However, in fact we only need to assume that to prove Theorem 3.2 for clipped SGD. The reason is as follows. In fact, since for clipped SGD , the momentum term in the Lyapunov function disappears, as well as the term in (102). So we no longer need to use Lemma A.5 to bound the term . The rest of the proof only needs (which is used in the telescope sum in (102)).
Secondly, although Theorem D.1 only assume that the variance of stochastic gradient is bounded, in their construction the noise is actually defined as
Therefore the norm of the noise is bounded by , and the example used to prove Theorem D.1 still works under Assumption 2.4.
Now suppose we need an output such that , then it follows from Theorem D.1 that . Therefore we have shown the optimality of clipped SGD in this class of algorithms, as stated in Section 3.3.
Appendix E Justifications on the Mixed Clipping
In the above optimization problem, the general update formula can be written as:
We now analyze three cases based on the proposition:
We further plot the value of (119) with respect to and in Figure 3 to visualize the above finding. It can be clearly seen that the using both gradient and momentum with a proper interpolation factor outperforms both SGD and SGD with momentum by a large margin (Figure 3(a)). Furthermore, we can drive to further improve convergence (Figure 3(b)), while in SGD with momentum we can not.
𝑥𝜉2f(x,\xi)=\frac{1}{2}(x+\xi)^{2}. The mixed update with proper outperforms both SGD and SGD with momentum by a large margin. Furthermore, for the mixed update we can drive to further improve convergence, while for SGD with momentum we can not. Although we use the simple function as an example, similar result exists in any general quadratic form with positive definite Hessian. Furthermore, the experiments in Section 4 also demonstrate that the mixed clipping outperforms both gradient clipping and momentum clipping.
Combining (120), (121) and (122), we can write the recursive relationship into a matrix form:
Denote the above matrix as . After a straightforward calculation, we can find that is an eigenvalue of , and
is the only eigenvector associated with . Similarly, is also an eigenvalue of . Let the other two eigenvalues be and , then
It follows that and . Since , we have . Therefore and (note that and can be composite numbers). If , we can further conclude that the four eigenvalues are different from each other (otherwise , which contradicts to ).
E.1.2 Proof of the general case
Now we prove Proposition E.1 for general .
Combining (120), (124) and (125), we obtain the following recursive matrix :
Using the same calculation as in the previous section, we finally get
Appendix F Soft Clipping
For Algorithm 1, as long as the norm of the gradient (or momentum) exceeds a constant, it is then clipped; we refer to this form of clipping as hard clipping. One can also consider a soft form of clipping, as presented in Algorithm 3.
We take for example to analyze soft clipping. For any gradient norm , the norm of the update is a function of :
For hard clipping, we can similarly write
Therefore soft clipping is in fact equivalent to hard clipping up to a constant factor 2 in the step size choice. Thus it’s easy to see that our results also hold for Algorithm 3. However, compared to hard clipping, soft clipping has the advantage that the function in (127) is smooth while in (128) is not, as shown in Figure 4. We also empirically observe that the training curve of soft clipping is more smooth than hard clipping.
Appendix G Experimental Details in Section 4
Based on the discussion in Appendix F, we use the soft version of clipping algorithms in all the experiments.
The CIFAR-10 dataset contains 50k images for training and 10k for testing. All the images are RGB bitmaps. We use the standard ResNet-32 architecture. The total number of parameters is 466,906. For all algorithms, we use mini-batch size 128 and weight decay . For the baseline algorithm, we use SGD with momentum using learning rate and momentum factor . Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with and . We optimize ResNet-32 for 150 epochs, and decrease the learning rate at epoch 80 and epoch 120. For other algorithms, we perform a course grid search for an , while keeping all the training strategy the same as SGD. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.
G.2 PTB
The Penn Treebank dataset has a vocabulary of size 10k, and 887k/70k/78k words for training/validation/testing. We use the state-of-the-art AWD-LSTM architecture using hidden size 1150 and embedding size 400. The total number of parameters is 23,941,600. For the baseline algorithm, we follow Merity et al. who use averaged SGD clipping without momentum using learning rate and . Note that here means that the gradient norm will be clipped to be no more than 0.25. We use the same dropout rate and regularization hyper-parameters in [Merity et al., 2017]. We train AWD-LSTM for 250 epochs, and averaging is triggered when the validation perplexity stops improving. For other algorithms, we perform a course grid search for an , while keeping all the training strategy the same as SGD clipping. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.
G.3 ImageNet
We also conduct experiments on ImageNet dataset. This dataset contains about 1.28 million training images and 50k validation images with various sizes. We train the standard ResNet-50 architecture on this dataset. The total number of parameters is 25,557,032. We use a batch size of 256 on 4 GPUs and a weight decay of . For the baseline algorithm, we choose SGD with learning rate and momentum , following Goyal et al. . Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with and . We train the ResNet-50 for 90 epochs, and decrease the learning rate in epoch 30, epoch 60 and epoch 80. For the other algorithms, we perform a course grid search for an , while keeping all the training strategy the same as SGD.
Appendix H Additional Experimental Results
Comparsion with Adam optimizer. Adam is a popular optimizer to train neural networks on a variety of tasks. We also run the same experiments in Section 4 using Adam optimizer with best hyper-parameters in order to compare it with clipping algorithms. We turn the learning rate of Adam based on a grid search, and choose the momentum hyper-parameters and . The results are shown in Figure 5(a) and 5(b).
It is clear from the figure that Adam trains really fast on both datasets. It is even faster than gradient clipping and momentum clipping. However, it does not outperform the mixed clipping method. Note that like clipping algorithms, Adam also uses adaptive gradient. However, Adam generalizes much worse than vanilla SGD or clipping algorithms. Particularly, the test accuracy of CIFAR-10 using Adam is only 91.5%, while all other algorithms reach 93% test accuracy.
CIFAR-10 classificatoin with ResNet-18. The original work of gradient clipping in Zhang et al. [2020a] conducts experiments using ResNet-18 on CIFAR-10. To compare with it, we also run the same experiments using different algorithms, e.g. SGD, SGD clipping, momentum clipping and mixed clipping. All the algorithms reach 95% test accuracy, and the result is similar to that of using ResNet-32 architeture (see Figure 5(c)).
In this section, we are aiming to construct an optimization problem which provably satisfies the -smoothness condition in this paper rather than the traditional -smoothness condition. We then conduct experiments in both deterministic setting and stochastic setting.
In fact, if the exponential function is replaced by , the problem becomes the well-known logistic regression. However, logistic loss has bounded second-order derivative (thus is -smooth), while does not. Furthermore, exponential function is (0,1)-smooth, thus we expect is also -smooth for some (see the following proposition). This is why we use exponential loss here. We point out that such exponential loss is also used in a variety of algorithms, such as boosting (AdaBoost).
When the dataset is linearly separable, parameter will be driven to infinity through optimization, thus adding some regularization is prevalent in linear classification. We use the following term (131) rather than norm for regularization, in order to be compatible with .
In fact, is similar to weight decay regularization in that when is small.
The total loss . We now claim that is indeed -smooth.
Assume bias term for simplicity. Suppose the data points have bounded norm, i.e. for all and . Let the loss function be defined above. Then for every , , is -smooth w.r.t for
We use MNIST dataset in this section, which contains 60,000 hand-writing training images. We only evaluate the training speed for different algorithms on the training set rather than the generalization capability. The loss functions is defined to be the sum of ten losses, each of which corresponds to the loss of a binary classification problem to recognize number 0 to 9. Regularization coefficient is set to be 0.02.
To compare different algorithms, we choose the best hyperparameters and for each algorithm based on a careful grid search. is set to be 0.7 for mixed clipping. The parameter initalization and all inputs in the schocastic setting are the same for all algorithms. For each run, we average the loss of the last 5 epoch in order to reduce variance. In the deterministic setting we train 500 epochs, each of which uses the entire dataset. In the stochastic setting we train 50 epochs with a mini-batch size 200. We run on 5 different random seeds ranging from 2016 to 2020 altogether and average their results.
Figure 6 plots the results. It is clear that in both settings, clipping is vital to a fast convergence. Also, momentum helps training, and mixed clipping performs the best in the stochastic setting.
Let . Let be two constants. Pick . We consider the following two cases:
(1). In this case can be directly upper bounded:
The first inequality in (134) uses the triangular inequality of matrix spectral norm and .
(2). Decompose to be where
Define set and . Then
In (136) we use the Cauchy-Schwartz inequality; In (137) we partition the index to three subsets , and , and use the lower bound and upper bound of for each set.
Similar, we can upper bound :
To bound , we again bound from a different perspective:
where (142) is obtained by selecting the with the largest which is equal to . Substitute (138) and (142) into (140) then we get
Since implies that for all from (132), we can upper bound the norm of : . Substitute this into (144) we get
Combining the above two cases concludes the proof.