Improved Analysis of Clipping Algorithms for Non-convex Optimization

Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang

Introduction

The problem of the central interest in this paper is to minimize a general non-convex function presented below:

where $F(x)$ can be potentially stochastic, i.e.

For non-convex optimization problems in form of (1), since obtaining the global minimum is NP-hard in general, this paper takes the concern on a reasonable relaxed criteria: finding an $\varepsilon$ -approximate first-order stationary point such that $\|\nabla F(x)\|\leq\varepsilon$ .

Gradient clipping (Pascanu et al., 2012) is a simple and commonly used trick in algorithms that adaptively choose step sizes to make optimization stable. For the task of training deep neural networks (especially for language processing tasks), it is often a standard practice and is believed to be efficient in relieving the exploding gradient problem from empirical studies (Pascanu et al., 2013). More recently, Zhang et al. (2020a) proposed an inspiring theoretical justification on the clipping technique via introducing the $(L_{0},L_{1})$ -smoothness assumption. The concept of $(L_{0},L_{1})$ -smoothness is defined as follows.

This assumption can be further relaxed such that twice differentiability is not required (see Remark 2.3). Therefore the standard $L$ -smoothness assumption (i.e. the gradient of $f$ is $L$ -Lipschitz continuous) is stronger than the $(L_{0},L_{1})$ -smoothness one in the sense that the latter allows $\|\nabla^{2}F(x)\|$ to have a linear growth with respect to $\|\nabla F(x)\|$ .

Secondly, Zhang et al. (2020a) performed experiments to show that $(L_{0},L_{1})$ -smoothness is a preciser characterization of the landscapes for objective functions in many real-world tasks, especially for training a deep neural network model. It was observed that the local Lipschitz constant $L_{0}$ near the stationary point is thousands of times smaller than the global one $L$ in the LSTM training (see Figure 1(c) taken from Zhang et al. (2020a)).

Seeing this, it is desirable to give a comprehensive and deep analysis on iteration complexities for $(L_{0},L_{1})$ -smooth objectives. How fast can we achieve to find a first-order stationary point for $(L_{0},L_{1})$ -smooth functions? What are simple algorithms that provably achieve such a convergence rate? In this paper, we give affirmative answers to the above questions. In fact, due to the violent fluctuation of gradients, the efficiency of (stochastic) Gradient Descent with a constant step size degenerates, whereas we will show in this paper that by simply combining the clipping technique, a wide range of algorithms can achieve much better convergence rate for $(L_{0},L_{1})$ -smooth functions. In fact, when $\varepsilon$ is small, the complexities (i.e. the number of gradient queries required) are $\mathcal{O}\left(\Delta L_{0}\varepsilon^{-2}\right)$ for the deterministic setting and $\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right)$ for the stochastic setting (see Section 3 for details), which are both independent of $L_{1}$ . Compared with Zhang et al. (2020a) who only studied clipped (stochastic) gradient descent, we consider proposing a unified framework which contains a variety of clipping-based algorithms and achieve much sharper complexities. The main technique for our proof is by introducing a novel Lyapunov function which does not appear in existing studies. We believe that our work provides better understandings for the clipping technique in training deep neural networks. We summarize the contributions of the paper in the following.

We provide a general framework to analyze the clipping technique for optimizing $(L_{0},L_{1})$ -smooth functions. It contains a variety of clipping algorithms, including gradient clipping and momentum clipping as special cases.

We provide convergence analysis for the general framework we propose. We show that our bounds are tight by comparing with existing lower bounds. For gradient clipping, a special case in our framework, our result is much sharper than that proposed by Zhang et al. (2020a).

We conduct extensive experiments on a variety of different tasks, and observe that the clipping algorithms consistently perform better than vanilla ones.

Clipping/normalizing Techniques. Clipping/normalizing has long been a popular technique in optimizing large-scale non-convex optimization problems (e.g. (Mikolov, 2012; Pascanu et al., 2013; Goodfellow et al., 2016; You et al., 2017)). There are several views which provide understandings for the clipping and normalizing techniques. Some show that clipping can reduce the stochastic noise. For example, Zhang et al. (2019); Gorbunov et al. (2020) showed that clipping is crucial for convergence when the stochastic gradient noise is heavy-tailed. Menon et al. (2020) pointed out that clipping can mitigate the effect of label noise. Cutkosky and Mehta (2020) found that adding momentum in normalized SGD provably reduces the stochastic noise. Another line of works try to understand the function of clipping and normalizing for the standard smooth optimization. For example, Levy (2016) showed that normalized GD can provably escape saddle points. Fang et al. (2018) designed a new algorithm based on normalized GD which achieves a faster convergence rate under suitable conditions. Gradient clipping has also been used to design differentially private optimization algorithms (Abadi et al., 2016).

The work of Zhang et al. (2020a) is mostly related to this paper. A detailed comparison between the two works is shown in Subsection 2.2.

Lower Bounds For Non-convex Optimization. A series of recent works establish lower bounds for finding an $\varepsilon$ -stationary point of a general non-convex and $L$ -smooth function, either in deterministic setting (Carmon et al., 2019) or in stochastic setting (Drori and Shamir, 2019; Arjevani et al., 2019). In this paper we borrow their counter examples to show the tightness of our obtained complexities for the general $(L_{0},L_{1})$ -smooth functions.

Assumptions & Comparisons of Results

We first present the assumptions that will be used in our theoretical analysis, which follows from Zhang et al. (2020a).

We assume that $F(x)$ is $(L_{0},L_{1})$ -smooth.

It does not need $F$ to be twice differentiable and is strictly weaker than $L$ -smoothness.

In the stochastic setting, for the briefness of our analysis, we assume that the noise is unbiased and bounded.

2 Comparisons of Results

In this paper, we mainly take our concern on the stochastic setting, though we also establish the complexities for the deterministic setting. We summarize the comparison in the stochastic setting with existing complexity results in Table 1, in which we only present the dominating complexities with respect to $\varepsilon$ .

Compared with the bound for clipped SGD established in Zhang et al. (2020a), our results improve theirs on the dependencies for all problem-dependent parameters, i.e. $\Delta$ , $\sigma$ , $L_{0}$ , and especially $L_{1}$ by order. For SGD with arbitrarily chosen step sizes (thus include clipped SGD), the example in Drori and Shamir (2019) can be used to show that clipped SGD is optimal (cf. Section 3.3).

General Analysis of Clipping

We aim to present a general framework in which we can provide a unified analysis for commonly used clipping-based algorithms. Since momentum is one of the most popular acceleration technique in optimization community, our framework takes this acceleration procedure into account. We show our framework in Algorithm 1, where we can simply replace $\nabla f(x_{t},\xi_{t})$ by $\nabla F(x_{t})$ for the deterministic setting.

We notice that our framework is similar to the Quasi-Hyperbolic Momentum(QHM) algorithm proposed by Ma and Yarats (2018), while they did not consider the clipping technique. They pointed out that QHM contains a wide range of popular algorithms (e.g. SGD+momentum, Nesterov Accelerated SGD, AccSGD, etc). As a result, for different choice of hyper-parameters, our framework encompasses the clipping version of all these algorithms. We now discuss several representative examples in our framework.

Gradient Clipping. By choosing $\nu=0$ in Algorithm 1, we obtain the clipped GD/SGD algorithm which can be written as:

It follows that in gradient clipping, the gradient is clipped to have its norm no more than $\gamma/\eta$ .

Momentum Clipping. By choosing $\nu=1$ in Algorithm 1, we perform the update using a clipped version of momentum which can be written as:

The approach has already been used in previous works (Zhang et al., 2019, 2020b), albeit in different settings. To the best of our knowledge, there is no existing analysis of this algorithm even for optimizing standard $L$ -smooth functions.

Mixed Clipping. By choosing $\nu\in(0,1)$ , we obtain the mixed clipping algorithm. Although this form of clipping is not widely used in practice, we observe from experiments that it typically converges faster than both gradient clipping and momentum clipping. Some explanations of this observation are provided in Appendix E.

Normalized Momentum. By choosing $\nu=1$ and $\eta\rightarrow+\infty$ in Algorithm 1, we recover the normalized SGD+momentum algorithm. This algorithm performs a normalized (rather than clipped) update in each iteration. It has been analyzed in Cutkosky and Mehta (2020) for $L$ -smooth functions and a layer-wise variant was used in the LARS algorithm (You et al., 2017). We will provide a detailed discussion of this algorithm in the Appendix C.

In this section we first deal with the deterministic case, in which we can get a strong justification that clipping is a natural choice to optimize $(L_{0},L_{1})$ -smooth functions. We have the following Theorem.

[Convergence of Algorithm 1, Deterministic Setting] Let the function $F$ satisfy Assumptions 2.1 and 2.2. Set $m_{0}=\nabla F(x_{0})$ in Algorithm 1 for simplicity. Fix $\varepsilon>0$ be a small constant. For any $0\leq\beta<1$ and $0\leq\nu\leq 1$ , if $\gamma\leq\frac{1-\beta}{10BL_{1}}$ and $\eta\leq\frac{1-\beta}{10AL_{0}}$ where $A=1.06$ , $B=1.06$ , then

In Theorem 3.1, the $(L_{0},L_{1})$ -smoothness is precisely reflected in the restriction of hyper-parameters $\gamma=\mathcal{O}(1/L_{1})$ and $\eta=\mathcal{O}(1/L_{0})$ . For large $L_{1}$ , we must use a small clipping hyper-parameter to guarantee convergence. This also coincides with the intuition that in highly non-smooth regions we should take a small step.

Theorem 3.1 states that in the deterministic setting, for any $\varepsilon>0$ , our framework can find an $\varepsilon$ -approximate stationary point in $\mathcal{O}\left(\Delta\max\left\{\frac{L_{0}}{\varepsilon^{2}},\frac{L_{1}^{2}}{L_{0}}\right\}\right)$ gradient evaluations if we choose $\gamma=\Theta\left({1}/{L_{1}}\right)$ and $\eta=\Theta\left({1}/{L_{0}}\right)$ . When $\varepsilon=\mathcal{O}(L_{0}/L_{1})$ , the dominating term is $\mathcal{O}\left(\Delta{L_{0}}{\varepsilon^{-2}}\right)$ .

Now we turn to our main result in the stochastic setting. We have the following theorem.

[Convergence of Algorithm 1, Stochastic setting] Let the function $F$ satisfy Assumptions 2.1 and 2.2, and the noise satisfies Assumption 2.4 with $\sigma\geq 1$ . Set $m_{0}=\nabla F(x_{0})$ in Algorithm 1 for simplicity. Fix $0<\varepsilon\leq 0.1$ be a small constant. For any $0\leq\beta<1$ and $0\leq\nu\leq 1$ , if $\gamma\leq\frac{\varepsilon}{2\sigma}\min\left\{\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right\}$ and $\gamma/\eta=5\sigma$ where constants $A=1.01,B=1.01$ , then

Here the expectation is taken over all the randomness $\xi_{0},\cdots,\xi_{T-1}$ .

Theorem 3.2 shows that in the stochastic setting, for any $\varepsilon>0$ , our framework can find an $\varepsilon$ -approximate stationary point in $\mathcal{O}\left(\Delta\sigma^{2}\left(\max\left\{\frac{L_{0}}{\varepsilon^{4}},\frac{L_{1}^{4}}{L_{0}^{3}}\right\}\right)\right)$ gradient evaluations. When $\varepsilon<\min\left\{1,\frac{L_{0}}{25L_{1}}\right\}(1-\beta)$ , the term $\min\left\{\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right\}$ reduces to $\frac{\varepsilon}{AL_{0}}$ . In this case $L_{1}$ no longer affects the choice of steps sizes $\eta$ and $\gamma$ , and the complexity in (4) reduces to $\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right)$ .

Theorem 3.2 suggests that the clipping threshold should take $\gamma/\eta=\Theta(\sigma)$ , which only depends on the noise and is several times larger than the its variance. This matches previous understanding of gradient clipping, in that clipping the stochastic gradient controls the variance while introducing some additional bias, and the clipping threshold should be tuned to trade-off variance with the introduced bias (Zhang et al., 2019).

We emphasize that in both settings, the dominating terms in our upper bounds are independent of the gradient upper bound $M$ and the smoothness parameter $L_{1}$ . In other words, the efficiency of Algorithm 1 is essentially unaffected by these quantities. Recall that $M$ and $L_{1}$ are related to steep cliffs in the landscape where the gradient may be large or fluctuate violently. Therefore, our results suggest that such non-smoothness can be tackled with clipping methods without sacrificing efficiency.

2 Proof Sketch

The analysis of Algorithm 1 is in fact challenging, as it uses both momentum and adaptive step sizes. Also, the general $(L_{0},L_{1})$ -smoothness assumption makes things more complicated. In this subsection we briefly introduce our proof technique. We hope our proof is also useful to a better understanding of other adaptive algorithms that combine momentum (such as Adam (Kingma and Ba, 2014)).

Proof sketch of Theorem 3.1. Due to the momentum term, each step in Algorithm 1 is not necessarily a descent one, which makes it difficult to prove convergence using traditional techniques. Instead, we construct a novel Lyapunov function as follows:

We prove this lemma by using the $(L_{0},L_{1})$ -smoothness properties deduced in Appendix A and conducting a comprehensive discussion on three cases in Lemmas B.1-B.3. Furthermore, we show in Lemma B.4 that $\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})-m_{t}\|=\mathcal{O}\left(\gamma T_{\mathcal{S}}(\rho+\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})\|)\right)$ . By choosing a small enough $\gamma$ and carefully dealing with constants, we can conclude that the total amount of decrease of the Lyapunov function is $\Omega\left(\rho\gamma T_{\mathcal{S}}\right)$ .

For any $t\in\overline{\mathcal{S}}$ , if $\eta=\mathcal{O}(1/L_{0})$ and $\gamma=\mathcal{O}(1/L_{1})$ , then we have

We prove (7) in Lemma B.5-B.7 by the fact that Algorithm 1 performs an unclipped update if $t\in\overline{\mathcal{S}}$ . Since the bound (7) is small in term of $\|\nabla F(x_{t})\|$ if $\beta$ and $\nu$ are close to 1, we convert $\|m_{t}\|$ to $\|\nabla F(x_{t})\|$ by proving that $\sum_{t\in\overline{\mathcal{S}}}\|m_{t}\|=\Omega\left(\sum_{t\in\overline{\mathcal{S}}}\|\nabla F(x_{t})\|-\gamma(\rho T_{\mathcal{S}}+\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})\|)\right)$ in Lemma B.8. The term related to $\mathcal{S}$ can all be offset by the terms in Lemma 3.3, and the term related to $\overline{\mathcal{S}}$ combined with $\eta(1-\nu\beta)\|\nabla F(x_{t})\|^{2}$ can be shown to ensure a descent amount of $\Omega\left(\varepsilon\eta\|\nabla F(x_{t})\|-\varepsilon^{2}\eta\right)$ , which is $\Omega\left(\varepsilon^{2}\eta\right)$ as long as $\|\nabla F(x_{t})\|\geq 2\varepsilon$ .

Finally, by combining the above two cases we obtain the conclusion in Theorem 3.1.

Firstly, since the gradients are not exact, the stochastic gradient we have access to is not guaranteed to be small even for the case of $t\in\overline{\mathcal{S}}$ . Fortunately, the choice of parameters in Theorem 3.2 ( $\rho=5\sigma$ ) settles such difficulty.

Finally, by choosing proper $\eta$ and $\gamma$ , we can obtain Theorem 3.2.

3 Lower Bounds and Discussions

Theorem 3.1 and 3.2 provide upper bounds for the complexity of Algorithm 1. Now we compare these results with existing lower bounds and discuss the tightness of our results.

Deterministic Setting. Carmon et al. (2019) have shown that there exists an $L$ -smooth function $F$ such that any (possibly randomized) algorithm requires at least $\Omega\left(\Delta L\varepsilon^{-2}\right)$ queries to gradient to ensure finding a point $x$ such that $\|\nabla F(x)\|\leq\varepsilon$ . Since Assumption 2.2 is weaker than $L$ -smoothness ( $L_{0}\leq L$ ), we have that the lower bound for $(L_{0},L_{1})$ -smooth functions is $\Omega\left(\Delta L_{0}\varepsilon^{-2}\right)$ . From Theorem 3.1, Algorithm 1 is optimal since it can achieve the lower bound when ignoring numerical constants.

Stochastic Setting. From the example constructed in Drori and Shamir (2019), we have that for any SGD method with arbitrary (possibly adaptive) step sizes and aggregation schemesSee details in Appendix D., the complexity lower bound is exactly $\Omega\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right)$ for $(L_{0},L_{1})$ -smooth functions that have $\sigma$ -bounded gradient noises. Therefore Theorem 3.2 indicates that clipped SGD matches the lower bound.

One may ask : what is the lower bound for general stochastic gradient-based algorithms? In fact, from the example in Arjevani et al. (2019), we have that any algorithm needs $\Omega\left(\Delta L_{0}\sigma^{2}_{1}\varepsilon^{-4}\right)$ stochastic gradient queries to find an $\varepsilon$ -approximate stationary point for a hard $(L_{0},L_{1})$ -smooth function whose gradient noise has a $\sigma^{2}_{1}$ -bounded variance. It is our conjecture that the lower bound for optimizing $(L_{0},L_{1})$ -smooth functions that have $\sigma$ -bounded gradient noises is also $\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right)$ . We leave the study as a future work.

Experiments

We conduct extensive experiments and find the clipping algorithms indeed consistently outperform their unclipped counterpart. We present experimental results on three deep learning benchmarks: CIFAR-10 classification using ResNet-32, Imagenet classification using ResNet-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. We put all the experimental details in the Appendix G. Our code is available at https://github.com/zbh2047/clipping-algorithms.

CIFAR-10 classification with ResNet. We train the standard ResNet-32 (He et al., 2016) architecture on CIFAR-10. We use SGD with momentum for the baseline algorithm with a decaying learning rate schedule, which is the standard choice to train the ResNet architecture. We set learning rate $\eta=1.0$ , momentum $\beta=0.9$ and minibatch size 128, following the common practice. For all the clipping algorithms, we choose the best $\eta$ and $\gamma$ based on a course grid search, while keeping other hyper-parameters and training strategy the same as SGD+momentum. We simply set the hyper-parameters $\nu=0.7$ and $\beta=0.999$ in mixed clipping, as suggested in Ma and Yarats (2018) (for its unclipped counterpart QHM). We run 5 times for each algorithm using different random seeds to make the results more reliable.

Figures 2(a) demonstrates the results. It can be seen that all the algorithms achieve a test accuracy more than 93% on CIFAR-10. Note that all clipping algorithms converge faster than SGD+momentum. Particularly, the mixed clipping (Algorithm 1) outperforms SGD+momentum by a large margin in term of training speed. As a result, one can possibly adopt a more aggressive learning rate decaying schedule to reduce training time considerably.

ImagNet classification with ResNet. We train the standard ResNet-50 (He et al., 2016) architecture on ImageNet. For the baseline algorithm, we choose SGD with learning rate $lr=1.0$ and momentum $\beta=0.9$ , following Goyal et al. (2017). We use batch size 256 on 4 GPUs.

Figure 2(b) plot the training loss curve and validation accuracy curve on ImageNet. All the algorithms reach a validation accuracy of about 76%. However, all the clipping algorithms train faster than the baseline SGD. Mixed clipping performs the best among the four algorithms.

Language modeling with LSTM. We train the state-of-the-art AWD-LSTM (Merity et al., 2017) on Penn Treebank (PTB) dataset (Mikolov et al., 2010). We first follow the training strategy in Merity et al. (2017), where they use averaged SGD without momentum with learning rate $\eta=30$ and clipping parameter $\gamma=7.5$ . Since our purpose is to compare different algorithms rather than to achieve state-of-the-art results, we only train AWD-LSTM for 250 epochs. We then evaluate other algorithms including standard SGD without clipping, momentum clipping, and mixed clipping. We choose the best $\eta$ and $\gamma$ (using validation perplexity criterion) based on a course grid search. Results are shown in Figure 2(c).

Figure 2(c) clearly shows all clipping methods converge much faster than SGD without clipping, and are much better in term of validation perplexity. This is consistent with our theory, in that the vanilla SGD must use a very small learning rate to guarantee convergence (Zhang et al., 2020a), which will be slow and be harmful to generalization on validation set according to previous works (Huang et al., 2017; Kleinberg et al., 2018) . Therefore clipping technique is crucial in LSTM models. We can also find that the training and test curve of mixed clipping is much better than both gradient clipping and momentum clipping. The mixed clipping improves validation perplexity for more than 1 point compared to clipped SGD after 250 epochs.

Other experiments. We also conduct experiments to compare clipping algorithms with Adam, and to directly compare our work with previous results (Zhang et al., 2020a) under the same setting. See Appendix H for details. Finally, we construct a provably $(L_{0},L_{1})$ -smooth optimization problem using MNIST dataset. We then run experiments in both deterministic setting and stochastic setting. The results are shown in Appendix I.

Conclusion

This paper proposes a detailed study for clipping methods under a general framework. In particular, we explore the possibility of combining clipping with other popular techniques, e.g. momentum acceleration, in deep learning. We provide a general and tight analysis for the framework, showing the efficiency of clipping methods in optimizing a class of non-convex and non-smooth (in traditional sense) functions. Experiments confirm that these methods have superior performance. We hope that our work affords more understandings on the clipping technique and $(L_{0},L_{1})$ smooth functions.

There are still many open questions that have not yet been answered. Firstly, as discussed in Section 3.3, we are not aware of any lower bounds for general first-order methods that can be applied our setting. Thus, it is interesting to explore such lower bound, or to relax Assumption 2.4 to the more general bounded variance assumption. Secondly, although we have shown the superiority of clipping-based methods, we do not provide theoretical explanation why some clipping schemes are better than others as observed in experiments. We believe that this can only be done by exploring new and better smoothness assumptions. Thirdly, the empirical superiority of other adaptive methods ( e.g. AdaGrad (Duchi et al., 2011), Adam (Kingma and Ba, 2014) ) have not been justified from a theoretical point of view. We hope that our analysis is helpful for the analysis of these methods. Finally, we are looking forward to seeing better optimization algorithms with better convergence properties in future work.

Acknowledgement

This work was supported by National Key R&D Program of China (2018YFB1402600), Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008)] and Beijing Academy of Artificial Intelligence.

References

In this section, we prove some important properties of $(L_{0},L_{1})$ -smooth functions. These properties will be frequently used in subsequent sections.

We first present a basic lemma without proof.

(Gronwall’s inequality) [Gronwall, 1919] Let $I=[a,b]$ denote an interval of the real line with $a<b$ . Let $f,g,h$ be continuous real-valued functions defined on $I$ . Assume $g$ is non-decreasing, $h$ is non-negative, and the negative part of $g$ is integrable on every closed and bounded subinterval of $I$ .If

The following result, Lemma A.2 , is a generalization of Lemma 9 in Zhang et al. [2020a].

Let $F$ be $(L_{0},L_{1})$ -smooth, and $c>0$ be a constant. Given $x$ , for any $x^{+}$ such that $\left\|x^{+}-x\right\|\leq c/L_{1}$ , we have $\left\|\nabla f\left(x^{+}\right)\right\|\leq e^{c}\left(\frac{cL_{0}}{L_{1}}+\|\nabla F(x)\|\right)$ .

Let $\gamma(t)$ be defined as $\gamma(t)=t(x^{+}-x)+x,t\in$ , then we have

We then bound the norm of $\nabla F(\gamma(t))$ :

The first inequality uses the triangular inequality of 2-norm; The second inequality uses the property of spectral norm; The third inequality uses the definition of $(L_{0},L_{1})$ -smoothness. By applying the Gronwall’s inequality we get

The Lemma follows by setting $t=1$ . $\square$

Now we are able to prove a descent inequality, which is similar to the descent inequality for $L$ -smooth functions. In fact, if a function $F$ is $L$ -smooth, it is well-known that for any $x,y$ , we have

(Descent Inequality) Let $F$ be $(L_{0},L_{1})$ -smooth, and $c>0$ be a constant. For any $x_{k}$ and $x_{k+1}$ , as long as $\|x_{k}-x_{k+1}\|\leq c/L_{1}$ , we have

where $A=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}$ .

Let $\gamma(t)$ be defined as $\gamma(t)=t(x_{k+1}-x_{k})+x_{k},t\in$ . The following derivation uses Taylor’s theorem (in (16)), then uses triangular inequality, Cauchy-Schwarz inequality and the property of spectral norm (in (17)):

Then we use $(L_{0},L_{1})$ -smoothness and (14) to bound $\left\|\nabla^{2}F(\gamma(t))\right\|$ :

Substituting (20) into (18) concludes the proof. $\square$

Let $F$ be $(L_{0},L_{1})$ -smooth, and $c>0$ be a constant. For any $x_{k}$ and $x_{k+1}$ , as long as $\|x_{k}-x_{k+1}\|\leq c/L_{1}$ , we have

where $A=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}$ .

Using (20) leads to the results. $\square$

Finally we prove a result which provides a way to upper-bound the gradient norm. A similar result for $L$ -smooth functions is the following: if $F$ is $L$ -smooth, then for any $x$ , we have

(Bounding the gradient norm) Let $F(x)$ be an $(L_{0},L_{1})$ -smooth function, and $F^{*}$ be the optimal value. Then for any $x_{0}$ , we have

Define the constant $c=\frac{L_{1}\|\nabla F(x_{0})\|}{AL_{0}+BL_{1}\|\nabla F(x_{0})\|}$ and $A=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}$ . It is easy to see that such $0\leq c<1$ exists. Let $\lambda=\frac{1}{AL_{0}+BL_{1}\|\nabla F(x_{0})\|}$ and $x=x_{0}-\lambda\nabla F(x_{0})$ . Then $\|x-x_{0}\|\leq c/L_{1}$ . By the descent inequality we have

If $\|\nabla F(x)\|\geq\frac{AL_{0}}{BL_{1}}$ , then

If $\|\nabla F(x)\|<\frac{AL_{0}}{BL_{1}}$ , then

The original definition of $(L_{0},L_{1})$ -smoothness requires the function to be twice-differentiable. Under this definition, $(L_{0},L_{1})$ -smoothness is actually not weaker than $L$ -smoothness, which only requires the function to be continuous differentiable. In this section we prove that the alternative definition provided in Remark 2.3 is sufficient for all the results in this paper.

We check that Lemma A.2 and A.3 still holds under the new assumption (with $L_{0},L_{1}$ replaced by $K_{0},K_{1}$ , up to numerical constants) We immediately obtain from (27) above that

which is of the same form as Lemma A.2. Next, we have

Since all the other results are established on the basis of these two lemmas, we can see that the conclusion still holds under (27).

Appendix B Proof of Theorems

We first prove the deterministic case (Theorem 3.1), then generalize the result to stochastic case (Theorem 3.2). In deterministic case we can use fewer notations, which will make the proof more readable and elegant. The proof in stochastic case will rely on all the techniques used in the deterministic case, as well as some new methods.

To simplify the notation, we write the update formula as

when analyzing a single iteration. The error between $m^{+}$ and $\nabla F(x)$ is denoted as $\delta=m^{+}-\nabla F(x)$ . Suppose $\gamma\leq c/L_{1}$ for some constant $c$ , and we denote $A=1+e^{c}-\frac{e^{c}-1}{c}$ and $B=\frac{e^{c}-1}{c}$ , just the same as in the descent inequality (Lemma A.3).

Let $\mu\geq 0$ be a real constant. For any vector $u$ and $v$ ,

To prove the theorem, we will construct an Lyapunov function and explore the decreasing property of this function. We define the Lyapunov function $G(x,m)$ to be

and analyze $G(x^{+},m^{+})-G(x,m)$ . We first bound $\min\left({\eta}\|m^{+}\|^{2},{\gamma}\|m^{+}\|\right)-\min\left({\eta}\|m\|^{2},{\gamma}\|m\|\right)$ .

For any momentum vectors $m$ and $m^{+}=\beta m+(1-\beta)\nabla F(x)$ , let $\delta=m^{+}-\nabla F(x)$ , then

$\|m\|<\gamma/\eta$ and $\|m^{+}\|<\gamma/\eta$ . In this case

$\|m\|<\gamma/\eta$ and $\|m^{+}\|>\gamma/\eta$ . In this case

Thus in all cases $\min\left({\eta}\|m^{+}\|^{2},{\gamma}\|m^{+}\|\right)-\min\left({\eta}\|m\|^{2},{\gamma}\|m\|\right)$ can be upper bounded by $\frac{2(1-\beta)}{\beta}\gamma\|\delta\|$ . $\square$

Suppose $\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\geq\gamma/{\eta}$ . Then

We first write $G(x^{+},m^{+})-G(x,m)$ as

Based on Lemma B.2, we only need to bound $F(x^{+})-F(x)$ . We will use the $(L_{0},L_{1})$ -smoothness assumption.

Where the first inequality uses the descent inequality (Lemma A.3), the second equation follows from the update rule, and the last inequality is obtained by the following two inequalities:

First we prove that (37) holds by considering the following three cases:

$\|m^{+}\|\geq\gamma/\eta$ . In this case the algorithm performs a normalized update. Then (37) follows by directly using Lemma B.1 with $\mu=2/5$ :

$\|m^{+}\|<\gamma/\eta$ and $\|\nabla F(x)\|\geq\gamma/\eta$ . In this case the algorithm performs an unnormalized update. We now prove $-\eta\left\langle\nabla F(x),m^{+}\right\rangle\leq-\frac{2}{5}\gamma\|\nabla F(x)\|-\frac{3\gamma^{2}}{5\eta}+\frac{7}{5}\gamma\|\nabla F(x)-m^{+}\|$ .

$\|m^{+}\|<\gamma/\eta$ and $\|\nabla F(x)\|<\gamma/\eta$ . This is the most complicated case. Due to the condition in Lemma B.3, $\|m\|\geq\gamma/\eta$ . In this case, the algorithm also performs an unnormalized update. We first bound $\eta\left\langle\nabla F(x),m\right\rangle$ using the same calculation as in the second case:

where we use the fact that $\|\nabla F(x)-m\|=\|\delta\|/\beta$ . We then bound $\eta\|\nabla F(x)\|^{2}$ as follows:

Combining the two inequalities, we obtain

Thus in all cases (37) holds. We now turn to (38) which is proven in a similar fashion. Specifically, consider the following three cases:

$\|\nabla F(x)\|\geq\gamma/\eta$ . In this case

$\|\nabla F(x)\|<\gamma/\eta$ and $\|m^{+}\|\geq\gamma/\eta$ . In this case bound $\eta\|\nabla F(x)\|^{2}$ the same as in the third case of (37):

$\|\nabla F(x)\|<\gamma/\eta$ and $\|m^{+}\|<\gamma/\eta$ . In this case $\|m\|\geq\gamma/\eta$ . Using the same calculation above,

Thus (38) holds. Merging all the cases above, we finally obtain

Now we consider all the steps $t$ which satisfy the condition in Lemma B.3, denoted as $\mathcal{S}=\{t\in[0,T-1]:\max(\|F(x_{t})\|,\|m_{t+1}\|,\|m_{t}\|)\geq\gamma/{\eta}\}$ . Similarly, use $\overline{\mathcal{S}}=[0,T-1]\backslash\mathcal{S}$ . Let $T_{\mathcal{S}}=|\mathcal{S}|$ , then $T-T_{\mathcal{S}}=|\overline{\mathcal{S}}|$ .

Let set $\mathcal{S}$ and $T_{\mathcal{S}}$ be defined above. Then

We now focus on the summation of the term $\|\delta_{t}\|$ . Define $S(a,b)=\nabla F(a)-\nabla F(b)$ . When $\|a-b\|\leq\gamma$ , $\|S(a,b)\|\leq\gamma\left(AL_{0}+BL_{1}\|\nabla F(b)\|\right)$ (see Lemma A.4). Thus we can expand $\delta_{t}=m_{t+1}-\nabla F(x_{t})$ using the recursive relation $\delta_{t}=\beta\delta_{t-1}+\beta S(x_{t-1},x_{t})$ as follows

where the last inequality uses the fact that $\|\nabla F(x_{\tau})\|\leq\gamma/\eta$ for all $\tau\in[1,t]\backslash\mathcal{S}$ .

After substituting the above results into (LABEL:lemma_mom_clip_deterministic_large_sum_1) we obtain

Now we turn to the case in which $\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\leq\gamma/{\eta}$ .

Suppose $\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\leq\gamma/{\eta}$ . Then

where $c_{1}=\nu(1-\beta)(2-\beta)-L\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-L\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-L\eta(\beta\nu)^{2}$ , and $L=AL_{0}+BL_{1}\gamma/\eta$ .

In the case of $\|m\|\leq\gamma/{\eta}$ , we have $\eta\|m\|^{2}\leq\gamma\|m\|$ , thus

We then bound $F(x^{+})-F(x)$ and $\|m^{+}\|^{2}-\|m\|^{2}$ . Note that $\|m^{+}\|\leq\gamma/\eta$ implies that the algorithm performs an update without normalization. Define $L:=AL_{0}+BL_{1}\gamma/\eta$ , then again by descent inequality,

by definition of the Lyapunov function, we have

where $c_{1}=\nu(1-\beta)(2-\beta)-L\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-L\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-L\eta(\beta\nu)^{2}$ . $\square$

Let $c_{1},c_{2},c_{3}$ and $L$ be defined in Lemma B.5. If $L\eta\leq 1$ , then the matrix

is symmetric and positive semi-definite, where $I_{d}$ is the $d\times d$ identity matrix.

In fact we only need to consider the case when $d=1$ , because the eigenvalues of $H_{2d\times 2d}$ can only be those that appears in $H_{2\times 2}$ ( $d=1$ ). Denote two eigenvalues be $\lambda_{1},\lambda_{2}$ when $d=1$ . A direct calculation shows that

If $L\eta\leq 1$ , then $\lambda_{1}\lambda_{2}\geq 0$ and $\lambda_{1}+\lambda_{2}\geq 0$ , which is equivalent to the semi-definiteness of $H$ . $\square$

Suppose $\max(\|\nabla F(x)\|,\|m\|,\|m^{+}\|)\leq\gamma/{\eta}$ . If $L\eta\leq 1$ , Then

Let $H$ be defined in Lemma B.6. The result of Lemma B.5 can be written in a matrix form:

Using the fact that $H$ is positive semi-definite, we obtain the desired result. $\square$

Suppose $AL_{0}\eta\leq c_{1}(1-\beta)$ and $BL_{1}\gamma\leq c_{3}(1-\beta)$ for some constant $c_{1}$ and $c_{3}$ . Let $m_{0}=\nabla F(x_{0})$ for simplicity. Let set $\mathcal{S}$ and $\overline{\mathcal{S}}$ be defined above. Then

where the last inequality follows by Corollary A.4. Applying (50) recursively, we obtain

Using $\|m_{t}\|\geq\|\nabla F(x_{t})\|-\|m_{t}-\nabla F(x_{t})\|$ and some straightforward calculation, we obtain

Now we are ready to prove the main theorem.

Let $F^{*}$ be the optimal value, and $\Delta=F(x_{0})-F^{*}$ . Assume $m_{0}=\nabla F(x_{0})$ for simplicity. If $\gamma\leq\frac{1-\beta}{10BL_{1}}$ and $\eta\leq\frac{1-\beta}{10AL_{0}}$ , where constants $A=1+e^{1/10}-10{(e^{1/10}-1)<1.06}$ , $B=10(e^{1/10}-1)<1.06$ , and $\varepsilon<\frac{\gamma}{5\eta}$ , then

By calculating $L\eta=AL_{0}\eta+BL_{1}\gamma\leq(1-\beta)/5<1$ , we can use Corollary B.7. Taking summation of the inequality (47) over steps $t\in\overline{\mathcal{S}}=[0,T-1]\backslash\mathcal{S}$ , we obtain

Combining (56) and (40) in Corollary B.4 we obtain

we have $AL_{0}\eta\leq(1-\beta)/10$ and $BL_{1}\gamma\leq(1-\beta)/10$ . Using Lemma B.14 we have

Therefore by standard inequality $x^{2}\geq 2\varepsilon x-\varepsilon^{2}$ and (57) we obtain

We now simplify $U(x)$ . Let $\varepsilon\leq\frac{\gamma}{5\eta}$ . By (58) we have

Since $\varepsilon<\dfrac{\gamma}{5\eta}$ , we have $U(x)\geq V(x)$ . Therefore by (LABEL:longineq) and Lemma A.5 we have

B.2 Proof of Theorem 3.2

We now prove the stochastic case. As before, to simplify the notation we write the update formula as

In stochastic case, we define the Lyapunov function to be

Suppose $\gamma\leq c/L_{1}$ for some constant $c$ , and we denote $A=1+e^{c}-\frac{e^{c}-1}{c}$ and $B=\frac{e^{c}-1}{c}$ , just the same as in the descent inequality (Lemma A.3). When $\gamma\leq\frac{1-\beta}{50L_{1}}\varepsilon\leq\frac{1}{500L_{1}}$ (in Theorem 3.2), we can take $c=1/500$ and $A=B=1.002$ .

Furthermore, using the noise assumption, for different time steps $t,t^{\prime}$ , we have

Based on Lemma B.2, we only need to bound $F(x^{+})-F(x)$ . We use the $(L_{0},L_{1})$ -smooth condition:

Now we bound $\left\langle\nabla F(x),x^{+}-x\right\rangle$ . The calculation is similar to the deterministic setting. We first bound $-\min\left(\eta,\frac{\gamma}{\|m^{+}\|}\right)\left\langle m^{+},\nabla F(x)\right\rangle$ . Consider the following three cases, all of which are analogous to the proof of Lemma B.3:

$\|m^{+}\|\geq\gamma/\eta$ . The algorithm performs a normalized update. We have

$\|m^{+}\|<\gamma/\eta$ and $\|\nabla F(x)\|\geq 4\gamma/5\eta$ . The algorithm performs an unnormalized update. We have

where (77) uses the following two inequalities which can be obtained by Lemma B.10:

We next bound $-\min\left(\eta,\frac{\gamma}{\|\nabla f(x,\xi)\|}\right)\left\langle\nabla f(x,\xi),\nabla F(x)\right\rangle$ . Consider the following cases, all of which are analogous to the proof of Lemma B.3:

$\|\nabla f(x,\xi)\|\geq\gamma/\eta$ . In this case we can use Lemma B.1 with $\mu=2/5$ :

$\|\nabla f(x,\xi)\|<\gamma/\eta$ . In this case

We now bound $-\eta\|\nabla F(x)\|^{2}$ . If $\|\nabla F(x)\|\geq\frac{4\gamma}{5\eta}$ , then $-\eta\|\nabla F(x)\|^{2}\leq-\frac{4}{5}\gamma\|\nabla F(x)\|$ . If $\|\nabla F(x)\|<\frac{4\gamma}{5\eta}$ and $\|m^{+}\|\geq\frac{4\gamma}{5\eta}$ , then using the same calculation as in the deterministic case,

Let set $\mathcal{S}$ and $T_{\mathcal{S}}$ be defined above. Then

where $c_{1}=\nu(1-\beta)(2-\beta)-AL_{0}\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-AL_{0}\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-AL_{0}\eta(\beta\nu)^{2}$ . $c_{1}=(1-\beta)[2-\beta-AL_{0}\eta(1-\beta)]$ , $c_{2}=\beta[1-\beta-AL_{0}\eta(1-\beta)]$ and $c_{3}=\beta(1+\beta-AL_{0}\eta\beta)$ .

Because $\|\nabla f(x,\xi)\|\leq 4\gamma/5\eta+\sigma=\gamma/\eta$ and $\|m^{+}\|\leq\gamma/\eta$ , the algorithm performs an unnormalized update. The proof is similar to the one in Lemma B.5 except for bounding the term $F(x^{+})-F(x)$ .

The proof of Lemma B.14 is similar to the proof of Lemma B.8. We first write (52) again as follows:

We now merge the two cases corresponding to Corollary B.12 and Lemma B.13. The proof of the following theorem involves many techniques which are different from the deterministic case and is far more challenging.

Let $F^{*}$ be the optimal value, and $\Delta=F(x_{0})-F^{*}$ . Assume $m_{0}=\nabla F(x_{0})$ for simplicity. Fix $\varepsilon\leq 0.1$ be a small constant.If $\gamma\leq\frac{\varepsilon}{\sigma}\min\left(\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{50BL_{1}}\right)$ and $\gamma/\eta=5\sigma$ where constants $A=1.01,B=1.01$ , then

Based on the previous results, we take summation over $t$ and obtain

We now simplify (93) by taking expectation. We first have

Applying the above equation recursively, we obtain

where the first inequality uses the proof of Corollary A.4 and Lemma B.10, and the last inequality uses $\gamma/\eta=5\sigma$ . By taking summation of the above inequality we obtain

Let $\gamma\leq\frac{\varepsilon}{2\sigma}\min\left(\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right)$ , and fix the ratio $\gamma/\eta=5\sigma$ . Then for small enough $\varepsilon<0.1$ and large enough noise $\sigma>1$ ,

Applying the above estimates and rearranging (102), we have

Due to Lemma B.14 ( $AL_{0}\eta\sigma\leq\frac{\varepsilon}{10}(1-\beta),BL_{1}\gamma\leq\frac{\varepsilon}{50}(1-\beta)$ ), we clearly have

Plugging (LABEL:sigma_mt) into (LABEL:rearrange), we obtain

It clearly follows that $\min\{U(x),V(x)\}\geq\frac{4}{5}\varepsilon\eta\|\nabla F(x)\|-\frac{4}{5}\varepsilon^{2}\eta$ . Therefore

We finally show $G(x_{0})-F^{*}+P_{0}=O(F(x_{0})-F^{*})$ . Using Lemma A.5,

For the term $P_{0}$ , if $\|\nabla F(x_{0})\|=\Omega(L_{0}/L_{1})$ , we can similarly use Lemma A.5 to obtain $P_{0}=\mathcal{O}(F(x_{0})-F^{*})$ . If $\|\nabla F(x_{0})\|=\mathcal{O}(L_{0}/L_{1})$ , using $L_{0}\|\nabla F(x_{0})\|\leq L_{1}\|\nabla F(x_{0})\|^{2}$ and Lemma A.5 leads to the result. $\square$

Appendix C Discussion of the normalized momentum algorithm

In this section we analyze in detail the theoretical aspects of the normalized momentum algorithm, as well as some practical issues. Recall that this algorithm can be seen as a special case of our clipping framework. For convenience we re-write it in Algorithm 2.

We remark that SNM is different from the clipping methods in traditional sense, in that it makes a normalized update each iteration. This algorithm has been analyzed in Cutkosky and Mehta for $L$ -smooth functions. In that setting they were able to prove that SNM achieves a complexity of $\mathcal{O}(\Delta L\sigma^{2}\varepsilon^{-4})$ .

For $(L_{0},L_{1})$ -smooth functions, we show that: (a). With carefully chosen momentum parameter $\beta$ and step size $\eta$ , SNM can achieve a complexity of $\mathcal{O}(\Delta L_{0}\sigma^{2}\varepsilon^{-4})$ , which is the same as the complexity we obtain in Theorem 3.2. (b). There are some practical issues that make SNM less favorable than traditional clipping methods (such as the other three special cases of our framework discussed in Section 3 of the main paper).

The following results provides convergence guarantee for Algorithm 2.

Consider the algorithm that starts at $x_{0}$ and make updates $x_{t+1}=x_{t}-\eta m_{t+1}$ . Define $\delta_{t}:=m_{t+1}-\nabla F(x_{t})$ be the estimation error. Assume $\eta\leq c/L_{1}$ for some $c>0$ and let constants $A=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}$ . Then

Since $\|x_{t+1}-x_{t}\|=\eta_{t}$ , by Lemma A.3 we have

where in the second inequality we use Lemma B.1. $\square$

holds in $T=\mathcal{O}(\Delta L_{0}\sigma^{2}\varepsilon^{-4})$ iterations.

Define the estimation errors $\delta_{t}:=m_{t+1}-\nabla F(x_{t})$ . Denote $S(a,b):=\nabla F(a)-\nabla F(b)$ , then for $a,b$ such that $\|a-b\|=\eta\leq c/L_{1}$ , we can upper bound $S(a,b)$ using Corollary A.4:

We can use $S(a,b)$ to get a recursive relationship:

Denote $\delta_{t}^{\prime}=\nabla f(x_{t},\xi_{t})-\nabla F(x_{t})$ , then

Using triangle inequality and plugging in the estimate (113) , we have

Taking a telescope summation of 115 and using Assumption 2.4 we obtain

If we choose $\eta=\Theta\left(\min(L_{1}^{-1},L_{0}^{-1}\varepsilon)\alpha\right)$ and $\alpha=\Theta\left(\sigma^{-2}\varepsilon^{2}\right)$ , then

We have shown the theoretical superiority of Algorithm 2. Specifically, it enjoys the same complexity as Theorem 3.2. However we notice some potential drawbacks of Algorithm 2:

Firstly, the step size of Algorithm 2 is at the order of $\mathcal{O}\left(\varepsilon^{3}\right)$ , while the step size we chose in Theorem 3.2 is $\mathcal{O}\left(\varepsilon^{2}\right)$ . Previous works have noticed that a smaller step size makes it easier to be trapped in a sharp local minima , which may result in worse generalization [Kleinberg et al., 2018].

Secondly, although the complexity of Algorithm 2 is the same as Theorem 3.2 for small $\varepsilon$ , it requires a more restrictive upper bound of $\varepsilon$ to ensure the $\varepsilon^{-4}$ term dominates. For instance with a poor initialization, $\Delta$ may very large. This suggests that in practice, where we do not get into a very small neighbourhood of stationary point, the performance of Algorithm 2 may be worse.

Appendix D Details of Lower Bounds in Section 3.3

In this section we discuss the lower bound for SGD in Drori and Shamir in detail. The following result is taken from this paper:

Now we discuss why this shows the optimality of clipped SGD under Assumptions 2.1, 2.2 and 2.4.

Firstly, Theorem D.1 assumes an upper bound $\Delta$ on $F(x_{0})-F(x_{t})$ rather than the one assumed in Assumption 2.1 ( $F(x_{0})-F^{*}\leq\Delta$ ). However, in fact we only need to assume that $F(x_{0})-F(x_{T})\leq\Delta$ to prove Theorem 3.2 for clipped SGD. The reason is as follows. In fact, since $\beta=0$ for clipped SGD , the momentum term in the Lyapunov function disappears, as well as the term $\frac{\nu\beta}{(1-\beta)^{2}}AL_{0}\eta^{2}\sigma\|\nabla F(x_{0})\|$ in (102). So we no longer need to use Lemma A.5 to bound the term $\|\nabla F(x_{0})\|$ . The rest of the proof only needs $F(x_{0})-F(x_{T})\leq\Delta$ (which is used in the telescope sum in (102)).

Secondly, although Theorem D.1 only assume that the variance of stochastic gradient is bounded, in their construction the noise is actually defined as

Therefore the norm of the noise is bounded by $\sigma$ , and the example used to prove Theorem D.1 still works under Assumption 2.4.

Now suppose we need an output such that $\|\nabla f(x_{\text{out}})\|=\|\gamma\|\leq\varepsilon$ , then it follows from Theorem D.1 that $T=\Omega\left(L\Delta\sigma^{2}\varepsilon^{-4}\right)$ . Therefore we have shown the optimality of clipped SGD in this class of algorithms, as stated in Section 3.3.

Appendix E Justifications on the Mixed Clipping

In the above optimization problem, the general update formula can be written as:

We now analyze three cases based on the proposition:

We further plot the value of (119) with respect to $\nu$ and $\beta$ in Figure 3 to visualize the above finding. It can be clearly seen that the using both gradient and momentum with a proper interpolation factor $\nu$ outperforms both SGD and SGD with momentum by a large margin (Figure 3(a)). Furthermore, we can drive $\beta\rightarrow 1$ to further improve convergence (Figure 3(b)), while in SGD with momentum we can not.

𝑥𝜉2f(x,\xi)=\frac{1}{2}(x+\xi)^{2}. The mixed update with proper $\nu$ outperforms both SGD and SGD with momentum by a large margin. Furthermore, for the mixed update we can drive $\beta\rightarrow 1$ to further improve convergence, while for SGD with momentum we can not. Although we use the simple function $F(x)=\frac{1}{2}x^{2}$ as an example, similar result exists in any general quadratic form with positive definite Hessian. Furthermore, the experiments in Section 4 also demonstrate that the mixed clipping outperforms both gradient clipping and momentum clipping.

Combining (120), (121) and (122), we can write the recursive relationship into a matrix form:

Denote the above matrix as $M$ . After a straightforward calculation, we can find that $\lambda_{1}=1$ is an eigenvalue of $M$ , and

is the only eigenvector associated with $\lambda_{1}=1$ . Similarly, $\lambda_{2}=\beta$ is also an eigenvalue of $M$ . Let the other two eigenvalues be $\lambda_{3}$ and $\lambda_{4}$ , then

It follows that $\lambda_{3}\lambda_{4}=\beta^{2}$ and $\lambda_{3}+\lambda_{4}=(\eta(1-\beta)-\beta-1)^{2}-2\beta$ . Since $(1+\beta-\eta(1-\beta))^{2}<(1+\beta)^{2}$ , we have $\lambda_{3}+\lambda_{4}<1+\beta^{2}$ . Therefore $|\lambda_{3}|<1$ and $|\lambda_{4}|<1$ (note that $\lambda_{3}$ and $\lambda_{4}$ can be composite numbers). If $\eta<1$ , we can further conclude that the four eigenvalues are different from each other (otherwise $\lambda_{3}=\lambda_{4}=\beta$ , which contradicts to $\lambda_{3}+\lambda_{4}=(\eta(1-\beta)-\beta-1)^{2}-2\beta$ ).

E.1.2 Proof of the general case

Now we prove Proposition E.1 for general $\nu$ .

Combining (120), (124) and (125), we obtain the following recursive matrix $M$ :

Using the same calculation as in the previous section, we finally get

Appendix F Soft Clipping

For Algorithm 1, as long as the norm of the gradient (or momentum) exceeds a constant, it is then clipped; we refer to this form of clipping as hard clipping. One can also consider a soft form of clipping, as presented in Algorithm 3.

We take $\nu=0$ for example to analyze soft clipping. For any gradient norm $l_{\text{g}}$ , the norm of the update $l_{\text{u}}$ is a function of $l_{\text{g}}$ :

For hard clipping, we can similarly write

Therefore soft clipping is in fact equivalent to hard clipping up to a constant factor 2 in the step size choice. Thus it’s easy to see that our results also hold for Algorithm 3. However, compared to hard clipping, soft clipping has the advantage that the function $h_{\text{soft}}$ in (127) is smooth while $h_{\text{hard}}$ in (128) is not, as shown in Figure 4. We also empirically observe that the training curve of soft clipping is more smooth than hard clipping.

Appendix G Experimental Details in Section 4

Based on the discussion in Appendix F, we use the soft version of clipping algorithms in all the experiments.

The CIFAR-10 dataset contains 50k images for training and 10k for testing. All the images are $32\times 32$ RGB bitmaps. We use the standard ResNet-32 architecture. The total number of parameters is 466,906. For all algorithms, we use mini-batch size 128 and weight decay $5\times 10^{-4}$ . For the baseline algorithm, we use SGD with momentum using learning rate $lr=1.0$ and momentum factor $\beta=0.9$ . Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with $lr=0.1$ and $\beta=0.9$ . We optimize ResNet-32 for 150 epochs, and decrease the learning rate at epoch 80 and epoch 120. For other algorithms, we perform a course grid search for $lr$ an $\gamma$ , while keeping all the training strategy the same as SGD. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.

G.2 PTB

The Penn Treebank dataset has a vocabulary of size 10k, and 887k/70k/78k words for training/validation/testing. We use the state-of-the-art AWD-LSTM architecture using hidden size 1150 and embedding size 400. The total number of parameters is 23,941,600. For the baseline algorithm, we follow Merity et al. who use averaged SGD clipping without momentum using learning rate $lr=30$ and $\gamma=7.5$ . Note that here $\gamma=7.5$ means that the gradient norm will be clipped to be no more than 0.25. We use the same dropout rate and regularization hyper-parameters in [Merity et al., 2017]. We train AWD-LSTM for 250 epochs, and averaging is triggered when the validation perplexity stops improving. For other algorithms, we perform a course grid search for $lr$ an $\gamma$ , while keeping all the training strategy the same as SGD clipping. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.

G.3 ImageNet

We also conduct experiments on ImageNet dataset. This dataset contains about 1.28 million training images and 50k validation images with various sizes. We train the standard ResNet-50 architecture on this dataset. The total number of parameters is 25,557,032. We use a batch size of 256 on 4 GPUs and a weight decay of $10^{-4}$ . For the baseline algorithm, we choose SGD with learning rate $lr=1.0$ and momentum $\beta=0.9$ , following Goyal et al. . Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with $lr=0.1$ and $\beta=0.9$ . We train the ResNet-50 for 90 epochs, and decrease the learning rate in epoch 30, epoch 60 and epoch 80. For the other algorithms, we perform a course grid search for $lr$ an $\gamma$ , while keeping all the training strategy the same as SGD.

Appendix H Additional Experimental Results

Comparsion with Adam optimizer. Adam is a popular optimizer to train neural networks on a variety of tasks. We also run the same experiments in Section 4 using Adam optimizer with best hyper-parameters in order to compare it with clipping algorithms. We turn the learning rate of Adam based on a grid search, and choose the momentum hyper-parameters $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The results are shown in Figure 5(a) and 5(b).

It is clear from the figure that Adam trains really fast on both datasets. It is even faster than gradient clipping and momentum clipping. However, it does not outperform the mixed clipping method. Note that like clipping algorithms, Adam also uses adaptive gradient. However, Adam generalizes much worse than vanilla SGD or clipping algorithms. Particularly, the test accuracy of CIFAR-10 using Adam is only 91.5%, while all other algorithms reach 93% test accuracy.

CIFAR-10 classificatoin with ResNet-18. The original work of gradient clipping in Zhang et al. [2020a] conducts experiments using ResNet-18 on CIFAR-10. To compare with it, we also run the same experiments using different algorithms, e.g. SGD, SGD clipping, momentum clipping and mixed clipping. All the algorithms reach 95% test accuracy, and the result is similar to that of using ResNet-32 architeture (see Figure 5(c)).

In this section, we are aiming to construct an optimization problem which provably satisfies the $(L_{0},L_{1})$ -smoothness condition in this paper rather than the traditional $L$ -smoothness condition. We then conduct experiments in both deterministic setting and stochastic setting.

In fact, if the exponential function $\exp(\cdot)$ is replaced by $\log(1+\exp(\cdot))$ , the problem becomes the well-known logistic regression. However, logistic loss has bounded second-order derivative (thus is $L$ -smooth), while $\exp(\cdot)$ does not. Furthermore, exponential function is (0,1)-smooth, thus we expect $L(w,b)$ is also $(L_{0},L_{1})$ -smooth for some $L_{0},L_{1}$ (see the following proposition). This is why we use exponential loss here. We point out that such exponential loss is also used in a variety of algorithms, such as boosting (AdaBoost).

When the dataset is linearly separable, parameter $w$ will be driven to infinity through optimization, thus adding some regularization is prevalent in linear classification. We use the following term (131) rather than $L_{2}$ norm for regularization, in order to be compatible with $L(w,b)$ .

In fact, $R_{\lambda}(w)$ is similar to weight decay regularization in that $R_{\lambda}(w)=\frac{1}{2}\lambda^{2}\|w\|^{2}+O(\lambda^{4}\|w\|^{4})$ when $w$ is small.

The total loss $E_{\lambda}(w,b)=L(w,b)+R_{\lambda}(w)$ . We now claim that $E_{\lambda}(w,b)$ is indeed $(L_{0},L_{1})$ -smooth.

Assume bias term $b=0$ for simplicity. Suppose the data points have bounded norm, i.e. $\|x_{i}\|\leq R$ for all $i$ and $\lambda<R$ . Let the loss function $E_{\lambda}(w,0)$ be defined above. Then for every $\rho_{1}>0,\rho_{2}>0$ , $\rho=\rho_{1}+\rho_{2}$ , $E_{\lambda}(w,0)$ is $(L_{0},L_{1})$ -smooth w.r.t $w$ for

We use MNIST dataset in this section, which contains 60,000 hand-writing training images. We only evaluate the training speed for different algorithms on the training set rather than the generalization capability. The loss functions is defined to be the sum of ten losses, each of which corresponds to the loss of a binary classification problem to recognize number 0 to 9. Regularization coefficient $\lambda$ is set to be 0.02.

To compare different algorithms, we choose the best hyperparameters $lr$ and $\gamma$ for each algorithm based on a careful grid search. $\nu$ is set to be 0.7 for mixed clipping. The parameter initalization and all inputs in the schocastic setting are the same for all algorithms. For each run, we average the loss of the last 5 epoch in order to reduce variance. In the deterministic setting we train 500 epochs, each of which uses the entire dataset. In the stochastic setting we train 50 epochs with a mini-batch size 200. We run on 5 different random seeds ranging from 2016 to 2020 altogether and average their results.

Figure 6 plots the results. It is clear that in both settings, clipping is vital to a fast convergence. Also, momentum helps training, and mixed clipping performs the best in the stochastic setting.

Let $M=\mathop{\max}\limits_{i\in[n+2d]}w^{T}z_{i}$ . Let $\rho_{1}>0,\rho_{2}>0$ be two constants. Pick $M_{0}=\left(1+\frac{1}{\rho_{2}}\right)\log\frac{n(R^{2}+d\lambda^{2})}{\rho_{1}R^{2}}$ . We consider the following two cases:

(1) $M\leq M_{0}$ . In this case $\|\nabla^{2}E(w)\|$ can be directly upper bounded:

The first inequality in (134) uses the triangular inequality of matrix spectral norm and $\|zz^{T}\|=\|z^{T}z\|=\|z\|^{2}$ .

(2) $M>M_{0}$ . Decompose $M_{0}$ to be $M_{0}=M_{1}+M_{2}$ where

Define set $I=\{i\in[n+2d]:w^{T}z_{i}\geq M-M_{1}\}$ and $I_{2}=\{i\in[n+2d]:w^{T}z_{i}<0\}$ . Then

In (136) we use the Cauchy-Schwartz inequality; In (137) we partition the index $\{i:i\in[n+2d]\}$ to three subsets $I$ , $I_{2}$ and $[n+2d]\backslash(I\cup I_{2})$ , and use the lower bound and upper bound of $w^{T}z_{i}>0$ for each set.

Similar, we can upper bound $\|\nabla^{2}E(w)\|$ :

To bound $\exp(M_{1})$ , we again bound $\|\nabla E(w)\|$ from a different perspective:

where (142) is obtained by selecting the $i$ with the largest $w^{T}z_{i}$ which is equal to $M$ . Substitute (138) and (142) into (140) then we get

Since $M=\mathop{\max}\limits_{i\in[n+2d]}w^{T}z_{i}$ implies that $|\lambda w_{k}|\leq M$ for all $k\in[d]$ from (132), we can upper bound the norm of $w$ : $\|w\|\leq\frac{M\sqrt{d}}{\lambda}$ . Substitute this into (144) we get

Combining the above two cases concludes the proof.