Improved Analysis of Clipping Algorithms for Non-convex Optimization

Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang

Introduction

The problem of the central interest in this paper is to minimize a general non-convex function presented below:

where F(x)F(x) can be potentially stochastic, i.e.

For non-convex optimization problems in form of (1), since obtaining the global minimum is NP-hard in general, this paper takes the concern on a reasonable relaxed criteria: finding an ε\varepsilon-approximate first-order stationary point such that F(x)ε\|\nabla F(x)\|\leq\varepsilon.

Gradient clipping (Pascanu et al., 2012) is a simple and commonly used trick in algorithms that adaptively choose step sizes to make optimization stable. For the task of training deep neural networks (especially for language processing tasks), it is often a standard practice and is believed to be efficient in relieving the exploding gradient problem from empirical studies (Pascanu et al., 2013). More recently, Zhang et al. (2020a) proposed an inspiring theoretical justification on the clipping technique via introducing the (L0,L1)(L_{0},L_{1})-smoothness assumption. The concept of (L0,L1)(L_{0},L_{1})-smoothness is defined as follows.

This assumption can be further relaxed such that twice differentiability is not required (see Remark 2.3). Therefore the standard LL-smoothness assumption (i.e. the gradient of ff is LL-Lipschitz continuous) is stronger than the (L0,L1)(L_{0},L_{1})-smoothness one in the sense that the latter allows 2F(x)\|\nabla^{2}F(x)\| to have a linear growth with respect to F(x)\|\nabla F(x)\|.

Secondly, Zhang et al. (2020a) performed experiments to show that (L0,L1)(L_{0},L_{1})-smoothness is a preciser characterization of the landscapes for objective functions in many real-world tasks, especially for training a deep neural network model. It was observed that the local Lipschitz constant L0L_{0} near the stationary point is thousands of times smaller than the global one LL in the LSTM training (see Figure 1(c) taken from Zhang et al. (2020a)).

Seeing this, it is desirable to give a comprehensive and deep analysis on iteration complexities for (L0,L1)(L_{0},L_{1})-smooth objectives. How fast can we achieve to find a first-order stationary point for (L0,L1)(L_{0},L_{1})-smooth functions? What are simple algorithms that provably achieve such a convergence rate? In this paper, we give affirmative answers to the above questions. In fact, due to the violent fluctuation of gradients, the efficiency of (stochastic) Gradient Descent with a constant step size degenerates, whereas we will show in this paper that by simply combining the clipping technique, a wide range of algorithms can achieve much better convergence rate for (L0,L1)(L_{0},L_{1})-smooth functions. In fact, when ε\varepsilon is small, the complexities (i.e. the number of gradient queries required) are O(ΔL0ε2)\mathcal{O}\left(\Delta L_{0}\varepsilon^{-2}\right) for the deterministic setting and O(ΔL0σ2ε4)\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right) for the stochastic setting (see Section 3 for details), which are both independent of L1L_{1}. Compared with Zhang et al. (2020a) who only studied clipped (stochastic) gradient descent, we consider proposing a unified framework which contains a variety of clipping-based algorithms and achieve much sharper complexities. The main technique for our proof is by introducing a novel Lyapunov function which does not appear in existing studies. We believe that our work provides better understandings for the clipping technique in training deep neural networks. We summarize the contributions of the paper in the following.

We provide a general framework to analyze the clipping technique for optimizing (L0,L1)(L_{0},L_{1})-smooth functions. It contains a variety of clipping algorithms, including gradient clipping and momentum clipping as special cases.

We provide convergence analysis for the general framework we propose. We show that our bounds are tight by comparing with existing lower bounds. For gradient clipping, a special case in our framework, our result is much sharper than that proposed by Zhang et al. (2020a).

We conduct extensive experiments on a variety of different tasks, and observe that the clipping algorithms consistently perform better than vanilla ones.

Clipping/normalizing Techniques. Clipping/normalizing has long been a popular technique in optimizing large-scale non-convex optimization problems (e.g. (Mikolov, 2012; Pascanu et al., 2013; Goodfellow et al., 2016; You et al., 2017)). There are several views which provide understandings for the clipping and normalizing techniques. Some show that clipping can reduce the stochastic noise. For example, Zhang et al. (2019); Gorbunov et al. (2020) showed that clipping is crucial for convergence when the stochastic gradient noise is heavy-tailed. Menon et al. (2020) pointed out that clipping can mitigate the effect of label noise. Cutkosky and Mehta (2020) found that adding momentum in normalized SGD provably reduces the stochastic noise. Another line of works try to understand the function of clipping and normalizing for the standard smooth optimization. For example, Levy (2016) showed that normalized GD can provably escape saddle points. Fang et al. (2018) designed a new algorithm based on normalized GD which achieves a faster convergence rate under suitable conditions. Gradient clipping has also been used to design differentially private optimization algorithms (Abadi et al., 2016).

The work of Zhang et al. (2020a) is mostly related to this paper. A detailed comparison between the two works is shown in Subsection 2.2.

Lower Bounds For Non-convex Optimization. A series of recent works establish lower bounds for finding an ε\varepsilon-stationary point of a general non-convex and LL-smooth function, either in deterministic setting (Carmon et al., 2019) or in stochastic setting (Drori and Shamir, 2019; Arjevani et al., 2019). In this paper we borrow their counter examples to show the tightness of our obtained complexities for the general (L0,L1)(L_{0},L_{1})-smooth functions.

Assumptions & Comparisons of Results

We first present the assumptions that will be used in our theoretical analysis, which follows from Zhang et al. (2020a).

We assume that F(x)F(x) is (L0,L1)(L_{0},L_{1})-smooth.

It does not need FF to be twice differentiable and is strictly weaker than LL-smoothness.

In the stochastic setting, for the briefness of our analysis, we assume that the noise is unbiased and bounded.

2 Comparisons of Results

In this paper, we mainly take our concern on the stochastic setting, though we also establish the complexities for the deterministic setting. We summarize the comparison in the stochastic setting with existing complexity results in Table 1, in which we only present the dominating complexities with respect to ε\varepsilon.

Compared with the bound for clipped SGD established in Zhang et al. (2020a), our results improve theirs on the dependencies for all problem-dependent parameters, i.e. Δ\Delta, σ\sigma, L0L_{0}, and especially L1L_{1} by order. For SGD with arbitrarily chosen step sizes (thus include clipped SGD), the example in Drori and Shamir (2019) can be used to show that clipped SGD is optimal (cf. Section 3.3).

General Analysis of Clipping

We aim to present a general framework in which we can provide a unified analysis for commonly used clipping-based algorithms. Since momentum is one of the most popular acceleration technique in optimization community, our framework takes this acceleration procedure into account. We show our framework in Algorithm 1, where we can simply replace f(xt,ξt)\nabla f(x_{t},\xi_{t}) by F(xt)\nabla F(x_{t}) for the deterministic setting.

We notice that our framework is similar to the Quasi-Hyperbolic Momentum(QHM) algorithm proposed by Ma and Yarats (2018), while they did not consider the clipping technique. They pointed out that QHM contains a wide range of popular algorithms (e.g. SGD+momentum, Nesterov Accelerated SGD, AccSGD, etc). As a result, for different choice of hyper-parameters, our framework encompasses the clipping version of all these algorithms. We now discuss several representative examples in our framework.

Gradient Clipping. By choosing ν=0\nu=0 in Algorithm 1, we obtain the clipped GD/SGD algorithm which can be written as:

It follows that in gradient clipping, the gradient is clipped to have its norm no more than γ/η\gamma/\eta.

Momentum Clipping. By choosing ν=1\nu=1 in Algorithm 1, we perform the update using a clipped version of momentum which can be written as:

The approach has already been used in previous works (Zhang et al., 2019, 2020b), albeit in different settings. To the best of our knowledge, there is no existing analysis of this algorithm even for optimizing standard LL-smooth functions.

Mixed Clipping. By choosing ν(0,1)\nu\in(0,1), we obtain the mixed clipping algorithm. Although this form of clipping is not widely used in practice, we observe from experiments that it typically converges faster than both gradient clipping and momentum clipping. Some explanations of this observation are provided in Appendix E.

Normalized Momentum. By choosing ν=1\nu=1 and η+\eta\rightarrow+\infty in Algorithm 1, we recover the normalized SGD+momentum algorithm. This algorithm performs a normalized (rather than clipped) update in each iteration. It has been analyzed in Cutkosky and Mehta (2020) for LL-smooth functions and a layer-wise variant was used in the LARS algorithm (You et al., 2017). We will provide a detailed discussion of this algorithm in the Appendix C.

In this section we first deal with the deterministic case, in which we can get a strong justification that clipping is a natural choice to optimize (L0,L1)(L_{0},L_{1})-smooth functions. We have the following Theorem.

[Convergence of Algorithm 1, Deterministic Setting] Let the function FF satisfy Assumptions 2.1 and 2.2. Set m0=F(x0)m_{0}=\nabla F(x_{0}) in Algorithm 1 for simplicity. Fix ε>0\varepsilon>0 be a small constant. For any 0β<10\leq\beta<1 and 0ν10\leq\nu\leq 1, if γ1β10BL1\gamma\leq\frac{1-\beta}{10BL_{1}} and η1β10AL0\eta\leq\frac{1-\beta}{10AL_{0}} where A=1.06A=1.06, B=1.06B=1.06, then

In Theorem 3.1, the (L0,L1)(L_{0},L_{1})-smoothness is precisely reflected in the restriction of hyper-parameters γ=O(1/L1)\gamma=\mathcal{O}(1/L_{1}) and η=O(1/L0)\eta=\mathcal{O}(1/L_{0}). For large L1L_{1}, we must use a small clipping hyper-parameter to guarantee convergence. This also coincides with the intuition that in highly non-smooth regions we should take a small step.

Theorem 3.1 states that in the deterministic setting, for any ε>0\varepsilon>0, our framework can find an ε\varepsilon-approximate stationary point in O(Δmax{L0ε2,L12L0})\mathcal{O}\left(\Delta\max\left\{\frac{L_{0}}{\varepsilon^{2}},\frac{L_{1}^{2}}{L_{0}}\right\}\right) gradient evaluations if we choose γ=Θ(1/L1)\gamma=\Theta\left({1}/{L_{1}}\right) and η=Θ(1/L0)\eta=\Theta\left({1}/{L_{0}}\right). When ε=O(L0/L1)\varepsilon=\mathcal{O}(L_{0}/L_{1}) , the dominating term is O(ΔL0ε2)\mathcal{O}\left(\Delta{L_{0}}{\varepsilon^{-2}}\right).

Now we turn to our main result in the stochastic setting. We have the following theorem.

[Convergence of Algorithm 1, Stochastic setting] Let the function FF satisfy Assumptions 2.1 and 2.2, and the noise satisfies Assumption 2.4 with σ1\sigma\geq 1. Set m0=F(x0)m_{0}=\nabla F(x_{0}) in Algorithm 1 for simplicity. Fix 0<ε0.10<\varepsilon\leq 0.1 be a small constant. For any 0β<10\leq\beta<1 and 0ν10\leq\nu\leq 1, if γε2σmin{εAL0,1βAL0,1β25BL1}\gamma\leq\frac{\varepsilon}{2\sigma}\min\left\{\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right\} and γ/η=5σ\gamma/\eta=5\sigma where constants A=1.01,B=1.01A=1.01,B=1.01, then

Here the expectation is taken over all the randomness ξ0,,ξT1\xi_{0},\cdots,\xi_{T-1}.

Theorem 3.2 shows that in the stochastic setting, for any ε>0\varepsilon>0, our framework can find an ε\varepsilon-approximate stationary point in O(Δσ2(max{L0ε4,L14L03}))\mathcal{O}\left(\Delta\sigma^{2}\left(\max\left\{\frac{L_{0}}{\varepsilon^{4}},\frac{L_{1}^{4}}{L_{0}^{3}}\right\}\right)\right) gradient evaluations. When ε<min{1,L025L1}(1β)\varepsilon<\min\left\{1,\frac{L_{0}}{25L_{1}}\right\}(1-\beta) , the term min{εAL0,1βAL0,1β25BL1}\min\left\{\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right\} reduces to εAL0\frac{\varepsilon}{AL_{0}}. In this case L1L_{1} no longer affects the choice of steps sizes η\eta and γ\gamma, and the complexity in (4) reduces to O(ΔL0σ2ε4)\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right).

Theorem 3.2 suggests that the clipping threshold should take γ/η=Θ(σ)\gamma/\eta=\Theta(\sigma), which only depends on the noise and is several times larger than the its variance. This matches previous understanding of gradient clipping, in that clipping the stochastic gradient controls the variance while introducing some additional bias, and the clipping threshold should be tuned to trade-off variance with the introduced bias (Zhang et al., 2019).

We emphasize that in both settings, the dominating terms in our upper bounds are independent of the gradient upper bound MM and the smoothness parameter L1L_{1}. In other words, the efficiency of Algorithm 1 is essentially unaffected by these quantities. Recall that MM and L1L_{1} are related to steep cliffs in the landscape where the gradient may be large or fluctuate violently. Therefore, our results suggest that such non-smoothness can be tackled with clipping methods without sacrificing efficiency.

2 Proof Sketch

The analysis of Algorithm 1 is in fact challenging, as it uses both momentum and adaptive step sizes. Also, the general (L0,L1)(L_{0},L_{1})-smoothness assumption makes things more complicated. In this subsection we briefly introduce our proof technique. We hope our proof is also useful to a better understanding of other adaptive algorithms that combine momentum (such as Adam (Kingma and Ba, 2014)).

Proof sketch of Theorem 3.1. Due to the momentum term, each step in Algorithm 1 is not necessarily a descent one, which makes it difficult to prove convergence using traditional techniques. Instead, we construct a novel Lyapunov function as follows:

We prove this lemma by using the (L0,L1)(L_{0},L_{1})-smoothness properties deduced in Appendix A and conducting a comprehensive discussion on three cases in Lemmas B.1-B.3. Furthermore, we show in Lemma B.4 that tSF(xt)mt=O(γTS(ρ+tSF(xt)))\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})-m_{t}\|=\mathcal{O}\left(\gamma T_{\mathcal{S}}(\rho+\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})\|)\right). By choosing a small enough γ\gamma and carefully dealing with constants, we can conclude that the total amount of decrease of the Lyapunov function is Ω(ργTS)\Omega\left(\rho\gamma T_{\mathcal{S}}\right).

For any tSt\in\overline{\mathcal{S}}, if η=O(1/L0)\eta=\mathcal{O}(1/L_{0}) and γ=O(1/L1)\gamma=\mathcal{O}(1/L_{1}), then we have

We prove (7) in Lemma B.5-B.7 by the fact that Algorithm 1 performs an unclipped update if tSt\in\overline{\mathcal{S}}. Since the bound (7) is small in term of F(xt)\|\nabla F(x_{t})\| if β\beta and ν\nu are close to 1, we convert mt\|m_{t}\| to F(xt)\|\nabla F(x_{t})\| by proving that tSmt=Ω(tSF(xt)γ(ρTS+tSF(xt)))\sum_{t\in\overline{\mathcal{S}}}\|m_{t}\|=\Omega\left(\sum_{t\in\overline{\mathcal{S}}}\|\nabla F(x_{t})\|-\gamma(\rho T_{\mathcal{S}}+\sum_{t\in\mathcal{S}}\|\nabla F(x_{t})\|)\right) in Lemma B.8. The term related to S\mathcal{S} can all be offset by the terms in Lemma 3.3, and the term related to S\overline{\mathcal{S}} combined with η(1νβ)F(xt)2\eta(1-\nu\beta)\|\nabla F(x_{t})\|^{2} can be shown to ensure a descent amount of Ω(εηF(xt)ε2η)\Omega\left(\varepsilon\eta\|\nabla F(x_{t})\|-\varepsilon^{2}\eta\right), which is Ω(ε2η)\Omega\left(\varepsilon^{2}\eta\right) as long as F(xt)2ε\|\nabla F(x_{t})\|\geq 2\varepsilon.

Finally, by combining the above two cases we obtain the conclusion in Theorem 3.1.

Firstly, since the gradients are not exact, the stochastic gradient we have access to is not guaranteed to be small even for the case of tSt\in\overline{\mathcal{S}}. Fortunately, the choice of parameters in Theorem 3.2 (ρ=5σ\rho=5\sigma) settles such difficulty.

Finally, by choosing proper η\eta and γ\gamma, we can obtain Theorem 3.2.

3 Lower Bounds and Discussions

Theorem 3.1 and 3.2 provide upper bounds for the complexity of Algorithm 1. Now we compare these results with existing lower bounds and discuss the tightness of our results.

Deterministic Setting. Carmon et al. (2019) have shown that there exists an LL-smooth function FF such that any (possibly randomized) algorithm requires at least Ω(ΔLε2)\Omega\left(\Delta L\varepsilon^{-2}\right) queries to gradient to ensure finding a point xx such that F(x)ε\|\nabla F(x)\|\leq\varepsilon. Since Assumption 2.2 is weaker than LL-smoothness (L0LL_{0}\leq L), we have that the lower bound for (L0,L1)(L_{0},L_{1})-smooth functions is Ω(ΔL0ε2)\Omega\left(\Delta L_{0}\varepsilon^{-2}\right). From Theorem 3.1, Algorithm 1 is optimal since it can achieve the lower bound when ignoring numerical constants.

Stochastic Setting. From the example constructed in Drori and Shamir (2019), we have that for any SGD method with arbitrary (possibly adaptive) step sizes and aggregation schemesSee details in Appendix D., the complexity lower bound is exactly Ω(ΔL0σ2ε4)\Omega\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right) for (L0,L1)(L_{0},L_{1})-smooth functions that have σ\sigma-bounded gradient noises. Therefore Theorem 3.2 indicates that clipped SGD matches the lower bound.

One may ask : what is the lower bound for general stochastic gradient-based algorithms? In fact, from the example in Arjevani et al. (2019), we have that any algorithm needs Ω(ΔL0σ12ε4)\Omega\left(\Delta L_{0}\sigma^{2}_{1}\varepsilon^{-4}\right) stochastic gradient queries to find an ε\varepsilon-approximate stationary point for a hard (L0,L1)(L_{0},L_{1})-smooth function whose gradient noise has a σ12\sigma^{2}_{1}-bounded variance. It is our conjecture that the lower bound for optimizing (L0,L1)(L_{0},L_{1})-smooth functions that have σ\sigma-bounded gradient noises is also O(ΔL0σ2ε4)\mathcal{O}\left(\Delta L_{0}\sigma^{2}\varepsilon^{-4}\right). We leave the study as a future work.

Experiments

We conduct extensive experiments and find the clipping algorithms indeed consistently outperform their unclipped counterpart. We present experimental results on three deep learning benchmarks: CIFAR-10 classification using ResNet-32, Imagenet classification using ResNet-50 and language modeling on Penn Treebank (PTB) dataset using AWD-LSTM. We put all the experimental details in the Appendix G. Our code is available at https://github.com/zbh2047/clipping-algorithms.

CIFAR-10 classification with ResNet. We train the standard ResNet-32 (He et al., 2016) architecture on CIFAR-10. We use SGD with momentum for the baseline algorithm with a decaying learning rate schedule, which is the standard choice to train the ResNet architecture. We set learning rate η=1.0\eta=1.0, momentum β=0.9\beta=0.9 and minibatch size 128, following the common practice. For all the clipping algorithms, we choose the best η\eta and γ\gamma based on a course grid search, while keeping other hyper-parameters and training strategy the same as SGD+momentum. We simply set the hyper-parameters ν=0.7\nu=0.7 and β=0.999\beta=0.999 in mixed clipping, as suggested in Ma and Yarats (2018) (for its unclipped counterpart QHM). We run 5 times for each algorithm using different random seeds to make the results more reliable.

Figures 2(a) demonstrates the results. It can be seen that all the algorithms achieve a test accuracy more than 93% on CIFAR-10. Note that all clipping algorithms converge faster than SGD+momentum. Particularly, the mixed clipping (Algorithm 1) outperforms SGD+momentum by a large margin in term of training speed. As a result, one can possibly adopt a more aggressive learning rate decaying schedule to reduce training time considerably.

ImagNet classification with ResNet. We train the standard ResNet-50 (He et al., 2016) architecture on ImageNet. For the baseline algorithm, we choose SGD with learning rate lr=1.0lr=1.0 and momentum β=0.9\beta=0.9, following Goyal et al. (2017). We use batch size 256 on 4 GPUs.

Figure 2(b) plot the training loss curve and validation accuracy curve on ImageNet. All the algorithms reach a validation accuracy of about 76%. However, all the clipping algorithms train faster than the baseline SGD. Mixed clipping performs the best among the four algorithms.

Language modeling with LSTM. We train the state-of-the-art AWD-LSTM (Merity et al., 2017) on Penn Treebank (PTB) dataset (Mikolov et al., 2010). We first follow the training strategy in Merity et al. (2017), where they use averaged SGD without momentum with learning rate η=30\eta=30 and clipping parameter γ=7.5\gamma=7.5. Since our purpose is to compare different algorithms rather than to achieve state-of-the-art results, we only train AWD-LSTM for 250 epochs. We then evaluate other algorithms including standard SGD without clipping, momentum clipping, and mixed clipping. We choose the best η\eta and γ\gamma (using validation perplexity criterion) based on a course grid search. Results are shown in Figure 2(c).

Figure 2(c) clearly shows all clipping methods converge much faster than SGD without clipping, and are much better in term of validation perplexity. This is consistent with our theory, in that the vanilla SGD must use a very small learning rate to guarantee convergence (Zhang et al., 2020a), which will be slow and be harmful to generalization on validation set according to previous works (Huang et al., 2017; Kleinberg et al., 2018) . Therefore clipping technique is crucial in LSTM models. We can also find that the training and test curve of mixed clipping is much better than both gradient clipping and momentum clipping. The mixed clipping improves validation perplexity for more than 1 point compared to clipped SGD after 250 epochs.

Other experiments. We also conduct experiments to compare clipping algorithms with Adam, and to directly compare our work with previous results (Zhang et al., 2020a) under the same setting. See Appendix H for details. Finally, we construct a provably (L0,L1)(L_{0},L_{1})-smooth optimization problem using MNIST dataset. We then run experiments in both deterministic setting and stochastic setting. The results are shown in Appendix I.

Conclusion

This paper proposes a detailed study for clipping methods under a general framework. In particular, we explore the possibility of combining clipping with other popular techniques, e.g. momentum acceleration, in deep learning. We provide a general and tight analysis for the framework, showing the efficiency of clipping methods in optimizing a class of non-convex and non-smooth (in traditional sense) functions. Experiments confirm that these methods have superior performance. We hope that our work affords more understandings on the clipping technique and (L0,L1)(L_{0},L_{1}) smooth functions.

There are still many open questions that have not yet been answered. Firstly, as discussed in Section 3.3, we are not aware of any lower bounds for general first-order methods that can be applied our setting. Thus, it is interesting to explore such lower bound, or to relax Assumption 2.4 to the more general bounded variance assumption. Secondly, although we have shown the superiority of clipping-based methods, we do not provide theoretical explanation why some clipping schemes are better than others as observed in experiments. We believe that this can only be done by exploring new and better smoothness assumptions. Thirdly, the empirical superiority of other adaptive methods ( e.g. AdaGrad (Duchi et al., 2011), Adam (Kingma and Ba, 2014) ) have not been justified from a theoretical point of view. We hope that our analysis is helpful for the analysis of these methods. Finally, we are looking forward to seeing better optimization algorithms with better convergence properties in future work.

Acknowledgement

This work was supported by National Key R&D Program of China (2018YFB1402600), Key-Area Research and Development Program of Guangdong Province (No. 2019B121204008)] and Beijing Academy of Artificial Intelligence.

References

In this section, we prove some important properties of (L0,L1)(L_{0},L_{1})-smooth functions. These properties will be frequently used in subsequent sections.

We first present a basic lemma without proof.

(Gronwall’s inequality) [Gronwall, 1919] Let I=[a,b]I=[a,b] denote an interval of the real line with a<ba<b. Let f,g,hf,g,h be continuous real-valued functions defined on II. Assume gg is non-decreasing, hh is non-negative, and the negative part of gg is integrable on every closed and bounded subinterval of II.If

The following result, Lemma A.2 , is a generalization of Lemma 9 in Zhang et al. [2020a].

Let FF be (L0,L1)(L_{0},L_{1})-smooth, and c>0c>0 be a constant. Given xx, for any x+x^{+} such that x+xc/L1\left\|x^{+}-x\right\|\leq c/L_{1}, we have f(x+)ec(cL0L1+F(x))\left\|\nabla f\left(x^{+}\right)\right\|\leq e^{c}\left(\frac{cL_{0}}{L_{1}}+\|\nabla F(x)\|\right).

Let γ(t)\gamma(t) be defined as γ(t)=t(x+x)+x,t\gamma(t)=t(x^{+}-x)+x,t\in, then we have

We then bound the norm of F(γ(t))\nabla F(\gamma(t)):

The first inequality uses the triangular inequality of 2-norm; The second inequality uses the property of spectral norm; The third inequality uses the definition of (L0,L1)(L_{0},L_{1})-smoothness. By applying the Gronwall’s inequality we get

The Lemma follows by setting t=1t=1. \square

Now we are able to prove a descent inequality, which is similar to the descent inequality for LL-smooth functions. In fact, if a function FF is LL-smooth, it is well-known that for any x,yx,y, we have

(Descent Inequality) Let FF be (L0,L1)(L_{0},L_{1})-smooth, and c>0c>0 be a constant. For any xkx_{k} and xk+1x_{k+1}, as long as xkxk+1c/L1\|x_{k}-x_{k+1}\|\leq c/L_{1}, we have

where A=1+ecec1c,B=ec1cA=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}.

Let γ(t)\gamma(t) be defined as γ(t)=t(xk+1xk)+xk,t\gamma(t)=t(x_{k+1}-x_{k})+x_{k},t\in. The following derivation uses Taylor’s theorem (in (16)), then uses triangular inequality, Cauchy-Schwarz inequality and the property of spectral norm (in (17)):

Then we use (L0,L1)(L_{0},L_{1})-smoothness and (14) to bound 2F(γ(t))\left\|\nabla^{2}F(\gamma(t))\right\|:

Substituting (20) into (18) concludes the proof. \square

Let FF be (L0,L1)(L_{0},L_{1})-smooth, and c>0c>0 be a constant. For any xkx_{k} and xk+1x_{k+1}, as long as xkxk+1c/L1\|x_{k}-x_{k+1}\|\leq c/L_{1}, we have

where A=1+ecec1c,B=ec1cA=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}.

Using (20) leads to the results. \square

Finally we prove a result which provides a way to upper-bound the gradient norm. A similar result for LL-smooth functions is the following: if FF is LL-smooth, then for any xx, we have

(Bounding the gradient norm) Let F(x)F(x) be an (L0,L1)(L_{0},L_{1})-smooth function, and FF^{*} be the optimal value. Then for any x0x_{0}, we have

Define the constant c=L1F(x0)AL0+BL1F(x0)c=\frac{L_{1}\|\nabla F(x_{0})\|}{AL_{0}+BL_{1}\|\nabla F(x_{0})\|} and A=1+ecec1c,B=ec1cA=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}. It is easy to see that such 0c<10\leq c<1 exists. Let λ=1AL0+BL1F(x0)\lambda=\frac{1}{AL_{0}+BL_{1}\|\nabla F(x_{0})\|} and x=x0λF(x0)x=x_{0}-\lambda\nabla F(x_{0}). Then xx0c/L1\|x-x_{0}\|\leq c/L_{1}. By the descent inequality we have

If F(x)AL0BL1\|\nabla F(x)\|\geq\frac{AL_{0}}{BL_{1}}, then

If F(x)<AL0BL1\|\nabla F(x)\|<\frac{AL_{0}}{BL_{1}}, then

The original definition of (L0,L1)(L_{0},L_{1})-smoothness requires the function to be twice-differentiable. Under this definition, (L0,L1)(L_{0},L_{1})-smoothness is actually not weaker than LL-smoothness, which only requires the function to be continuous differentiable. In this section we prove that the alternative definition provided in Remark 2.3 is sufficient for all the results in this paper.

We check that Lemma A.2 and A.3 still holds under the new assumption (with L0,L1L_{0},L_{1} replaced by K0,K1K_{0},K_{1}, up to numerical constants) We immediately obtain from (27) above that

which is of the same form as Lemma A.2. Next, we have

Since all the other results are established on the basis of these two lemmas, we can see that the conclusion still holds under (27).

Appendix B Proof of Theorems

We first prove the deterministic case (Theorem 3.1), then generalize the result to stochastic case (Theorem 3.2). In deterministic case we can use fewer notations, which will make the proof more readable and elegant. The proof in stochastic case will rely on all the techniques used in the deterministic case, as well as some new methods.

To simplify the notation, we write the update formula as

when analyzing a single iteration. The error between m+m^{+} and F(x)\nabla F(x) is denoted as δ=m+F(x)\delta=m^{+}-\nabla F(x). Suppose γc/L1\gamma\leq c/L_{1} for some constant cc, and we denote A=1+ecec1cA=1+e^{c}-\frac{e^{c}-1}{c} and B=ec1cB=\frac{e^{c}-1}{c}, just the same as in the descent inequality (Lemma A.3).

Let μ0\mu\geq 0 be a real constant. For any vector uu and vv,

To prove the theorem, we will construct an Lyapunov function and explore the decreasing property of this function. We define the Lyapunov function G(x,m)G(x,m) to be

and analyze G(x+,m+)G(x,m)G(x^{+},m^{+})-G(x,m). We first bound min(ηm+2,γm+)min(ηm2,γm)\min\left({\eta}\|m^{+}\|^{2},{\gamma}\|m^{+}\|\right)-\min\left({\eta}\|m\|^{2},{\gamma}\|m\|\right).

For any momentum vectors mm and m+=βm+(1β)F(x)m^{+}=\beta m+(1-\beta)\nabla F(x), let δ=m+F(x)\delta=m^{+}-\nabla F(x), then

m<γ/η\|m\|<\gamma/\eta and m+<γ/η\|m^{+}\|<\gamma/\eta. In this case

m<γ/η\|m\|<\gamma/\eta and m+>γ/η\|m^{+}\|>\gamma/\eta. In this case

Thus in all cases min(ηm+2,γm+)min(ηm2,γm)\min\left({\eta}\|m^{+}\|^{2},{\gamma}\|m^{+}\|\right)-\min\left({\eta}\|m\|^{2},{\gamma}\|m\|\right) can be upper bounded by 2(1β)βγδ\frac{2(1-\beta)}{\beta}\gamma\|\delta\|. \square

Suppose max(F(x),m+,m)γ/η\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\geq\gamma/{\eta}. Then

We first write G(x+,m+)G(x,m)G(x^{+},m^{+})-G(x,m) as

Based on Lemma B.2, we only need to bound F(x+)F(x)F(x^{+})-F(x). We will use the (L0,L1)(L_{0},L_{1})-smoothness assumption.

Where the first inequality uses the descent inequality (Lemma A.3), the second equation follows from the update rule, and the last inequality is obtained by the following two inequalities:

First we prove that (37) holds by considering the following three cases:

m+γ/η\|m^{+}\|\geq\gamma/\eta. In this case the algorithm performs a normalized update. Then (37) follows by directly using Lemma B.1 with μ=2/5\mu=2/5:

m+<γ/η\|m^{+}\|<\gamma/\eta and F(x)γ/η\|\nabla F(x)\|\geq\gamma/\eta. In this case the algorithm performs an unnormalized update. We now prove ηF(x),m+25γF(x)3γ25η+75γF(x)m+-\eta\left\langle\nabla F(x),m^{+}\right\rangle\leq-\frac{2}{5}\gamma\|\nabla F(x)\|-\frac{3\gamma^{2}}{5\eta}+\frac{7}{5}\gamma\|\nabla F(x)-m^{+}\|.

m+<γ/η\|m^{+}\|<\gamma/\eta and F(x)<γ/η\|\nabla F(x)\|<\gamma/\eta. This is the most complicated case. Due to the condition in Lemma B.3, mγ/η\|m\|\geq\gamma/\eta. In this case, the algorithm also performs an unnormalized update. We first bound ηF(x),m\eta\left\langle\nabla F(x),m\right\rangle using the same calculation as in the second case:

where we use the fact that F(x)m=δ/β\|\nabla F(x)-m\|=\|\delta\|/\beta. We then bound ηF(x)2\eta\|\nabla F(x)\|^{2} as follows:

Combining the two inequalities, we obtain

Thus in all cases (37) holds. We now turn to (38) which is proven in a similar fashion. Specifically, consider the following three cases:

F(x)γ/η\|\nabla F(x)\|\geq\gamma/\eta. In this case

F(x)<γ/η\|\nabla F(x)\|<\gamma/\eta and m+γ/η\|m^{+}\|\geq\gamma/\eta. In this case bound ηF(x)2\eta\|\nabla F(x)\|^{2} the same as in the third case of (37):

F(x)<γ/η\|\nabla F(x)\|<\gamma/\eta and m+<γ/η\|m^{+}\|<\gamma/\eta. In this case mγ/η\|m\|\geq\gamma/\eta. Using the same calculation above,

Thus (38) holds. Merging all the cases above, we finally obtain

Now we consider all the steps tt which satisfy the condition in Lemma B.3, denoted as S={t[0,T1]:max(F(xt),mt+1,mt)γ/η}\mathcal{S}=\{t\in[0,T-1]:\max(\|F(x_{t})\|,\|m_{t+1}\|,\|m_{t}\|)\geq\gamma/{\eta}\}. Similarly, use S=[0,T1]\S\overline{\mathcal{S}}=[0,T-1]\backslash\mathcal{S}. Let TS=ST_{\mathcal{S}}=|\mathcal{S}|, then TTS=ST-T_{\mathcal{S}}=|\overline{\mathcal{S}}|.

Let set S\mathcal{S} and TST_{\mathcal{S}} be defined above. Then

We now focus on the summation of the term δt\|\delta_{t}\|. Define S(a,b)=F(a)F(b)S(a,b)=\nabla F(a)-\nabla F(b). When abγ\|a-b\|\leq\gamma, S(a,b)γ(AL0+BL1F(b))\|S(a,b)\|\leq\gamma\left(AL_{0}+BL_{1}\|\nabla F(b)\|\right) (see Lemma A.4). Thus we can expand δt=mt+1F(xt)\delta_{t}=m_{t+1}-\nabla F(x_{t}) using the recursive relation δt=βδt1+βS(xt1,xt)\delta_{t}=\beta\delta_{t-1}+\beta S(x_{t-1},x_{t}) as follows

where the last inequality uses the fact that F(xτ)γ/η\|\nabla F(x_{\tau})\|\leq\gamma/\eta for all τ[1,t]\S\tau\in[1,t]\backslash\mathcal{S}.

After substituting the above results into (LABEL:lemma_mom_clip_deterministic_large_sum_1) we obtain

Now we turn to the case in which max(F(x),m+,m)γ/η\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\leq\gamma/{\eta}.

Suppose max(F(x),m+,m)γ/η\max(\|\nabla F(x)\|,\|m^{+}\|,\|m\|)\leq\gamma/{\eta}. Then

where c1=ν(1β)(2β)Lη(1βν)2+2(1ν),c2=νβ(1β)Lηβν(1βν),c3=νβ(1+β)Lη(βν)2c_{1}=\nu(1-\beta)(2-\beta)-L\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-L\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-L\eta(\beta\nu)^{2}, and L=AL0+BL1γ/ηL=AL_{0}+BL_{1}\gamma/\eta.

In the case of mγ/η\|m\|\leq\gamma/{\eta}, we have ηm2γm\eta\|m\|^{2}\leq\gamma\|m\|, thus

We then bound F(x+)F(x)F(x^{+})-F(x) and m+2m2\|m^{+}\|^{2}-\|m\|^{2}. Note that m+γ/η\|m^{+}\|\leq\gamma/\eta implies that the algorithm performs an update without normalization. Define L:=AL0+BL1γ/ηL:=AL_{0}+BL_{1}\gamma/\eta, then again by descent inequality,

by definition of the Lyapunov function, we have

where c1=ν(1β)(2β)Lη(1βν)2+2(1ν),c2=νβ(1β)Lηβν(1βν),c3=νβ(1+β)Lη(βν)2c_{1}=\nu(1-\beta)(2-\beta)-L\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-L\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-L\eta(\beta\nu)^{2}. \square

Let c1,c2,c3c_{1},c_{2},c_{3} and LL be defined in Lemma B.5. If Lη1L\eta\leq 1, then the matrix

is symmetric and positive semi-definite, where IdI_{d} is the d×dd\times d identity matrix.

In fact we only need to consider the case when d=1d=1, because the eigenvalues of H2d×2dH_{2d\times 2d} can only be those that appears in H2×2H_{2\times 2} (d=1d=1). Denote two eigenvalues be λ1,λ2\lambda_{1},\lambda_{2} when d=1d=1. A direct calculation shows that

If Lη1L\eta\leq 1, then λ1λ20\lambda_{1}\lambda_{2}\geq 0 and λ1+λ20\lambda_{1}+\lambda_{2}\geq 0, which is equivalent to the semi-definiteness of HH. \square

Suppose max(F(x),m,m+)γ/η\max(\|\nabla F(x)\|,\|m\|,\|m^{+}\|)\leq\gamma/{\eta}. If Lη1L\eta\leq 1, Then

Let HH be defined in Lemma B.6. The result of Lemma B.5 can be written in a matrix form:

Using the fact that HH is positive semi-definite, we obtain the desired result. \square

Note that the amount of descent in Corollary B.7 is small in terms of F(x)\|\nabla F(x)\| if β\beta and ν\nu are close to 1. We now try to convert the term m\|m\| into F(x)\|\nabla F(x)\|, which is stated in the following lemma.

Suppose AL0ηc1(1β)AL_{0}\eta\leq c_{1}(1-\beta) and BL1γc3(1β)BL_{1}\gamma\leq c_{3}(1-\beta) for some constant c1c_{1} and c3c_{3}. Let m0=F(x0)m_{0}=\nabla F(x_{0}) for simplicity. Let set S\mathcal{S} and S\overline{\mathcal{S}} be defined above. Then

where the last inequality follows by Corollary A.4. Applying (50) recursively, we obtain

Using mtF(xt)mtF(xt)\|m_{t}\|\geq\|\nabla F(x_{t})\|-\|m_{t}-\nabla F(x_{t})\| and some straightforward calculation, we obtain

Now we are ready to prove the main theorem.

Let FF^{*} be the optimal value, and Δ=F(x0)F\Delta=F(x_{0})-F^{*}. Assume m0=F(x0)m_{0}=\nabla F(x_{0}) for simplicity. If γ1β10BL1\gamma\leq\frac{1-\beta}{10BL_{1}} and η1β10AL0\eta\leq\frac{1-\beta}{10AL_{0}}, where constants A=1+e1/1010(e1/101)<1.06A=1+e^{1/10}-10{(e^{1/10}-1)<1.06}, B=10(e1/101)<1.06B=10(e^{1/10}-1)<1.06, and ε<γ5η\varepsilon<\frac{\gamma}{5\eta}, then

By calculating Lη=AL0η+BL1γ(1β)/5<1L\eta=AL_{0}\eta+BL_{1}\gamma\leq(1-\beta)/5<1, we can use Corollary B.7. Taking summation of the inequality (47) over steps tS=[0,T1]\St\in\overline{\mathcal{S}}=[0,T-1]\backslash\mathcal{S}, we obtain

Combining (56) and (40) in Corollary B.4 we obtain

we have AL0η(1β)/10AL_{0}\eta\leq(1-\beta)/10 and BL1γ(1β)/10BL_{1}\gamma\leq(1-\beta)/10. Using Lemma B.14 we have

Therefore by standard inequality x22εxε2x^{2}\geq 2\varepsilon x-\varepsilon^{2} and (57) we obtain

We now simplify U(x)U(x). Let εγ5η\varepsilon\leq\frac{\gamma}{5\eta}. By (58) we have

Since ε<γ5η\varepsilon<\dfrac{\gamma}{5\eta}, we have U(x)V(x)U(x)\geq V(x). Therefore by (LABEL:longineq) and Lemma A.5 we have

B.2 Proof of Theorem 3.2

We now prove the stochastic case. As before, to simplify the notation we write the update formula as

In stochastic case, we define the Lyapunov function to be

Suppose γc/L1\gamma\leq c/L_{1} for some constant cc, and we denote A=1+ecec1cA=1+e^{c}-\frac{e^{c}-1}{c} and B=ec1cB=\frac{e^{c}-1}{c}, just the same as in the descent inequality (Lemma A.3). When γ1β50L1ε1500L1\gamma\leq\frac{1-\beta}{50L_{1}}\varepsilon\leq\frac{1}{500L_{1}} (in Theorem 3.2), we can take c=1/500c=1/500 and A=B=1.002A=B=1.002.

Furthermore, using the noise assumption, for different time steps t,tt,t^{\prime}, we have

Based on Lemma B.2, we only need to bound F(x+)F(x)F(x^{+})-F(x). We use the (L0,L1)(L_{0},L_{1})-smooth condition:

Now we bound F(x),x+x\left\langle\nabla F(x),x^{+}-x\right\rangle. The calculation is similar to the deterministic setting. We first bound min(η,γm+)m+,F(x)-\min\left(\eta,\frac{\gamma}{\|m^{+}\|}\right)\left\langle m^{+},\nabla F(x)\right\rangle. Consider the following three cases, all of which are analogous to the proof of Lemma B.3:

m+γ/η\|m^{+}\|\geq\gamma/\eta. The algorithm performs a normalized update. We have

m+<γ/η\|m^{+}\|<\gamma/\eta and F(x)4γ/5η\|\nabla F(x)\|\geq 4\gamma/5\eta. The algorithm performs an unnormalized update. We have

where (77) uses the following two inequalities which can be obtained by Lemma B.10:

We next bound min(η,γf(x,ξ))f(x,ξ),F(x)-\min\left(\eta,\frac{\gamma}{\|\nabla f(x,\xi)\|}\right)\left\langle\nabla f(x,\xi),\nabla F(x)\right\rangle. Consider the following cases, all of which are analogous to the proof of Lemma B.3:

f(x,ξ)γ/η\|\nabla f(x,\xi)\|\geq\gamma/\eta. In this case we can use Lemma B.1 with μ=2/5\mu=2/5:

f(x,ξ)<γ/η\|\nabla f(x,\xi)\|<\gamma/\eta. In this case

We now bound ηF(x)2-\eta\|\nabla F(x)\|^{2}. If F(x)4γ5η\|\nabla F(x)\|\geq\frac{4\gamma}{5\eta}, then ηF(x)245γF(x)-\eta\|\nabla F(x)\|^{2}\leq-\frac{4}{5}\gamma\|\nabla F(x)\|. If F(x)<4γ5η\|\nabla F(x)\|<\frac{4\gamma}{5\eta} and m+4γ5η\|m^{+}\|\geq\frac{4\gamma}{5\eta}, then using the same calculation as in the deterministic case,

Let set S\mathcal{S} and TST_{\mathcal{S}} be defined above. Then

where c1=ν(1β)(2β)AL0η(1βν)2+2(1ν),c2=νβ(1β)AL0ηβν(1βν),c3=νβ(1+β)AL0η(βν)2c_{1}=\nu(1-\beta)(2-\beta)-AL_{0}\eta(1-\beta\nu)^{2}+2(1-\nu),c_{2}=\nu\beta(1-\beta)-AL_{0}\eta\beta\nu(1-\beta\nu),c_{3}=\nu\beta(1+\beta)-AL_{0}\eta(\beta\nu)^{2}. c1=(1β)[2βAL0η(1β)]c_{1}=(1-\beta)[2-\beta-AL_{0}\eta(1-\beta)], c2=β[1βAL0η(1β)]c_{2}=\beta[1-\beta-AL_{0}\eta(1-\beta)] and c3=β(1+βAL0ηβ)c_{3}=\beta(1+\beta-AL_{0}\eta\beta).

Because f(x,ξ)4γ/5η+σ=γ/η\|\nabla f(x,\xi)\|\leq 4\gamma/5\eta+\sigma=\gamma/\eta and m+γ/η\|m^{+}\|\leq\gamma/\eta, the algorithm performs an unnormalized update. The proof is similar to the one in Lemma B.5 except for bounding the term F(x+)F(x)F(x^{+})-F(x).

The proof of Lemma B.14 is similar to the proof of Lemma B.8. We first write (52) again as follows:

We now merge the two cases corresponding to Corollary B.12 and Lemma B.13. The proof of the following theorem involves many techniques which are different from the deterministic case and is far more challenging.

Let FF^{*} be the optimal value, and Δ=F(x0)F\Delta=F(x_{0})-F^{*}. Assume m0=F(x0)m_{0}=\nabla F(x_{0}) for simplicity. Fix ε0.1\varepsilon\leq 0.1 be a small constant.If γεσmin(εAL0,1βAL0,1β50BL1)\gamma\leq\frac{\varepsilon}{\sigma}\min\left(\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{50BL_{1}}\right) and γ/η=5σ\gamma/\eta=5\sigma where constants A=1.01,B=1.01A=1.01,B=1.01, then

Based on the previous results, we take summation over tt and obtain

We now simplify (93) by taking expectation. We first have

Applying the above equation recursively, we obtain

where the first inequality uses the proof of Corollary A.4 and Lemma B.10, and the last inequality uses γ/η=5σ\gamma/\eta=5\sigma. By taking summation of the above inequality we obtain

Let γε2σmin(εAL0,1βAL0,1β25BL1)\gamma\leq\frac{\varepsilon}{2\sigma}\min\left(\frac{\varepsilon}{AL_{0}},\frac{1-\beta}{AL_{0}},\frac{1-\beta}{25BL_{1}}\right), and fix the ratio γ/η=5σ\gamma/\eta=5\sigma. Then for small enough ε<0.1\varepsilon<0.1 and large enough noise σ>1\sigma>1,

Applying the above estimates and rearranging (102), we have

Due to Lemma B.14 (AL0ησε10(1β),BL1γε50(1β)AL_{0}\eta\sigma\leq\frac{\varepsilon}{10}(1-\beta),BL_{1}\gamma\leq\frac{\varepsilon}{50}(1-\beta)), we clearly have

Plugging (LABEL:sigma_mt) into (LABEL:rearrange), we obtain

It clearly follows that min{U(x),V(x)}45εηF(x)45ε2η\min\{U(x),V(x)\}\geq\frac{4}{5}\varepsilon\eta\|\nabla F(x)\|-\frac{4}{5}\varepsilon^{2}\eta. Therefore

We finally show G(x0)F+P0=O(F(x0)F)G(x_{0})-F^{*}+P_{0}=O(F(x_{0})-F^{*}). Using Lemma A.5,

For the term P0P_{0}, if F(x0)=Ω(L0/L1)\|\nabla F(x_{0})\|=\Omega(L_{0}/L_{1}), we can similarly use Lemma A.5 to obtain P0=O(F(x0)F)P_{0}=\mathcal{O}(F(x_{0})-F^{*}). If F(x0)=O(L0/L1)\|\nabla F(x_{0})\|=\mathcal{O}(L_{0}/L_{1}), using L0F(x0)L1F(x0)2L_{0}\|\nabla F(x_{0})\|\leq L_{1}\|\nabla F(x_{0})\|^{2} and Lemma A.5 leads to the result. \square

Appendix C Discussion of the normalized momentum algorithm

In this section we analyze in detail the theoretical aspects of the normalized momentum algorithm, as well as some practical issues. Recall that this algorithm can be seen as a special case of our clipping framework. For convenience we re-write it in Algorithm 2.

We remark that SNM is different from the clipping methods in traditional sense, in that it makes a normalized update each iteration. This algorithm has been analyzed in Cutkosky and Mehta for LL-smooth functions. In that setting they were able to prove that SNM achieves a complexity of O(ΔLσ2ε4)\mathcal{O}(\Delta L\sigma^{2}\varepsilon^{-4}).

For (L0,L1)(L_{0},L_{1})-smooth functions, we show that: (a). With carefully chosen momentum parameter β\beta and step size η\eta, SNM can achieve a complexity of O(ΔL0σ2ε4)\mathcal{O}(\Delta L_{0}\sigma^{2}\varepsilon^{-4}), which is the same as the complexity we obtain in Theorem 3.2. (b). There are some practical issues that make SNM less favorable than traditional clipping methods (such as the other three special cases of our framework discussed in Section 3 of the main paper).

The following results provides convergence guarantee for Algorithm 2.

Consider the algorithm that starts at x0x_{0} and make updates xt+1=xtηmt+1x_{t+1}=x_{t}-\eta m_{t+1}. Define δt:=mt+1F(xt)\delta_{t}:=m_{t+1}-\nabla F(x_{t}) be the estimation error. Assume ηc/L1\eta\leq c/L_{1} for some c>0c>0 and let constants A=1+ecec1c,B=ec1cA=1+e^{c}-\frac{e^{c}-1}{c},B=\frac{e^{c}-1}{c}. Then

Since xt+1xt=ηt\|x_{t+1}-x_{t}\|=\eta_{t}, by Lemma A.3 we have

where in the second inequality we use Lemma B.1. \square

holds in T=O(ΔL0σ2ε4)T=\mathcal{O}(\Delta L_{0}\sigma^{2}\varepsilon^{-4}) iterations.

Define the estimation errors δt:=mt+1F(xt)\delta_{t}:=m_{t+1}-\nabla F(x_{t}). Denote S(a,b):=F(a)F(b)S(a,b):=\nabla F(a)-\nabla F(b), then for a,ba,b such that ab=ηc/L1\|a-b\|=\eta\leq c/L_{1}, we can upper bound S(a,b)S(a,b) using Corollary A.4:

We can use S(a,b)S(a,b) to get a recursive relationship:

Denote δt=f(xt,ξt)F(xt)\delta_{t}^{\prime}=\nabla f(x_{t},\xi_{t})-\nabla F(x_{t}), then

Using triangle inequality and plugging in the estimate (113) , we have

Taking a telescope summation of 115 and using Assumption 2.4 we obtain

If we choose η=Θ(min(L11,L01ε)α)\eta=\Theta\left(\min(L_{1}^{-1},L_{0}^{-1}\varepsilon)\alpha\right) and α=Θ(σ2ε2)\alpha=\Theta\left(\sigma^{-2}\varepsilon^{2}\right), then

We have shown the theoretical superiority of Algorithm 2. Specifically, it enjoys the same complexity as Theorem 3.2. However we notice some potential drawbacks of Algorithm 2:

Firstly, the step size of Algorithm 2 is at the order of O(ε3)\mathcal{O}\left(\varepsilon^{3}\right), while the step size we chose in Theorem 3.2 is O(ε2)\mathcal{O}\left(\varepsilon^{2}\right). Previous works have noticed that a smaller step size makes it easier to be trapped in a sharp local minima , which may result in worse generalization [Kleinberg et al., 2018].

Secondly, although the complexity of Algorithm 2 is the same as Theorem 3.2 for small ε\varepsilon, it requires a more restrictive upper bound of ε\varepsilon to ensure the ε4\varepsilon^{-4} term dominates. For instance with a poor initialization, Δ\Delta may very large. This suggests that in practice, where we do not get into a very small neighbourhood of stationary point, the performance of Algorithm 2 may be worse.

Appendix D Details of Lower Bounds in Section 3.3

In this section we discuss the lower bound for SGD in Drori and Shamir in detail. The following result is taken from this paper:

Now we discuss why this shows the optimality of clipped SGD under Assumptions 2.1, 2.2 and 2.4.

Firstly, Theorem D.1 assumes an upper bound Δ\Delta on F(x0)F(xt)F(x_{0})-F(x_{t}) rather than the one assumed in Assumption 2.1 (F(x0)FΔF(x_{0})-F^{*}\leq\Delta). However, in fact we only need to assume that F(x0)F(xT)ΔF(x_{0})-F(x_{T})\leq\Delta to prove Theorem 3.2 for clipped SGD. The reason is as follows. In fact, since β=0\beta=0 for clipped SGD , the momentum term in the Lyapunov function disappears, as well as the term νβ(1β)2AL0η2σF(x0)\frac{\nu\beta}{(1-\beta)^{2}}AL_{0}\eta^{2}\sigma\|\nabla F(x_{0})\| in (102). So we no longer need to use Lemma A.5 to bound the term F(x0)\|\nabla F(x_{0})\|. The rest of the proof only needs F(x0)F(xT)ΔF(x_{0})-F(x_{T})\leq\Delta (which is used in the telescope sum in (102)).

Secondly, although Theorem D.1 only assume that the variance of stochastic gradient is bounded, in their construction the noise is actually defined as

Therefore the norm of the noise is bounded by σ\sigma, and the example used to prove Theorem D.1 still works under Assumption 2.4.

Now suppose we need an output such that f(xout)=γε\|\nabla f(x_{\text{out}})\|=\|\gamma\|\leq\varepsilon, then it follows from Theorem D.1 that T=Ω(LΔσ2ε4)T=\Omega\left(L\Delta\sigma^{2}\varepsilon^{-4}\right). Therefore we have shown the optimality of clipped SGD in this class of algorithms, as stated in Section 3.3.

Appendix E Justifications on the Mixed Clipping

In the above optimization problem, the general update formula can be written as:

We now analyze three cases based on the proposition:

We further plot the value of (119) with respect to ν\nu and β\beta in Figure 3 to visualize the above finding. It can be clearly seen that the using both gradient and momentum with a proper interpolation factor ν\nu outperforms both SGD and SGD with momentum by a large margin (Figure 3(a)). Furthermore, we can drive β1\beta\rightarrow 1 to further improve convergence (Figure 3(b)), while in SGD with momentum we can not.

𝑥𝜉2f(x,\xi)=\frac{1}{2}(x+\xi)^{2}. The mixed update with proper ν\nu outperforms both SGD and SGD with momentum by a large margin. Furthermore, for the mixed update we can drive β1\beta\rightarrow 1 to further improve convergence, while for SGD with momentum we can not. Although we use the simple function F(x)=12x2F(x)=\frac{1}{2}x^{2} as an example, similar result exists in any general quadratic form with positive definite Hessian. Furthermore, the experiments in Section 4 also demonstrate that the mixed clipping outperforms both gradient clipping and momentum clipping.

Combining (120), (121) and (122), we can write the recursive relationship into a matrix form:

Denote the above matrix as MM. After a straightforward calculation, we can find that λ1=1\lambda_{1}=1 is an eigenvalue of MM, and

is the only eigenvector associated with λ1=1\lambda_{1}=1. Similarly, λ2=β\lambda_{2}=\beta is also an eigenvalue of MM. Let the other two eigenvalues be λ3\lambda_{3} and λ4\lambda_{4}, then

It follows that λ3λ4=β2\lambda_{3}\lambda_{4}=\beta^{2} and λ3+λ4=(η(1β)β1)22β\lambda_{3}+\lambda_{4}=(\eta(1-\beta)-\beta-1)^{2}-2\beta. Since (1+βη(1β))2<(1+β)2(1+\beta-\eta(1-\beta))^{2}<(1+\beta)^{2}, we have λ3+λ4<1+β2\lambda_{3}+\lambda_{4}<1+\beta^{2}. Therefore λ3<1|\lambda_{3}|<1 and λ4<1|\lambda_{4}|<1 (note that λ3\lambda_{3} and λ4\lambda_{4} can be composite numbers). If η<1\eta<1, we can further conclude that the four eigenvalues are different from each other (otherwise λ3=λ4=β\lambda_{3}=\lambda_{4}=\beta, which contradicts to λ3+λ4=(η(1β)β1)22β\lambda_{3}+\lambda_{4}=(\eta(1-\beta)-\beta-1)^{2}-2\beta).

E.1.2 Proof of the general case

Now we prove Proposition E.1 for general ν\nu.

Combining (120), (124) and (125), we obtain the following recursive matrix MM:

Using the same calculation as in the previous section, we finally get

Appendix F Soft Clipping

For Algorithm 1, as long as the norm of the gradient (or momentum) exceeds a constant, it is then clipped; we refer to this form of clipping as hard clipping. One can also consider a soft form of clipping, as presented in Algorithm 3.

We take ν=0\nu=0 for example to analyze soft clipping. For any gradient norm lgl_{\text{g}}, the norm of the update lul_{\text{u}} is a function of lgl_{\text{g}}:

For hard clipping, we can similarly write

Therefore soft clipping is in fact equivalent to hard clipping up to a constant factor 2 in the step size choice. Thus it’s easy to see that our results also hold for Algorithm 3. However, compared to hard clipping, soft clipping has the advantage that the function hsofth_{\text{soft}} in (127) is smooth while hhardh_{\text{hard}} in (128) is not, as shown in Figure 4. We also empirically observe that the training curve of soft clipping is more smooth than hard clipping.

Appendix G Experimental Details in Section 4

Based on the discussion in Appendix F, we use the soft version of clipping algorithms in all the experiments.

The CIFAR-10 dataset contains 50k images for training and 10k for testing. All the images are 32×3232\times 32 RGB bitmaps. We use the standard ResNet-32 architecture. The total number of parameters is 466,906. For all algorithms, we use mini-batch size 128 and weight decay 5×1045\times 10^{-4}. For the baseline algorithm, we use SGD with momentum using learning rate lr=1.0lr=1.0 and momentum factor β=0.9\beta=0.9. Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with lr=0.1lr=0.1 and β=0.9\beta=0.9. We optimize ResNet-32 for 150 epochs, and decrease the learning rate at epoch 80 and epoch 120. For other algorithms, we perform a course grid search for lrlr an γ\gamma, while keeping all the training strategy the same as SGD. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.

G.2 PTB

The Penn Treebank dataset has a vocabulary of size 10k, and 887k/70k/78k words for training/validation/testing. We use the state-of-the-art AWD-LSTM architecture using hidden size 1150 and embedding size 400. The total number of parameters is 23,941,600. For the baseline algorithm, we follow Merity et al. who use averaged SGD clipping without momentum using learning rate lr=30lr=30 and γ=7.5\gamma=7.5. Note that here γ=7.5\gamma=7.5 means that the gradient norm will be clipped to be no more than 0.25. We use the same dropout rate and regularization hyper-parameters in [Merity et al., 2017]. We train AWD-LSTM for 250 epochs, and averaging is triggered when the validation perplexity stops improving. For other algorithms, we perform a course grid search for lrlr an γ\gamma, while keeping all the training strategy the same as SGD clipping. We use 5 random seeds ranging from 2016 to 2020, and the results are similar. The plot in Figure 2 uses the random seed 2020.

G.3 ImageNet

We also conduct experiments on ImageNet dataset. This dataset contains about 1.28 million training images and 50k validation images with various sizes. We train the standard ResNet-50 architecture on this dataset. The total number of parameters is 25,557,032. We use a batch size of 256 on 4 GPUs and a weight decay of 10410^{-4}. For the baseline algorithm, we choose SGD with learning rate lr=1.0lr=1.0 and momentum β=0.9\beta=0.9, following Goyal et al. . Note that we use the momentum defined in Algorithm 1, which is equivalent to a Pytorch implementation with lr=0.1lr=0.1 and β=0.9\beta=0.9. We train the ResNet-50 for 90 epochs, and decrease the learning rate in epoch 30, epoch 60 and epoch 80. For the other algorithms, we perform a course grid search for lrlr an γ\gamma, while keeping all the training strategy the same as SGD.

Appendix H Additional Experimental Results

Comparsion with Adam optimizer. Adam is a popular optimizer to train neural networks on a variety of tasks. We also run the same experiments in Section 4 using Adam optimizer with best hyper-parameters in order to compare it with clipping algorithms. We turn the learning rate of Adam based on a grid search, and choose the momentum hyper-parameters β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999. The results are shown in Figure 5(a) and 5(b).

It is clear from the figure that Adam trains really fast on both datasets. It is even faster than gradient clipping and momentum clipping. However, it does not outperform the mixed clipping method. Note that like clipping algorithms, Adam also uses adaptive gradient. However, Adam generalizes much worse than vanilla SGD or clipping algorithms. Particularly, the test accuracy of CIFAR-10 using Adam is only 91.5%, while all other algorithms reach 93% test accuracy.

CIFAR-10 classificatoin with ResNet-18. The original work of gradient clipping in Zhang et al. [2020a] conducts experiments using ResNet-18 on CIFAR-10. To compare with it, we also run the same experiments using different algorithms, e.g. SGD, SGD clipping, momentum clipping and mixed clipping. All the algorithms reach 95% test accuracy, and the result is similar to that of using ResNet-32 architeture (see Figure 5(c)).

In this section, we are aiming to construct an optimization problem which provably satisfies the (L0,L1)(L_{0},L_{1})-smoothness condition in this paper rather than the traditional LL-smoothness condition. We then conduct experiments in both deterministic setting and stochastic setting.

In fact, if the exponential function exp()\exp(\cdot) is replaced by log(1+exp())\log(1+\exp(\cdot)), the problem becomes the well-known logistic regression. However, logistic loss has bounded second-order derivative (thus is LL-smooth), while exp()\exp(\cdot) does not. Furthermore, exponential function is (0,1)-smooth, thus we expect L(w,b)L(w,b) is also (L0,L1)(L_{0},L_{1})-smooth for some L0,L1L_{0},L_{1} (see the following proposition). This is why we use exponential loss here. We point out that such exponential loss is also used in a variety of algorithms, such as boosting (AdaBoost).

When the dataset is linearly separable, parameter ww will be driven to infinity through optimization, thus adding some regularization is prevalent in linear classification. We use the following term (131) rather than L2L_{2} norm for regularization, in order to be compatible with L(w,b)L(w,b).

In fact, Rλ(w)R_{\lambda}(w) is similar to weight decay regularization in that Rλ(w)=12λ2w2+O(λ4w4)R_{\lambda}(w)=\frac{1}{2}\lambda^{2}\|w\|^{2}+O(\lambda^{4}\|w\|^{4}) when ww is small.

The total loss Eλ(w,b)=L(w,b)+Rλ(w)E_{\lambda}(w,b)=L(w,b)+R_{\lambda}(w). We now claim that Eλ(w,b)E_{\lambda}(w,b) is indeed (L0,L1)(L_{0},L_{1})-smooth.

Assume bias term b=0b=0 for simplicity. Suppose the data points have bounded norm, i.e. xiR\|x_{i}\|\leq R for all ii and λ<R\lambda<R. Let the loss function Eλ(w,0)E_{\lambda}(w,0) be defined above. Then for every ρ1>0,ρ2>0\rho_{1}>0,\rho_{2}>0, ρ=ρ1+ρ2\rho=\rho_{1}+\rho_{2}, Eλ(w,0)E_{\lambda}(w,0) is (L0,L1)(L_{0},L_{1})-smooth w.r.t ww for

We use MNIST dataset in this section, which contains 60,000 hand-writing training images. We only evaluate the training speed for different algorithms on the training set rather than the generalization capability. The loss functions is defined to be the sum of ten losses, each of which corresponds to the loss of a binary classification problem to recognize number 0 to 9. Regularization coefficient λ\lambda is set to be 0.02.

To compare different algorithms, we choose the best hyperparameters lrlr and γ\gamma for each algorithm based on a careful grid search. ν\nu is set to be 0.7 for mixed clipping. The parameter initalization and all inputs in the schocastic setting are the same for all algorithms. For each run, we average the loss of the last 5 epoch in order to reduce variance. In the deterministic setting we train 500 epochs, each of which uses the entire dataset. In the stochastic setting we train 50 epochs with a mini-batch size 200. We run on 5 different random seeds ranging from 2016 to 2020 altogether and average their results.

Figure 6 plots the results. It is clear that in both settings, clipping is vital to a fast convergence. Also, momentum helps training, and mixed clipping performs the best in the stochastic setting.

Let M=maxi[n+2d]wTziM=\mathop{\max}\limits_{i\in[n+2d]}w^{T}z_{i}. Let ρ1>0,ρ2>0\rho_{1}>0,\rho_{2}>0 be two constants. Pick M0=(1+1ρ2)logn(R2+dλ2)ρ1R2M_{0}=\left(1+\frac{1}{\rho_{2}}\right)\log\frac{n(R^{2}+d\lambda^{2})}{\rho_{1}R^{2}}. We consider the following two cases:

(1)MM0M\leq M_{0}. In this case 2E(w)\|\nabla^{2}E(w)\| can be directly upper bounded:

The first inequality in (134) uses the triangular inequality of matrix spectral norm and zzT=zTz=z2\|zz^{T}\|=\|z^{T}z\|=\|z\|^{2}.

(2)M>M0M>M_{0}. Decompose M0M_{0} to be M0=M1+M2M_{0}=M_{1}+M_{2} where

Define set I={i[n+2d]:wTziMM1}I=\{i\in[n+2d]:w^{T}z_{i}\geq M-M_{1}\} and I2={i[n+2d]:wTzi<0}I_{2}=\{i\in[n+2d]:w^{T}z_{i}<0\}. Then

In (136) we use the Cauchy-Schwartz inequality; In (137) we partition the index {i:i[n+2d]}\{i:i\in[n+2d]\} to three subsets II, I2I_{2} and [n+2d]\(II2)[n+2d]\backslash(I\cup I_{2}), and use the lower bound and upper bound of wTzi>0w^{T}z_{i}>0 for each set.

Similar, we can upper bound 2E(w)\|\nabla^{2}E(w)\|:

To bound exp(M1)\exp(M_{1}), we again bound E(w)\|\nabla E(w)\| from a different perspective:

where (142) is obtained by selecting the ii with the largest wTziw^{T}z_{i} which is equal to MM. Substitute (138) and (142) into (140) then we get

Since M=maxi[n+2d]wTziM=\mathop{\max}\limits_{i\in[n+2d]}w^{T}z_{i} implies that λwkM|\lambda w_{k}|\leq M for all k[d]k\in[d] from (132), we can upper bound the norm of ww: wMdλ\|w\|\leq\frac{M\sqrt{d}}{\lambda}. Substitute this into (144) we get

Combining the above two cases concludes the proof.