Robustness to Unbounded Smoothness of Generalized SignSGD

Michael Crawshaw, Mingrui Liu, Francesco Orabona, Wei Zhang, Zhenxun Zhuang

Introduction

Recent years have witnessed a surge in non-convex machine learning models, with a focus on deep neural networks . DNNs have achieved tremendous progress in a variety of tasks, including computer vision , natural language processing , and a lot more. Despite their huge empirical success, the theoretical analyses of non-convex optimization prove to be fundamentally more challenging than the established convex optimization theory . Among the numerous literature, many of them assume smoothness of the objective function, namely requiring the gradients to be Lipschitz. Under this scenario, past works have succeeded in proving the convergence rates for a number of algorithms, e.g., Stochastic Gradient Descent , AdaGrad , and STORM .

Nevertheless, it was recently observed that the smoothness assumption does not capture the training of LSTMs : the Hessian can grow with the size of the gradients . Inspired by this, Zhang et al. proposed a relaxed smoothness assumption, named (L0,L1)(L_{0},L_{1}) smoothness:

They also showed that the well-known gradient clipping technique can ensure SGD’s convergence in such scenarios. Later, their results were improved to show that SGD with clipping can be made unaffected by the L1L_{1} in (1) and is able to recover the optimal convergence rate of SGD under the original smoothness setting .

Nevertheless, the (L0,L1)(L_{0},L_{1}) condition has not yet been empirically verified beyond LSTMs. Therefore, our first contribution lies in studying the applicability and generalization of the (L0,L1)(L_{0},L_{1}) condition. In particular, we have empirically verified that the popular Transformer model also seems to satisfy this assumption, see Figure 1. Yet, we noticed that different coordinates, especially when they are in different layers of the model, exhibit very distinct L0L_{0} and L1L_{1} values as shown in Figure 3. Hence, we propose to refine the (L0,L1)(L_{0},L_{1}) assumption in (1) to a coordinate-wise version (Assumption 2) and consider this to better capture the loss surface when training deep neural networks like Transformers.

Given that we assume (a generalization) of the (L0,L1)(L_{0},L_{1}) assumption, it would be natural to use some clipping procedure. However, we found out that the use of clipping on Adam , while carried out in common practice [e.g., 53], has no effect on the training and testing performance on optimizing a large transformer model as shown in Figure 2. In retrospect, this might not be surprising: It is known that Adam has an implicit clipping behavior due to the normalization by the estimated second moment of the gradients. Indeed, Adam can be interpreted as a variant of SignSGD .

Inspired by this, our second contribution is to propose and analyze a generalized SignSGD algorithm under the relaxed smoothness assumption. It is parameterized in such a way that it on one end recovers SignSGD while on the other end closely resembles Adam. Apart from the convergence rates, we also located the critical role the momentum plays in analyzing Adam-type algorithms: it not only reduces the effects of noise but also gives an exponential decaying effect on the unbounded gradient norms and smoothness. This can partly explain the phenomenon that clipping does not help Adam.

The structure of this paper is as follows. Section 2 discusses related works and how our paper builds upon and distinguishes from them. The settings and assumptions are carried out in Section 3. We will introduce formally the generalized SignSGD algorithm and its analysis in Section 4, with a detailed discussion on the bounds and the role of momentum. The experimental results are shown in Section 5, comparing our algorithm with some popular competitors in deep learning tasks. Finally, we draw some conclusions and discuss the limitations of our work in Section 6.

Related Works

Adaptive Gradient Methods Adaptive gradient methods are popular optimizers for training deep neural networks. The traditional analysis of adaptive gradient methods is providing regret bounds under the online convex optimization framework . Recently, there are some analysis of adaptive gradient methods for nonconvex smooth functions . Zou et al. introduces an intriguing connection between Adam and SignGD when training a two-layer neural network in the deterministic setting, where SignGD is an algorithm following the negative gradient sign direction to perform the update. However, these works cannot be directly extended to nonconvex functions with unbounded smoothness in the stochastic setting. To the best of our knowledge, this work is the first one establishing guarantees for coordinate-wise type optimizers like generalized SignSGD as well as Adam-type updates under a relaxed smoothness condition.

Gradient Clipping The algorithm and analysis of gradient clipping can be traced back to under the assumption that the function is convex and rapidly growing. Hazan et al. considered gradient clipping in quasi-convex optimization. Mai and Johansson showed the stability and convergence of stochastic gradient clipping algorithms for convex problems without the smoothness condition. Gradient clipping is a standard technique in training deep neural networks such as RNNs and LSTMs. The theoretical analysis of gradient clipping for nonconvex models is pioneered by , in which the authors analyzed the convergence of gradient clipping under the relaxed smoothness assumption rather than the standard smoothness assumption. Zhang et al. further improved the convergence rate bound under the same assumption as in . Gradient clipping is also used when there is a heavy tail noise in the stochastic gradient to establish high probability convergence rates . Cutkosky and Mehta proved that normalized momentum improves normalized SGD under a second-order smoothness condition. A close algorithm is the one in which employs gradient normalization, momentum, and no gradient clipping to tackle the (L0,L1)(L_{0},L_{1}) condition (1) and control noise. Yet, their algorithm normalizes each coordinate with the same scale unlike popular optimizers such as Adam . Moreover, we observe empirically that normalized SGD with momentum performs worse than Adam. Motivated by this, we propose a coordinate-wise optimization algorithm which requires new analysis tools compared with .

Employ mt2\boldsymbol{m}_{t}^{2} to compute vt\boldsymbol{v}_{t} in Adam Designed to combine the advantages of Adagrad and RMSProp , the update of Adam employs the ratio between the exponential moving average of the stochastic gradient (mt\boldsymbol{m}_{t}) and the exponential moving average of the squared stochastic gradient (vt\boldsymbol{v}_{t}). Many variants of Adam have been proposed ever since. Among them, one idea is to use mt2\boldsymbol{m}_{t}^{2} to compute vt\boldsymbol{v}_{t} instead of gt2\boldsymbol{g}_{t}^{2}. The intuition is that mt\boldsymbol{m}_{t} represents a better update direction than gt\boldsymbol{g}_{t} and can thus better capture the second-moment information. Reddi et al. adopted this change to prove the convergence of Adam in a federated learning setting; yet, they only consider the smooth setting and require a large ϵ\epsilon to obtain convergence in contrast to the original Adam. Later, Wang et al. explored this idea in more detail, but their analyses are still restricted to the smooth setting.

Settings and Preliminaries

In this paper, we focus on the following stochastic optimization problem:

where ξ\xi is a random variable representing a randomly selected data sample or random noise following an unknown distribution D\mathcal{D}. We will use the following assumptions.

We will denote L0:=[L0,1,L0,2,,L0,d]T\boldsymbol{L_{0}}:=[L_{0,1},L_{0,2},\ldots,L_{0,d}]^{T} and L1:=[L1,1,L1,2,,L1,d]T\boldsymbol{L_{1}}:=[L_{1,1},L_{1,2},\ldots,L_{1,d}]^{T}.

Indeed, our Assumption 2 implies (4) when L0,j=L0L_{0,j}=L_{0} and L1,j=L1L_{1,j}=L_{1} for all j[d]j\in[d], up to constants (See Lemma 3 in the Appendix). Note that the 1d\frac{1}{\sqrt{d}} factor in ours is exactly for easy comparison with (4). The reason we turn to the current coordinate-wise version is that we observed a vast variance across different layers in training Transformer models: (1) is still true globally (Figure 1), but each layer or even each coordinate satisfies has a very different (L0,L1)(L_{0},L_{1}) pair (Figure 3). The smoothness assumption has been generalized in orthogonal directions in other work .

One merit of Assumption 2 is that it gives us the following descent lemma.

Our last assumption is common in the literature studying the (L0,L1)(L_{0},L_{1}) smooth condition .

A Generalized SignSGD Algorithm

In this section, we present in Algorithm 1 a generalized SignSGD algorithm. This algorithm encompasses a variety of optimization algorithms.

At first sight, it seems very similar to Adam. Indeed, if we employ gt2\boldsymbol{g}_{t}^{2} in computing vt\boldsymbol{v}_{t} instead of mt2\boldsymbol{m}_{t}^{2}, then it is exactly Adam, except for the bias correction terms. We would like to clarify that the idea of this change has been explored before, as detailed in Section 2. In this paper, the motivation for adopting this idea comes from the known effect of momentum on reducing the influence of noises . Indeed, in our analysis the difference between mt\boldsymbol{m}_{t} and F(xt)\nabla F(\boldsymbol{x}_{t}) is much more controllable than between gt\boldsymbol{g}_{t} and F(xt)\nabla F(\boldsymbol{x}_{t}). Thus, we consider employing mt\boldsymbol{m}_{t} in computing vt\boldsymbol{v}_{t} a better choice.

Note that both SignSGD and Adam are good candidates for optimization algorithms whose update must be bounded on functions that satisfy the (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) condition. Indeed, SignSGD can be seen as an extreme form of gradient clipping. On the other hand, as said in the introduction, Adam does not seem to require gradient clipping at all when used to train the large Transformer model in Figure 2.

Hence, we expect our algorithm, a generalization of SignSGD and a close resemblance to Adam, can enjoy the merits of both and be robust to the unbounded smoothness in the (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) scenario. In the next section, we will formalize this claim by presenting the theoretical analysis of Algorithm 1.

Under Assumptions 1, 2, and 3, assume Mj:=sup{Fxj(x):F(x)F(x1)}M_{j}:=\sup\left\{\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x})\right|:F(\boldsymbol{x})\leq F(\boldsymbol{x}_{1})\right\} is finite for each j[d]j\in[d], let Δ\Delta be any upper bound on F(x1)FF(\boldsymbol{x}_{1})-F^{*}, α=min(L01Δσ1T,1)\alpha=\min\left(\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{\Delta}}{\|\boldsymbol{\sigma}\|_{1}\sqrt{T}},1\right), β1=1α\beta_{1}=1-\alpha, β2β1<1\frac{\sqrt{\beta_{2}}}{\beta_{1}}<1, ρ=1β2β1\rho=1-\frac{\sqrt{\beta_{2}}}{\beta_{1}}, η=ΔαL01T\eta=\frac{\sqrt{\Delta\alpha}}{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{T}}, for Tmax(100dΔL12(1β2)ρ2L01,10000d2Δσ12L14(1β2)2ρ4L013)T\geq\max\left(\frac{100d\Delta\|\boldsymbol{L_{1}}\|_{\infty}^{2}}{(1-\beta_{2})\rho^{2}\|\boldsymbol{L_{0}}\|_{1}},\frac{10000d^{2}\Delta\|\boldsymbol{\sigma}\|_{1}^{2}\|\boldsymbol{L_{1}}\|_{\infty}^{4}}{(1-\beta_{2})^{2}\rho^{4}\|\boldsymbol{L_{0}}\|_{1}^{3}}\right), Algorithm 1 guarantees, with probability at least 1δ1-\delta, that

Furthermore, for the case when β2=0\beta_{2}=0, we have the following refined guarantee:

Here, MjM_{j} denotes the maximum absolute value of the partial derivative of FF for coordinate jj among the sub-level set of F(x1)F(\boldsymbol{x}_{1}), namely any point x\boldsymbol{x} with F(x)F(x1)F(\boldsymbol{x})\leq F(\boldsymbol{x}_{1}). In other words, we assume gradients to be bounded in the sub-level set of F(x1)F(\boldsymbol{x}_{1}); yet, we do not make any restriction on gradients outside of this set. We believe this is not a strong assumption, for example, when the sub-level set of F(x1)F(\boldsymbol{x}_{1}) is bounded, then by the assumed continuity of gradients it trivially holds. Also, we just require an upper bound and it can even be exponentially large as we have an exponentially decaying coefficient to counteract it: notice how the term M1\|\boldsymbol{M}\|_{1} is multiplied by a term that decays exponentially with TT. Better still, when β2=0\beta_{2}=0, we no longer even need this assumption and the algorithm is entirely free of the influence of M1\|\boldsymbol{M}\|_{1}. To see why this is good, we show a refined lower bound of Gradient Descent under the relaxed smoothness scenario below which is originally in .

Theorem 2 shows that in the relaxed smoothness setting, GD with any constant step size will suffer from a linear term depending on L1ML_{1}M. On a side note, it is a fixed version of the lower bound in : we provide in Appendix an explanation of errors in their lower bound and our corrected proof.

Compared with GD, our algorithm only has an exponentially decaying dependence on L1ML_{1}M. We consider this to be substantial merit of our algorithm. Furthermore, when β2=0\beta_{2}=0 in which case we recover the SignSGD with Momentum algorithm, we can even show that it completely removes the effects of the unbounded gradient norms. Also notice that in such case we actually no longer need the assumption of Mj:=sup{Fxj(x):F(x)F(x1)}M_{j}:=\sup\left\{\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x})\right|:F(\boldsymbol{x})\leq F(\boldsymbol{x}_{1})\right\} being finite for each j[d]j\in[d] anymore, and the L1\|\boldsymbol{L_{1}}\|_{\infty} term does not appear in the final bound anymore.

We also would like to point out that this bound closely resembles the one achieved by SGD with gradient clipping algorithm except that we consider the coordinate-wise setting: take the setting of β2=0\beta_{2}=0 for example, we need at most O(Δmax{σ12L01ϵ4,d2σ12L14L013,dL12L01})\mathcal{O}\left(\Delta\max\left\{\frac{\|\boldsymbol{\sigma}\|_{1}^{2}\|\boldsymbol{L_{0}}\|_{1}}{\epsilon^{4}},\frac{d^{2}\|\boldsymbol{\sigma}\|_{1}^{2}\|\boldsymbol{L_{1}}\|_{\infty}^{4}}{\|\boldsymbol{L_{0}}\|_{1}^{3}},\frac{d\|\boldsymbol{L_{1}}\|_{\infty}^{2}}{\|\boldsymbol{L_{0}}\|_{1}}\right\}\right) to get a point x\boldsymbol{x} with F(x)1ϵ\|\nabla F(\boldsymbol{x})\|_{1}\leq\epsilon with high probability.

Remark 1 The almost surely bounded assumption 3 can be relaxed to sub-gaussian noise, using standard extensions of Freedman inequality [e.g., 16].

Remark 2 When β2=0\beta_{2}=0, we can prove an average-iterate complexity bound (see Proof of Theorem 1 for β2=0\beta_{2}=0 in Appendix A.3); yet, we use the min form for consistency between the two cases.

The proof of the theorem is highly technical and it uses recent advancements in the analysis of momentum methods , key techniques to deal with the (L0,L1)(L_{0},L_{1}) assumption , as well as a novel and essential inductive argument to control the norm of past gradients. We want to stress that the difficulty mainly comes from analyzing Adam-type updates when β2>0\beta_{2}>0, while for the other case of β2=0\beta_{2}=0 the proof is significantly simpler. The full proof is in the Appendix, but here we present a proof sketch that underlines the main steps. First, we list some key lemmas we used but move their proofs to the appendix due to space constraints.

With notations in Algorithm 1, for ττˉ=1β2ηdL1\tau\leq\bar{\tau}=\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}, we have xtτxt21L1\|\boldsymbol{x}_{t-\tau}-\boldsymbol{x}_{t}\|_{2}\leq\frac{1}{\|\boldsymbol{L_{1}}\|_{\infty}}.

Lemma 5 limits our focus to the most recent τˉ\bar{\tau} steps on which Assumption 2 and Lemma 1 can apply.

Assume Assumption 3. With the notation of Algorithm 1, let j[d]j\in[d] and β11\beta_{1}\leq 1. Then, with probability at least 13δ1-3\delta, for any t0[t]t_{0}\in[t], we have

Lemma 7 is the major tool we use to handle the noise we incur during drawing stochastic gradients. It is derived based on Lemma 12 in .

With the notation of Algorithm 1 and under the assumptions of Theorem 1, if Fxj(xτ)Mj\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{\tau})\right|\leq M_{j} holds for all τt\tau\leq t and j[d]j\in[d], and D>0D>0, then, with probability at least 13tδ1-3t\delta we have that,

where BjηL0,j1β2(1β1)+β1τˉ(Mj+σj)+(1β1)EjB_{j}\triangleq\frac{\eta L_{0,j}}{\sqrt{1-\beta_{2}}(1-\beta_{1})}+\beta_{1}^{\bar{\tau}}(M_{j}+\sigma_{j})+(1-\beta_{1})E_{j} and D12ηdL11β2(1β1)D\triangleq 1-\frac{2\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}{\sqrt{1-\beta_{2}}(1-\beta_{1})}.

Lemma 10 is similar to Lemma A.2 in which considered the deterministic and smooth setting; in contrast, our proof is much more challenging in that we need to tackle both the noise and the unbounded smoothness. With this lemma, we know that either the true gradient is small or that the update of our Algorithm 1 can be lower bounded.

Under Assumptions 1, 2, and 3, using the hyperparameters in Theorem 1, denoting α=1β1\alpha=1-\beta_{1} and ϵt=mtF(xt)\boldsymbol{\epsilon}_{t}=\boldsymbol{m}_{t}-\nabla F(\boldsymbol{x}_{t}), for all tt and j[d]j\in[d] we have, with probability at least 13δ1-3\delta,

Lemma 12 shows how the use of momentum can help control the noise by choosing β1\beta_{1} wisely. It is adapted from the proof of Theorem 2 in but with the added difficulty of unbounded smoothness.

Observing the formula of setting β1\beta_{1}, we can see that when σ1L01Δ/T\|\boldsymbol{\sigma}\|_{1}\leq\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}/\sqrt{T}, β1=0\beta_{1}=0. As β2<β1\beta_{2}<\beta_{1}, Algorithm 1 reduces to SignSGD. In this case, the key component is Lemma 12 using which we are able to show that t=1Tmt,jFxj(xt)\sum^{T}_{t=1}\left|m_{t,j}-\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right| can be controlled as C1t=1TFxj(xt)+C2C_{1}\sum^{T}_{t=1}\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|+C_{2}. The summation of true gradients over time can then be offsetted by choosing η\eta and β1\beta_{1} wisely when we invoke the descent lemma 1. The rest is standard.

Now for the other case in which σ1>L01Δ/T\|\boldsymbol{\sigma}\|_{1}>\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}/\sqrt{T}, we take a different route.

First, notice that Assumption 2 and the Descent Lemma 1 only hold when two points are not too far away. Thus, we need to restrict our attention to the recent updates (Lemma 5), beyond which we would have no control. This means we want the influence of those updates too long ago to not have a big effect on the current one. To make this happen, one natural idea is to use a bounded gradient assumption, then with the use of exponential averaging, their effect would be quickly reduced. Yet, assuming directly that all gradients are bounded would trivialize the (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) assumption. Thus, we pose a much weaker condition, assuming that Mj:=sup{Fxj(x):F(x)F(x1)}M_{j}:=\sup\left\{\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x})\right|:F(\boldsymbol{x})\leq F(\boldsymbol{x}_{1})\right\} being finite for each j[d]j\in[d]. Then, we prove that MjM_{j} will provide an upper bound to all the true gradients the algorithm see. We prove it using induction, analyzing separately the case that either the true gradient is already very small and we have reached the proximity of a stationary point, or the objective function is monotonically non-increasing and the gradient remains bounded.

Having controlled the past gradients, we prove in Lemma 10 that the update of Algorithm 1 is either very small that we can pass or having a constant lower bound that we can use in the Descent Lemma 1.

Also, considering that this is the stochastic setting, noise typically slows down convergence or can even cause the algorithm to diverge if the hyperparameters are not chosen wisely. To handle this, we invoke Freedman’s inequality to show that the addition of adjacent stochastic noise almost cancels out each other and the absolute value of the sum remains controlled (Lemma 7).

Yet, we still need another block to handle the difference between the true gradient and the momentum as we are updating in the direction of the momentum instead of the true gradient. Turns our that we can prove that sign(mt,j)=sign(Fxj(xt))\text{sign}(m_{t,j})=\text{sign}\left(\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right) when Fxj(xt)\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right| is not too small. As before, in the case Fxj(xt)\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right| is small, we have converged on that coordinate.

Combining all these blocks together, we are able to arrive at the final results. ∎

Experiments

We conducted our experiments using PyTorch on Nvidia V100 GPUs.

To validate the efficacy of our Algorithm 1, we compare it with Adam , SGD , SGD Momentum Normalized , SGDClipGrad, and SGDClipMomentum. The latter two are from Algorithm 1 in where SGDClipGrad corresponds to the case when ν=0\nu=0 and SGDClipMomentum corresponds to when ν=1\nu=1.

Training Unless otherwise specified, we use grid-search to fine-tune the initial learning rate for all optimizers, as well as the clipping threshold for SGDClipGrad and SGDClipMomentum, and β2\beta_{2} for Adam and our algorithm, to select the one giving the best validation performance on a separated validation set. We then employ the best performing hyperparameters to train the model over all training data and report the testing performance. The testing is repeated with random seeds 5 times to eliminate the influence of stochasticity. For more details, please refer to Section A.4.

Resnet for Image Classification on CIFAR-10 We employ the 20-layer Residual Network model to do image classification on the CIFAR-10 dataset. Images are normalized per channel using the means and standard deviations computed from all training images. We adopt the data augmentation technique following (for training only): 4 pixels are padded on each side of an image and a 32 × 32 crop is randomly sampled from the padded image or its horizontal flip. The mini-batch size is 128128 and we train all algorithms for 164164 epochs. We do not employ any learning rate decay schedule in order to focus on the comparison of the optimizers themselves. We fixed the weight decay value to be 0.00010.0001 and the momentum parameter (β1)\beta_{1}) to be 0.90.9. Figure 4 and Table 1 report the training and testing performance for each algorithm, showing that ours is among the best.

LSTM for Language Modeling on Penn Treebank We adopt a 3-layer AWD-LSTM to do language modeling on the Penn Treebank (PTB) dataset (word level). The mini-batch size is 4040 and we trained each algorithm for 750750 epochs. Apart from the hyperparameters we stated above, we further fine-tuned the weight decay value for all algorithms noticing its significant influence on the performance. We choose the set of hyperparameters that give the smallest final validation perplexity. We report the results in Figure 5 and Table 1. It can be seen that we can match the performance of Adam while beating the others.

For Figure 1 which verifies the original form (1) of the (L0,L1)(L_{0},L_{1}) condition using the norm, we followed the method in Section H.3 of . Specifically, given xt\boldsymbol{x}_{t} and xt+1\boldsymbol{x}_{t+1}, denote d:=xt+1xt\boldsymbol{d}:=\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t}. We estimate the smoothness at xt\boldsymbol{x}_{t} by

where {δ1,δ2,,δN}\{\delta_{1},\delta_{2},\ldots,\delta_{N}\} denotes the sample locations and we use {16,26,36,46,56}\{\frac{1}{6},\frac{2}{6},\frac{3}{6},\frac{4}{6},\frac{5}{6}\}.

For Figure 3 verifying the coordinate-wise version (3) of the (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) condition, note that the equation is symmetric in that if we just swap x\boldsymbol{x} and y\boldsymbol{y} it shall still holds. Thus, during plotting, we compare \nicefracFxj(xt+1)Fxj(xt)xt+1,jxt,j\nicefrac{{\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t+1})-\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|}}{{|x_{t+1,j}-x_{t,j}|}} vs. min(Fxj(xt),Fxj(xt+1))\min\left(\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|,\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t+1})\right|\right).

Figure 1(a) is on training a 22-layer Transformer Encoder to do language modeling on the Wikitext-2 dataset. The implementation, settings, and parameter choices follow this.https://pytorch.org/tutorials/beginner/transformer_tutorial.html We only plot the first 55 training epochs. Figure 1(b) and 3 are on training a 66-layer Transformer to do machine translation on the WMT’16 Multimodal Machine Translation Task German-English dataset. The implementation of the transformer is forked from herehttps://github.com/jadore801120/attention-is-all-you-need-pytorch and we also follow their default settings. The mini-batch size is 256256 and we trained for 400400 epochs using Adam and report the whole training trajectory.

3 Clipping does not Affect Adam’s Performance

We compare clipping and non-clipping for Adam optimizer on the Wikitext-103 (103 million tokens, 180MB) language modeling task, with a 16-layer GPT-2 transformer model . This GPT-2 model has an input length of 256 tokens, 410-dimension word embedding, 16 Attention layers with 10 Attention heads and 2100 hidden dimensions. Model size is 201.58 MB. The vocabulary size is 28996. We use the hyper-parameter settings prescribed in : batch size 256, warm up learning rate from 0 to 2.5×1042.5\times 10^{-4} in the first 64000 samples (i.e., 250 iterations) and then cosine-anneal learning rate to zero, on top of an Adam optimizer. It takes about 40 hours to train 200 epochs on 8 V100 GPUs. We use clipping threshold max_norm 0.25 for the entire model as prescribed in the literature . We also count that with this clipping scheme, clipping occurs in every single batch. As we can see from Figure 2, neither training loss (2.79 vs 2.76) nor perplexity score (27.92 vs 27.97) differs much in the clipping and non-clipping case, which is consistent with our theory that Adam naturally achieves gradients clipping effect.

Conclusion and Limitations

Smoothness has been a widely adopted condition for proving convergence rates of algorithms in the non-convex optimization scenario. Yet, it has been found that this assumption does not capture losses when employing some deep learning models including RNNs and LSTMs. In light of this, a relaxed smoothness assumption was proposed that aligns well with the practice. We observed that the loss surface of training using Transformers also exhibits this relaxed smoothness. Under this assumption, SGD with clipped gradient has been proven to work well. However, we found that clipping is not necessary for achieving convergence in such a setting. Indeed, we showed that a generalized SignSGD algorithm does not require explicit clipping but can almost guarantee the same bound as SGD with clipping. In the analyses, we identified the key effect of using momentum in analyzing Adam-type algorithms, that it reduces both the noise and the unbounded gradient norms. Finally, we conducted a variety of deep learning tasks showing that our algorithm can match Adam’s performance while exceeding others.

Limitations The current work is in no way a perfect one and there are many directions worth exploring beyond it. First of all, though our algorithm could be seen as a close resemblance to the original Adam algorithm, they are still not equal. Considering the huge popularity of Adam and its established effectivity in practice, it is worth studying whether Adam in its original form can converge in the relaxed smooth setting. Second, while our Theorem 1 are upper bounds and cannot be directly compared between the two cases of β2\beta_{2}, it does suggests that β2=0\beta_{2}=0 minimizes the worst-case convergence rate. However, it still does not fully explain the phenomenon that a choice of β2\beta_{2} close to 11 yields better performance in using our Algorithm 1 as well as Adam in practice. Third, despite there are lower bounds showing that, for example, GD with a constant step size can be arbitrarily worse than GD with clipping, it would be more meaningful to study whether the relaxed smooth condition is inherently more difficult, possibly by establishing a lower bound for all first-order optimization algorithms. Fourth, we did show that Transformers observe the relaxed smoothness condition, but we consider it more beneficial to research in-depth what properties or structures make a model satisfy such conditions. Finally, when conducting our experiments, we observed that the weight decay value plays a prominent role in each optimizer’s performance, and that the best weight decay value varies for different optimizers. Thus, one potential direction would be to explore different ways of incorporating the regularization in a way to preserve the scale-freeness of Algorithm 1, just as AdamW does .

Acknowledgements

Michael Crawshaw is supported by the Institute for Digital Innovation fellowship from George Mason University. Michael Crawshaw and Mingrui Liu are both supported by a grant from George Mason University. Francesco Orabona is supported by the National Science Foundation under the grants no. 1908111 “AF: Small: Collaborative Research: New Representations for Learning Algorithms and Secure Computation”, no. 2022446 “Foundations of Data Science Institute”, and no. 2046096 “CAREER: Parameter-free Optimization Algorithms for Machine Learning”.

References

Appendix A Appendix

Below, we prove the descent lemma for coordinate-wisely (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}})-smooth functions satisfying Assumption 2.

where the second inequality uses the fact that abF(x)dxabF(x)dx\left|\int^{b}_{a}F(x)dx\right|\leq\int^{b}_{a}|F(x)|dx and the final one is due to Assumption 2. ∎

The following Lemma shows that our coordinate-wise (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) smooth assumption 2 is equivalent to the original (L0,L1)(L_{0},L_{1}) smooth assumption (1) at least in 11-d case.

(2) \Rightarrow (1) This is a special case for 11-d and c=1c=1 of Corollary A.4 in . ∎

When L0,j=L0L_{0,j}=L_{0} and L1,j=L1L_{1,j}=L_{1} for all j[d]j\in[d], Assumption 2 implies (4) (up to constants).

Suppose all L0,j,L1,jL_{0,j},L_{1,j} are the same across jj, then we have

A.2 Proof of Lower Bound

By Lemma 2, we know that, in 1-d case, our coordinate-wise (L0,L1)(\boldsymbol{L_{0}},\boldsymbol{L_{1}}) assumption 2 is equivalent as the original one (1). Thus, without loss of generality, we use the original condition (1) in the proof. We will construct two different (L0,L1)(L_{0},L_{1})-smooth functions based on the value of η\eta.

Case η>2ML1(lnML1L0+1)\eta>\frac{2}{ML_{1}}\left(\ln\frac{ML_{1}}{L_{0}}+1\right). In this case, we can construct a function on which GD does not converge, hence the lower bound is trivially true. Consider the function

Note that ff is (L0,L1)(L_{0},L_{1})-smooth. Without loss of generality, we can assume x0=1L1(lnML1L0+1)x_{0}=\frac{1}{L_{1}}\left(\ln\frac{ML_{1}}{L_{0}}+1\right), in fact if this is not the case we can translate the function ff accordingly. This setting of x0x_{0} guarantees that the bound on the gradient is correct. Moreover, with this choice, we claim the function will diverge. To see this, we use mathematical induction to show that xt+1>xt|x_{t+1}|>|x_{t}| and sign(xt+1)sign(xt)\text{sign}(x_{t+1})\neq\text{sign}(x_{t}) for any t0t\geq 0. First, for the case when t=0t=0, we have

Then suppose the condition holds up until tt and we prove for t+1t+1. From the formula of ff, we have that sign(f(x))=sign(x)\text{sign}(f^{\prime}(x))=\text{sign}(x) and that ff is monotonically increasing with x|x|. Thus, from the update of gradient descent which moves along the negative direction of the gradient, if we can show that xt+1>xt|x_{t+1}|>|x_{t}|, then sign(xt+1)sign(xt)\text{sign}(x_{t+1})\neq\text{sign}(x_{t}). This yields

Now, note that ψ(x)=2xL1exp(L1x1)\psi(x)=\frac{2|x|L_{1}}{\exp(L_{1}|x|-1)} is decreasing for x>1L1x>\frac{1}{L_{1}} and increasing for x<1L1x<-\frac{1}{L_{1}}. Hence, we have that

where the first inequality is true by the choice of x0>1L1x_{0}>\frac{1}{L_{1}} and the condition on η\eta and the second one is true by the induction hypothesis.

Case η2ML1(lnML1L0+1)\eta\leq\frac{2}{ML_{1}}\left(\ln\frac{ML_{1}}{L_{0}}+1\right).

We have that ff is (L0,0)(L_{0},0)-smooth, hence also (L0,L1)(L_{0},L_{1})-smooth. Note that the presence of the fourth power makes this function twice differentiable. Moreover, the maximum gradient in this case is ϵM\epsilon\leq M.

As before, without loss of generality, let the initial point x0=3ϵ2L0+Δx_{0}=\frac{3\epsilon}{2L_{0}}+\Delta, where Δ>0\Delta>0. We have that f(x0)f=ϵ(Δ+3ϵ2L0)9ϵ216L0f(x_{0})-f^{*}=\epsilon\left(\Delta+\frac{3\epsilon}{2L_{0}}\right)-\frac{9\epsilon^{2}}{16L_{0}}, hence Δ=1ϵ(f(x0)f)15ϵ16L0\Delta=\frac{1}{\epsilon}\left(f(x_{0})-f^{*}\right)-\frac{15\epsilon}{16L_{0}}. Now, while we stay on the last branch of the function, we have

we guarantee f(xt)=ϵ|f^{\prime}(x_{t})|=\epsilon. ∎

As we said in the main text, unfortunately, the lower bound theorem in is wrong, both statement and proof. First of all, they have a logarithm of a quantity with units, MM, which is an undefined mathematical operation. A closer look at the proof reveals that, differently from the statement of their theorem, they construct a function with L0=L1L_{0}=L_{1}, which explains why these terms are missing in the logarithm. Moreover, it is also unclear if the second constructed function satisfies the assumptions of the theorem. We correct all these issues by properly scaling the constructed functions so that they always satisfy the (L0,L1)(L_{0},L_{1}) condition and all the units are coherent. This result in the correct term inside the logarithm and the right conditions on L0L_{0}, L1L_{1}, MM, and ϵ\epsilon.

A.3 Proof of Theorem 1

We first write down some notations here that we will use heavily later for easier reference:

Also, we would need the following formula many times:

where in the first inequality we used the fact that (1x)1/x1e(1-x)^{1/x}\leq\frac{1}{e} for 0<x<10<x<1.

With the notations in Algorithm 1, for each coordinate j[d]j\in[d] we have

The following lemma shows when we can apply Assumption 2 and Lemma 1.

With notations in Algorithm 1, for ττˉ=1β2ηdL1\tau\leq\bar{\tau}=\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}, we have xtτxt21L1\|\boldsymbol{x}_{t-\tau}-\boldsymbol{x}_{t}\|_{2}\leq\frac{1}{\|\boldsymbol{L_{1}}\|_{\infty}}.

The following two lemmas are the major tools we use to analyze the effects of noises.

Assume Assumption 3. With the notation of Algorithm 1, let j[d]j\in[d] and β1<1\beta_{1}<1. Then, with probability at least 13δ1-3\delta, for any t0[t]t_{0}\in[t], we have

The following lemma upper bounds the differences between recent true gradients and the current one.

With the notation of Algorithm 1 and under the assumptions in Theorem 1, for any j[d]j\in[d] and any t0t_{0} with tt0τˉ=1β2ηdL1t-t_{0}\leq\bar{\tau}=\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}, we have

where the first inequality is due to Assumption 2 and Lemma 5, the second inequality uses Lemma 4, and the final inequality uses the fact that k=1Nkak1(1a)2\sum^{N}_{k=1}ka^{k}\leq\frac{1}{(1-a)^{2}} for any 0<a<10<a<1. ∎

The following lemma upper bounds a past momentum with the current one.

With the notation of Algorithm 1 and under the assumptions of Theorem 1, for any ττˉ=1β2ηdL1\tau\leq\bar{\tau}=\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}, with probability at least 13δ1-3\delta, it holds that

We now upper bound the first term of (59) using Lemma 8 by using the fact that τ1β2ηL11\tau\leq\frac{\sqrt{1-\beta_{2}}}{\eta\|\boldsymbol{L_{1}}\|_{1}}:

Finally, the second term of (59) can be bounded using Lemma 7 by noticing that

The following Lemma is adapted from . Yet, they only considered Adam under the LL-smooth setting and when there is no noise. The existence of noise and the relaxed smoothness assumption makes the proofs substantially more challenging. With this lemma, we know that either the true gradient is small or that the update of our Algorithm 1 can be lower bounded.

With the notation of Algorithm 1 and under the assumptions of Theorem 1, if Fxj(xτ)Mj\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{\tau})\right|\leq M_{j} holds for all τt\tau\leq t and j[d]j\in[d], and D>0D>0, then, with probability at least 13tδ1-3t\delta we have that,

Given that Fxj(xτ)Mj\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{\tau})\right|\leq M_{j} for any τt\tau\leq t and j[d]j\in[d], using Lemma 4 and Assumption 3, it is immediate to show that mt,jMj+σj|m_{t,j}|\leq M_{j}+\sigma_{j}. Then, denote τ^=τˉ=1β2ηdL1\hat{\tau}=\lfloor\bar{\tau}\rfloor=\left\lfloor\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}\right\rfloor namely the largest integer that is no greater than τˉ\bar{\tau}, from Lemma 4, we have

where the final inequality uses the assumption that β2<β1\sqrt{\beta_{2}}<\beta_{1}. Using Lemma 9 and the definition of ρ=1β21/2β11(0,1]\rho=1-\beta_{2}^{1/2}\beta_{1}^{-1}\in(0,1], with probability at least 13tδ1-3t\delta, as we need to invoke Lemma 7 for at most tt times, we have

where in the last inequality we used the fact that 1ρ11β2\frac{1}{\rho}\geq\frac{1}{\sqrt{1-\beta_{2}}}.

Thus, we consider following two cases depending on the relative size of mt,j|m_{t,j}| vs. CjFxj(xt)+BjC_{j}\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|+B_{j}.

Case 1: mt,j>CjFxj(xt)+Bj|m_{t,j}|>C_{j}\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|+B_{j}, then

Case 2: mt,jCjFxj(xt)+Bj|m_{t,j}|\leq C_{j}\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|+B_{j}, then we have

Also, for mt,j|m_{t,j}| we have from Lemma 4 that

The first term can be bounded below by using Lemma 8 and that τ^+1τˉ\hat{\tau}+1\geq\bar{\tau}:

The second term can be bounded using Lemma 7. Thus,

where we used (39) and that ex1xe^{-x}\leq\frac{1}{x} for x>0x>0.

Therefore, with probability at least 13tδ1-3t\delta we have

Given that D>0D>0, depending on the relative size of Fxj(xt)\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right| vs. BjB_{j}, we have following two cases.

Case 2.1: Fxj(xt)<5BjD\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|<\frac{5B_{j}}{D}.

Case 2.2: Fxj(xt)5BjD\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|\geq\frac{5B_{j}}{D}, using the fact that the r.h.s of (79) is decreasing in BjB_{j}, we have

where in the last inequality we used the fact that Cj+D2C_{j}+D\leq 2. Note that D<1D<1 so the above lower bound is smaller than (71). ∎

The following two lemmas are for the special case of β2=0\beta_{2}=0.

With choices of parameters in Theorem 1, when β2=0\beta_{2}=0, we have xt+1xt2=ηd1L1\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t}\|_{2}=\eta\sqrt{d}\leq\frac{1}{\|\boldsymbol{L_{1}}\|_{\infty}}.

Using the fact that α1\alpha\leq 1 and the condition on TT, we have

The following lemma is adapted from the proof of Theorem 2 in .

Under Assumptions 1, 2, and 3, using the settings of the hyperparameters in Theorem 1, denoting α=1β1\alpha=1-\beta_{1} and ϵt=mtF(xt)\boldsymbol{\epsilon}_{t}=\boldsymbol{m}_{t}-\nabla F(\boldsymbol{x}_{t}), for all t1t\geq 1 and j[d]j\in[d] we have, with probability at least 13δ1-3\delta,

We can derive the following recursive formulation for any t1t\geq 1:

Take the absolute value of both sides, to obtain

where the second inequality uses (84) and Lemma 7, the fourth and fifth inequalities use (84), and the final one is due to (83). ∎

Under Assumptions 1, 2, and 3, assume Mj:=sup{Fxj(x):F(x)F(x1)}M_{j}:=\sup\left\{\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x})\right|:F(\boldsymbol{x})\leq F(\boldsymbol{x}_{1})\right\} is finite for each j[d]j\in[d], let Δ\Delta be any upper bound on F(x1)FF(\boldsymbol{x}_{1})-F^{*}, α=min(L01Δσ1T,1)\alpha=\min\left(\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{\Delta}}{\|\boldsymbol{\sigma}\|_{1}\sqrt{T}},1\right), β1=1α\beta_{1}=1-\alpha, β2β1<1\frac{\sqrt{\beta_{2}}}{\beta_{1}}<1, ρ=1β2β1\rho=1-\frac{\sqrt{\beta_{2}}}{\beta_{1}}, η=ΔαL01T\eta=\frac{\sqrt{\Delta\alpha}}{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{T}}, for Tmax(100dΔL12(1β2)ρ2L01,10000d2Δσ12L14(1β2)2ρ4L013)T\geq\max\left(\frac{100d\Delta\|\boldsymbol{L_{1}}\|_{\infty}^{2}}{(1-\beta_{2})\rho^{2}\|\boldsymbol{L_{0}}\|_{1}},\frac{10000d^{2}\Delta\|\boldsymbol{\sigma}\|_{1}^{2}\|\boldsymbol{L_{1}}\|_{\infty}^{4}}{(1-\beta_{2})^{2}\rho^{4}\|\boldsymbol{L_{0}}\|_{1}^{3}}\right), Algorithm 1 guarantees, with probability at least 1δ1-\delta, that

Furthermore, for the case when β2=0\beta_{2}=0, we have the following refined guarantee:

From Lemma 11 we know that xt+1xt21L1\|\boldsymbol{x}_{t+1}-\boldsymbol{x}_{t}\|_{2}\leq\frac{1}{\|\boldsymbol{L_{1}}\|_{\infty}} for all t[T]t\in[T]. Thus, we can apply Lemma 1 to have

Thus, denoting by ϵt=mtF(xt)\boldsymbol{\epsilon}_{t}=\boldsymbol{m}_{t}-\nabla F(\boldsymbol{x}_{t}) gives

Sum both sides over t=1,,Tt=1,\dots,T, to have

Use Lemma 12 to bound each coordinate of t=1Tϵt1\sum^{T}_{t=1}\|\boldsymbol{\epsilon}_{t}\|_{1}:

The above one holds with probability at least 13Tδ1-3T\delta as we invoked Lemma 12 for TT times which in turns means invoking Lemma 7 for TT times. Yet, the above inequality would need to sum from j=1j=1 to dd, meaning in total we would invoke Lemma 7 for dTdT times. Thus, following results hold with probability at least 13dTδ1-3dT\delta.

Now, put the above inequality back into (103) to have

Now, using the definitions of η\eta and α\alpha, and the conditions on TT, we have

Divide both sides by TT and rearrange terms to give

Now, we need to consider the following two cases:

σ1<L01ΔT\|\boldsymbol{\sigma}\|_{1}<\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{\Delta}}{\sqrt{T}}: then α=1\alpha=1 and η=ΔL01T\eta=\frac{\sqrt{\Delta}}{\sqrt{\|\boldsymbol{L_{0}}\|_{1}}\sqrt{T}}

σ1L01ΔT\|\boldsymbol{\sigma}\|_{1}\geq\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\sqrt{T}}: then α=L01Δσ1T1\alpha=\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\|\boldsymbol{\sigma}\|_{1}\sqrt{T}}\leq 1 and η=Δ3/4L011/4σ1T3/4\eta=\frac{\Delta^{3/4}}{\|\boldsymbol{L_{0}}\|_{1}^{1/4}\sqrt{\|\boldsymbol{\sigma}\|_{1}}T^{3/4}}

Put δ=δ3dT\delta^{\prime}=\frac{\delta}{3dT} concludes the proof. ∎

Note that when σ1L01ΔT\|\boldsymbol{\sigma}\|_{1}\leq\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\sqrt{T}}, α=1\alpha=1, β2<β12=0\beta_{2}<\beta_{1}^{2}=0. Then our Generalized SignSGD algorithm 1 reduces to the SignSGD algorithm, and thus has the same guarantee of (125). Therefore, we only consider the other case when σ1L01ΔT\|\boldsymbol{\sigma}\|_{1}\geq\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\sqrt{T}}.

Note that the only randomness comes from evaluating stochastic gradients. In the following proof, we will need to invoke Lemma 7 for TT times for each coordinate j[d]j\in[d]. Therefore, the following results hold with probability at least 13dTδ1-3dT\delta. For simplicity, we use the term "with high probability" later in the proof to denote this.

We derive the following quantity which will be used multiple times later:

First, from Lemma 10 we have D=12dL1η(1β1)1β2D=1-\frac{2\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}\eta}{(1-\beta_{1})\sqrt{1-\beta_{2}}}. Then, from the choice of the hyperparameters, we have

Thus, we have D11512D\geq 1-\frac{1}{5}\geq\frac{1}{2} and, as β2/β1<1\sqrt{\beta_{2}}/\beta_{1}<1,

Also, for those coordinates with small gradients Fxj(xt)<5BjD10Bj\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|<\frac{5B_{j}}{D}\leq 10B_{j}, we have

We are now ready to prove the theorem. We will need to use Lemma 10, hence we first need to show that all past true gradients are bounded by MjM_{j}, namely that, for any tt, Fxj(xτ)Mj\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{\tau})\right|\leq M_{j} holds for all τt\tau\leq t and all j[d]j\in[d]. From the definition of MjM_{j} stated in the theorem, in order to guarantee this, we only need to prove that F(xτ)F(x1)F(\boldsymbol{x}_{\tau})\leq F(\boldsymbol{x}_{1}) for all τt\tau\leq t. We will prove this by induction.

For t=1t=1 the condition is trivially true.

We then assume that the condition holds for tt and prove based on this that it still holds for t+1t+1.

For those coordinates with Fxj(xt)5BjD\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|\geq\frac{5B_{j}}{D}, denote τ^=τˉ=1β2ηdL1\hat{\tau}=\lfloor\bar{\tau}\rfloor=\left\lfloor\frac{\sqrt{1-\beta_{2}}}{\eta\sqrt{d}\|\boldsymbol{L_{1}}\|_{\infty}}\right\rfloor, with high probability, we have

where the first equality comes from Lemma 4; for the second inequality, the third term can be bounded using Lemma 8, and the final term can be bounded using Lemma 7; for the third inequality, we used (39) and that ex1xe^{-x}\leq\frac{1}{x} for x>0x>0. This inequality implies that sign(mt,j)=sign(Fxj(xt))\text{sign}(m_{t,j})=\text{sign}\left(\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right) with high probability.

Denote Ut={j[d]:Fxj(xt)5BjD}\mathcal{U}_{t}=\left\{j\in[d]:\left|\frac{\partial F}{\partial x_{j}}(\boldsymbol{x}_{t})\right|\geq\frac{5B_{j}}{D}\right\}. From the choices of hyperparameters we can show that 1τˉ1\leq\bar{\tau} which means xt+1xt21L1\|\boldsymbol{x}_{t+1}-x_{t}\|_{2}\leq\frac{1}{\|\boldsymbol{L_{1}}\|_{\infty}} (Lemma 5). Thus, using Lemma 1, with high probability we have

where the second inequality uses (136), (143), and Lemma 4, and the third inequality uses Lemma 10 and (133).

Now, noticing the conditions on η\eta, α\alpha, β2<β12<β1\beta_{2}<\beta_{1}^{2}<\beta_{1}, and TT, use (131) to have

or F(xt+1)F(xt)0F(\boldsymbol{x}_{t+1})-F(\boldsymbol{x}_{t})\leq 0.

This concludes the mathematical induction up until (151) is met for the first time which we denote as T0T_{0}. In the following, we will explain that if (151) holds then the algorithm has found an approximate stationary point.

Now, suppose TT0T\leq T_{0}, then (150) holds all the time and we sum both sides of it from 11 to TT to have, with high probability,

Note that RHS of (151) is less than RHS of (153). Thus, for the other case of T>T0T>T_{0}, (153) still holds.

Recalling that ρ=1β2β1\rho=1-\frac{\sqrt{\beta_{2}}}{\beta_{1}}, A=ρ101β2A=\frac{\rho}{10\sqrt{1-\beta_{2}}}, β2<β12<β1\beta_{2}<\beta_{1}^{2}<\beta_{1}, and BjηL0,j(1β1)1β2+β1τˉ(Mj+σj)+6(1β1)σjmax(1,log(1/δ))+6(1β1)1β12σj2max(1,log(1/δ))B_{j}\triangleq\frac{\eta L_{0,j}}{(1-\beta_{1})\sqrt{1-\beta_{2}}}+\beta_{1}^{\bar{\tau}}(M_{j}+\sigma_{j})+6(1-\beta_{1})\sigma_{j}\max(1,\log(1/\delta))+\frac{6(1-\beta_{1})}{\sqrt{1-\beta_{1}^{2}}}\sqrt{\sigma_{j}^{2}\max(1,\log(1/\delta))}, we have

When σ1L01ΔT\|\boldsymbol{\sigma}\|_{1}\geq\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\sqrt{T}}, then α=L01Δσ1T\alpha=\frac{\sqrt{\|\boldsymbol{L_{0}}\|_{1}\Delta}}{\|\boldsymbol{\sigma}\|_{1}\sqrt{T}} and η=Δ3/4L011/4σ1T3/4\eta=\frac{\Delta^{3/4}}{\|\boldsymbol{L_{0}}\|_{1}^{1/4}\sqrt{\|\boldsymbol{\sigma}\|_{1}}T^{3/4}}. Hence, we obtain

Finally, taking δ=δ3dT\delta^{\prime}=\frac{\delta}{3dT}, we obtain the stated result. ∎

A.4 More Experiment Details

In this section, we provide more details for the experiments we show in Section 5.

Hyperparameter Tuning During the validation stage, we used grid-search to fine-tune respective hyperparameters and choose the ones that yield the best validation results. We tuned the hyperparameters using the following two-stage grid searching strategy: First, search over a coarse grid, and select the one yielding the best validation result. Next, continue searching in a fine grid centering at the best-performing hyperparameters found in the coarse stage, and in turn, take the best one as the final choice. Also, whenever the best-performing hyperparameters lie in the boundary of the searching grid, we always extend the grid to make the final best-performing hyperparameters fall into the interior of the grid, if possible.

Resnet on CIFAR-10 We randomly selected 10%10\% images from the training dataset for validation. Yet, during testing, we trained on the whole training dataset. The detailed search ranges and the hyperparameter choices yielding the highest validation accuracy for each optimizer are listed in Table 2.

AWD-LSTM on Penn Treebank We used the original train-validation-test split that comes with the dataset. The momentum parameter (β1)\beta_{1}) is fixed to be 0.90.9 except for SGDClipGrad which does not use momentum. The detailed search ranges and the hyperparameter choices yielding the lowest validation perplexity for each optimizer are listed in Table 3.

A.5 Training a Transformer Model on WMT’16 German-English Translation Task

We noted that Transformers are gaining huge popularities recently and reported in Figure 1 and 3 that Transformers observe the relaxed smoothness conditions. Thus, to further showcase the effectivity of our algorithm 1 compared with other optimizers listed in 5.1, we train a 66-layer Transformer model to do machine translation on the WMT’16 Multimodal Machine Translation Task German-English dataset. The implementation of the transformer is forked from herehttps://github.com/jadore801120/attention-is-all-you-need-pytorch and we inherited the default model structure. We also adopted the warm-up steps of 128000128000 and the learning rate decay strategy as recommended by the GitHub repo. The mini-batch size is 256256 and we trained for 400400 epochs.

For the hyperparameter tuning, the momentum parameter (β1)\beta_{1}) is fixed to be 0.90.9 except for SGDClipGrad which does not use momentum. We then used grid-search to fine-tune the initial learning rate and the weight decay value for all optimizers, as well as the clipping threshold for SGDClipGrad and SGDClipMomentum, and β2\beta_{2} for Adam and our algorithm. We select the combination of hyperparameters that gives the lowest validation perplexity. The detailed search ranges and the hyperparameter choices yielding the lowest validation perplexity for each optimizer are listed in Table 3.

We then employ the best-performing hyperparameters to report the testing performance. The testing is repeated with random seeds 5 times to eliminate the influence of stochasticity. The results are reported in Figure 6 and Table 4. Here the accuracy means the proportion of correct correspondences of words, namely the same word at the same location, between the machine-translated output and the target. It can be seen that we can still match the performance of Adam while beating the others. Also, note that the curves of SGD Momentum and the ones of SGDClipMomentum overlap as they utilize the same weight decay values and initial learning rates, and turns out clipping is seldomly performed when employing SGDClipMomentum.