Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu

Introduction

Stochastic gradient descent (SGD) is now one of the most dominant approaches for training deep neural networks (Goodfellow et al., 2016). In each iteration, SGD only performs one parameter update on a mini-batch of training examples. SGD is simple and has been proved to be efficient, especially for tasks on large datasets. In recent years, adaptive variants of SGD have emerged and shown their successes for their convenient automatic learning rate adjustment mechanism. Adagrad (Duchi et al., 2011) is probably the first along this line of research, and significantly outperforms vanilla SGD in the sparse gradient scenario. Despite the first success, Adagrad was later found to demonstrate degraded performance especially in cases where the loss function is nonconvex or the gradient is dense. Many variants of Adagrad, such as RMSprop (Hinton et al., 2012), Adam (Kingma and Ba, 2015), Adadelta (Zeiler, 2012), Nadam (Dozat, 2016), were then proposed to address these challenges by adopting exponential moving average rather than the arithmetic average used in Adagrad. This change largely mitigates the rapid decay of learning rate in Adagrad and hence makes this family of algorithms, especially Adam, particularly popular on various tasks. Recently, it has also been observed (Reddi et al., 2018) that Adam does not converge in some settings where rarely encountered large gradient information quickly dies out due to the “short momery” problem of exponential moving average. To address this issue, Amsgrad (Reddi et al., 2018) has been proposed to keep an extra “long term memory” variable to preserve the past gradient information and to correct the potential convergence issue in Adam. There are also some other variants of adaptive gradient method such as SC-Adagrad / SC-RMSprop (Mukkamala and Hein, 2017), which derives logarithmic regret bounds for strongly convex functions.

On the other hand, people recently found that for largely over-parameterized neural networks, e.g., more complex modern convolutional neural network (CNN) architectures such as VGGNet (He et al., 2016), ResNet (He et al., 2016), Wide ResNet (Zagoruyko and Komodakis, 2016), DenseNet (Huang et al., 2017), training with Adam or its variants typically generalizes worse than SGD with momentum, even when the training performance is better. In particular, people found that carefully-tuned SGD, combined with proper momentum, weight decay and appropriate learning rate decay schedules, can significantly outperform adaptive gradient algorithms eventually (Wilson et al., 2017). As a result, many recent studies train their models with SGD-Momentum (He et al., 2016; Zagoruyko and Komodakis, 2016; Huang et al., 2017; Simonyan and Zisserman, 2014; Ren et al., 2015; Xie et al., 2017; Howard et al., 2017) despite that adaptive gradient algorithms usually converge faster. Different from SGD, which adopts a universal learning rate for all coordinates, the effective learning rate of adaptive gradient methods, i.e., the universal base learning rate divided by the second order moment term, is different for different coordinates. Due to the normalization of the second order moment, some coordinates will have very large effective learning rates. To alleviate this problem, one usually chooses a smaller base learning rate for adaptive gradient methods than SGD with momentum. This makes the learning rate decay schedule less effective when applied to adaptive gradient methods, since a much smaller base learning rate will lead to diminishing effective learning rate for most coordinates after several rounds of decay. We refer to the above phenomenon as the “small learning rate dilemma” (see more details in Section 3).

With all these observations, a natural question is:

Can we take the best from both Adam and SGD-Momentum, i.e., design an algorithm that not only enjoys the fast convergence rate as Adam, but also generalizes as well as SGD-Momentum?

In this paper, we answer this question affirmatively. We close the generalization gap of adaptive gradient methods by proposing a new algorithm, called partially adaptive momentum estimation (Padam) method, which unifies Adam/Amsgrad with SGD-Momentum to achieve the best of both worlds, by a partially adaptive parameter. The intuition behind our algorithm is: by controlling the degree of adaptiveness, the base learning rate in Padam does not need to be as small as other adaptive gradient methods. Therefore, it can maintain a larger learning rate while preventing the gradient explosion. We note that there exist several studies (Zaheer et al., 2018; Loshchilov and Hutter, 2019; Luo et al., 2019) that also attempted to address the same research question. In detail, Yogi (Zaheer et al., 2018) studied the effect of adaptive denominator constant ϵ\epsilon and minibatch size in the convergence of adaptive gradient methods. AdamW (Loshchilov and Hutter, 2019) proposed to fix the weight decay regularization in Adam by decoupling the weight decay from the gradient update and this improves the generalization performance of Adam. AdaBound (Luo et al., 2019) applies dynamic bound of learning rate on Adam and make them smoothly converge to a constant final step size as in SGD. Our algorithm is very different from Yogi, AdamW and AdaBound. Padam is built upon a simple modification of Adam without extra complicated algorithmic design and it comes with a rigorous convergence guarantee in the nonconvex stochastic optimization setting.

We highlight the main contributions of our work as follows:

\bullet We propose a novel and simple algorithm Padam with a partially adaptive parameter, which resolves the “small learning rate dilemma” for adaptive gradient methods and allows for faster convergence, hence closing the gap of generalization.

\bullet We provide a convergence guarantee for Padam in nonconvex optimization. Specifically, we prove that the convergence rate of Padam to a stationary point for stochastic nonconvex optimization is

where ss characterizes the growth rate of the cumulative stochastic gradient g1:T,i=[g1,i,g2,i,,gT,i]\mathbf{g}_{1:T,i}=[g_{1,i},g_{2,i},\ldots,g_{T,i}]^{\top} (g1,,gT\mathbf{g}_{1},\ldots,\mathbf{g}_{T} are the stochastic gradients) and 0s1/20\leq s\leq 1/2. When the stochastic gradients are sparse, i.e., s<1/2s<1/2, (1.1) is strictly better than the convergence rate of SGD in terms of the rate of TT.

\bullet We also provide thorough experiments about our proposed Padam method on training modern deep neural architectures. We empirically show that Padam achieves the fastest convergence speed while generalizing as well as SGD with momentum. These results suggest that practitioners should pick up adaptive gradient methods once again for faster training of deep neural networks.

\bullet Last but not least, compared with the recent work on adaptive gradient methods, such as Yogi (Zaheer et al., 2018), AdamW (Loshchilov and Hutter, 2019), AdaBound (Luo et al., 2019), our proposed Padam achieves better generalization performance than these methods in our experiments.

Here we review additional related work that is not covered before. Zhang et al. (2017) proposed a normalized direction-preserving Adam (ND-Adam), which changes the adaptive terms from individual dimensions to the whole gradient vector. Keskar and Socher (2017) proposed to improve the generalization performance by switching from Adam to SGD. On the other hand, despite the great successes of adaptive gradient methods for training deep neural networks, the convergence guarantees for these algorithms are still understudied. Most convergence analyses of adaptive gradient methods are restricted to online convex optimization (Duchi et al., 2011; Kingma and Ba, 2015; Mukkamala and Hein, 2017; Reddi et al., 2018). A few recent attempts have been made to analyze adaptive gradient methods for stochastic nonconvex optimization. More specifically, Basu et al. (2018) proved the convergence rate of RMSProp and Adam when using deterministic gradient rather than stochastic gradient. Li and Orabona (2018) analyzed convergence rate of AdaGrad under both convex and nonconvex settings but did not consider more complicated Adam-type algorithms. Ward et al. (2018) also proved the convergence rate of AdaGrad under both convex and nonconvex settings without considering the effect of stochastic momentum. Chen et al. (2018) provided a convergence analysis for a class of Adam-type algorithms for nonconvex optimization. Zou and Shen (2018) analyzed the convergence rate of AdaHB and AdaNAG, two modified version of AdaGrad with the use of momentum. Liu et al. (2019) proposed Optimistic Adagrad and showed its convergence in non-convex non-concave min-max optimization. However, none of these results are directly applicable to Padam. Our convergence analysis in Section 4 is quite general and implies the convergence rate of AMSGrad for nonconvex optimization. In terms of learning rate decay schedule, Wu et al. (2018) studied the learning rate schedule via short-horizon bias. Xu et al. (2016); Davis et al. (2019) analyzed the convergence of stochastic algorithms with geometric learning rate decay. Ge et al. (2019) studied the learning rate schedule for quadratic functions.

The remainder of this paper is organized as follows: in Section 2, we briefly review existing adaptive gradient methods. We present our proposed algorithm in Section 3, and the main theory in Section 4. In Section 5, we compare the proposed algorithm with existing algorithms on modern neural network architectures on benchmark datasets. Finally, we conclude this paper and point out the future work in Section 6.

Review of Adaptive Gradient Methods

Various adaptive gradient methods have been proposed in order to achieve better performance on various stochastic optimization tasks. Adagrad (Duchi et al., 2011) is among the first methods with adaptive learning rate for each individual dimension, which motivates the research on adaptive gradient methods in the machine learning community. In detail, AdagradThe formula is equivalent to the one from the original paper (Duchi et al., 2011) after simple manipulations. adopts the following update form:

where gt\mathbf{g}_{t} stands for the stochastic gradient ft(xt)\nabla f_{t}(\mathbf{x}_{t}), and αt=α/t\alpha_{t}=\alpha/\sqrt{t} is the step size. In this paper, we call αt\alpha_{t} base learning rate, which is the same for all coordinates of xt\mathbf{x}_{t}, and we call αt/vt,i\alpha_{t}/\sqrt{v_{t,i}} effective learning rate for the ii-th coordinate of xt\mathbf{x}_{t}, which varies across the coordinates. Adagrad is proved to enjoy a huge gain in terms of convergence especially in sparse gradient situations. Empirical studies also show a performance gain even for non-sparse gradient settings. RMSprop (Hinton et al., 2012) follows the idea of adaptive learning rate and it changes the arithmetic averages used for vt\mathbf{v}_{t} in Adagrad to exponential moving averages. Even though RMSprop is an empirical method with no theoretical guarantee, the outstanding empirical performance of RMSprop raised people’s interests in exponential moving average variants of Adagrad. Adam (Kingma and Ba, 2015)Here for simplicity and consistency, we ignore the bias correction step in the original paper of Adam. Yet adding the bias correction step will not affect the argument in the paper. is the most popular exponential moving average variant of Adagrad. It combines the idea of RMSprop and momentum acceleration, and takes the following update form:

Adam also requires αt=α/t\alpha_{t}=\alpha/\sqrt{t} for the sake of convergence analysis. In practice, any decaying step size or even constant step size works well for Adam. Note that if we choose β1=0\beta_{1}=0, Adam basically reduces to RMSprop. Reddi et al. (2018) identified a non-convergence issue in Adam. Specifically, Adam does not collect long-term memory of past gradients and therefore the effective learning rate could be increasing in some cases. They proposed a modified algorithm namely Amsgrad. More specifically, Amsgrad adopts an additional step to ensure the decay of the effective learning rate αt/v^t\alpha_{t}/\sqrt{\widehat{\mathbf{v}}_{t}}, and its key update formula is as follows:

mt\mathbf{m}_{t} and vt\mathbf{v}_{t} are the same as Adam. By introducing the v^t\widehat{\mathbf{v}}_{t} term, Reddi et al. (2018) corrected some mistakes in the original proof of Adam and proved an O(1/T)O(1/\sqrt{T}) convergence rate of Amsgrad for convex optimization. Note that all the theoretical guarantees on adaptive gradient methods (Adagrad, Adam, Amsgrad) are only proved for convex functions.

The Proposed Algorithm

In this section, we propose a new algorithm for bridging the generalization gap for adaptive gradient methods. Specifically, we introduce a partial adaptive parameter pp to control the level of adaptiveness of the optimization procedure. The proposed algorithm is displayed in Algorithm 1.

In Algorithm 1, gt\mathbf{g}_{t} denotes the stochastic gradient and v^t\widehat{\mathbf{v}}_{t} can be seen as a moving average over the second order moment of the stochastic gradients. As we can see from Algorithm 1, the key difference between Padam and Amsgrad (Reddi et al., 2018) is that: while mt\mathbf{m}_{t} is still the momentum as in Adam/Amsgrad, it is now “partially adapted” by the second order moment. We call p[0,1/2]p\in[0,1/2] the partially adaptive parameter. Note that 1/21/2 is the largest possible value for pp and a larger pp will result in non-convergence in the proof (see the proof details in the supplementary materials). When p0p\to 0, Algorithm 1 reduces to SGD with momentumThe only difference between Padam with p=0p=0 and SGD-Momentum is an extra constant factor (1β1)(1-\beta_{1}), which can be moved into the learning rate such that the update rules for these two algorithms are identical. and when p=1/2p=1/2, Algorithm 1 is exactly Amsgrad. Therefore, Padam indeed unifies Amsgrad and SGD with momentum.

With the notations defined above, we are able to formally explain the “small learning rate dilemma”. In order to make things clear, we first emphasize the relationship between adaptiveness and learning rate decay. We refer the actual learning rate applied to mt\mathbf{m}_{t} as the effective learning rate, i.e., αt/v^tp\alpha_{t}/\widehat{\mathbf{v}}_{t}^{p}. Now suppose that a learning rate decay schedule is applied to αt\alpha_{t}. If pp is large, then at early stages, the effective learning rate αt/v^t,ip\alpha_{t}/{\widehat{v}_{t,i}}^{p} could be fairly large for certain coordinates with small v^t,i\widehat{v}_{t,i} valueThe coordinate v^t,i\widehat{v}_{t,i}’s are much less than 11 for most commonly used network architectures. . To prevent those coordinates from overshooting we need to enforce a smaller αt\alpha_{t}, and therefore the base learning rate must be set small (Keskar and Socher, 2017; Wilson et al., 2017). As a result, after several rounds of decaying, the learning rates of the adaptive gradient methods are too small to make any significant progress in the training processThis does not mean the learning rate decay schedule weakens adaptive gradient method. On the contrary, applying the learning rate decay schedule still gives performance boost to the adaptive gradient methods in general but this performance boost is not as significant as SGD + momentum.. We call this phenomenon “small learning rate dilemma”. It is also easy to see that the larger pp is, the more severe “small learning rate dilemma” is. This suggests that intuitively, we should consider using Padam with a proper adaptive parameter pp, and choosing p<1/2p<1/2 can potentially make Padam suffer less from the “small learning rate dilemma” than Amsgrad, which justifies the range of pp in Algorithm 1. We will show in our experiments (Section 5) that Padam with p<1/2p<1/2 can adopt an equally large base learning rate as SGD with momentum.

Note that even though in Algorithm 1, the choice of αt\alpha_{t} covers different choices of learning rate decay schedule, the main focus of this paper is not about finding the best learning rate decay schedule, but designing a new algorithm to control the adaptiveness for better empirical generalization result. In other words, our focus is not on αt\alpha_{t}, but on v^t\widehat{\mathbf{v}}_{t}. For this reason, we simply fix the learning rate decay schedule for all methods in the experiments to provide a fair comparison for different methods.

Figure 1 shows the comparison of test error performances under the different partial adaptive parameter pp for ResNet on both CIFAR-10 and CIFAR-100 datasets. We can observe that a larger pp will lead to fast convergence at early stages and worse generalization performance later, while a smaller pp behaves more like SGD with momentum: slow in early stages but finally catch up. With a proper choice of pp (e.g., 1/81/8 in this case), Padam can obtain the best of both worlds. Note that besides Algorithm 1, our partially adaptive idea can also be applied to other adaptive gradient methods such as Adagrad, Adadelta, RMSprop, AdaMax (Kingma and Ba, 2015). For the sake of conciseness, we do not list the partially adaptive versions for other adaptive gradient methods here. We also would like to comment that Padam is totally different from the pp-norm generalized version of Adam in Kingma and Ba (2015), which induces AdaMax method when pp\to\infty. In their case, pp-norm is used to generalize 22-norm of their current and past gradients while keeping the scale of adaptation unchanged. In sharp contrast, we intentionally change (reduce) the scale of the adaptive term in Padam to get better generalization performance.

Convergence Analysis of the Proposed Algorithm

In this section, we establish the convergence theory of Algorithm 1 in the stochastic nonconvex optimization setting, i.e., we aim at solving the following stochastic nonconvex optimization problem

Assumption 4.2 is frequently used in analysis of gradient-based algorithms. It is equivalent to the LL-gradient Lipschitz condition, which is often written as f(x)f(y)2Lxy2\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_{2}\leq L\|\mathbf{x}-\mathbf{y}\|_{2}. Next we provide the main convergence rate result for our proposed algorithm. The detailed proof can be found in the longer version of this paper.

In Algorithm 1, suppose that p[0,1/2]p\in[0,1/2], β1<β22p\beta_{1}<\beta_{2}^{2p}, αt=α\alpha_{t}=\alpha and g1:T,i2GTs\|\mathbf{g}_{1:T,i}\|_{2}\leq G_{\infty}T^{s} for t=1,,Tt=1,\ldots,T, 0s1/20\leq s\leq 1/2, under Assumptions 4.1 and 4.2, let Δf=f(x1)infxf(x)\Delta f=f(\mathbf{x}_{1})-\inf_{\mathbf{x}}f(\mathbf{x}), for any q[max{0,4p1},1]q\in[\max\{0,4p-1\},1], the output xout\mathbf{x}_{\text{out}} of Algorithm 1 satisfies that

From Theorem 4.3, we can see that M1M_{1} and M3M_{3} are independent of the number of iterations TT and dimension dd. In addition, if v^11=O(1)\|\widehat{\mathbf{v}}_{1}^{-1}\|_{\infty}=O(1), it is easy to see that M2M_{2} also has an upper bound that is independent of TT and dd. ss characterizes the growth rate condition (Liu et al., 2019) of the cumulative stochastic gradient g1:T,i\mathbf{g}_{1:T,i}. In the worse case, s=1/2s=1/2, while in practice when the stochastic gradients are sparse, s<1/2s<1/2.

The following corollary simplifies the result of Theorem 4.3 by choosing q=0q=0 under the condition p[0,1/4]p\in[0,1/4].

Under the same conditions in Theorem 4.3, if p[0,1/4]p\in[0,1/4], Padam’s output satisfies

where M1M_{1} and M2M_{2} and Δf\Delta f are the same as in Theorem 4.3, and M3M_{3}^{\prime} is defined as follows:

We show the convergence rate under optimal choice of step size α\alpha. If

Experiments

In this section, we empirically evaluate our proposed algorithm for training various modern deep learning models and test them on several standard benchmarks.The code is available at https://github.com/uclaml/Padam. We show that for nonconvex loss functions in deep learning, our proposed algorithm still enjoys a fast convergence rate, while its generalization performance is as good as SGD with momentum and much better than existing adaptive gradient methods such as Adam and Amsgrad.

We compare Padam against several state-of-the-art algorithms, including: (1) SGD-Momentum, (2) Adam (Kingma and Ba, 2015), (3) Amsgrad (Reddi et al., 2018), (4) AdamW (Loshchilov and Hutter, 2019) (5) Yogi (Zaheer et al., 2018) and (6) AdaBound (Luo et al., 2019). We use several popular datasets for image classifications and language modeling: CIFAR-10 (Krizhevsky and Hinton, 2009), ImageNet dataset (ILSVRC2012) (Deng et al., 2009) and Penn Treebank dataset (Marcus et al., 1993). We adopt three popular CNN architectures for image classification task: VGGNet-16 (Simonyan and Zisserman, 2014), Residual Neural Network (ResNet-18) (He et al., 2016), Wide Residual Network (WRN-16-4) (Zagoruyko and Komodakis, 2016). We test the language modeling task using 2-layer and 3-layer Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997). For CIFAR-10 and Penn Treebank experiments, we test for 200200 epochs and decay the learning rate by 0.10.1 at the 100100th and 150150th epoch. We test ImageNet tasks for 100100 epochs with similar multi-stage learning rate decaying scheme at the 3030th, 6060th and 8080th epoch.

We perform grid searches to choose the best hyper-parameters for all algorithms in both image classification and language modeling tasks. For the base learning rate, we do grid search over {104,,102}\{10^{-4},\ldots,10^{2}\} for all algorithms, and choose the partial adaptive parameter pp from {2/5,1/4,1/5,1/8,1/16}\{2/5,1/4,1/5,1/8,1/16\} and the second order moment parameter β2\beta_{2} from {0.9,0.99,0.999}\{0.9,0.99,0.999\}. For image classification experiments, we set the base learning rate of 0.10.1 for SGD with momentum and Padam, 0.0010.001 for all other adaptive gradient methods. β1\beta_{1} is set as 0.90.9 for all methods. β2\beta_{2} is set as 0.990.99 for Adam and Amsgrad, 0.9990.999 for all other methods. For Padam, the partially adaptive parameter pp is set to be 1/81/8. For AdaBound, the final learning rate is set to be 0.10.1. For AdamW, the normalized weight decay factor is set to 2.5×1022.5\times 10^{-2} for CIFAR-10 and 5×1025\times 10^{-2} for ImageNet. For Yogi, ϵ\epsilon is set as 10310^{-3} as suggested in the original paper. The minibatch size for CIFAR-10 is set to be 128128 and for ImageNet dataset we set it to be 256256. Regarding the LSTM experiments, for SGD with momentum, the base learning rate is set to be 11 for 2-layer LSTM model and 1010 for 3-layer LSTM. The momentum parameter is set to be 0.90.9 for both models. For all adaptive gradient methods except Padam and Yogi, we set the base learning rate as 0.0010.001. For Yogi, we set the base learning rate as 0.010.01 for 2-layer LSTM model and 0.10.1 for 3-layer LSTM model. For Padam, we set the base learning rate as 0.010.01 for 2-layer LSTM model and 11 for 3-layer LSTM model. For all adaptive gradient methods, we set β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999. In terms of algorithm specific parameters, for Padam, we set the partially adaptive parameter pp as 0.40.4 for 2-layer LSTM model and 0.20.2 for 3-layer LSTM model. For AdaBound, we set the final learning rate as 1010 for 2-layer LSTM model and 100100 for 3-layer LSTM model. For Yogi, ϵ\epsilon is set as 10310^{-3} as suggested in the original paper. For AdamW, the normalized weight decay factor is set to 4×1044\times 10^{-4}. The minibatch size is set to be 2020 for all LSTM experiments.

We compare our proposed algorithm with other baselines on training the aforementioned three modern CNN architectures for image classification on the CIFAR-10 and ImageNet datasets. Figure 2 plots the train loss and test error (top-11 error) against training epochs on the CIFAR-10 dataset. As we can see from the figure, at the early stage of the training process, (partially) adaptive gradient methods including Padam, make rapid progress lowing both the train loss and the test error, while SGD with momentum converges relatively slowly. After the first learning rate decaying at the 100100-th epoch, different algorithms start to behave differently. SGD with momentum makes a huge drop while fully adaptive gradient methods (Adam and Amsgrad) start to generalize badly. Padam, on the other hand, maintains relatively good generalization performance and also holds the lead over other algorithms. After the second decaying at the 150150-th epoch, Adam and Amsgrad basically lose all the power of traversing through the parameter space due to the “small learning dilemma”, while the performance of SGD with momentum finally catches up with Padam. AdamW, Yogi and AdaBound indeed improve the performance compared with original Adam but the performance is still worse than Padam. Overall we can see that Padam achieves the best of both worlds (i.e., Adam and SGD with momentum): it maintains faster convergence rate while also generalizing as well as SGD with momentum in the end.

Figure 3 (a)(b)(d)(e) plot the Top-11 and Top-55 error against training epochs on the ImageNet dataset for both VGGNet and ResNet. We can see that on the ImageNet dataset, all methods behave similarly as in our CIFAR-10 experiments. Padam method again obtains the best from both worlds by achieving the fastest convergence while generalizing as well as SGD with momentum. Even though methods such as AdamW, Yogi and AdaBound have better performance than standard Adam, they still suffer from a big generalization gap on the ImageNet dataset. Note that we did not conduct WideResNet experiment on the Imagenet dataset due to GPU memory limits.

We also perform experiments on the language modeling tasks to test our proposed algorithm on Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997), where adaptive gradient methods such as Adam are currently the mainstream optimizers for these tasks. Figure 3 (c)(f) plot the test perplexity against training epochs on the Penn Treebank dataset (Marcus et al., 1993) for both 2-layer LSTM and 3-layer LSTM models. We can observe that the differences on simpler 2-layer LSTM model is not very obvious but on more complicated 3-layer LSTM model, different algorithms have quite different optimizing behaviors. Even though Adam, Amsgrad and AdamW have faster convergence in the early stages, Padam achieves the best final test perplexity on this language modeling task for both of our experiments.

For a more quantitative comparison, we also provide the test accuracy for all above experiments. Table 1 shows the test accuracy of all algorithms on the CIFAR-10 dataset. On the CIFAR-10 dataset, methods such as Adam and Amsgrad have the lowest test accuracy. Even though more recent algorithms AdamW, Yogi, AdaBound improve upon original Adam, they still fall behind or barely match the performance of SGD with momentum. In contrast, Padam achieves the highest test accuracy for VGGNet and WideResNet on CIFAR-10 dataset. For training ResNet on the CIFAR-10 dataset, Padam is also on a par with SGD with momentum at the final epoch (differences less than 0.2%0.2\%). Table 2 shows the final test accuracy of all algorithms on the ImageNet dataset. Again, we can observe that Padam achieves the best test accuracy for VGGNet (both Top-1 and Top-5) and Top-1 accuracy for ResNet. It stays very close to the best baseline of Top-1 accuracy for the ResNet model. Table 3 shows the final test perplexity of all algorithms on the Penn Treebank dataset. As we can see, Padam achieves the best (lowest) test perplexity on both 2-layer LSTM and 3-layer LSTM models. All these experimental results suggest that practitioners can use Padam for training deep neural networks, without worrying about the generalization performances.

Conclusions and Future Work

In this paper, we proposed Padam, which unifies Adam/Amsgrad with SGD-Momentum. With an appropriate choice of the partially adaptive parameter, we show that Padam can achieve the best from both worlds, i.e., maintaining fast convergence rate while closing the generalization gap. We also provide a theoretical analysis towards the convergence rate of Padam to a stationary point for stochastic nonconvex optimization.

It would also be interesting to see how well Padam performs in other types of neural networks, such as generative adversarial network (GAN) (Goodfellow et al., 2014) and graph convolutional neural network (GCN) (Kipf and Welling, 2017; Zou et al., 2019). We leave it as a future work.

Acknowledgements

We thank the anonymous reviewers for their helpful comments. This research was sponsored in part by the National Science Foundation CAREER Award IIS-1906169, BIGDATA IIS-1855099 and IIS-1903202. We also thank AWS for providing cloud computing credits associated with the NSF BIGDATA award. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

References

Appendix A Proof of the Main Theory

In this section, we provide a detailed version of proof of Theorem 4.3.

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

Let zt\mathbf{z}_{t} be defined in (A.1). For t2t\geq 2, we have

Let v^t\widehat{\mathbf{v}}_{t} and mt\mathbf{m}_{t} be as defined in Algorithm 1. Then under Assumption 4.1, we have f(x)G\|\nabla f(\mathbf{x})\|_{\infty}\leq G_{\infty}, v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2} and mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}.

Suppose that ff has GG_{\infty}-bounded stochastic gradient. Let β1,β2\beta_{1},\beta_{2} be the weight parameters, αt\alpha_{t}, t=1,,Tt=1,\ldots,T be the step sizes in Algorithm 1 and q[max{4p1,0},1]q\in[\max\{4p-1,0\},1]. We denote γ=β1/β22p\gamma=\beta_{1}/\beta_{2}^{2p}. Suppose that αt=α\alpha_{t}=\alpha and γ1\gamma\leq 1, then under Assumption 4.1, we have the following two results:

In the following, we bound I1I_{1}, I2I_{2} and I3I_{3} separately.

Bounding term I1I_{1}: When t=1t=1, we have

where the first equality holds due to (A.3) in Lemma A.1. For f(xt)(αt1V^t1pαtV^tp)mt1\nabla f(\mathbf{x}_{t})^{\top}(\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-p}-\alpha_{t}\widehat{\mathbf{V}}_{t}^{-p})\mathbf{m}_{t-1} in (A.7), we have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t1pαtV^tp0\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-p}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-p}\succeq 0. Next we bound f(xt)αtV^tpgt-\nabla f(\mathbf{x}_{t})^{\top}\alpha_{t}\widehat{\mathbf{V}}_{t}^{-p}\mathbf{g}_{t}. We have

The first inequality holds because for a positive diagonal matrix A\mathbf{A}, we have xAyxA1,1y\mathbf{x}^{\top}\mathbf{A}\mathbf{y}\leq\|\mathbf{x}\|_{\infty}\cdot\|\mathbf{A}\|_{1,1}\cdot\|\mathbf{y}\|_{\infty}. The second inequality holds due to αt1V^t1pαtV^tp0\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-p}\succeq\alpha_{t}\widehat{\mathbf{V}}_{t}^{-p}\succeq 0. Substituting (A.8) and (A.9) into (A.7), we have

Bounding term I2I_{2}: For t1t\geq 1, we have

where the second inequality holds because of Lemma A.1 and Lemma A.2, the last inequality holds due to Young’s inequality.

Bounding term I3I_{3}: For t1t\geq 1, we have

The first inequality is obtained by introducing Lemma A.1.

For t=1t=1, substituting (A.6), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

For t2t\geq 2, substituting (A.10), (A.11) and (A.12) into (A.5), taking expectation and rearranging terms, we have

where γ=β1/β22p\gamma=\beta_{1}/\beta_{2}^{2p}. We also have

Substituting (A.16) and (A.17) into (A.15), and rearranging (A.15), we have

where the second inequality holds because αt=α\alpha_{t}=\alpha. Rearranging (A.18), we obtain

where {Mi}i=13\{M_{i}\}_{i=1}^{3} are defined in Theorem 4.3. This completes the proof. ∎

A.2 Proof of Corollary 4.5

From Theorem 4.3, let p[0,1/4]p\in[0,1/4], we have qq\in. Setting q=0q=0, we have

where M1M_{1} and M2M_{2} are defined in Theorem 4.3 and M3M_{3}^{\prime} is defined in Corollary 4.5. This completes the proof. ∎

Appendix B Proof of Technical Lemmas

The equities above are based on definition. Then we have

The equalities above follow by combining the like terms. ∎

B.2 Proof of Lemma A.2

where the inequality holds because the term β1/(1β1)\beta_{1}/(1-\beta_{1}) is positive, and triangle inequality. Considering that αtv^t,jpαt1v^t1,jp\alpha_{t}\widehat{\mathbf{v}}_{t,j}^{-p}\leq\alpha_{t-1}\widehat{\mathbf{v}}_{t-1,j}^{-p}, when p>0p>0, we have \Big{\|}\mathbf{I}-(\alpha_{t}\widehat{\mathbf{V}}_{t}^{-p})(\alpha_{t-1}\widehat{\mathbf{V}}_{t-1}^{-p})^{-1}\Big{\|}_{\infty,\infty}\leq 1. With that fact, the term above can be bound as:

B.3 Proof of Lemma A.3

For term f(zt)f(xt)2\|\nabla f(\mathbf{z}_{t})-\nabla f(\mathbf{x}_{t})\|_{2}, we have:

where the last inequality holds because the term β1/(1β1)\beta_{1}/(1-\beta_{1}) is positive. ∎

B.4 Proof of Lemma A.4

Since ff has GG_{\infty}-bounded stochastic gradient, for any x\mathbf{x} and ξ\xi, f(x;ξ)G\|\nabla f(\mathbf{x};\xi)\|_{\infty}\leq G_{\infty}. Thus, we have

Next we bound mt\|\mathbf{m}_{t}\|_{\infty}. We have m0=0G\|\mathbf{m}_{0}\|=0\leq G_{\infty}. Suppose that mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}, then for mt+1\mathbf{m}_{t+1}, we have

Thus, for any t0t\geq 0, we have mtG\|\mathbf{m}_{t}\|_{\infty}\leq G_{\infty}. Finally we bound v^t\|\widehat{\mathbf{v}}_{t}\|_{\infty}. First we have v0=v^0=0G2\|\mathbf{v}_{0}\|_{\infty}=\|\widehat{\mathbf{v}}_{0}\|_{\infty}=0\leq G_{\infty}^{2}. Suppose that v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2} and vtG2\|\mathbf{v}_{t}\|_{\infty}\leq G_{\infty}^{2}. Note that we have

and by definition, we have v^t+1=max{v^t,vt+1}G2\|\widehat{\mathbf{v}}_{t+1}\|_{\infty}=\max\{\|\widehat{\mathbf{v}}_{t}\|_{\infty},\|\mathbf{v}_{t+1}\|_{\infty}\}\leq G_{\infty}^{2}. Thus, for any t0t\geq 0, we have v^tG2\|\widehat{\mathbf{v}}_{t}\|_{\infty}\leq G_{\infty}^{2}. ∎

B.5 Proof of Lemma A.5

Recall that v^t,j,mt,j,gt,j\widehat{v}_{t,j},m_{t,j},g_{t,j} denote the jj-th coordinate of v^t,mt\widehat{\mathbf{v}}_{t},\mathbf{m}_{t} and gt\mathbf{g}_{t}. We have

where the first inequality holds because v^t,ivt,i\widehat{v}_{t,i}\geq v_{t,i}. Next we have

where the first inequality holds due to Cauchy inequality, the second inequality holds because gj,iG|g_{j,i}|\leq G_{\infty}, the last inequality holds because j=1tβ1tj(1β1)1\sum_{j=1}^{t}\beta_{1}^{t-j}\leq(1-\beta_{1})^{-1}. Note that

where the equality holds due to the definition of γ\gamma. Substituting (B.2) and (B.3) into (B.1), we have

Telescoping (B.4) for t=1t=1 to TT, we have

where the first and second inequalities hold due to Hölder’s inequality. Substituting (B.6) into (B.5), we have

Specifically, taking β1=0\beta_{1}=0, we have mt=gt\mathbf{m}_{t}=\mathbf{g}_{t}, then

Appendix C Experiment Details

We use several popular datasets for image classifications.

CIFAR-10 [Krizhevsky and Hinton, 2009]: it consists of a training set of 50,00050,000 32×3232\times 32 color images from 1010 classes, and also 10,00010,000 test images.

CIFAR-100 [Krizhevsky and Hinton, 2009]: it is similar to CIFAR-10 but contains 100100 image classes with 600600 images for each.

ImageNet dataset (ILSVRC2012) [Deng et al., 2009]: ILSVRC2012 contains 1.281.28 million training images, and 50k50k validation images over 10001000 classes.

In addition, we adopt Penn Treebank (PTB) dataset [Marcus et al., 1993], which is widely used in Natural Language Processing (NLP) research. Note that word-level PTB dataset does not contain capital letters, numbers, and punctuation. Models are evaluated based on the perplexity metric (lower is better).

C.2 Architectures

VGGNet [Simonyan and Zisserman, 2014]: We use a modified VGG-16 architecture for this experiment. The VGG-16 network uses only 3×33\times 3 convolutional layers stacked on top of each other for increasing depth and adopts max pooling to reduce volume size. Finally, two fully-connected layers For CIFAR experiments, we change the ending two fully-connected layers from 20482048 nodes to 512512 nodes. For ImageNet experiments, we use batch normalized version (vgg16_bn) provided in Pytorch official package are followed by a softmax classifier.

ResNet [He et al., 2016]: Residual Neural Network (ResNet) introduces a novel architecture with “skip connections” and features a heavy use of batch normalization. Such skip connections are also known as gated units or gated recurrent units and have a strong similarity to recent successful elements applied in RNNs. We use ResNet-18 for this experiment, which contains 22 blocks for each type of basic convolutional building blocks in He et al. .

Wide ResNet [Zagoruyko and Komodakis, 2016]: Wide Residual Network further exploits the “skip connections” used in ResNet and in the meanwhile increases the width of residual networks. In detail, we use the 1616 layer Wide ResNet with 44 multipliers (WRN-16-4) in our experiments.

LSTM [Hochreiter and Schmidhuber, 1997]: Long Short-Term Memory (LSTM) network is a special kind of Recurrent Neural Network (RNN), capable of learning long-term dependencies. It was introduced by Hochreiter and Schmidhuber , and was refined and popularized in many followup work.

Appendix D Additional Experiments

We conducted additional experiments using CIFAR-100 dataset. Figure 4 plots the train loss and test error (top-11 error) against training epochs on the CIFAR-100 dataset. The parameter settings are the same as the setting for CIFAR-10 dataset. We can see that Padam achieves the best from both worlds by maintaining faster convergence rate while also generalizing as well as SGD with momentum in the end.

Table 4 show the test accuracy of all algorithms on CIFAR-100 dataset. We can observe that Padam achieves the highest test accuracy in VGGNet and ResNet experiments for CIFAR-100. On the other task (WideResNet for CIFAR-100), Padam is also on a par with SGD with momentum at the final epoch (differences less than 0.1%0.1\%).