SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients

Feihu Huang, Junyi Li, Heng Huang

Introduction

In the paper, we consider solving the following stochastic optimization problem:

Adam recently has been shown great successes in current machine learning problems, e.g., it is a default method of choice for training DNNs and contrastive learning . Unfortunately, Reddi et al. still showed that Adam is frequently divergent in some settings where the gradient information quickly disappear. To deal with this issue, some variants of Adam algorithm, e.g., AMSGrad , YOGI and generalized Adam have been proposed. Specifically, AMSGrad applies an extra ‘long term memory’ variable to preserve the past gradient information in order to handle the convergence issue of Adam. YOGI introduces an adaptive denominator constant, and studies effect of the mini-batch size in its convergence. Subsequently, Chen et al. studied the convergence of a class of Adam-type algorithms for nonconvex optimization. Zhou et al. analyzed the convergence of a class of adaptive gradient algorithms for nonconvex optimization, and the result shows the advantage of adaptive gradient methods over SGD in sparse stochastic gradient setting. Meanwhile, Liu et al. studied the variances of these adaptive algorithms. More recently, Guo et al. presented a novel convergence analysis for a family of Adam-style methods (including Adam, AMSGrad, Adabound, etc.) with an increasing or large momentum parameter for the first-order moment.

Although the above these adaptive gradient methods show some good empirical performances, their generalization performance is worse than SGD (with momentum) on many deep learning tasks due to using the coordinate-wise learning rates . Thus, recently some adaptive gradient methods have been proposed to improve the generalization performance of Adam. For example, AdamW and Padam improve the generalization performance of Adam by decoupling weight decay regularization and introducing a partial adaptive parameter, respectively. Luo et al. proposed a new variant of Adam (i.e., Adabound) by employing dynamic bounds on learning rates to improve the generalization performance. Subsequently, AdaBelief has been presented to obtain a good generalization by adopting the stepsize according to the ‘belief’ in the current gradient direction. In addition, the norm version of AdaGrad (i.e., AdaGrad-Norm) has been proposed to obtain a good generalization performance.

To fill this gap, in the paper, we propose a faster and universal framework of adaptive gradients, i.e., SUPER-ADAM algorithm, by introducing a universal adaptive matrix. Moreover, we provide a novel convergence analysis framework for the adaptive gradient methods under the nonconvex setting based on the mirror descent algorithm . In summary, our main contributions are threefold:

We propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradients. Moreover, our framework can flexibly integrate the momentum and variance-reduced techniques.

We provide a novel convergence analysis framework for the adaptive gradient methods in the nonconvex setting under the milder conditions (Please see Table 1).

Preliminaries

2 Adaptive Gradient Algorithms

In the subsection, we review some existing typical adaptive gradient methods. Recently, many adaptive algorithms have been proposed to solve the problem (1), and achieve good performances. For example, Adagrad is the first adaptive gradient method with adaptive learning rate for each individual dimension, which adopts the following update form:

where gt=f(xt;ξt)g_{t}=\nabla f(x_{t};\xi_{t}) and vt=1tj=1tgj2v_{t}=\frac{1}{t}\sum_{j=1}^{t}g_{j}^{2}, and ηt=ηt\eta_{t}=\frac{\eta}{\sqrt{t}} with η>0\eta>0 is the step size. In fact, ηt\eta_{t} only is the basic learning rate that is the same for all coordinates of variable xtx_{t}, while ηtvt,i\frac{\eta_{t}}{\sqrt{v_{t,i}}} is the effective learning rate for the ii-th coordinate of xtx_{t}, which changes across the coordinates.

Adam is one of the most popular exponential moving average variant of Adagrad, which combines the exponential moving average technique with momentum acceleration. Its update form is:

where α1,α2(0,1)\alpha_{1},\alpha_{2}\in(0,1) and ε>0\varepsilon>0, and ηt=ηt\eta_{t}=\frac{\eta}{\sqrt{t}} with η>0\eta>0. However, Reddi et al. found a divergence issue of the Adam algorithm, and proposed a modified version of Adam (i.e., Amsgrad), which adopts a new step instead of the debiasing step in (2.2) to ensure the decay of the effective learning rate, defined as

Due to using the coordinate-wise learning rates, these adaptive gradient methods frequently have worse generalization performance than SGD (with momentum) . To improve the generalization performance of Adam, AdamW uses a decoupled weight decay regularization, defined as

where α1,α2(0,1)\alpha_{1},\alpha_{2}\in(0,1), α>0\alpha>0, λ>0\lambda>0 and ε>0\varepsilon>0. More recently, to further improve generalization performance, AdaBelief adopts a stepsize according to ‘belief’ in the current gradient direction,

where α1,α2(0,1)\alpha_{1},\alpha_{2}\in(0,1), and ηt=ηt\eta_{t}=\frac{\eta}{\sqrt{t}} with η>0\eta>0, and ε>0\varepsilon>0.

At the same time, to improve generalization performance, recently some effective adaptive gradient methods have been proposed with adopting the global adaptive learning rates instead of coordinate-wise counterparts. For example, AdaGrad-Norm applies a global adaptive learning rate to the following update form, for all t1t\geq 1

where η>0\eta>0. The adaptive-SGD adopts a global adaptive learning rate, defined as for all t1t\geq 1

where k>0k>0, ω>0\omega>0, and ε0\varepsilon\geq 0. Subsequently, STORM not only uses a global adaptive learning rate but also adopts the variance-reduced technique in gradient estimator to accelerate algorithm, defined as for all t1t\geq 1

SUPER-ADAM Algorithm

In the section, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Specifically, our SUPER-ADAM algorithm is summarized in Algorithm 1.

At the step 4 in Algorithm 1, we generate an adaptive matrix HtH_{t} based on stochastic gradient information, which can include both coordinate-wise and global learning rates. For example, HtH_{t} generated from the case 1 in Algorithm 1 is similar to the coordinate-wise adaptive learning rate used in Adam . HtH_{t} generated from the case 2 in Algorithm 1 is similar to the global adaptive learning rate used in the AdaGrad-Norm and Adaptive-SGD . Moreover, we can obtain some new adaptive learning rates by generating some specific adaptive matrices. In the case 3, based on Barzilai-Borwein technique , we design a novel adaptive matrix HtH_{t} defined as:

where λ>0\lambda>0. In the case 4, as the adaptive learning rate used in , we can generate a coordinate-wise-type adaptive matrix Ht=\mboxdiag(vt+λ)H_{t}=\mbox{diag}(\sqrt{v_{t}}+\lambda) and a global-type adaptive matrix Ht=(bt+λ)IdH_{t}=(b_{t}+\lambda)I_{d}, respectively, defined as: mt=β1mt1+(1β1)f(xt;ξt)m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})\nabla f(x_{t};\xi_{t}),

where β1,β2(0,1)\beta_{1},\beta_{2}\in(0,1) and λ>0\lambda>0. In fact, the adaptive matrix HtH_{t} can be given in a generic form Ht=At+λIdH_{t}=A_{t}+\lambda I_{d}, where the matrix AtA_{t} includes the adaptive information that is generated from stochastic gradients with noises, and the tuning parameter λ>0\lambda>0 balances these adaptive information with noises.

At the step 9 in Algorithm 1, we use a generalized gradient descent (i.e., mirror descent) iteration to update xx based on the adaptive matrix HtH_{t}, defined as

Under this case, γμt\gamma\mu_{t} is a basic stepsize as ηt\eta_{t} in the formula (2.2) of Adam algorithm, and Ht1H_{t}^{-1} is an adaptive stepsize as 1v^t\frac{1}{\sqrt{\hat{v}_{t}}} in the formula (2.2) of Adam algorithm.

At the step 11 of Algorithm 1, we use the stochastic gradient estimator gt+1g_{t+1} for all t1t\geq 1:

where τ{0,1}\tau\in\{0,1\} and αt+1(0,1]\alpha_{t+1}\in(0,1] for all t1t\geq 1. When τ=1\tau=1, we have g_{t+1}=\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})\big{(}g_{t}-\nabla f(x_{t};\xi_{t+1})\big{)} for all t1t\geq 1, which is a momentum-based variance reduced gradient estimator used in STORM . When τ=0\tau=0, we have gt+1=αt+1f(xt+1;ξt+1)+(1αt+1)gtg_{t+1}=\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})g_{t} for all t1t\geq 1, which is a basic momentum gradient estimator used in the Adam algorithm .

Theoretical Analysis

In this section, we study the convergence properties of our algorithm (SUPER-ADAM) under some mild conditions. All detailed proofs are in the supplementary materials.

The function f(x)f(x) is bounded from below in X\mathcal{X}, i.e., f=infxXf(x)f^{*}=\inf_{x\in\mathcal{X}}f(x).

Assume the adaptive matrix HtH_{t} for all t1t\geq 1 satisfies HtρId0H_{t}\succeq\rho I_{d}\succ 0, and ρ>0\rho>0 denotes a lower bound of the smallest eigenvalue of HtH_{t} for all t1t\geq 1.

Assumption 1 is commonly used in stochastic optimization . Assumption 2 ensures the feasibility of the problem (1). In fact, all adaptive algorithms in Table 1 require these mild Assumptions 1 and 2. Assumption 3 guarantees that the adaptive matrices {Ht}t1\{H_{t}\}_{t\geq 1} are positive definite and their smallest eigenvalues have a lower bound ρ>0\rho>0. From the above adaptive matrices {Ht}t1\{H_{t}\}_{t\geq 1} given in our SUPER-ADAM algorithm, we have ρλ>0\rho\geq\lambda>0. In fact, many existing adaptive algorithms also implicitly use Assumption 3. For example, Zaheer et al. and Zhuang et al. used the following iteration form to update the variable xx: xt+1=xtηtmtvt+εx_{t+1}=x_{t}-\eta_{t}\frac{m_{t}}{\sqrt{v_{t}}+\varepsilon} for all t0t\geq 0 and ε>0\varepsilon>0, which is equivalent to xt+1=xtηtHt1mtx_{t+1}=x_{t}-\eta_{t}H^{-1}_{t}m_{t} with Ht=\mboxdiag(vt+ε)H_{t}=\mbox{diag}(\sqrt{v_{t}}+\varepsilon). Clearly, we have HtεId0H_{t}\succeq\varepsilon I_{d}\succ 0. Ward et al. applied a global adaptive learning rate to the update form in (7), which is equivalent to the following form: xt=xt1ηHt1f(xt1;ξt1)x_{t}=x_{t-1}-\eta H_{t}^{-1}\nabla f(x_{t-1};\xi_{t-1}) with Ht=btIdH_{t}=b_{t}I_{d}. By the above (7), we have HtH0=b0Id0H_{t}\succeq\cdots\succeq H_{0}=b_{0}I_{d}\succ 0. Li et al. and Cutkosky et al. applied a global adaptive learning rate to the update forms in (8) and (9), which is equivalent to xt+1=xtHt1gtx_{t+1}=x_{t}-H_{t}^{-1}g_{t}, where Ht=(1/ηt)IdH_{t}=(1/\eta_{t})I_{d} and \eta_{t}=k/\big{(}\omega+\sum_{i=1}^{t}\|\nabla f(x_{i};\xi_{i})\|^{2}\big{)}^{\alpha} with k>0,ω>0,α(0,1)k>0,\omega>0,\alpha\in(0,1). By the above (8) and (9), we have HtH0=(ωα/k)Id0H_{t}\succeq\cdots\succeq H_{0}=(\omega^{\alpha}/k)I_{d}\succ 0. Reddi et al. and Chen et al. used the condition v^t=max(v^t1,vt)\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t}), and let Ht=\mboxdiag(v^t)H_{t}=\mbox{diag}(\sqrt{\hat{v}_{t}}), thus we have HtH1=\mboxdiag(v^1)=1α2\mboxdiag(f(x1;ξ1))0H_{t}\succeq\cdots\succeq H_{1}=\mbox{diag}(\sqrt{\hat{v}_{1}})=\sqrt{1-\alpha_{2}}\mbox{diag}(|\nabla f(x_{1};\xi_{1})|)\succeq 0. Without loss of generality, choosing an initial point x1x_{1} and let (f(x1;ξ1))j0(\nabla f(x_{1};\xi_{1}))_{j}\neq 0 for all j[d]j\in[d], we have HtH10H_{t}\succeq\cdots\succeq H_{1}\succ 0. Interestingly, our SUPER-ADAM algorithm includes a class of novel momentum-based quasi-Newton algorithms by generating an approximated Hessian matrix HtH_{t}. In fact, the quasi-Newton algorithms generally require the bounded approximated Hessian matrices, i.e., κ^IdHtκˉId0\hat{\kappa}I_{d}\succeq H_{t}\succeq\bar{\kappa}I_{d}\succ 0 for all t1t\geq 1, where κ^κˉ>0\hat{\kappa}\geq\bar{\kappa}>0. Thus Assumption 3 is reasonable and mild. Due to Assumption 3, our convergence analysis can be easily applied to the stochastic quasi-Newton algorithms.

2 A Useful Convergence Measure

We provide a useful measure to analyze the convergence of our algorithm, defined as

We define a Bregman distance associated with function wt(x)=12xTHtxw_{t}(x)=\frac{1}{2}x^{T}H_{t}x as follows

Thus, the step 9 of Algorithm 1 is equivalent to the following mirror descent iteration:

As in , we define a gradient mapping GX(xt,f(xt),γ)=1γ(xtxt+1+)\mathcal{G}_{\mathcal{X}}(x_{t},\nabla f(x_{t}),\gamma)=\frac{1}{\gamma}(x_{t}-x^{+}_{t+1}), where

3 Convergence Analysis of SUPER-ADAM (τ=1)𝜏1(\tau=1)

In this subsection, we provide the convergence analysis of our SUPER-ADAM (τ=1)(\tau=1) algorithm using the momentum-based variance reduced gradient estimator .

Each component function f(x;ξ)f(x;\xi) is LL-smooth for all ξD\xi\in\mathcal{D}, i.e.,

where G=f(x1)fkργ+m1/3σ28k2L2γ2+k2c2σ24L2γ2ln(m+T)G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T).

where G=νL(f(x1)f)+ν2σ28+ν2k4c2σ24m1/3ln(m+T)G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T).

4 Convergence Analysis of SUPER-ADAM (τ=0)𝜏0(\tau=0)

In this subsection, we provide the convergence analysis of our SUPER-ADAM (τ=0)(\tau=0) algorithm using the basic momentum stochastic gradient estimator .

Assumption 5 is widely used in adaptive algorithms , which is milder than Assumption 4.

where M=f(x1)fργk+2σ2ργkL+2mσ2ργkLln(m+T)M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T).

where M=νL(f(x1)f)+2νσ2+2νmσ2ln(m+T)M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T).

Differences between Our Algorithm and Related Algorithms

In this section, we show some significance differences between our algorithm and some related algorithms, i.e., STORM algorithm and Adam-type algorithms . Although our SUPER-ADAM (τ=1\tau=1) algorithm uses the same stochastic gradient estimator used in the STORM, there exist some significant differences:

Our algorithm focuses on both constrained and unconstrained optimizations, but STORM only focuses on unconstrained optimization.

In our algorithm, we introduce a weighted solution xt+1x_{t+1} at the step 10 by using momentum update. Under this case, our algorithm can easily incorporate various adaptive learning rates and variance reduced techniques. Specifically, we can flexibly use various adaptive learning rates and different stochastic gradient estimators gtg_{t} at the step 9 of our algorithm. In fact, this is one of important novelties of our paper. However, the STORM only uses a simple gradient descent iteration with a specific monotonically decreasing adaptive learning rate.

Similarly, although our SUPER-ADAM (τ=0\tau=0) algorithm uses the same stochastic gradient estimator used in these Adam-type algorithms, there exist some significant differences besides using different adaptive learning rates. These Adam-type algorithms use a decreasing learning rate ηt=ηt\eta_{t}=\frac{\eta}{\sqrt{t}} (Please see the above (2.2), (4) and (2.2)), while our algorithm only uses a constant learning rate γ\gamma besides an adaptive learning rate. Moreover, our algorithm introduces a weighted solution xt+1x_{t+1} at the step 10 with a decreasing parameter μt=km+t\mu_{t}=\frac{k}{\sqrt{m+t}} (Please see Theorem 2) and uses a decreasing parameter αt+1=cμt\alpha_{t+1}=c\mu_{t} in the gradient estimator, while these Adam-type algorithms only use a constant parameter α1(0,1)\alpha_{1}\in(0,1) in their gradient estimators. Under this case, our algorithm uses these decreasing parameters μt\mu_{t} and αt+1\alpha_{t+1} to control the noises in our gradient estimator, so our algorithm does not require some additional assumptions such as the bounded (stochastic) gradient assumption in our convergence analysis for the constrained optimization. For example, when τ=0\tau=0, our gradient estimator is gt+1=αt+1f(xt+1;ξt+1)+(1αt+1)gtg_{t+1}=\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})g_{t}. Intuitively, with growing tt, αt+1=ckm+t\alpha_{t+1}=\frac{ck}{\sqrt{m+t}} will become small, so the new noises added in our gradient estimator gt+1g_{t+1} will also become less.

Numerical Experiments

In this section, we conduct some experiments to empirically evaluate our SUPER-ADAM algorithm on two deep learning tasks as in : image classification on CIFAR-10, CIFAR-100 and Image-Net datasets and language modeling on Wiki-Text2 dataset. In the experiments, we compare our SUPER-ADAM algorithm against several state-of-the-art adaptive gradient algorithms, including: (1) SGD, (2) Adam , (3) Amsgrad , (4) AdaGrad-Norm , (5) Adam+ , (6) STORM and (7) AdaBelief . For our SUPER-ADAM algorithm, we consider τ=1\tau=1 and τ=0\tau=0, respectively. Without loss of generality, in the following experiments, we only use the case 1 in Algorithm 1 to generate adaptive matrix HtH_{t} and let λ=0.0005\lambda=0.0005. All experiments are run over a machine with Intel Xeon E5-2683 CPU and 4 Nvidia Tesla P40 GPUs.

In the experiment, we conduct image classification task on CIFAR-10, CIFAR-100 and Image-Net datasets. We perform training over ResNet-18 and VGG-19 on CIFAR-10 and CIFAR-100 datasets, respectively. For all the optimizers, we set the batch size as 128 and trains for 200 epochs. For the learning rates and other hyper-parameters, we do grid search and report the best one for each optimizer. In Adam, Amsgrad and AdaBelief algorithms, we set the learning rate as 0.001. In AdaGrad-Norm, the best learning rate is 17 for CIFAR-10 and 10 for CIFAR-100, respectively. In Adam+, we use the recommended tuning parameters in . In STORM, the best result is obtained when w=6w=6, k=10k=10 and c=100c=100 for CIFAR-10, while w=3w=3, k=10k=10 and c=100c=100 for CIFAR-100. For our SUPER-ADAM algorithm, in both CIFAR-10 and CIFAR-100 datasets, we set k=1k=1, m=100m=100, c=40c=40, γ=0.001\gamma=0.001 when τ=1\tau=1, and k=1k=1, m=100m=100, c=20c=20, γ=0.001\gamma=0.001 when τ=0\tau=0. Note that although c>m2/3k2c>\frac{m^{2/3}}{k^{2}} (c>m1/2kc>\frac{m^{1/2}}{k}) in our algorithm, we set αt=min(αt,0.9)\alpha_{t}=\min(\alpha_{t},0.9) at the first several iterations. In our algorithm, μt=k(m+t)1/3\mu_{t}=\frac{k}{(m+t)^{1/3}} (μt=k(m+t)1/2\mu_{t}=\frac{k}{(m+t)^{1/2}}) decreases as the number of iteration tt increases, so αt+1=cμt2\alpha_{t+1}=c\mu_{t}^{2} (αt+1=cμt\alpha_{t+1}=c\mu_{t}) will be less than 1 after the first several iterations.

We train a ResNet-34 on ImageNet dataset. For all the optimizers, we set the batch size as 256 and trains for 60 epochs. In Adam,Amsgrad and AdaBelief, we set learning rate as 0.001. In AdaGrad-Norm, the best learning rate is 30. In Adam+, we set learning rate as 0.1. In STORM, the best result is obtained when k=5k=5, w=100w=100 and c=10c=10. For our algorithm, we set k=1k=1, m=100m=100, c=40c=40, γ=0.01\gamma=0.01 when τ=1\tau=1, and k=1k=1, m=100m=100, c=4c=4, γ=0.04\gamma=0.04 when τ=0\tau=0.

2 Language Modeling Task

In the experiment, we conduct language modeling task on the Wiki-Text2 dataset. Specifically, we train a 2-layer LSTM and a 2-layer Transformer over the WiKi-Text2 dataset. For the LSTM, we use 650 dimensional word embeddings and 650 hidden units per-layer. Due to space limitation, we provide the experimental results for the transformer in the supplementary materials. In the experiment, we set the batch size as 20 and trains for 40 epochs with dropout rate 0.5. We also clip the gradients by norm 0.25 in case of the exploding gradient in LSTM. We also decrease the learning by 4 whenever the validation error increases. For the learning rate, we also do grid search and report the best one for each optimizer. In Adam and Amsgrad algorithms, we set the learning rate as 0.001 in LSTM In AdaGrad-Norm algorithm, the best learning rate is 40. In Adam+ algorithm, we use the learning rate 20. In AdaBelief algorithm, we set the learing rate 0.1. In STORM algorithm, we set w=50w=50, k=10k=10 and c=100c=100. In our SUPER-ADAM algorithm, we set k=1k=1, m=100m=100, c=40c=40, γ=0.001\gamma=0.001 when τ=1\tau=1, while k=1k=1, m=100m=100, c=20c=20, γ=0.01\gamma=0.01 when τ=0\tau=0.

Figure 5 shows that both train and test perplexities (losses) for different optimizers. When τ=1\tau=1, our SUPER-ADAM algorithm outperforms all the other optimizers. When τ=0\tau=0, our SUPER-ADAM optimizer gets a comparable performance with the other Adam-type optimizers.

Conclusions

In the paper, we proposed a novel faster and universal adaptive gradient framework (i.e., SUPER-ADAM) by introducing a universal adaptive matrix including most existing adaptive gradient forms. In particular, our algorithm can flexibly work with the momentum and variance reduced techniques. Moreover, we provided a novel convergence analysis framework for the adaptive gradient methods under the nonconvex setting. Experimental studies were conducted on both image classification and language modeling tasks, and all empirical results verify the superior performances of our algorithm.

Acknowledgments and Disclosure of Funding

This work was partially supported by NSF IIS 1845666, 1852606, 1838627, 1837956, 1956002, OIA 2040588. Feihu and Heng are corresponding Authors.

References

Appendix A Proofs of Convergence Analysis

In this section, we detail the convergence analysis of our algorithm. We first provide some useful lemmas.

Given a ρ\rho-strongly convex function w(x)w(x), we define a prox-function (i.e., Bregman distance) associated with w(x)w(x) as follows:

Then we define a generalized projection problem as in :

where ρ>0\rho>0 depends on ρ\rho-strongly convex function w(x)w(x).

When h(x)=0h(x)=0, in the above Lemma 1, we have g,GX(x,g,γ)ρGX(x,g,γ)2\langle g,\mathcal{G}_{\mathcal{X}}(x,g,\gamma)\rangle\geq\rho\|\mathcal{G}_{\mathcal{X}}(x,g,\gamma)\|^{2}.

(Proposition 1 in ) Let x1x_{1}^{*} and x2x_{2}^{*} be given in (23) with gg replaced by g1g_{1} and g2g_{2} respectively. Further let GX(x,g1,γ)\mathcal{G}_{\mathcal{X}}(x,g_{1},\gamma) and GX(x,g2,γ)\mathcal{G}_{\mathcal{X}}(x,g_{2},\gamma) be defined in (24) with xx^{*} replaced by x1x_{1}^{*} and x2x_{2}^{*} respectively. Then we have

Suppose that the sequence {xt}t=1T\{x_{t}\}_{t=1}^{T} be generated from Algorithm 1. Let 0<μt10<\mu_{t}\leq 1 and 0<γρ2Lμt0<\gamma\leq\frac{\rho}{2L\mu_{t}}, then we have

According to Assumption 4 or 5, i.e., the function f(x)f(x) is LL-smooth, we have

According to Assumption 3, i.e., HtρIdH_{t}\succ\rho I_{d} for any t1t\geq 1, the function wt(x)=12xTHtxw_{t}(x)=\frac{1}{2}x^{T}H_{t}x is ρ\rho-strongly convex, then we have a prox-function associated with wt(x)w_{t}(x) as in , defined as

Next, consider the bound of the term T2T_{2}, we have

where the first inequality is due to the Cauchy-Schwarz inequality and the last is due to Young’s inequality. By combining the above inequalities (A), (31) with (A), we obtain

where the last inequality is due to 0<γρ2Lμt0<\gamma\leq\frac{\rho}{2L\mu_{t}}.

In this subsection, we provide the convergence analysis of our SUPER-ADAM (τ=1\tau=1) algorithm.

In Algorithm 1, given τ=1\tau=1 and 0<αt+110<\alpha_{t+1}\leq 1 for all t0t\geq 0, we have

This proof mainly follows the proof of Lemma 2 in . By the definition of gt+1g_{t+1} in Algorithm 1 with τ=1\tau=1, we have gt+1=f(xt+1;ξt+1)+(1αt+1)(gtf(xt;ξt+1))g_{t+1}=\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})(g_{t}-\nabla f(x_{t};\xi_{t+1})). Then we have

where G=f(x1)fkργ+m1/3σ28k2L2γ2+k2c2σ24L2γ2ln(m+T)G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T).

Since μt=k(m+t)1/3\mu_{t}=\frac{k}{(m+t)^{1/3}} on tt is decreasing and mk3m\geq k^{3}, we have μtμ0=km1/31\mu_{t}\leq\mu_{0}=\frac{k}{m^{1/3}}\leq 1 for all t0t\geq 0. Due to cm2/3k2c\leq\frac{m^{2/3}}{k^{2}}, we have αt+1=cμt2cμ02ck2m2/3m2/3k2k2m2/3=1\alpha_{t+1}=c\mu_{t}^{2}\leq c\mu_{0}^{2}\leq\frac{ck^{2}}{m^{2/3}}\leq\frac{m^{2/3}}{k^{2}}\frac{k^{2}}{m^{2/3}}=1. Considering 0<γρm1/34Lk0<\gamma\leq\frac{\rho m^{1/3}}{4Lk}, we have 0<γρm1/34Lkρm1/32Lk=ρ2Lμ0ρ2Lμt0<\gamma\leq\frac{\rho m^{1/3}}{4Lk}\leq\frac{\rho m^{1/3}}{2Lk}=\frac{\rho}{2L\mu_{0}}\leq\frac{\rho}{2L\mu_{t}} for any t0t\geq 0. Thus, the parameters μt\mu_{t}, αt+1\alpha_{t+1} for all t0t\geq 0 and γ\gamma satisfy the conditions in the above Lemmas 3 and 4.

According to the concavity of the function f(x)=x1/3f(x)=x^{1/3}, i.e., (x+y)1/3x1/3+y3x2/3(x+y)^{1/3}\leq x^{1/3}+\frac{y}{3x^{2/3}} for any x,y>0x,y>0, we can obtain

where the second inequality holds by m3/2m\geq 3/2 and the last inequality is due to 0<μt10<\mu_{t}\leq 1.

where the second inequality is due to 0<αt+110<\alpha_{t+1}\leq 1, and the last equality holds by αt+1=cμt2\alpha_{t+1}=c\mu_{t}^{2}. Since c1k3+10L2γ2ρ2c\geq\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}}, we have

Due to cm2/3k2c\leq\frac{m^{2/3}}{k^{2}}, c1k3+10L2γ2ρ2c\geq\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}} and 0<γρm1/34kL0<\gamma\leq\frac{\rho m^{1/3}}{4kL}, we require

Then we obtain m83/2(3k)3/2m\geq\frac{8^{3/2}}{(3k)^{3/2}}. In the other words, we need m83/2(3k)3/2m\geq\frac{8^{3/2}}{(3k)^{3/2}} to ensure 1k3+10L2γ2ρ2cm2/3k2\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}}\leq c\leq\frac{m^{2/3}}{k^{2}}.

where the first inequality is due to the above inequality (A.1) and the above Lemma 3.

By using the above inequality (A.1), we have

where the last inequality holds by Assumptions 1 and 2. Since μt=k(m+t)1/3\mu_{t}=\frac{k}{(m+t)^{1/3}} on tt is not increasing, we have

where the second inequality is due to t=1Tμt3dt1Tμt3dt\sum_{t=1}^{T}\mu_{t}^{3}dt\leq\int^{T}_{1}\mu_{t}^{3}dt. Let G=f(x1)fkργ+m1/3σ28k2L2γ2+k2c2σ24L2γ2ln(m+T)G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T), we have

According to Jensen’s inequality, we have

where the last inequality holds by (a+b)1/6a1/6+b1/6(a+b)^{1/6}\leq a^{1/6}+b^{1/6} for any a,b>0a,b>0. Thus we can obtain

Let wt(x)=12xTHtxw_{t}(x)=\frac{1}{2}x^{T}H_{t}x, we give a prox-function (i.e., Bregman distance) associated with wt(x)w_{t}(x), defined as:

Then the step 9 of Algorithm 1 is equivalent to the following generalized projection problem:

According to the above Lemma 2, we have GX(xt,gt,γ)GX(xt,f(xt),γ)1ρf(xt)gt\|\mathcal{G}_{\mathcal{X}}(x_{t},g_{t},\gamma)-\mathcal{G}_{\mathcal{X}}(x_{t},\nabla f(x_{t}),\gamma)\|\leq\frac{1}{\rho}\|\nabla f(x_{t})-g_{t}\|. Then we have

By combining the above inequalities (47) with (A.1), we have

where G=νL(f(x1)f)+ν2σ28+ν2k4c2σ24m1/3ln(m+T)G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T).

where the second last inequality holds by Htρ\|H_{t}\|\geq\rho. Then we have

By using Cauchy-Schwarz inequality, we have

According to the above inequality (45), we have

By combining the inequalities (56) and (57), we obtain

Since γ=ρm1/3νkL (ν4)\gamma=\frac{\rho m^{1/3}}{\nu kL}\ (\nu\geq 4), we have

where G=νL(f(x1)f)+ν2σ28+ν2k4c2σ24m1/3ln(m+T)G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T). Plugging G=1ρ2m1/3GG=\frac{1}{\rho^{2}m^{1/3}}G^{\prime} into the above inequality (58), we have

A.2 Convergence Analysis of SUPER-ADAM (τ=0𝜏0\tau=0)

In this subsection, we provide the convergence analysis of our SUPER-ADAM (τ=0\tau=0) algorithm.

In Algorithm 1, given τ=0\tau=0 and 0<αt+110<\alpha_{t+1}\leq 1 for all t0t\geq 0, we have

By the definition of gt+1g_{t+1} in Algorithm 1 with τ=0\tau=0, we have gt+1=(1αt+1)gt+αt+1f(xt+1;ξt+1)g_{t+1}=(1-\alpha_{t+1})g_{t}+\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1}). Then we have

where M=f(x1)fργk+2σ2ργkL+2mσ2ργkLln(m+T)M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T).

Since μt=k(m+t)1/2\mu_{t}=\frac{k}{(m+t)^{1/2}} is decreasing on tt and mk2m\geq k^{2}, we have μtμ0=km1/21\mu_{t}\leq\mu_{0}=\frac{k}{m^{1/2}}\leq 1 for all t0t\geq 0. Due to cm1/2kc\leq\frac{m^{1/2}}{k}, we have αt+1=cμtcμ0m1/2kkm1/2=1\alpha_{t+1}=c\mu_{t}\leq c\mu_{0}\leq\frac{m^{1/2}}{k}\frac{k}{m^{1/2}}=1 for all t0t\geq 0. Since 0<γρm1/28Lk0<\gamma\leq\frac{\rho m^{1/2}}{8Lk}, we have γρm1/28Lkρm1/22Lk=ρ2Lμ0ρ2Lμt\gamma\leq\frac{\rho m^{1/2}}{8Lk}\leq\frac{\rho m^{1/2}}{2Lk}=\frac{\rho}{2L\mu_{0}}\leq\frac{\rho}{2L\mu_{t}} for all t0t\geq 0. Thus, the parameters μt\mu_{t}, αt+1\alpha_{t+1} for all t0t\geq 0 and γ\gamma satisfy the conditions in the above Lemmas 3 and 5.

where the first equality is due to αt+1=cμt\alpha_{t+1}=c\mu_{t} and the last equality holds by 8Lγρcm1/2k\frac{8L\gamma}{\rho}\leq c\leq\frac{m^{1/2}}{k}.

where the first inequality follows by the above inequality (A.2) and the above Lemma 3.

Taking average over t=1,2,,Tt=1,2,\cdots,T on both sides of (A.2), we have

where the second inequality holds by Assumptions 1 and 2. Since μt=k(m+t)1/2\mu_{t}=\frac{k}{(m+t)^{1/2}} is decreasing on tt, we have

Let M=f(x1)fργk+2σ2ργkL+2mσ2ργkLln(m+T)M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T), the above inequality (A.2) reduces to

According to Jensen’s inequality, we have

where the last inequality is due to the inequality (a+b)1/4a1/4+b1/4(a+b)^{1/4}\leq a^{1/4}+b^{1/4} for all a,b0a,b\geq 0. Thus, we have

By using the above inequality (A.1), we obtain

where M=νL(f(x1)f)+2νσ2+2νmσ2ln(m+T)M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T).

According to the above inequality (56), we have

According to the above inequality (67), we have

By combining the inequalities (71) and (72), we obtain

Since γ=ρm1/2νLk (ν8)\gamma=\frac{\rho m^{1/2}}{\nu Lk}\ (\nu\geq 8), we have

where M=νL(f(x1)f)+2νσ2+2νmσ2ln(m+T)M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T).

Plugging M=1ρ2m1/2MM=\frac{1}{\rho^{2}m^{1/2}}M^{\prime} into the above inequality (73), we obtain

Appendix B Additional Experimental Results

In the section, we conduct some numerical experiments to empirically evaluate our SUPER-ADAM algorithm on two deep learning tasks as in : image classification on CIFAR-10, CIFAR-100 and Image-Net datasets and language modeling on Wiki-Text2 dataset (Please see Table 2).

We add some experimental results of ImageNet data given in Figure 6.

B.2 Language Modeling Task

In the experiment, we conduct language modeling task on the Wiki-Text2 dataset. Specifically, we train a 2-layer LSTM and a 2-layer Transformer over the WiKi-Text2 dataset. For the 2-layer Transformer, we use 200 dimensional word embeddings, 200 hidden unites and 2 heads. What’s more, we set the batch size as 20 and trains for 40 epochs with dropout rate 0.5. We also clip the gradients by norm 0.25 same as when we use LSTM. We decrease the learning by 4 whenever the validation error increases. For the learning rate, we also do grid search and report the best one for each optimizer. In Adam and Amsgrad algorithms, we set the learning rate as 0.0002 in Transformer. In AdaGrad-Norm algorithm, the best learning rate is 10 in Transformer. In Adam+ algorithm, we use the learning rate 20. In AdaBelief algorithm, we set the learing 1. In STORM algorithm, we set k=12.5k=12.5, w=100w=100 and c=10c=10. In our SUPER-ADAM algorithm, we set k=1k=1, m=100m=100, c=20c=20, γ=0.002\gamma=0.002 when τ=1\tau=1, while k=1k=1, m=100m=100, c=20c=20, γ=0.003\gamma=0.003 when τ=0\tau=0.

Figure 7 shows that both train and test perplexities (losses) for different optimizers. When τ=1\tau=1, our SUPER-ADAM algorithm outperforms all the other optimizers. When τ=0\tau=0, our SUPER-ADAM optimizer gets a comparable performance with the other Adam-type algorithms.