SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients

Feihu Huang, Junyi Li, Heng Huang

Introduction

In the paper, we consider solving the following stochastic optimization problem:

Adam recently has been shown great successes in current machine learning problems, e.g., it is a default method of choice for training DNNs and contrastive learning . Unfortunately, Reddi et al. still showed that Adam is frequently divergent in some settings where the gradient information quickly disappear. To deal with this issue, some variants of Adam algorithm, e.g., AMSGrad , YOGI and generalized Adam have been proposed. Specifically, AMSGrad applies an extra ‘long term memory’ variable to preserve the past gradient information in order to handle the convergence issue of Adam. YOGI introduces an adaptive denominator constant, and studies effect of the mini-batch size in its convergence. Subsequently, Chen et al. studied the convergence of a class of Adam-type algorithms for nonconvex optimization. Zhou et al. analyzed the convergence of a class of adaptive gradient algorithms for nonconvex optimization, and the result shows the advantage of adaptive gradient methods over SGD in sparse stochastic gradient setting. Meanwhile, Liu et al. studied the variances of these adaptive algorithms. More recently, Guo et al. presented a novel convergence analysis for a family of Adam-style methods (including Adam, AMSGrad, Adabound, etc.) with an increasing or large momentum parameter for the first-order moment.

Although the above these adaptive gradient methods show some good empirical performances, their generalization performance is worse than SGD (with momentum) on many deep learning tasks due to using the coordinate-wise learning rates . Thus, recently some adaptive gradient methods have been proposed to improve the generalization performance of Adam. For example, AdamW and Padam improve the generalization performance of Adam by decoupling weight decay regularization and introducing a partial adaptive parameter, respectively. Luo et al. proposed a new variant of Adam (i.e., Adabound) by employing dynamic bounds on learning rates to improve the generalization performance. Subsequently, AdaBelief has been presented to obtain a good generalization by adopting the stepsize according to the ‘belief’ in the current gradient direction. In addition, the norm version of AdaGrad (i.e., AdaGrad-Norm) has been proposed to obtain a good generalization performance.

To fill this gap, in the paper, we propose a faster and universal framework of adaptive gradients, i.e., SUPER-ADAM algorithm, by introducing a universal adaptive matrix. Moreover, we provide a novel convergence analysis framework for the adaptive gradient methods under the nonconvex setting based on the mirror descent algorithm . In summary, our main contributions are threefold:

We propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradients. Moreover, our framework can flexibly integrate the momentum and variance-reduced techniques.

We provide a novel convergence analysis framework for the adaptive gradient methods in the nonconvex setting under the milder conditions (Please see Table 1).

Preliminaries

2 Adaptive Gradient Algorithms

In the subsection, we review some existing typical adaptive gradient methods. Recently, many adaptive algorithms have been proposed to solve the problem (1), and achieve good performances. For example, Adagrad is the first adaptive gradient method with adaptive learning rate for each individual dimension, which adopts the following update form:

where $g_{t}=\nabla f(x_{t};\xi_{t})$ and $v_{t}=\frac{1}{t}\sum_{j=1}^{t}g_{j}^{2}$ , and $\eta_{t}=\frac{\eta}{\sqrt{t}}$ with $\eta>0$ is the step size. In fact, $\eta_{t}$ only is the basic learning rate that is the same for all coordinates of variable $x_{t}$ , while $\frac{\eta_{t}}{\sqrt{v_{t,i}}}$ is the effective learning rate for the $i$ -th coordinate of $x_{t}$ , which changes across the coordinates.

Adam is one of the most popular exponential moving average variant of Adagrad, which combines the exponential moving average technique with momentum acceleration. Its update form is:

where $\alpha_{1},\alpha_{2}\in(0,1)$ and $\varepsilon>0$ , and $\eta_{t}=\frac{\eta}{\sqrt{t}}$ with $\eta>0$ . However, Reddi et al. found a divergence issue of the Adam algorithm, and proposed a modified version of Adam (i.e., Amsgrad), which adopts a new step instead of the debiasing step in (2.2) to ensure the decay of the effective learning rate, defined as

Due to using the coordinate-wise learning rates, these adaptive gradient methods frequently have worse generalization performance than SGD (with momentum) . To improve the generalization performance of Adam, AdamW uses a decoupled weight decay regularization, defined as

where $\alpha_{1},\alpha_{2}\in(0,1)$ , $\alpha>0$ , $\lambda>0$ and $\varepsilon>0$ . More recently, to further improve generalization performance, AdaBelief adopts a stepsize according to ‘belief’ in the current gradient direction,

where $\alpha_{1},\alpha_{2}\in(0,1)$ , and $\eta_{t}=\frac{\eta}{\sqrt{t}}$ with $\eta>0$ , and $\varepsilon>0$ .

At the same time, to improve generalization performance, recently some effective adaptive gradient methods have been proposed with adopting the global adaptive learning rates instead of coordinate-wise counterparts. For example, AdaGrad-Norm applies a global adaptive learning rate to the following update form, for all $t\geq 1$

where $\eta>0$ . The adaptive-SGD adopts a global adaptive learning rate, defined as for all $t\geq 1$

where $k>0$ , $\omega>0$ , and $\varepsilon\geq 0$ . Subsequently, STORM not only uses a global adaptive learning rate but also adopts the variance-reduced technique in gradient estimator to accelerate algorithm, defined as for all $t\geq 1$

SUPER-ADAM Algorithm

In the section, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Specifically, our SUPER-ADAM algorithm is summarized in Algorithm 1.

At the step 4 in Algorithm 1, we generate an adaptive matrix $H_{t}$ based on stochastic gradient information, which can include both coordinate-wise and global learning rates. For example, $H_{t}$ generated from the case 1 in Algorithm 1 is similar to the coordinate-wise adaptive learning rate used in Adam . $H_{t}$ generated from the case 2 in Algorithm 1 is similar to the global adaptive learning rate used in the AdaGrad-Norm and Adaptive-SGD . Moreover, we can obtain some new adaptive learning rates by generating some specific adaptive matrices. In the case 3, based on Barzilai-Borwein technique , we design a novel adaptive matrix $H_{t}$ defined as:

where $\lambda>0$ . In the case 4, as the adaptive learning rate used in , we can generate a coordinate-wise-type adaptive matrix $H_{t}=\mbox{diag}(\sqrt{v_{t}}+\lambda)$ and a global-type adaptive matrix $H_{t}=(b_{t}+\lambda)I_{d}$ , respectively, defined as: $m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})\nabla f(x_{t};\xi_{t})$ ,

where $\beta_{1},\beta_{2}\in(0,1)$ and $\lambda>0$ . In fact, the adaptive matrix $H_{t}$ can be given in a generic form $H_{t}=A_{t}+\lambda I_{d}$ , where the matrix $A_{t}$ includes the adaptive information that is generated from stochastic gradients with noises, and the tuning parameter $\lambda>0$ balances these adaptive information with noises.

At the step 9 in Algorithm 1, we use a generalized gradient descent (i.e., mirror descent) iteration to update $x$ based on the adaptive matrix $H_{t}$ , defined as

Under this case, $\gamma\mu_{t}$ is a basic stepsize as $\eta_{t}$ in the formula (2.2) of Adam algorithm, and $H_{t}^{-1}$ is an adaptive stepsize as $\frac{1}{\sqrt{\hat{v}_{t}}}$ in the formula (2.2) of Adam algorithm.

At the step 11 of Algorithm 1, we use the stochastic gradient estimator $g_{t+1}$ for all $t\geq 1$ :

where $\tau\in\{0,1\}$ and $\alpha_{t+1}\in(0,1]$ for all $t\geq 1$ . When $\tau=1$ , we have $g_{t+1}=\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})\big{(}g_{t}-\nabla f(x_{t};\xi_{t+1})\big{)}$ for all $t\geq 1$ , which is a momentum-based variance reduced gradient estimator used in STORM . When $\tau=0$ , we have $g_{t+1}=\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})g_{t}$ for all $t\geq 1$ , which is a basic momentum gradient estimator used in the Adam algorithm .

Theoretical Analysis

In this section, we study the convergence properties of our algorithm (SUPER-ADAM) under some mild conditions. All detailed proofs are in the supplementary materials.

The function $f(x)$ is bounded from below in $\mathcal{X}$ , i.e., $f^{*}=\inf_{x\in\mathcal{X}}f(x)$ .

Assume the adaptive matrix $H_{t}$ for all $t\geq 1$ satisfies $H_{t}\succeq\rho I_{d}\succ 0$ , and $\rho>0$ denotes a lower bound of the smallest eigenvalue of $H_{t}$ for all $t\geq 1$ .

Assumption 1 is commonly used in stochastic optimization . Assumption 2 ensures the feasibility of the problem (1). In fact, all adaptive algorithms in Table 1 require these mild Assumptions 1 and 2. Assumption 3 guarantees that the adaptive matrices $\{H_{t}\}_{t\geq 1}$ are positive definite and their smallest eigenvalues have a lower bound $\rho>0$ . From the above adaptive matrices $\{H_{t}\}_{t\geq 1}$ given in our SUPER-ADAM algorithm, we have $\rho\geq\lambda>0$ . In fact, many existing adaptive algorithms also implicitly use Assumption 3. For example, Zaheer et al. and Zhuang et al. used the following iteration form to update the variable $x$ : $x_{t+1}=x_{t}-\eta_{t}\frac{m_{t}}{\sqrt{v_{t}}+\varepsilon}$ for all $t\geq 0$ and $\varepsilon>0$ , which is equivalent to $x_{t+1}=x_{t}-\eta_{t}H^{-1}_{t}m_{t}$ with $H_{t}=\mbox{diag}(\sqrt{v_{t}}+\varepsilon)$ . Clearly, we have $H_{t}\succeq\varepsilon I_{d}\succ 0$ . Ward et al. applied a global adaptive learning rate to the update form in (7), which is equivalent to the following form: $x_{t}=x_{t-1}-\eta H_{t}^{-1}\nabla f(x_{t-1};\xi_{t-1})$ with $H_{t}=b_{t}I_{d}$ . By the above (7), we have $H_{t}\succeq\cdots\succeq H_{0}=b_{0}I_{d}\succ 0$ . Li et al. and Cutkosky et al. applied a global adaptive learning rate to the update forms in (8) and (9), which is equivalent to $x_{t+1}=x_{t}-H_{t}^{-1}g_{t}$ , where $H_{t}=(1/\eta_{t})I_{d}$ and $\eta_{t}=k/\big{(}\omega+\sum_{i=1}^{t}\|\nabla f(x_{i};\xi_{i})\|^{2}\big{)}^{\alpha}$ with $k>0,\omega>0,\alpha\in(0,1)$ . By the above (8) and (9), we have $H_{t}\succeq\cdots\succeq H_{0}=(\omega^{\alpha}/k)I_{d}\succ 0$ . Reddi et al. and Chen et al. used the condition $\hat{v}_{t}=\max(\hat{v}_{t-1},v_{t})$ , and let $H_{t}=\mbox{diag}(\sqrt{\hat{v}_{t}})$ , thus we have $H_{t}\succeq\cdots\succeq H_{1}=\mbox{diag}(\sqrt{\hat{v}_{1}})=\sqrt{1-\alpha_{2}}\mbox{diag}(|\nabla f(x_{1};\xi_{1})|)\succeq 0$ . Without loss of generality, choosing an initial point $x_{1}$ and let $(\nabla f(x_{1};\xi_{1}))_{j}\neq 0$ for all $j\in[d]$ , we have $H_{t}\succeq\cdots\succeq H_{1}\succ 0$ . Interestingly, our SUPER-ADAM algorithm includes a class of novel momentum-based quasi-Newton algorithms by generating an approximated Hessian matrix $H_{t}$ . In fact, the quasi-Newton algorithms generally require the bounded approximated Hessian matrices, i.e., $\hat{\kappa}I_{d}\succeq H_{t}\succeq\bar{\kappa}I_{d}\succ 0$ for all $t\geq 1$ , where $\hat{\kappa}\geq\bar{\kappa}>0$ . Thus Assumption 3 is reasonable and mild. Due to Assumption 3, our convergence analysis can be easily applied to the stochastic quasi-Newton algorithms.

2 A Useful Convergence Measure

We provide a useful measure to analyze the convergence of our algorithm, defined as

We define a Bregman distance associated with function $w_{t}(x)=\frac{1}{2}x^{T}H_{t}x$ as follows

Thus, the step 9 of Algorithm 1 is equivalent to the following mirror descent iteration:

As in , we define a gradient mapping $\mathcal{G}_{\mathcal{X}}(x_{t},\nabla f(x_{t}),\gamma)=\frac{1}{\gamma}(x_{t}-x^{+}_{t+1})$ , where

3 Convergence Analysis of SUPER-ADAM (τ=1)𝜏1(\tau=1)

In this subsection, we provide the convergence analysis of our SUPER-ADAM $(\tau=1)$ algorithm using the momentum-based variance reduced gradient estimator .

Each component function $f(x;\xi)$ is $L$ -smooth for all $\xi\in\mathcal{D}$ , i.e.,

where $G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T)$ .

where $G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T)$ .

4 Convergence Analysis of SUPER-ADAM (τ=0)𝜏0(\tau=0)

In this subsection, we provide the convergence analysis of our SUPER-ADAM $(\tau=0)$ algorithm using the basic momentum stochastic gradient estimator .

Assumption 5 is widely used in adaptive algorithms , which is milder than Assumption 4.

where $M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T)$ .

where $M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T)$ .

Differences between Our Algorithm and Related Algorithms

In this section, we show some significance differences between our algorithm and some related algorithms, i.e., STORM algorithm and Adam-type algorithms . Although our SUPER-ADAM ( $\tau=1$ ) algorithm uses the same stochastic gradient estimator used in the STORM, there exist some significant differences:

Our algorithm focuses on both constrained and unconstrained optimizations, but STORM only focuses on unconstrained optimization.

In our algorithm, we introduce a weighted solution $x_{t+1}$ at the step 10 by using momentum update. Under this case, our algorithm can easily incorporate various adaptive learning rates and variance reduced techniques. Specifically, we can flexibly use various adaptive learning rates and different stochastic gradient estimators $g_{t}$ at the step 9 of our algorithm. In fact, this is one of important novelties of our paper. However, the STORM only uses a simple gradient descent iteration with a specific monotonically decreasing adaptive learning rate.

Similarly, although our SUPER-ADAM ( $\tau=0$ ) algorithm uses the same stochastic gradient estimator used in these Adam-type algorithms, there exist some significant differences besides using different adaptive learning rates. These Adam-type algorithms use a decreasing learning rate $\eta_{t}=\frac{\eta}{\sqrt{t}}$ (Please see the above (2.2), (4) and (2.2)), while our algorithm only uses a constant learning rate $\gamma$ besides an adaptive learning rate. Moreover, our algorithm introduces a weighted solution $x_{t+1}$ at the step 10 with a decreasing parameter $\mu_{t}=\frac{k}{\sqrt{m+t}}$ (Please see Theorem 2) and uses a decreasing parameter $\alpha_{t+1}=c\mu_{t}$ in the gradient estimator, while these Adam-type algorithms only use a constant parameter $\alpha_{1}\in(0,1)$ in their gradient estimators. Under this case, our algorithm uses these decreasing parameters $\mu_{t}$ and $\alpha_{t+1}$ to control the noises in our gradient estimator, so our algorithm does not require some additional assumptions such as the bounded (stochastic) gradient assumption in our convergence analysis for the constrained optimization. For example, when $\tau=0$ , our gradient estimator is $g_{t+1}=\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})g_{t}$ . Intuitively, with growing $t$ , $\alpha_{t+1}=\frac{ck}{\sqrt{m+t}}$ will become small, so the new noises added in our gradient estimator $g_{t+1}$ will also become less.

Numerical Experiments

In this section, we conduct some experiments to empirically evaluate our SUPER-ADAM algorithm on two deep learning tasks as in : image classification on CIFAR-10, CIFAR-100 and Image-Net datasets and language modeling on Wiki-Text2 dataset. In the experiments, we compare our SUPER-ADAM algorithm against several state-of-the-art adaptive gradient algorithms, including: (1) SGD, (2) Adam , (3) Amsgrad , (4) AdaGrad-Norm , (5) Adam+ , (6) STORM and (7) AdaBelief . For our SUPER-ADAM algorithm, we consider $\tau=1$ and $\tau=0$ , respectively. Without loss of generality, in the following experiments, we only use the case 1 in Algorithm 1 to generate adaptive matrix $H_{t}$ and let $\lambda=0.0005$ . All experiments are run over a machine with Intel Xeon E5-2683 CPU and 4 Nvidia Tesla P40 GPUs.

In the experiment, we conduct image classification task on CIFAR-10, CIFAR-100 and Image-Net datasets. We perform training over ResNet-18 and VGG-19 on CIFAR-10 and CIFAR-100 datasets, respectively. For all the optimizers, we set the batch size as 128 and trains for 200 epochs. For the learning rates and other hyper-parameters, we do grid search and report the best one for each optimizer. In Adam, Amsgrad and AdaBelief algorithms, we set the learning rate as 0.001. In AdaGrad-Norm, the best learning rate is 17 for CIFAR-10 and 10 for CIFAR-100, respectively. In Adam+, we use the recommended tuning parameters in . In STORM, the best result is obtained when $w=6$ , $k=10$ and $c=100$ for CIFAR-10, while $w=3$ , $k=10$ and $c=100$ for CIFAR-100. For our SUPER-ADAM algorithm, in both CIFAR-10 and CIFAR-100 datasets, we set $k=1$ , $m=100$ , $c=40$ , $\gamma=0.001$ when $\tau=1$ , and $k=1$ , $m=100$ , $c=20$ , $\gamma=0.001$ when $\tau=0$ . Note that although $c>\frac{m^{2/3}}{k^{2}}$ ( $c>\frac{m^{1/2}}{k}$ ) in our algorithm, we set $\alpha_{t}=\min(\alpha_{t},0.9)$ at the first several iterations. In our algorithm, $\mu_{t}=\frac{k}{(m+t)^{1/3}}$ ( $\mu_{t}=\frac{k}{(m+t)^{1/2}}$ ) decreases as the number of iteration $t$ increases, so $\alpha_{t+1}=c\mu_{t}^{2}$ ( $\alpha_{t+1}=c\mu_{t}$ ) will be less than 1 after the first several iterations.

We train a ResNet-34 on ImageNet dataset. For all the optimizers, we set the batch size as 256 and trains for 60 epochs. In Adam,Amsgrad and AdaBelief, we set learning rate as 0.001. In AdaGrad-Norm, the best learning rate is 30. In Adam+, we set learning rate as 0.1. In STORM, the best result is obtained when $k=5$ , $w=100$ and $c=10$ . For our algorithm, we set $k=1$ , $m=100$ , $c=40$ , $\gamma=0.01$ when $\tau=1$ , and $k=1$ , $m=100$ , $c=4$ , $\gamma=0.04$ when $\tau=0$ .

2 Language Modeling Task

In the experiment, we conduct language modeling task on the Wiki-Text2 dataset. Specifically, we train a 2-layer LSTM and a 2-layer Transformer over the WiKi-Text2 dataset. For the LSTM, we use 650 dimensional word embeddings and 650 hidden units per-layer. Due to space limitation, we provide the experimental results for the transformer in the supplementary materials. In the experiment, we set the batch size as 20 and trains for 40 epochs with dropout rate 0.5. We also clip the gradients by norm 0.25 in case of the exploding gradient in LSTM. We also decrease the learning by 4 whenever the validation error increases. For the learning rate, we also do grid search and report the best one for each optimizer. In Adam and Amsgrad algorithms, we set the learning rate as 0.001 in LSTM In AdaGrad-Norm algorithm, the best learning rate is 40. In Adam+ algorithm, we use the learning rate 20. In AdaBelief algorithm, we set the learing rate 0.1. In STORM algorithm, we set $w=50$ , $k=10$ and $c=100$ . In our SUPER-ADAM algorithm, we set $k=1$ , $m=100$ , $c=40$ , $\gamma=0.001$ when $\tau=1$ , while $k=1$ , $m=100$ , $c=20$ , $\gamma=0.01$ when $\tau=0$ .

Figure 5 shows that both train and test perplexities (losses) for different optimizers. When $\tau=1$ , our SUPER-ADAM algorithm outperforms all the other optimizers. When $\tau=0$ , our SUPER-ADAM optimizer gets a comparable performance with the other Adam-type optimizers.

Conclusions

In the paper, we proposed a novel faster and universal adaptive gradient framework (i.e., SUPER-ADAM) by introducing a universal adaptive matrix including most existing adaptive gradient forms. In particular, our algorithm can flexibly work with the momentum and variance reduced techniques. Moreover, we provided a novel convergence analysis framework for the adaptive gradient methods under the nonconvex setting. Experimental studies were conducted on both image classification and language modeling tasks, and all empirical results verify the superior performances of our algorithm.

Acknowledgments and Disclosure of Funding

This work was partially supported by NSF IIS 1845666, 1852606, 1838627, 1837956, 1956002, OIA 2040588. Feihu and Heng are corresponding Authors.

References

Appendix A Proofs of Convergence Analysis

In this section, we detail the convergence analysis of our algorithm. We first provide some useful lemmas.

Given a $\rho$ -strongly convex function $w(x)$ , we define a prox-function (i.e., Bregman distance) associated with $w(x)$ as follows:

Then we define a generalized projection problem as in :

where $\rho>0$ depends on $\rho$ -strongly convex function $w(x)$ .

When $h(x)=0$ , in the above Lemma 1, we have $\langle g,\mathcal{G}_{\mathcal{X}}(x,g,\gamma)\rangle\geq\rho\|\mathcal{G}_{\mathcal{X}}(x,g,\gamma)\|^{2}$ .

(Proposition 1 in ) Let $x_{1}^{*}$ and $x_{2}^{*}$ be given in (23) with $g$ replaced by $g_{1}$ and $g_{2}$ respectively. Further let $\mathcal{G}_{\mathcal{X}}(x,g_{1},\gamma)$ and $\mathcal{G}_{\mathcal{X}}(x,g_{2},\gamma)$ be defined in (24) with $x^{*}$ replaced by $x_{1}^{*}$ and $x_{2}^{*}$ respectively. Then we have

Suppose that the sequence $\{x_{t}\}_{t=1}^{T}$ be generated from Algorithm 1. Let $0<\mu_{t}\leq 1$ and $0<\gamma\leq\frac{\rho}{2L\mu_{t}}$ , then we have

According to Assumption 4 or 5, i.e., the function $f(x)$ is $L$ -smooth, we have

According to Assumption 3, i.e., $H_{t}\succ\rho I_{d}$ for any $t\geq 1$ , the function $w_{t}(x)=\frac{1}{2}x^{T}H_{t}x$ is $\rho$ -strongly convex, then we have a prox-function associated with $w_{t}(x)$ as in , defined as

Next, consider the bound of the term $T_{2}$ , we have

where the first inequality is due to the Cauchy-Schwarz inequality and the last is due to Young’s inequality. By combining the above inequalities (A), (31) with (A), we obtain

where the last inequality is due to $0<\gamma\leq\frac{\rho}{2L\mu_{t}}$ .

In this subsection, we provide the convergence analysis of our SUPER-ADAM ( $\tau=1$ ) algorithm.

In Algorithm 1, given $\tau=1$ and $0<\alpha_{t+1}\leq 1$ for all $t\geq 0$ , we have

This proof mainly follows the proof of Lemma 2 in . By the definition of $g_{t+1}$ in Algorithm 1 with $\tau=1$ , we have $g_{t+1}=\nabla f(x_{t+1};\xi_{t+1})+(1-\alpha_{t+1})(g_{t}-\nabla f(x_{t};\xi_{t+1}))$ . Then we have

where $G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T)$ .

Since $\mu_{t}=\frac{k}{(m+t)^{1/3}}$ on $t$ is decreasing and $m\geq k^{3}$ , we have $\mu_{t}\leq\mu_{0}=\frac{k}{m^{1/3}}\leq 1$ for all $t\geq 0$ . Due to $c\leq\frac{m^{2/3}}{k^{2}}$ , we have $\alpha_{t+1}=c\mu_{t}^{2}\leq c\mu_{0}^{2}\leq\frac{ck^{2}}{m^{2/3}}\leq\frac{m^{2/3}}{k^{2}}\frac{k^{2}}{m^{2/3}}=1$ . Considering $0<\gamma\leq\frac{\rho m^{1/3}}{4Lk}$ , we have $0<\gamma\leq\frac{\rho m^{1/3}}{4Lk}\leq\frac{\rho m^{1/3}}{2Lk}=\frac{\rho}{2L\mu_{0}}\leq\frac{\rho}{2L\mu_{t}}$ for any $t\geq 0$ . Thus, the parameters $\mu_{t}$ , $\alpha_{t+1}$ for all $t\geq 0$ and $\gamma$ satisfy the conditions in the above Lemmas 3 and 4.

According to the concavity of the function $f(x)=x^{1/3}$ , i.e., $(x+y)^{1/3}\leq x^{1/3}+\frac{y}{3x^{2/3}}$ for any $x,y>0$ , we can obtain

where the second inequality holds by $m\geq 3/2$ and the last inequality is due to $0<\mu_{t}\leq 1$ .

where the second inequality is due to $0<\alpha_{t+1}\leq 1$ , and the last equality holds by $\alpha_{t+1}=c\mu_{t}^{2}$ . Since $c\geq\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}}$ , we have

Due to $c\leq\frac{m^{2/3}}{k^{2}}$ , $c\geq\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}}$ and $0<\gamma\leq\frac{\rho m^{1/3}}{4kL}$ , we require

Then we obtain $m\geq\frac{8^{3/2}}{(3k)^{3/2}}$ . In the other words, we need $m\geq\frac{8^{3/2}}{(3k)^{3/2}}$ to ensure $\frac{1}{k^{3}}+\frac{10L^{2}\gamma^{2}}{\rho^{2}}\leq c\leq\frac{m^{2/3}}{k^{2}}$ .

where the first inequality is due to the above inequality (A.1) and the above Lemma 3.

By using the above inequality (A.1), we have

where the last inequality holds by Assumptions 1 and 2. Since $\mu_{t}=\frac{k}{(m+t)^{1/3}}$ on $t$ is not increasing, we have

where the second inequality is due to $\sum_{t=1}^{T}\mu_{t}^{3}dt\leq\int^{T}_{1}\mu_{t}^{3}dt$ . Let $G=\frac{f(x_{1})-f^{*}}{k\rho\gamma}+\frac{m^{1/3}\sigma^{2}}{8k^{2}L^{2}\gamma^{2}}+\frac{k^{2}c^{2}\sigma^{2}}{4L^{2}\gamma^{2}}\ln(m+T)$ , we have

According to Jensen’s inequality, we have

where the last inequality holds by $(a+b)^{1/6}\leq a^{1/6}+b^{1/6}$ for any $a,b>0$ . Thus we can obtain

Let $w_{t}(x)=\frac{1}{2}x^{T}H_{t}x$ , we give a prox-function (i.e., Bregman distance) associated with $w_{t}(x)$ , defined as:

Then the step 9 of Algorithm 1 is equivalent to the following generalized projection problem:

According to the above Lemma 2, we have $\|\mathcal{G}_{\mathcal{X}}(x_{t},g_{t},\gamma)-\mathcal{G}_{\mathcal{X}}(x_{t},\nabla f(x_{t}),\gamma)\|\leq\frac{1}{\rho}\|\nabla f(x_{t})-g_{t}\|$ . Then we have

By combining the above inequalities (47) with (A.1), we have

where $G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T)$ .

where the second last inequality holds by $\|H_{t}\|\geq\rho$ . Then we have

By using Cauchy-Schwarz inequality, we have

According to the above inequality (45), we have

By combining the inequalities (56) and (57), we obtain

Since $\gamma=\frac{\rho m^{1/3}}{\nu kL}\ (\nu\geq 4)$ , we have

where $G^{\prime}=\nu L(f(x_{1})-f^{*})+\frac{\nu^{2}\sigma^{2}}{8}+\frac{\nu^{2}k^{4}c^{2}\sigma^{2}}{4m^{1/3}}\ln(m+T)$ . Plugging $G=\frac{1}{\rho^{2}m^{1/3}}G^{\prime}$ into the above inequality (58), we have

A.2 Convergence Analysis of SUPER-ADAM (τ=0𝜏0\tau=0)

In this subsection, we provide the convergence analysis of our SUPER-ADAM ( $\tau=0$ ) algorithm.

In Algorithm 1, given $\tau=0$ and $0<\alpha_{t+1}\leq 1$ for all $t\geq 0$ , we have

By the definition of $g_{t+1}$ in Algorithm 1 with $\tau=0$ , we have $g_{t+1}=(1-\alpha_{t+1})g_{t}+\alpha_{t+1}\nabla f(x_{t+1};\xi_{t+1})$ . Then we have

where $M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T)$ .

Since $\mu_{t}=\frac{k}{(m+t)^{1/2}}$ is decreasing on $t$ and $m\geq k^{2}$ , we have $\mu_{t}\leq\mu_{0}=\frac{k}{m^{1/2}}\leq 1$ for all $t\geq 0$ . Due to $c\leq\frac{m^{1/2}}{k}$ , we have $\alpha_{t+1}=c\mu_{t}\leq c\mu_{0}\leq\frac{m^{1/2}}{k}\frac{k}{m^{1/2}}=1$ for all $t\geq 0$ . Since $0<\gamma\leq\frac{\rho m^{1/2}}{8Lk}$ , we have $\gamma\leq\frac{\rho m^{1/2}}{8Lk}\leq\frac{\rho m^{1/2}}{2Lk}=\frac{\rho}{2L\mu_{0}}\leq\frac{\rho}{2L\mu_{t}}$ for all $t\geq 0$ . Thus, the parameters $\mu_{t}$ , $\alpha_{t+1}$ for all $t\geq 0$ and $\gamma$ satisfy the conditions in the above Lemmas 3 and 5.

where the first equality is due to $\alpha_{t+1}=c\mu_{t}$ and the last equality holds by $\frac{8L\gamma}{\rho}\leq c\leq\frac{m^{1/2}}{k}$ .

where the first inequality follows by the above inequality (A.2) and the above Lemma 3.

Taking average over $t=1,2,\cdots,T$ on both sides of (A.2), we have

where the second inequality holds by Assumptions 1 and 2. Since $\mu_{t}=\frac{k}{(m+t)^{1/2}}$ is decreasing on $t$ , we have

Let $M=\frac{f(x_{1})-f^{*}}{\rho\gamma k}+\frac{2\sigma^{2}}{\rho\gamma kL}+\frac{2m\sigma^{2}}{\rho\gamma kL}\ln(m+T)$ , the above inequality (A.2) reduces to

According to Jensen’s inequality, we have

where the last inequality is due to the inequality $(a+b)^{1/4}\leq a^{1/4}+b^{1/4}$ for all $a,b\geq 0$ . Thus, we have

By using the above inequality (A.1), we obtain

where $M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T)$ .

According to the above inequality (56), we have

According to the above inequality (67), we have

By combining the inequalities (71) and (72), we obtain

Since $\gamma=\frac{\rho m^{1/2}}{\nu Lk}\ (\nu\geq 8)$ , we have

where $M^{\prime}=\nu L(f(x_{1})-f^{*})+2\nu\sigma^{2}+2\nu m\sigma^{2}\ln(m+T)$ .

Plugging $M=\frac{1}{\rho^{2}m^{1/2}}M^{\prime}$ into the above inequality (73), we obtain

Appendix B Additional Experimental Results

In the section, we conduct some numerical experiments to empirically evaluate our SUPER-ADAM algorithm on two deep learning tasks as in : image classification on CIFAR-10, CIFAR-100 and Image-Net datasets and language modeling on Wiki-Text2 dataset (Please see Table 2).

We add some experimental results of ImageNet data given in Figure 6.

B.2 Language Modeling Task

In the experiment, we conduct language modeling task on the Wiki-Text2 dataset. Specifically, we train a 2-layer LSTM and a 2-layer Transformer over the WiKi-Text2 dataset. For the 2-layer Transformer, we use 200 dimensional word embeddings, 200 hidden unites and 2 heads. What’s more, we set the batch size as 20 and trains for 40 epochs with dropout rate 0.5. We also clip the gradients by norm 0.25 same as when we use LSTM. We decrease the learning by 4 whenever the validation error increases. For the learning rate, we also do grid search and report the best one for each optimizer. In Adam and Amsgrad algorithms, we set the learning rate as 0.0002 in Transformer. In AdaGrad-Norm algorithm, the best learning rate is 10 in Transformer. In Adam+ algorithm, we use the learning rate 20. In AdaBelief algorithm, we set the learing 1. In STORM algorithm, we set $k=12.5$ , $w=100$ and $c=10$ . In our SUPER-ADAM algorithm, we set $k=1$ , $m=100$ , $c=20$ , $\gamma=0.002$ when $\tau=1$ , while $k=1$ , $m=100$ , $c=20$ , $\gamma=0.003$ when $\tau=0$ .

Figure 7 shows that both train and test perplexities (losses) for different optimizers. When $\tau=1$ , our SUPER-ADAM algorithm outperforms all the other optimizers. When $\tau=0$ , our SUPER-ADAM optimizer gets a comparable performance with the other Adam-type algorithms.