Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

Soham De, Anirbit Mukherjee, Enayat Ullah

Introduction

Momentum methods like HB and NAG have been shown to have superior convergence properties compared to gradient descent in the deterministic setting both for convex and non-convex functions [nesterov1983method, polyak1987introduction, zavriev1993heavy, ochs2016local, o2017behavior, jin2017accelerated]. While (to the best of our knowledge) there is no clear theoretical justification in the stochastic case of the benefits of NAG and HB over regular SGD in general [yuan2016influence, kidambi2018on, wiegerinck1994stochastic, orr1994momentum, yang2016unified, gadat2018stochastic], unless considering specialized function classes [loizou2017momentum]; in practice, these momentum methods, and in particular NAG, have been repeatedly shown to have good convergence and generalization on a range of neural net problems [sutskever2013importance, lucas2018aggregated, kidambi2018on].

The performance of NAG (as well as HB and SGD), however, are typically quite sensitive to the selection of its hyper-parameters: step size, momentum and batch size [sutskever2013importance]. Thus, “adaptive gradient” algorithms such as RMSProp (Algorithm 2) [tieleman2012lecture] and ADAM (Algorithm 3) [kingma2014adam] have become very popular for optimizing deep neural networks [melis2017state, xu2015show, denkowski2017stronger, gregor2015draw, radford2015unsupervised, bahar2017empirical, kiros2015skip]. The reason for their widespread popularity seems to be the fact that they are believed to be easier to tune than SGD, NAG or HB. Adaptive gradient methods use as their update direction a vector which is the image under a linear transformation (often called the “diagonal pre-conditioner”) constructed out of the history of the gradients, of a linear combination of all the gradients seen till now. It is generally believed that this “pre-conditioning” makes these algorithms much less sensitive to the selection of its hyper-parameters. A precursor to these RMSProp and ADAM was outlined in [duchi2011adaptive].

Despite their widespread use in neural net problems, adaptive gradients methods like RMSProp and ADAM lack theoretical justifications in the non-convex setting - even with exact/deterministic gradients [bernstein2018signsgd]. Further, there are also important motivations to study the behavior of these algorithms in the deterministic setting because of usecases where the amount of noise is controlled during optimization, either by using larger batches [martens2015optimizing, de2017automated, babanezhad2015stop] or by employing variance-reducing techniques [johnson2013accelerating, defazio2014saga].

Further, works like [wilson2017marginal] and [keskar2017improving] have shown cases where SGD (no momentum) and HB (classical momentum) generalize much better than RMSProp and ADAM with stochastic gradients. [wilson2017marginal] also show that ADAM generalizes poorly for large enough nets and that RMSProp generalizes better than ADAM on a couple of neural network tasks (most notably in the character-level language modeling task). But in general it’s not clear and no heuristics are known to the best of our knowledge to decide whether these insights about relative performances (generalization or training) between algorithms hold for other models or carry over to the full-batch setting.

In this work we try to shed some light on the above described open questions about adaptive gradient methods in the following two ways.

To the best of our knowledge, this work gives the first convergence guarantees for adaptive gradient based standard neural-net training heuristics. Specifically we show run-time bounds for deterministic RMSProp and ADAM to reach approximate criticality on smooth non-convex functions, as well as for stochastic RMSProp under an additional assumption.

Recently, [reddi2018convergence] have shown in the setting of online convex optimization that there are certain sequences of convex functions where ADAM and RMSprop fail to converge to asymptotically zero average regret. We contrast our findings with Theorem $3$ in [reddi2018convergence]. Their counterexample for ADAM is constructed in the stochastic optimization framework and is incomparable to our result about deterministic ADAM. Our proof of convergence to approximate critical points establishes a key conceptual point that for adaptive gradient algorithms one cannot transfer intuitions about convergence from online setups to their more common use case in offline setups.

Our second contribution is empirical investigation into adaptive gradient methods, where our goals are different from what our theoretical results are probing. We test the convergence and generalization properties of RMSProp and ADAM and we compare their performance against NAG on a variety of autoencoder experiments on MNIST data, in both full and mini-batch settings. In the full-batch setting, we demonstrate that ADAM with very high values of the momentum parameter ( $\beta_{1}=0.99$ ) matches or outperforms carefully tuned NAG and RMSProp, in terms of getting lower training and test losses. We show that as the autoencoder size keeps increasing, RMSProp fails to generalize pretty soon. In the mini-batch experiments we see exactly the same behaviour for large enough nets. We further validate this behavior on an image classification task on CIFAR-10 using a VGG-9 convolutional neural network, the results to which we present in the Appendix LABEL:vgg9_sec.

We note that recently it has been shown by [lucas2018aggregated], that there are problems where NAG generalizes better than ADAM even after tuning $\beta_{1}$ . In contrast our experiments reveal controlled setups where tuning ADAM’s $\beta_{1}$ closer to $1$ than usual practice helps close the generalization gap with NAG and HB which exists at standard values of $\beta_{1}$ .

Much after this work was completed we came to know of a related paper [li2018convergence] which analyzes convergence rates of a modification of AdaGrad (not RMSPRop or ADAM). After the initial version of our work was made public, a few other analysis of adaptive gradient methods have also appeared like [chen2018convergence], [zhou2018convergence] and [zaheer2018adaptive].

Notations and Pseudocodes

Firstly we define the smoothness property that we assume in our proofs for all our non-convex objectives. This is a standard assumption used in the optimization literature.

We need one more definition that of square-root of diagonal matrices,

Now we list out the pseudocodes used for NAG, RMSProp and ADAM in theory and experiments,

Nesterov’s Accelerated Gradient (NAG) Algorithm

Convergence Guarantees for ADAM and RMSProp

Previously it has been shown in [rangamani2017critical] that mini-batch RMSProp can off-the-shelf do autoencoding on depth $2$ autoencoders trained on MNIST data while similar results using non-adaptive gradient descent methods requires much tuning of the step-size schedule. Here we give the first result about convergence to criticality for stochastic RMSProp albeit under a certain technical assumption about the training set (and hence on the first order oracle). Towards that we need the following definition,