YellowFin and the Art of Momentum Tuning

Jian Zhang, Ioannis Mitliagkas

Introduction

Accelerated forms of stochastic gradient descent (SGD), pioneered by Polyak and Nesterov , are the de-facto training algorithms for deep learning. Their use requires a sane choice for their hyperparameters: typically a learning rate and momentum parameter . However, tuning hyperparameters is arguably the most time-consuming part of deep learning, with many papers outlining best tuning practices written . Deep learning researchers have proposed a number of methods to deal with hyperparameter optimization, ranging from grid-search and smart black-box methods to adaptive optimizers. Adaptive optimizers aim to eliminate hyperparameter search by tuning on the fly for a single training run: algorithms like AdaGrad , RMSProp and Adam use the magnitude of gradient elements to tune learning rates individually for each variable and have been largely successful in relieving practitioners of tuning the learning rate.

Recently some researchers have started favoring simple momentum SGD over the previously mentioned adaptive methods , often reporting better test scores . Motivated by this trend, we ask the question: can simpler adaptive methods based on momentum SGD perform as well or better? We empirically show, with a hand-tuned learning rate, Polyak’s momentum SGD achieves faster convergence than Adam for a large class of models. We then formulate the optimization update as a dynamical system and study certain robustness properties of the momentum operator. Inspired by our analysis, we design \tuner, an automatic hyperparameter tuner for momentum SGD. \tunersimultaneously tunes the learning rate and momentum on the fly, and can handle the complex dynamics of asynchronous execution. Our contribution and outline are as follows:

In Section 2, we demonstrate examples where momentum offers convergence robust to learning rate misspecification and curvature variation in a class of non-convex objectives. This robustness is desirable for deep learning. It stems from a known but obscure fact: the momentum operator’s spectral radius is constant in a large subset of the hyperparameter space.

In Section 3, we use these robustness insights and a simple quadratic model analysis to motivate the design of \tuner, an automatic tuner for momentum SGD. \tuneruses on-the-fly measurements from the gradients to tune both a single learning rate and a single momentum.

In Section 3.3, we discuss common stability concerns related to the phenomenon of exploding gradients . We present a natural extension to our basic tuner, using adaptive gradient clipping, to stabilize training for objectives with exploding gradients.

In Section 4 we present closed-loop YellowFin, suited for asynchronous training. It uses a novel component for measuring the total momentum in a running system, including any asynchrony-induced momentum, a phenomenon described in . This measurement is used in a negative feedback loop to control the value of algorithmic momentum.

We provide a thorough empirical evaluation of the performance and stability of our tuner. In Section 5, we demonstrate empirically that on ResNets and LSTMs \tunercan converge in fewer iterations compared to: (i) hand-tuned momentum SGD (up to 1.751.75x speedup); and (ii) hand-tuned Adam (0.770.77x to 3.283.28x speedup). Under asynchrony, the closed-loop control architecture speeds up \tuner, making it up to 2.692.69x faster than Adam. Our experiments include runs on 77 different models, randomized over at least 33 different random seeds. \tuneris stable and achieves consistent performance: the normalized sample standard deviation of test metrics varies from 0.05%0.05\% to 0.6%0.6\%. We released PyTorch and TensorFlow implementations TensorFlow: goo.gl/zC2rjG. PyTorch: goo.gl/N4sFfsthat can be used as drop-in replacements for any optimizer. \tunerhas also been implemented in various other packages. Its large-scale deployment in industry has taught us important lessons about stability; we discuss those challenges and our solution in Section 3.3. We conclude with related work and discussion in Section 6 and 7.

The momentum operator

In this section, we identify the main technical insight behind the design of \tuner: gradient descent with momentum can exhibit linear convergence robust to learning rate misspecification and to curvature variation. The robustness to learning rate misspecification means tolerance to a less-carefully-tuned learning rate. On the other hand, the robustness to curvature variation means empirical linear convergence on a class of non-convex objectives with varying curvatures. After preliminary on momentum, we discuss these two properties desirable for deep learning objectives.

We aim to minimize some objective f(x)f(x). In machine learning, xx is referred to as the model and the objective is some loss function. A low loss implies a well-fit model. Gradient descent-based procedures use the gradient of the objective function, f(x)\nabla f(x), to update the model iteratively. These procedures can be characterized by the convergence rate with respect to the distance to a minimum.

Let xx^{*} be a local minimum of f(x)f(x) and xtx_{t} denote the model after tt steps of an iterative procedure. The iterates converge to xx^{*} with linear rate β\beta, if

Polyak’s momentum gradient descent is one of these iterative procedures, given by

where α\alpha denotes a single learning rate and μ\mu a single momentum for all model variables. Momentum’s main appeal is its established ability to accelerate convergence . On a γ\gamma-strongly convex δ\delta-smooth function with condition number κ=δ/γ\kappa=\delta/\gamma, the optimal convergence rate of gradient descent without momentum is O(κ1κ+1)O(\frac{\kappa-1}{\kappa+1}) . On the other hand, for certain classes of strongly convex and smooth functions, like quadratics, the optimal momentum value,

yields the optimal accelerated linear convergence rate O(κ1κ+1)O(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}). This guarantee does not generalize to arbitrary strongly convex smooth functions . Nonetheless, this linear rate can often be observed in practice even on non-quadratics (cf. Section 2.2).

Key insight: Consider a quadratic objective with condition number κ>1\kappa>1. Even though its curvature is different along the different directions, Polyak’s momentum gradient descent, with μμ\mu\geq\mu^{*}, achieves the same linear convergence rate μ\sqrt{\mu} along all directions. Specifically, let xi,tx_{i,t} and xix_{i}^{*} be the i-th coordinates of xtx_{t} and x ⁣x^{*}\!. For any μμ\mu\geq\mu^{*} with an appropriate learning rate, the update in (1) can achieve xi,txiμtxi,0xi|x_{i,t}-x_{i}^{*}|\leq\sqrt{\mu}^{t}|x_{i,0}-x_{i}^{*}| simultaneously along all axes ii. This insight has been hidden away in proofs.

In this quadratic case, curvature is different across different axes, but remains constant on any one-dimensional slice. In the next section (Section 2.2), we extend this insight to non-quadratic one-dimensional functions. We then present the main technical insight behind the design of \tuner: similar linear convergence rate μ\sqrt{\mu} can be achieved in a class of one-dimensional non-convex objectives where curvature varies; this linear convergence behavior is robust to learning rate misspecification and to the varying curvature. These robustness properties are behind a tuning rule for learning rate and momentum in Section 2.2. We extend this rule to handle SGD noise and generalize it to multidimensional objectives in Section 3.

2 Robustness properties of the momentum operator

In this section, we analyze the dynamics of momentum on a class of one-dimensional, non-convex objectives. We first introduce the notion of generalized curvature and use it to describe the momentum operator. Then we discuss the robustness properties of the momentum operator.

Curvature along different directions is encoded in the different eigenvalues of the Hessian. It is the only feature of a quadratic needed to characterize the convergence of gradient descent. Specifically, gradient descent achieves a linear convergence rate 1αhc|1-\alpha h_{c}| on one-dimensional quadratics with constant curvature hch_{c}. On one-dimensional non-quadratic objectives with varying curvature, this neat characterization is lost. We can recover it by defining a new kind of “curvature” with respect to a specific minimum.

Generalized curvature describes, in some sense, non-local curvature with respect to minimum xx^{*}. It coincides with curvature on quadratics. On non-quadratic objectives, it characterizes the convergence behavior of gradient descent-based algorithms. Specifically, we recover the fact that starting at point xtx_{t}, distance from minimum xx^{*} is reduced by 1αh(xt)|1-\alpha h(x_{t})| in one step of gradient descent. Using a state-space augmentation, we can rewrite the momentum update of (1) as

where the momentum operator At\bm{\mathit{A}}_{t} at time tt is defined as

Assume that generalized curvature hh and hyperparameters α,μ\alpha,\mu satisfy

Then as proven in Appendix A, the spectral radius of the momentum operator at step tt depends solely on the momentum parameter: ρ(At)=μ\rho(\bm{\mathit{A}}_{t})=\sqrt{\mu}, for all tt. The inequalities in (6) define the robust region, the set of learning rate α\alpha and momentum μ\mu achieving this μ\sqrt{\mu} spectral radius.

We know that the spectral radius of an operator, A\bm{\mathit{A}}, describes its asymptotic behavior when applied multiple times: AtxO(ρ(A)t)\|A^{t}x\|\approx O(\rho(\bm{\mathit{A}})^{t}). For any ϵ>0\epsilon>0, there exists a matrix norm \|\cdot\| such that Aρ(A)+ϵ\|\bm{\mathit{A}}\|\leq\rho(A)+\epsilon . Unfortunately, the same does not always hold for the composition of different operators, even if they have the same spectral radius, ρ(At)=μ\rho(\bm{\mathit{A}}_{t})=\sqrt{\mu}. It is not always true that AtA1x=O(μt)\|\bm{\mathit{A}}_{t}\cdots\bm{\mathit{A}}_{1}x\|=O(\sqrt{\mu}^{t}). However, a homogeneous spectral radius often yields the μt\sqrt{\mu}^{t} rate empirically. In other words, this linear convergence rate is not guaranteed. Instead, we demonstrate examples to expose the robustness properties: if the learning rate α\alpha and momentum μ\mu are in the robust region, the homogeneity of spectral radii can empirically yield linear convergence with rate μ\sqrt{\mu}; this behavior is robust with respect to learning rate misspecification and to varying curvature.

For a one-dimensional quadratic with curvature hh, we have generalized curvature h(x)=hh(x)=h for all xx. Lemma 3 implies the spectral radius ρ(At) ⁣= ⁣μ\rho(\bm{\mathit{A}}_{t})\!=\!\sqrt{\mu} if

In Figure 2, we plot ρ(At)\rho(\bm{\mathit{A}}_{t}) for different α\alpha and μ\mu when h ⁣= ⁣1h\!=\!1. The solid line segments correspond to the robust region. As we increase momentum, a linear rate of convergence, μ\sqrt{\mu}, is robustly achieved by an ever-widening range of learning rates: higher values of momentum are more robust to learning rate mispecification. This property influences the design of our tuner: more generally for a class of one-dimensional non-convex objectives, as long as the learning rate α\alpha and momentum μ\mu are in the robust region, i.e. satisfy (6) at every step, then momentum operators at all steps tt have the same spectral radius. In the case of quadratics, this implies a convergence rate of μ\sqrt{\mu}, independent of the learning rate. Having established that, we can just focus on optimally tuning momentum.

Momentum is robust to varying curvature

As discussed in Section 2.1, the intuition hidden in classic results is that for certain strongly convex smooth objectives, momentum at least as high as the value in (2) can achieve the same rate of linear convergence along all axes with different curvatures. We extend this intuition to certain one-dimensional non-convex functions with varying curvatures along their domains; we discuss the generalization to multidimensional cases in Section 3.1. Lemma 3 guarantees constant, time-homogeneous spectral radii for momentum operators At\bm{\mathit{A}}_{t} assuming (6) is satisfied at every step. This assumption motivates a “long-range” extension of the condition number.

The GCN captures variations in generalized curvature along a scalar slice. From Lemma 3 we get

as the description of the robust region. The momentum and learning rate satisfying (9) guarantees a homogeneous spectral radius of μ\sqrt{\mu} for all At\bm{\mathit{A}}_{t}. Specifically, μ\mu^{*} is the smallest momentum value that allows for homogeneous spectral radii. We demonstrate with examples that homogeneous spectral radii suggest an empirical linear convergence behavior on a class of non-convex objectives. In Figure 3(a), the non-convex objective, composed of two quadratics with curvatures 11 and 10001000, has a GCN of 10001000. Using the tuning rule of (9), and running the momentum algorithm (Figure 3(b)) practically yields the linear convergence predicted by Lemma 3. In Figures 3(c,d), we demonstrate an LSTM as another example. As we increase the momentum value (the same value for all variables in the model), more model variables follow a μ\sqrt{\mu} convergence rate. In these examples, the linear convergence is robust to the varying curvature of the objectives. This property influences our tuner design: in the next section, we extend the tuning rules of (9) to handle SGD noise; we generalize the extended rule to multidimensional cases as the tuning rule in \tuner.

The \tunertuner

Here we describe our tuner for momentum SGD that uses the same learning rate for all variables. We first introduce a noisy quadratic model f(x)f(x) as the local approximation of an arbitrary one-dimensional objective. On this approximation, we extend the tuning rule of (9) to SGD. In section 3.1, we generalize the discussion to multidimensional objectives; it yields the \tunertuning rule.

Let f(x)f(x) be defined as in (10), x1=x0x_{1}=x_{0} and xtx_{t} follow the momentum update (1) with stochastic gradients fSt(xt1)\nabla f_{S_{t}}(x_{t-1}) for t2t\geq 2. Let e1=T\bm{\mathit{e}}_{1}=^{T}, the expectation of squared distance to the optimum xx^{*} is

where the first and second term correspond to squared bias and variance, and their corresponding momentum dynamics are captured by operators

Even though it is possible to numerically work on (11) directly, we use a scalar, asymptotic surrogate in (13) based on the spectral radii of operators to simplify analysis and expose insights. This decision is supported by our findings in Section 2: the spectral radii can capture empirical convergence rate.

One of our design decisions for \tuneris to always work in the robust region of Lemma 3. We know that this implies a spectral radius μ\sqrt{\mu} of the momentum operator, A\bm{\mathit{A}}, for the bias. Lemma 6 shows that under the exact same condition, the variance operator B\bm{\mathit{B}} has spectral radius μ\mu.

The spectral radius of the variance operator, B\bm{\mathit{B}} is μ\mu, if (1μ)2αh(1+μ)2{(1-\sqrt{\mu})^{2}}\leq\alpha h\leq{(1+\sqrt{\mu})^{2}}.

As a result, the surrogate objective of (13), takes the following form in the robust region.

We extend this surrogate to multidimensional cases to extract a noisy tuning rule for \tuner.

1 Tuning rule

In this section, we present SingleStep, the tuning rule of YellowFin (Algorithm 1). Based on the surrogate in (14), SingleStep is a multidimensional SGD version of the noiseless tuning rule in (9). We first generalize (9) and (14) to multidimensional cases, and then discuss SingleStep.

As discussed in Section 2.2, GCN ν\nu captures the dynamic range of generalized curvatures in a one-dimensional objective with varying curvature. The consequent robust region described by (9) implies homogeneous spectral radii. On a multidimensional non-convex objective, each one-dimensional slice passing a minimum xx^{*} can have varying curvature. As we use a single μ\mu and α\alpha for the entire model, if ν\nu simultaneously captures the dynamic range of generalized curvature over all these slices, μ\mu and α\alpha in (9) are in the robust region for all these slices. This implies homogeneous spectral radii μ\sqrt{\mu} according to Lemma 3, empirically facilitating convergence at a common rate along all the directions.

Let DD be an estimate of the current model’s distance to a local quadratic approximation’s minimum, and CC denote an estimate for gradient variance. SingleStep minimizes the multidimensional surrogate after a single step (i.e. t=1t=1) while ensuring μ\mu and α\alpha in the robust region for all directions. A single instance of SingleStep solves a single momentum and learning rate for the entire model at each iteration. Specifically, the extremal curvatures hminh_{min} and hmaxh_{max} denote estimates for the largest and smallest generalized curvature respectively. They are meant to capture both generalized curvature variation along all different directions (like the classic condition number) and also variation that occurs as the landscape evolves. The constraints keep the global learning rate and momentum in the robust region (defined in Lemma 3) for slices along all directions. SingleStep can be solved in closed form; we refer to Appendix D for relevant details on the closed form solution. \tuneruses functions CurvatureRange, Variance and Distance to measure quantities hmaxh_{\max}, hminh_{\min}, CC and DD respectively. These measurement functions can be designed in different ways. We present the implementations we used for our experiments, based completely on gradients, in Section 3.2.

2 Measurement functions in \tuner

This section describes our implementation of the measurement oracles used by \tuner: CurvatureRange, Variance, and Distance. We design the measurement functions with the assumption of a negative log-probability objective; this is in line with typical losses in machine learning, e.g. cross-entropy for neural nets and maximum likelihood estimation in general. Under this assumption, the Fisher information matrix—i.e. the expected outer product of noisy gradients—approximates the Hessian of the objective . This allows for measurements purely from minibatch gradients with overhead linear to model dimensionality. These implementations are not guaranteed to give accurate measurements. Nonetheless, their use in our experiments in Section 5 shows that they are sufficient for \tunerto outperform the state of the art on a variety of objectives. We also refer to Appendix E for details on zero-debias , slow start and smoothing for curvature range estimation.

Let gtg_{t} be a noisy gradient, we estimate the curvatures range in Algorithm 2. We notice that the outer product gtgtTg_{t}g_{t}^{T} has an eigenvalue ht=gt2h_{t}=\|g_{t}\|^{2} with eigenvector gtg_{t}. Thus under our negative log-likelihood assumption, we use hth_{t} to approximate the curvature of Hessian along gradient direction gtg_{t}. Specifically, we maintain hminh_{\min} and hmaxh_{\max} as running averages of extreme curvature hmin,th_{\min,t} and hmax,th_{\max,t}, from a sliding window of width 20. As gradient directions evolve, we estimate curvatures along different directions. Thus hminh_{\min} and hmaxh_{\max} capture the curvature variations.

Gradient variance

Distance to optimum

In Algorithm 4, we estimate the distance to the optimum of the local quadratic approximation. Inspired by the fact that f(x)Hxx\|\nabla f(\bm{\mathit{x}})\|\leq\|\bm{\mathit{H}}\|\|\bm{\mathit{x}}-\bm{\mathit{x}}^{\star}\| for a quadratic f(x)f(x) with Hessian H\bm{\mathit{H}} and minimizer x\bm{\mathit{x}}^{*}, we first maintain h\overline{h} and g\overline{\|g\|} as running averages of curvature hth_{t} and gradient norm gt\|g_{t}\|. Then the distance is approximated using g/h\overline{\|g\|}/\overline{h}.

3 Stability on non-smooth objectives

The process of training neural networks is inherently non-stationary, with the landscape abruptly switching from flat to steep areas. In particular, the objective functions of RNNs with hidden units can exhibit occasional but very steep slopes . To deal with this issue, we use adaptive gradient clipping heuristics as a very natural addition to our basic tuner. It is discussed with extensive details in Appendix F. In Figure 6 in Appendix F, we present an example of an LSTM that exhibits the ’exploding gradient’ issue. The proposed adaptive clipping can stabilize the training process using \tunerand prevent large catastrophic loss spikes.

We validate the proposed adaptive clipping on the convolutional sequence to sequence learning model for IWSLT 2014 German-English translation. The default optimizer uses learning rate 0.250.25 and Nesterov’s momentum 0.990.99, diverging to loss overflow due to ’exploding gradient’. It requires, as in Gehring et al. , strict manually set gradient norm threshold 0.10.1 to stabilize. In Table 3.3, we can see YellowFin, with adaptive clipping, outperforms the default optimizer using manually set clipping, with 0.84 higher validation BLEU4 after 120 epochs.

Closed-loop YellowFin

Asynchrony is a parallelization technique that avoids synchronization barriers . It yields better hardware efficiency, i.e. faster steps, but can increase the number of iterations to a given metric, i.e. statistical efficiency, as a tradeoff . Mitliagkas et al. interpret asynchrony as added momentum dynamics. We design closed-loop YellowFin, a variant of \tunerto automatically control algorithmic momentum, compensate for asynchrony and accelerate convergence. We use the formula in (16) to model the dynamics in the system, where the total momentum, μT\mu_{T}, includes both asynchrony-induced and algorithmic momentum, μ\mu, in (1).

We first use (16) to design an robust estimator μ^T\hat{\mu}_{T} for the value of total momentum at every iteration. Then we use a simple negative feedback control loop to adjust the value of algorithmic momentum so that μ^T\hat{\mu}_{T} matches the target momentum decided by \tunerin Algorithm 1. In Figure 4, we demonstrate momentum dynamics in an asynchronous training system. As directly using the target value as algorithmic momentum, \tuner(middle) presents total momentum μ^T\hat{\mu}_{T} strictly larger than the target momentum, due to asynchrony-induced momentum. Closed-loop YellowFin (right) automatically brings down algorithmic momentum, match measured total momentum μ^T\hat{\mu}_{T} to target value and, as we will see, speeds up convergence comparing to \tuner. We refer to Appendix G for details on estimator μ^T\hat{\mu}_{T} and Closed-loop YellowFin in Algorithm 5.

Experiments

We empirically validate the importance of momentum tuning and evaluate \tunerin both synchronous (single-node) and asynchronous settings. In synchronous settings, we first demonstrate that, with hand-tuning, momentum SGD is competitive with Adam, a state-of-the-art adaptive method. Then, we evaluate \tunerwithout any hand tuning in comparison to hand-tuned Adam and momentum SGD. In asynchronous settings, we show that closed-loop YellowFin accelerates with momentum closed-loop control, significantly outperforming Adam.

We evaluate on convolutional neural networks (CNN) and recurrent neural networks (RNN). For CNN, we train ResNet for image recognition on CIFAR10 and CIFAR100 . For RNN, we train LSTMs for character-level language modeling with the TinyShakespeare (TS) dataset , word-level language modeling with the Penn TreeBank (PTB) , and constituency parsing on the Wall Street Journal (WSJ) dataset . We refer to Table 3 in Appendix H for model specifications. To eliminate influences of a specific random seed, in our synchronous and asynchronous experiments, the training loss and validation metrics are averaged from 3 runs using different random seeds.

We tune Adam and momentum SGD on learning rate grids with prescribed momentum 0.90.9 for SGD. We fix the parameters of Algorithm 1 in all experiments, i.e. \tunerruns without any hand tuning. We provide full specifications, including the learning rate (grid) and the number of iterations we train on each model in Appendix I. For visualization purposes, we smooth training losses with a uniform window of width 10001000. For Adam and momentum SGD on each model, we pick the configuration achieving the lowest averaged smoothed loss. To compare two algorithms, we record the lowest smoothed loss achieved by both. Then the speedup is reported as the ratio of iterations to achieve this loss. We use this setup to validate our claims.

In Table 2, we compare tuned momentum SGD and tuned Adam on ResNets with training losses shown in Figure 8 in Appendix J. We can observe that momentum SGD achieves 1.711.71x and 1.871.87x speedup to tuned Adam on CIFAR10 and CIFAR100 respectively. In Figure 5 and Table 2, with the exception of PTB LSTM, momentum SGD also produces better training loss, as well as better validation perplexity in language modeling and validation F1 in parsing. For the parsing task, we also compare with tuned Vanilla SGD and AdaGrad, which are used in the NLP community. Figure 5 (right) shows that fixed momentum 0.9 can already speedup Vanilla SGD by 2.732.73x, achieving observably better validation F1. We refer to Appendix J.2 for further discussion on the importance of momentum adaptivity in \tuner.

\tunercan match hand-tuned momentum SGD and can outperform hand-tuned Adam

In our experiments, \tuner, without any hand-tuning, yields training loss matching hand-tuned momentum SGD for all the ResNet and LSTM models in Figure 5 and 8. When comparing to tuned Adam in Table 2, except being slightly slower on PTB LSTM, \tunerachieves 1.381.38x to 3.283.28x speedups in training losses on the other four models. More importantly, \tunerconsistently shows better validation metrics than tuned Adam in Figure 5. It demonstrates that \tunercan match tuned momentum SGD and outperform tuned state-of-the-art adaptive optimizers. In Appendix J.4, we show \tunerfurther speeding up with finer-grain manual learning rate tuning.

2 Asynchronous experiments

In this section, we evaluate closed-loop YellowFin with focus on the number of iterations to reach a certain solution. To that end, we run 1616 asynchronous workers on a single machine and force them to update the model in a round-robin fashion, i.e. the gradient is delayed for 1515 iterations. Figure 1 (right) presents training losses on the CIFAR100 ResNet, using \tunerin Algorithm 1, closed-loop YellowFin in Algorithm 5 and Adam with the learning rate achieving the best smoothed loss in Section 5.1. We can observe closed-loop \tunerachieves 20.120.1x speedup to \tuner, and consequently a 2.692.69x speedup to Adam. This demonstrates that (1) closed-loop YellowFin accelerates by reducing algorithmic momentum to compensate for asynchrony and (2) can converge in less iterations than Adam in asynchronous-parallel training.

Related work

Many techniques have been proposed on tuning hyperparameters for optimizers. General hyperparameter tuning approaches, such as random search and Bayesian approaches , can directly tune optimizers. As another trend, adaptive methods, including AdaGrad , RMSProp and Adam , uses per-dimension learning rate. Schaul et al. use a noisy quadratic model similar to ours to tune the learning rate in Vanilla SGD. However they do not use momentum which is essential in training modern neural nets. Existing adaptive momentum approach either consider the deterministic setting or only analyze the stochastic setting with O(1/t)O(1/t) learning rate . In contrast, we aim at practical momentum adaptivity for stochastically training neural nets.

Discussion

We presented \tuner, the first optimization method that automatically tunes momentum as well as the learning rate of momentum SGD. \tuneroutperforms the state-of-the-art adaptive optimizers on a large class of models both in synchronous and asynchronous settings. It estimates statistics purely from the gradients of a running system, and then tunes the hyperparameters of momentum SGD based on noisy, local quadratic approximations. As future work, we believe that more accurate curvature estimation methods, like the bbpropbbprop method can further improve \tuner. We also believe that our closed-loop momentum control mechanism in Section 4 could accelerate other adaptive methods in asynchronous-parallel settings.

Acknowledgements

We are grateful to Christopher Ré for his valuable guidance and support. We thank Bryan He, Paroma Varma, Chris De Sa, Tri Dao, Albert Gu, Fred Sala, Alex Ratner and Theodoros Rekatsinas for helpful discussions and feedbacks. We gratefully acknowledge the support of the D3M program under No. FA8750-17-2-0095. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA or the U.S. government.

References

Appendix A Proof of Lemma 3

To prove Lemma 3, we first prove a more generalized version in Lemma 7. By restricting ff to be a one dimensional quadratics function, the generalized curvature hth_{t} itself is the only eigenvalue. We can prove Lemma 3 as a straight-forward corollary. Lemma 7 also implies, in the multiple dimensional correspondence of (4), the spectral radius ρ(At)=μ\rho(\bm{\mathit{A}}_{t})=\sqrt{\mu} if the curvature on all eigenvector directions (eigenvalue) satisfies (6).

Let the gradients of a function ff be described by

where ytxtx\bm{\mathit{y}}_{t}\triangleq\bm{\mathit{x}}_{t}-\bm{\mathit{x}}^{*}. Now, assume that the following condition holds for all eigenvalues λ(H(xt))\lambda(\bm{\mathit{H}}(\bm{x}_{t})) of H(xt)\bm{\mathit{H}}(\bm{x}_{t}):

then the spectral radius of At\bm{\mathit{A}}_{t} is controlled by momentum with ρ(At)=μ.\rho(\bm{\mathit{A}}_{t})=\sqrt{\mu}.

Let λt\lambda_{t} be an eigenvalue of matrix At\bm{\mathit{A}}_{t}, it gives det(AtλtI)=0\det\left(\bm{\mathit{A}}_{t}-\lambda_{t}\bm{\mathit{I}}\right)=0. We define the blocks in At\bm{\mathit{A}}_{t} as C=IαHt+μIλtI\bm{\mathit{C}}=\bm{\mathit{I}}-\alpha\bm{\mathit{H}}_{t}+\mu\bm{\mathit{I}}-\lambda_{t}\bm{\mathit{I}}, D=μI\bm{\mathit{D}}=-\mu\bm{\mathit{I}}, E=I\bm{\mathit{E}}=\bm{\mathit{I}} and F=λtI\bm{\mathit{F}}=-\lambda_{t}\bm{\mathit{I}} which gives

assuming generally F\bm{\mathit{F}} is invertible. Note we use HtH(xt)\bm{\mathit{H}}_{t}\triangleq\bm{\mathit{H}}(\bm{\mathit{x}}_{t}) for simplicity in writing. The equation det(CDF1E)=0\det{\left(\bm{\mathit{C}}-\bm{\mathit{D}}\bm{\mathit{F}}^{-1}\bm{\mathit{E}}\right)}=0 implies that

with Mt=(IαHt+μI)\bm{\mathit{M}}_{t}=\left(\bm{\mathit{I}}-\alpha\bm{\mathit{H}}_{t}+\mu\bm{\mathit{I}}\right). In other words, λt\lambda_{t} satisfied that λt2λtλ(Mt)+μ=0\lambda_{t}^{2}-\lambda_{t}\lambda(\bm{\mathit{M}}_{t})+\mu=0 with λ(Mt)\lambda(\bm{\mathit{M}}_{t}) being one eigenvalue of Mt\bm{\mathit{M_{t}}}. I.e.

On the other hand, (19) guarantees that (1αλ(Ht)+μ)24μ(1-\alpha\lambda(\bm{\mathit{H}}_{t})+\mu)^{2}\leq 4\mu. We know both Ht\bm{\mathit{H}}_{t} and IαHt+μI\bm{\mathit{I}}-\alpha\bm{\mathit{H}}_{t}+\mu\bm{\mathit{I}} are symmetric. Thus for all eigenvalues λ(Mt)\lambda(\bm{\mathit{M}}_{t}) of Mt\bm{\mathit{M}}_{t}, we have λ(Mt)2=(1αλ(Ht)+μ)24μ\lambda(\bm{\mathit{M}}_{t})^{2}=(1-\alpha\lambda(\bm{\mathit{H}}_{t})+\mu)^{2}\leq 4\mu which guarantees λt=μ|\lambda_{t}|=\sqrt{\mu} for all λt\lambda_{t}. As the spectral radius is equal to the magnitude of the largest eigenvalue of At\bm{\mathit{A}}_{t}, we have the spectral radius of At\bm{\mathit{A}}_{t} being μ\sqrt{\mu}.

Appendix B Proof of Lemma 5

We first prove Lemma 8 and Lemma 9 as preparation for the proof of Lemma 5. After the proof for one dimensional case, we discuss the trivial generalization to multiple dimensional case.

From the recurrence of momentum SGD, we have

By putting the equation in to matrix form, (22) is a straight-forward result from unrolling the recurrence for tt times. Note as we set x1=x0x_{1}=x_{0} with no uncertainty in momentum SGD, we have [x0,x1]=[x0,x1][\overline{x}_{0},\overline{x}_{1}]=[x_{0},x_{1}]. ∎

We prove by first deriving the recurrence for UtU_{t} and VtV_{t} respectively and combining them in to a matrix form. For UtU_{t}, we have

Note the second term in the second equality is zero because x0x_{0} and x1x_{1} are deterministic. Thus U1 ⁣= ⁣U0 ⁣= ⁣V1 ⁣= ⁣0U_{1}\!=\!U_{0}\!=\!V_{1}\!=\!0. ∎

Appendix C Proof of Lemma 6

Again we first present a proof of a multiple dimensional generalized version of Lemma 6. The proof of Lemma 6 is a one dimensional special case of Lemma 10. Lemma 10 also implies that for multiple dimension quadratics, the corresponding spectral radius ρ(B)=μ\rho(\bm{\mathit{B}})=\mu if (1μ)2αh(1+μ)2α{(1-\sqrt{\mu})^{2}\over\alpha}\leq h\leq{(1+\sqrt{\mu})^{2}\over\alpha} on all the eigenvector directions with hh being the eigenvalue (curvature).

We have ρ(B)=μ\rho(\bm{\mathit{B}})=\mu if all eigenvalues λ(H)\lambda(\bm{\mathit{H}}) of H\bm{\mathit{H}} satisfies

Let λ\lambda be an eigenvalue of matrix B\bm{\mathit{B}}, it gives det(BλI)=0\det\left(\bm{\mathit{B}}-\lambda\bm{\mathit{I}}\right)=0 which can be alternatively expressed as

assuming F\bm{\mathit{F}} is invertible, i.e. λ+μ0\lambda+\mu\neq 0, where the blocks in B\bm{\mathit{B}}

with M=IαH+μI\bm{\mathit{M}}=\bm{\mathit{I}}-\alpha\bm{\mathit{H}}+\mu\bm{\mathit{I}}. (30) can be transformed using straight-forward algebra as

Using similar simplification technique as in (30), we can further simplify into

if λμ\lambda\neq\mu, as (λ+μ)2IλMM(\lambda+\mu)^{2}\bm{\mathit{I}}-\lambda\bm{\mathit{M}}^{\top}\bm{\mathit{M}} is diagonalizable, we have (λ+μ)2λλ(M)2=0(\lambda+\mu)^{2}-\lambda\lambda(\bm{\mathit{M}})^{2}=0 with λ(M)\lambda(\bm{\mathit{M}}) being an eigenvalue of symmetric M\bm{\mathit{M}}. The analytic solution to the equation can be explicitly expressed as

When the condition in (29) holds, we have λ(M)2=(1αλ(H)+μ)24μ\lambda(M)^{2}=(1-\alpha\lambda(\bm{\mathit{H}})+\mu)^{2}\leq 4\mu. One can verify that

Thus the roots in (33) are conjugate with λ=μ|\lambda|=\mu. In conclusion, the condition in (29) can guarantee all the eigenvalues of B\bm{\mathit{B}} has magnitude μ\mu. Thus the spectral radius of B\bm{\mathit{B}} is controlled by μ\mu. ∎

Appendix D Analytical solution to (15)

The problem in (15) does not need iterative solver but has an analytical solution. Substituting only the second constraint, the objective becomes p(x)=x2D2+(1x)4/hmin2Cp(x)=x^{2}D^{2}+(1-x)^{4}/h_{\min}^{2}C with x=μ[0,1)x=\sqrt{\mu}\in[0,1). By setting the gradient of p(x)p(x) to 0, we can get a cubic equation whose root x=μpx=\sqrt{\mu_{p}} can be computed in closed form using Vieta’s substitution. As p(x)p(x) is uni-modal in [0,1)[0,1), the optimizer for (15) is exactly the maximum of μp\mu_{p} and (hmax/hmin1)2/(hmax/hmin+1)2(\sqrt{h_{\max}/h_{\min}}-1)^{2}/(\sqrt{h_{\max}/h_{\min}}+1)^{2}, the right hand-side of the first constraint in (15).

Appendix E Practical implementation

In Section 3.2, we discuss estimators for learning rate and momentum tuning in \tuner. In our experiment practice, we have identified a few practical implementation details which are important for improving estimators. Zero-debias is proposed by Kingma and Ba , which accelerates the process where exponential average adapts to the level of original quantity in the beginning. We applied zero-debias to all the exponential average quantities involved in our estimators. In some LSTM models, we observe that our estimated curvature may decrease quickly along the optimization process. In order to better estimate extremal curvature hmaxh_{\max} and hminh_{\min} with fast decreasing trend, we apply zero-debias exponential average on the logarithmic of hmax,th_{\max,t} and hmin,th_{\min,t}, instead of directly on hmax,th_{\max,t} and hmin,th_{\min,t}. Except from the above two techniques, we also implemented the slow start heuristic proposed by . More specifically, we use α=min{αt,tαt/(10w)}\alpha=\min\{\alpha_{t},t\cdot\alpha_{t}/(10\cdot w)\} as our learning rate with ww as the size of our sliding window in hmaxh_{\max} and hminh_{\min} estimation. It discount the learning rate in the first 10w10\cdot w steps and helps to keep the learning rate small in the beginning when the exponential averaged quantities are not accurate enough.

Appendix F Adaptive gradient clipping in \tuner

Gradient clipping has been established in literature as a standard—almost necessary—tool for training such objectives . However, the classic tradeoff between adaptivity and stability applies: setting a clipping threshold that is too low can hurt performance; setting it to be high, can compromise stability. \tuner, keeps running estimates of extremal gradient magnitude squares, hmaxh_{max} and hminh_{min} in order to estimate a generalized condition number. We posit that hmax\sqrt{h_{max}} is an ideal gradient norm threshold for adaptive clipping. In order to ensure robustness to extreme gradient spikes, like the ones in Figure 6, we also limit the growth rate of the envelope hmaxh_{max} in Algorithm 2 as follows:

Our heuristics follows along the lines of classic recipes like . However, instead of using the average gradient norm to clip, it uses a running estimate of the maximum norm hmaxh_{\max}.

In Section 3.3, we saw that adaptive clipping stabilizes the training on objectives that exhibit exploding gradients. In Figure 7, we demonstrate that the adaptive clipping does not hurt performance on models that do not exhibit instabilities without clipping. Specifically, for both PTB LSTM and CIFAR10 ResNet, the difference between \tunerwith and without adaptive clipping diminishes quickly.

Appendix G Closed-loop \tunerfor asynchronous training

In Section 4, we briefly discuss the closed-loop momentum control mechanism in closed-loop YellowFin. In this section, after presenting more preliminaries on asynchrony, we show with details on the mechanism: it measures the dynamics on a running system and controls momentum with a negative feedback loop.

Asynchrony is a popular parallelization technique that avoids synchronization barriers. When training on MM asynchronous workers, staleness (the number of model updates between a worker’s read and write operations) is on average τ=M1\tau=M-1, i.e., the gradient in the SGD update is delayed by τ\tau iterations as fStτ(xtτ)\nabla f_{S_{t-\tau}}(x_{t-\tau}). Asynchrony yields faster steps, but can increase the number of iterations to achieve the same solution, a tradeoff between hardware and statistical efficiency . Mitliagkas et al. interpret asynchrony as added momentum dynamics. Experiments in Hadjis et al. support this finding, and demonstrate that reducing algorithmic momentum can compensate for asynchrony-induced momentum and significantly reduce the number of iterations for convergence. Motivated by that result, we use the model in (36), where the total momentum, μT\mu_{T}, includes both asynchrony-induced and algorithmic momentum, μ\mu, in (1).

We will use this expression to design an estimator for the value of total momentum, μ^T\hat{\mu}_{T}. This estimator is a basic building block of closed-loop YellowFin, that removes the need to manually compensate for the effects of asynchrony.

Measuring the momentum dynamics

Closed-loop YellowFin estimates total momentum μT\mu_{T} on a running system and uses a negative feedback loop to adjust algorithmic momentum accordingly. Equation (16) gives an estimate of μ^T\hat{\mu}_{T} on a system with staleness τ\tau, based on (16).

We use τ\tau-stale model values to match the staleness of the gradient, and perform all operations in an elementwise fashion. This way we get a total momentum measurement from each variable; the median combines them into a more robust estimate.

Closing the asynchrony loop

Given a reliable measurement of μT\mu_{T}, we can use it to adjust the value of algorithmic momentum so that the total momentum matches the target momentum as decided by \tunerin Algorithm 1. Closed-loop YellowFin in Algorithm 5 uses a simple negative feedback loop to achieve the adjustment.

Appendix H Model specification

The model specification is shown in Table 3 for all the experiments in Section 5. CIRAR10 ResNet uses the regular ResNet units while CIFAR100 ResNet uses the bottleneck units. Only the convolutional layers are shown with filter size, filter number as well as the repeating count of the units. The layer counting for ResNets also includes batch normalization and Relu layers. The LSTM models are also diversified for different tasks with different vocabulary sizes, word embedding dimensions and number of layers.

Appendix I Specification for synchronous experiments

In Section 5.1, we demonstrate the synchronous experiments with extensive discussions. For the reproducibility, we provide here the specification of learning rate grids. The number of iterations as well as epochs, i.e. the number of passes over the full training sets, are also listed for completeness. For \tunerin all the experiments in Section 5, we uniformly use sliding window size 2020 for extremal curvature estimation and β=0.999\beta=0.999 for smoothing. For momentum SGD and Adam, we use the following configurations.

Momentum SGD learning rates {0.001,0.01(best),0.1,1.0}\{0.001,0.01\text{(best)},0.1,1.0\}, momentum 0.9

Adam learning rates {0.0001,0.001(best),0.01,0.1}\{0.0001,0.001\text{(best)},0.01,0.1\}

Momentum SGD learning rates {0.001,0.01(best),0.1,1.0}\{0.001,0.01\text{(best)},0.1,1.0\}, momentum 0.9

Adam learning rates {0.00001,0.0001(best),0.001,0.01}\{0.00001,0.0001\text{(best)},0.001,0.01\}

Momentum SGD learning rates {0.01,0.1,1.0(best),10.0}\{0.01,0.1,1.0\text{(best)},10.0\}, momentum 0.9

Adam learning rates {0.0001,0.001(best),0.01,0.1}\{0.0001,0.001\text{(best)},0.01,0.1\}

Momentum SGD learning rates {0.05,0.1,0.5,1.0(best),5.0}\{0.05,0.1,0.5,1.0\text{(best)},5.0\}, momentum 0.9

Adam learning rates {0.0005,0.001,0.005(best),0.01,0.05}\{0.0005,0.001,0.005\text{(best)},0.01,0.05\}

Decrease learning rate by factor 0.97 every epoch for all optimizers, following the design by Karpathy et al. .

Momentum SGD learning rates {0.05,0.1,0.5(best),1.0,5.0}\{0.05,0.1,0.5\text{(best)},1.0,5.0\}, momentum 0.9

Adam learning rates {0.0001,0.0005,0.001(best),0.005,0.01}\{0.0001,0.0005,0.001\text{(best)},0.005,0.01\}

Vanilla SGD learning rates {0.05,0.1,0.5,1.0(best),5.0}\{0.05,0.1,0.5,1.0\text{(best)},5.0\}

Adagrad learning rates {0.05,0.1,0.5(best),1.0,5.0}\{0.05,0.1,0.5(\text{best}),1.0,5.0\}

Decrease learning rate by factor 0.9 every epochs after 14 epochs for all optimizers, following the design by Choe and Charniak .

Appendix J Additional experiment results

In Figure 8, we demonstrate the training loss on CIFAR10 ResNet and CIFAR100 ResNet. Specifically, \tunercan match the performance of hand-tuned momentum SGD, and achieves 1.93x and 1.38x speedup comparing to hand-tuned Adam respectively on CIFAR10 and CIFAR100 ResNet.

J.2 Importance of momentum adaptivity

To further emphasize the importance of momentum adaptivity in \tuner, we run YF on CIFAR100 ResNet and TS LSTM. In the experiments, \tunertunes the learning rate. Instead of also using the momentum tuned by YF, we continuously feed prescribed momentum value 0.00.0 and 0.90.9 to the underlying momentum SGD optimizer which YF is tuning. In Figure 9, when comparing to \tunerwith prescribed momentum 0.0 or 0.9, \tunerwith adaptively tuned momentum achieves observably faster convergence on both TS LSTM and CIFAR100 ResNet. It empirically demonstrates the essential role of momentum adaptivity in \tuner.

J.3 Tuning momentum can improve Adam in async.-parallel setting

We conduct experiments on PTB LSTM with 16 asynchronous workers using Adam using the same protocol as in Section 5.2. Fixing the learning rate to the value achieving the lowest smoothed loss in Section 5.1, we sweep the smoothing parameter β1\beta_{1} of the first order moment estimate in grid {0.2,0.0,0.3,0.5,0.7,0.9}\{-0.2,0.0,0.3,0.5,0.7,0.9\}. β1\beta_{1} serves the same role as momentum in SGD and we call it the momentum in Adam. Figure 10 shows tuning momentum for Adam under asynchrony gives measurably better training loss. This result emphasizes the importance of momentum tuning in asynchronous settings and suggests that state-of-the-art adaptive methods can perform sub-optimally when using prescribed momentum.

J.4 Accelerating \tunerwith finer grain learning rate tuning

As an adaptive tuner, \tunerdoes not involve manual tuning. It can present faster development iterations on model architectures than grid search on optimizer hyperparameters. In deep learning practice for computer vision and natural language processing, after fixing the model architecture, extensive optimizer tuning (e.g. grid search or random search) can further improve the performance of a model. A natural question to ask is can we also slightly tune \tunerto accelerate convergence and improve the model performance. Specifically, we can manually multiply a positive number, the learning rate factor, to the auto-tuned learning rate in \tunerto further accelerate.

In this section, we empirically demonstrate the effectiveness of learning rate factor on a 29-layer ResNext (2x64d) on CIFAR10 and a Tied LSTM model with 650 dimensions for word embedding and two hidden units layers on the PTB dataset. When running \tuner, we search for the optimal learning rate factor in grid {13,0.5,1,2(best for ResNext),3(best for Tied LSTM),10}\{\frac{1}{3},0.5,1,2(\text{best for ResNext}),3(\text{best for Tied LSTM}),10\}. Similarly, we search the same learning rate factor grid for Adam, multiplying the factor to its default learning rate 0.0010.001. To further strengthen the performance of Adam as a baseline, we also run it on conventional logarithmic learning rate grid {5e5,1e4,5e4,1e3,5e3}\{5e^{-5},1e^{-4},5e^{-4},1e^{-3},5e^{-3}\} for ResNext and {1e4,5e4,1e3,5e3,1e2}\{1e^{-4},5e^{-4},1e^{-3},5e^{-3},1e^{-2}\} for Tied LSTM. We report the best metric from searching the union of learning rate factor grid and logarithmic learning rate grid as searched Adam results. Empirically, learning factor 13\frac{1}{3} and 1.01.0 works best for Adam respectively on ResNext and Tied LSTM.

As shown in Figure 11, with the searched best learning rate factor, \tunercan improve validation perplexity on Tied LSTM from 88.788.7 to 80.580.5, an improvement of more than 9%9\%. Similarly, the searched learning rate factor can improve test accuracy from 92.6392.63 to 94.7594.75 on ResNext. More importantly, we can observe, with learning rate factor search on the two models, \tunercan achieve better validation metric than the searched Adam results. It demonstrates that finer-grain learning rate tuning, i.e. the learning rate factor search, can be effectively applied on \tunerto improve the performance of deep learning models.