Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

Introduction and Background

Gradient-based optimization forms the backbone of most modern approaches used to train deep neural networks. One of the simplest methods is stochastic gradient descent (SGD), wherein steps are taken along the direction of the negative gradient of the loss function evaluated on a minibatch. Building on this foundation, a variety of adaptive gradient-based methods have been proposed in which the gradient is divided by the componentwise square root of a vector summarizing the history of squared gradients, usually obtained through summation as in Adagrad (Duchi et al., 2011) or exponential averaging as in RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2015), and Adadelta (Zeiler, 2012). On convex problems, several of these methods offer theoretical advantages over SGD when gradients are sparse. While convergence guarantees have not yet been provided in the dense, non-convex setting in which most neural network training takes place, practitioners have nevertheless found these methods to empirically outperform SGD across a variety of domains.

The superior performance of these methods does come at a cost. Recent improvements in the computational capacity needed to train neural networks with larger numbers of parameters have far outstripped improvements in the memory capacity required to store those parameters during training. This has led to memory usage becoming an important constraint on model size. Adaptive optimization algorithms exacerbate this problem by requiring additional memory for extra accumulators, such as those required for momentum and per-coordinate gradient scaling. For example, Adam (Kingma & Ba, 2015) keeps two additional values for each parameter, tripling the memory requirements.

We propose a way to reduce memory usage while retaining the empirical benefits of adaptivity by maintaining a factored representation of the squared gradient accumulator across training steps. Specifically, by tracking moving averages of the row and column sums of the squared gradients for matrix-valued variables, we are able to reconstruct a low-rank approximation of the exponentially smoothed accumulator at each training step that is optimal with respect to the generalized Kullback-Leibler divergence. For an $n\times m$ matrix, this reduces the memory requirements from $O(nm)$ to $O(n+m)$ . We demonstrate empirically using Adam on a large-scale machine translation task known for its expensive models that our approach achieves comparable performance to that obtained using full accumulators.

Beyond this, we also investigate another issue related to Adam of recent interest. To further reduce memory requirements, we would like to run Adam without momentum, eliminating an additional auxiliary value per model parameter. But without making any other changes, eliminating momentum can cause training instability. We identify out-of-date second moment accumulators as a possible cause of this instability and propose two remedies.

Finally, while the learning rate in Adam denotes a target absolute step size, we follow the intuition that relative change in the parameters is more relevant, so we propose scaling the size of the updates relative to the scale of the parameters themselves.

A Brief Review of Adam

We reproduce the pseudocode for the Adam optimizer in Algorithm 1 for reference (Kingma & Ba, 2015). The setup of the problem is as follows. Suppose we are trying to minimize the expected value of a noisy objective function $f(x)$ . At each step, we receive a stochastic realization $f_{t}$ , e.g. the loss computed on a random minibatch of data, and we compute the gradient $g_{t}$ of this function with respect to our previous parameters. We then update the exponential running averages of the first and second moments of the gradient $m_{t}$ and $v_{t}$ , compute bias-corrected versions $\hat{m}_{t}$ and $\hat{v}_{t}$ to account for the zero initialization, and finally make a parameter update to obtain a new iterate $x_{t}$ . This repeats for $T$ steps, at which point we return the final iterate $x_{T}$ as our approximate solution.

The step size $\alpha_{t}$ is often held constant over the course of training, but recent work in large-scale optimization suggests that performance can be improved on some problems through a linear ramp-up followed by some form of decay (Goyal et al., 2017; Vaswani et al., 2017). We use the latter with an inverse square root decay scheme in our experiments, finding it to yield more stable results.

Factored Second Moment Estimation

Recent work has shown that for problems where vast quantities of data are available, e.g. language modeling and machine translation, task performance improves consistently as model size increases, even in the regime of models with several billions of parameters (Shazeer et al., 2017). As models continue to grow, the storage requirements of one or two auxiliary parameters per model parameter imposed by existing adaptive methods can be prohibitive, motivating the investigation of a low-memory alternative. In this section, we propose a novel approach in which model structure is exploited in order to reduce storage requirements without compromising empirical performance.

One common choice for low-rank approximation is to truncate the singular value decomposition at the top $k$ singular values. This is known to give the optimal projection onto the space of rank- $k$ matrices with respect to the Frobenius norm (Eckart & Young, 1936). While heavily tuned procedures exist for finding the top $k$ singular values and vectors of a matrix, these quantities in general do not decompose over matrix addition, implying an incompatibility with exponential smoothing. Moreover, there is no guarantee that the entries of the approximation will be nonnegative, which is problematic given that we would like to scale the gradient by the componentwise inverse square root.

In search of a more suitable alternative, we turn to techniques from nonnegative matrix factorization. In addition to the Frobenius norm, another popular cost function in the literature is the generalized Kullback-Leibler divergence, also known as the I-divergence (Lee & Seung, 1999). For nonnegative scalar inputs, the I-divergence is given by the equation

with the conventions that $0/0=0$ , $0\log 0=0$ , and $p/0=\infty$ for $p>0$ . It is easily seen that $d(p,q)\geq 0$ with equality iff $p=q$ by setting $x=p/q$ in the standard inequality $x\log x\geq x-1$ . Under this cost function, we aim to minimize the total elementwise divergence subject to componentwise nonnegativity constraints:

Solving this problem for general rank- $k$ factors is nontrivial, requiring for instance the use of an alternating minimization procedure (Finesso & Spreij, 2006). In the special case of rank-1 factors, however, we can derive an analytic solution.

Let $R$ and $S$ be any feasible solution. Noting that $[RS]_{ij}=R_{i}S_{j}$ and expanding the loss, we have

Setting the derivatives of this expression with respect to $R_{i}$ and $S_{j}$ equal to 0, we obtain the relations

Now note that for any minimizer $(R,S)$ , the solution $(\alpha R,S/\alpha)$ is also a minimizer for any $\alpha>0$ , since the loss only depends on the product $RS$ . Hence we may break the symmetry by fixing the sum of the components of $R$ at $\sum_{i=1}^{n}R_{i}=\sum_{i=1}^{n}\sum_{j=1}^{m}V_{ij}$ , in which case we obtain a canonical minimizer

By our discussion of symmetry above, it follows that the solution set consists more broadly of all pairs $(R,S)$ satisfying $RS=V1_{m}1_{n}^{\top}V/1_{n}^{\top}V1_{m}$ , and the claim follows. ∎

We now note some important properties of this rank-1 projection. First, if $V$ itself is a rank-1 matrix, then it will be exactly recovered as one would expect. Second, the projection can be expressed entirely in terms of the row sums $V1_{m}$ and column sums $1_{n}^{\top}V$ , which in particular are linear functions of $V$ . This convenient fact gives us the desired compatibility with exponential smoothing, since the row sums of the moving average equal the moving average of the row sums, and similarly for columns. Moreover, storing only the moving averages of these factors rather than the full matrix $V$ yields considerable memory savings, requiring space proportional to $n+m$ rather than $nm$ .

We present a concrete implementation of Adam with factored second moment accumulators in Algorithm 2 for the case where the parameter set $x$ can be viewed as a single matrix $X$ . In the event that the parameter set is most suitably partitioned into multiple matrices (treating vectors and scalars as special cases), the steps can be performed in parallel for each matrix individually. We present the algorithm with $\beta_{1}$ fixed at 0 so as to focus our attention on the second moments. First moments can be included as in Adam without modification if desired.

In the implementation, we keep running averages of the row sums $R_{t}$ and column sums $C_{t}$ of the squared gradients. The full accumulator is then approximated as the outer product divided by the sum of all entries, $R_{t}C_{t}/1_{n}^{\top}R_{t}$ , and is subsequently scaled by the same bias correction factor as in Adam. We note that the normalization term in the denominator $1_{n}^{\top}R_{t}$ could equivalently be expressed as $C_{t}1_{m}$ , so the treatment of row sums and column sums is not asymmetric despite the surface form of the approximation.

A preliminary version of this method was briefly mentioned in Appendix D of Shazeer et al. (2017). Also, Gupta et al. (2014) employ a similar technique, saving memory by averaging Adagrad accumulators across embedding vectors.

2 Experiments

We ran the Transformer model from Vaswani et al. (2017), using Adam with and without our factored second moment estimation for optimization. See Section 9 for more details on the experimental setup. Results were similar in all tested cases. See Table 2 (A) vs. (C) and (H) vs. (J).

We also tried simplified estimation schemes where the second-moment estimators for matrices were approximated by either the row means or the column means (but not their outer product). For this model, the results for the row-mean scheme were similar to baseline, but the results for the column mean scheme were much worse. See Table 2 (D) and (E). We suspect that these results are due to the model’s use of a shared weight matrix used both to represent the token embeddings and to produce the output probabilities. Each row in this matrix corresponds to one token in the vocabulary. Rows associated with very frequent tokens tend to receive gradients of much larger magnitude than rows associated with very infrequent tokens.

No Momentum

Adam requires two persistent accumulators per parameter for the first and second moments of the gradients. In Section 3, we reduced the memory requirements of the second-moment accumulator. To remove the need for a first-moment accumulator, we simply turn momentum off by setting $\beta_{1}=0$ .

For a step size schedule similar to the one used in Vaswani et al. (2017), which includes a warmup period, model quality is similar without and with momentum (BLEU = 23.6 vs. 23.4) – see Table 2 (A) vs. (B), second to last column.

Without the warmup period, the model without momentum becomes more unstable (BLEU = 0.1 vs. 23.1) – see Table 2 (A) vs. (B), last column. We hypothesize that removing the momentum unmasks an underlying problem with the stability of Adam, which we will discuss in the next section.

A Problem with Adam: Out-of-Date Second Moment Estimator

Reddi et al. (2018) discuss non-convergence issues when using a fast decay of the second-moment estimator (low $\beta_{2}$ ). We observe the same issues in our experiments – see Table 1, first result column. On the other hand, slow decay (high $\beta_{2}$ ) causes training instability when we turn off the step size warmup – see Table 1, second result column.

The fact that slow decay causes both larger-than-desired updates and training instability supports our hypothesis that the large updates are the cause of the instability, but does not prove it. One competing hypothesis is that the instability causes the larger-than-desired updates. We refute this particular competing hypothesis by noting that the $RMS(U_{t})$ values plotted in Figure 1 are for training runs with step size warmup, neither of which exhibited instability. In the next section, we further support our hypothesis by showing that we can cure the instability by clipping the larger-than-desired updates.

Update Clipping

The actual parameter update is then the product $\alpha_{t}\hat{U}_{t}$ of the step size and the clipped unscaled update, as in Algorithm 4.

Gradient clipping is a popular heuristic used for training neural networks in which the gradient is scaled down before an update if needed to ensure that its norm never exceeds some fixed threshold (Pascanu et al., 2013). For stochastic gradient descent, the update direction is exactly the gradient, so this also has the effect of putting an upper bound on the distance traveled in each step. While gradient clipping is also applied to adaptive methods in practice, the norm of the update direction may still exceed the user-imposed threshold due to the presence of additional per-parameter scaling factors. In update clipping, we cap the norm of the actual update rather than just the gradient.

2 Experiments

We added update clipping to the previously described fast-decay experiments. For the experiment without learning rate warmup, update clipping with $d=1$ significantly ameliorated the instability problem – see Table 2 (A) vs. (H). With $d=2$ , the instability was not improved. Update clipping did not significantly affect the experiments with warmup (with no instability problems).

Increasing Decay Parameter

An alternative solution to the problems described in Section 5 is to use an increasing schedule of $\beta_{2}$ , as proposed by Reddi et al. (2018). Perhaps this can give us the best of both worlds – see Table 1, where different decay rates are better in different situations.

We point out here that Adam already uses an increasing decay parameter if we rewrite the bias correction as a correction to $\beta_{2}$ . To do this, we define $\hat{\beta_{2}}_{t}=\beta_{2}\frac{1-\beta_{2}^{t-1}}{1-\beta_{2}^{t}}$ , and we compute $\hat{v}_{t}$ directly in terms of $\hat{v}_{t-1}$ as follows:

This, along with similar logic for $\beta_{1}$ , leads to the alternative formulation of Adam in Algorithm 3.

In our reformulation of Adam, the corrected decay parameter $\hat{\beta_{2}}_{t}=\beta_{2}\frac{1-\beta_{2}^{t-1}}{1-\beta_{2}^{t}}$ starts at when $t=1$ and asymptotically approaches $\beta_{2}$ for large values of $t$ .

2 Proposed Alternative

Alternatively, we propose the family of schedules

parameterized by a scalar $c>0$ controlling the rate of increase.

By inspection, it is clear that this schedule starts at 0 for $t=1$ and increases toward 1 as $t$ tends to $\infty$ . This allows us to benefit from the stability properties of a low $\hat{\beta_{2}}_{t}$ at the start of training while still realizing the gains in performance due to a high $\hat{\beta_{2}}_{t}$ as the run progresses.

Less obviously, this schedule also eliminates the need for bias correction. To see why, we begin by expanding the recursive definition of $v_{t}$ to arrive at

Taking expectations of both sides, we have

which means that the contributions of past gradients will go to 0 as training progresses rather than retaining nontrivial weight for all time.

We verify the first property with a simple induction argument. At time $t=1$ , we have $1-\hat{\beta_{2}}_{1}=1$ as desired. Then if the equality holds at time $t-1$ , we have

which completes the argument. We remark that this proof in fact goes through for any schedule for which $\hat{\beta_{2}}_{1}=0$ .

The second condition is more restrictive in comparison. For the proposed schedule, we would like it to be true that

for all $i\geq 1$ . Using the standard result that for a sequence $0\leq a_{n}<1$ , the infinite product $\prod_{n}(1-a_{n})$ converges to a nonzero value iff the series $\sum_{n}a_{n}$ converges, we see that the limit above will be 0 iff the series $\sum_{j=2}^{\infty}1/j^{c}$ diverges, which is only true for $c\leq 1$ . Hence the decay parameter must not increase too fast, as otherwise past gradients will maintain a weight bounded away from 0 for the full duration of training. In the special case where $c=1$ , we note that $v_{t}=\sum_{i=1}^{t}g_{i}^{2}/t$ reduces to a simple arithmetic moving average of the history of squared gradients.

3 Experiments

We added this alternative to our experimental baseline – see Table 2 lines (A) vs. (K), (L), (M). The schedule $\hat{\beta_{2}}_{t}=1-t^{-0.5}$ did in fact maintain both stability and convergence. When combined with update clipping, this method produced similar results to constant high $\beta_{2}$ with update clipping – see Table 2 lines (H) vs. (N).

Relative Step Size

Instead of defining the optimization algorithm in terms of absolute step sizes $\{\alpha_{t}\}_{t=1}^{T}$ , we propose defining the optimization algorithm in terms of relative step sizes $\{\rho_{t}\}_{t=1}^{T}$ , which get multiplied by the scale of the parameters. We define the scale of a parameter vector or matrix as the root-mean-square of its components, lower-bounded by a small constant $\epsilon_{2}$ . The reason for this lower bound is to allow zero-initialized parameters to escape 0. Combining this with the other proposals in this paper gives the Adafactor algorithm defined in Algorithms 4 and 5. Proposed hyperparameters for Adafactor are listed in Algorithm 6.

Experimental Setup

We evaluated the optimization algorithms described in this paper on the Transformer machine translation model described in Vaswani et al. (2017) on the same WMT 2014 English-to-German translation task described in that paper, using the latest version of the architecture from the Tensor2Tensor open-source repository.

Models were trained for 100,000 steps. Each training batch contained sentence pairs containing approximately 4,096 tokens in the input and 4,096 tokens in the target sentences. These batches are about 8 times smaller than the ones used by Vaswani et al. (2017). This causes our results to be slightly worse, but significantly speeds up training times (less than two hours each on one Google TPU v2).

In one set of experiments, we followed a similar step size schedule as Vaswani et al. (2017) consisting of a linear warmup followed by inverse-square root decay, given by $\alpha_{t}=0.1\cdot\min(10^{-6}\cdot t,\frac{1}{\sqrt{t}})$ . In order to test training stability, we ran a second set of experiments where the initial warmup was replaced by a flat learning rate: $\alpha_{t}=0.1\cdot\min(10^{-2},\frac{1}{\sqrt{t}})$ . For the experiments with relative step sizes, we used schedules $\rho_{t}=\min(10^{-6}\cdot t,\frac{1}{\sqrt{t}})$ and $\rho_{t}=\min(10^{-2},\frac{1}{\sqrt{t}})$ .

In addition, we tested plain SGD with learning rate schemes equal to the step size schemes above, multiplied by various constants, since SGD also requires little (zero) additional memory cost.

Results are listed in Table 2. The listed BLEU scores are on the development set, newstest2013, using beam search with beam size 4 and length penalty $\alpha=0.6$ . Higher scores are better. Note that the scores listed should not be compared to those in Vaswani et al. (2017), due to both our shorter training regime and various improvements in the open-source version of the model over the published version.

The schemes with warmup mostly achieved very similar results. Fast decay of the second-moment estimator (G) was significantly worse.

Without warmup, the baseline (A) becomes unstable. The instability is relieved by any of momentum (B), fast decay (G), variable decay (K), and gradient clipping (H). It is not clear whether relative step size has an affect on stability, since the step sizes used in the experiments are not directly comparable.

Rows (J) and (N) demonstrate algorithms with sub-linear additional memory requirements which attain comparable convergence and stability results to Adam with momentum.

Results for SGD (Q) were poorer and less stable than Adam, and highly dependent on choice of learning rate.

Conclusion

On a popular machine translation task, we have demonstrated similar quality results to Adam, using a sublinear amount of extra space for accumulators. This should enable training of significantly larger models on the same memory-constrained hardware. We have also introduced update clipping, a potentially more-generally-useful technique for stabilizing adaptive gradient methods.

Code for running Adafactor is available in the open-source Tensor2Tensor library.

Acknowledgements

Thanks to Łukasz Kaiser, the Tensor2Tensor team and the open-source community for helping test and debug Adafactor. Also thanks to Geoffrey Hinton, who asserted that training works well if the magnitudes of parameter updates are about $10^{-2}$ to $10^{-3}$ times the magnitude of the parameters.

Introduction and Background

A Brief Review of Adam

Factored Second Moment Estimation

2 Experiments

No Momentum

A Problem with Adam: Out-of-Date Second Moment Estimator

Update Clipping

2 Experiments

Increasing Decay Parameter

2 Proposed Alternative

3 Experiments

Relative Step Size

Experimental Setup

Conclusion

Acknowledgements

References