Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, Mark Schmidt

Introduction

Adam and its derivatives have been so successful in training deep learning models that they have become the default optimizer for some architectures. Adam often outperforms stochastic gradient descent (SGD) by such a margin that SGD is considered incapable of training certain models, to the point of being omitted from performance comparisons [22, 2, e.g.]. Despite this success, we still do not understand why Adam works, much less why it can outperform SGD by such a wide margin. We have made progress understanding why it should not, as in the work of who pointed out that Adam does not converge, even on convex problems, but this does not answer why Adam outperforms SGD.

The limited effectiveness of standard theory. We usually analyse optimization algorithms under assumptions like Lipschitz continuous function/gradient and convexity [29, e.g.]. Many works have focused on improving the analysis of Adam and its variants under those same assumptions. But these assumptions are only models of how losses behave. They do not convey the complexity of the optimization process in complex architectures, and are limited to showing that Adam does not do much worse than gradient descent . Analyses in online learning also struggle to illuminate the gap. The assumption that the gradients come from an adversary requires decreasing step-sizes [16, e.g.], which decrease too quickly to perform well in practice. Our theoretical understanding is thus still limited in that we cannot describe the empirical behavior we observe—that Adam outperforms SGD in many settings.

As a result, there is a sentiment in the community that the success of these heuristics need not be due to robust theoretical underpinnings, but rather to social dynamics and a co-evolution of deep learning architectures and optimization heuristics [30, see for example]. These “adaptive” algorithms might actually be adapted to type of problems where they outperform SGD. But this suggests that they are leveraging some problem structure that our current theory and theory-derived algorithms are missing. Understanding this structure may be key to develop better practical algorithms.

Inconsistency with large batch results. note a correlation between heavy-tailedness and cases where Adam outperforms SGD, and give a mechanism to link heavy-tailed errors to this gap. However, the type of noise is not the only difference between image and language tasks. There is limited empirical evidence that stochasticity is the root cause of this gap. In fact, there are reasons to believe noise might not be a major contributor. For example, the lack of variance in the behavior of the optimizers on language tasks in Figure 1 suggests they are less sensitive to noise than image tasks. Moreover, we would expect the gap to diminish as the noise is reduced by increasing the batch size. However, methods such as LAMB or plain Adam find success in large batch settings , suggesting a competitive advantage even with reduced noise. Studies of batch size scaling also find that Adam scales better with batch size .

Alternative explanations. These empirical results cast doubt on the idea that robustness to heavy-tailed noise is the primary factor behind the performance improvement of Adam over SGD. Hypotheses based on deterministic properties might provide better descriptions of the cause of this gap — we discuss them in more details in Sections 4 and 5. One such interpretation is the view of Adam as a variant of sign descent, which was a motivation of RMSprop , as studied by . However, it remains unclear whether the performance of Adam can be explained by its similarities to a simpler algorithm, or if the additional changes needed to obtain Adam are necessary to obtain good performance.

The key question we wish to address is whether the gap between SGD and Adam on transformer models, as studied by , is really caused by noise, or whether there already is a large difference in the deterministic setting. Our key observations are

Noise is not the key contributor to the performance gap between SGD and Adam on transformers. Removing noise by running in full batch does not reduce the gap (Figure 2).

SGD benefits less from the reduction in noise than Adam. Increasing the batch size actually increases the gap between the methods when controlling for the step-size and number of iterations (§3, Figure 3). The performance of SGD does not seem dominated by noise, as its progress per iteration does not improve with batch size as much as it does for Adam (§3.1, Figure 4).

That the performance gap between SGD and Adam grows when noise is removed suggests that the benefit of Adam over SGD can not primarily be due to a robustness to noise. This raises the question as to which component of Adam enables its good performance in the deterministic setting and whether a simpler algorithm can display similar behavior. We show that

Sign descent can close most of the gap between SGD and Adam in full batch. While sign descent performs poorly with small batch sizes, it improves drastically with larger batch sizes and can bridge most of the gap between gradient descent and Adam (§4, Figure 5–7).

We also present results using normalized gradient updates, which improves on the performance of SGD and scales better to large batch sizes. However, in full batch, sign descent outperforms plain normalization and approaches the performance of Adam. Momentum improves the performance of both Adam and sign descent, and they are similar when both run with- or without momentum.

Despite sign descent being one of the key motivations behind RMSprop, and Adam by extension, the similarity to sign descent in full batch had not been shown empirically. Although this does not give an explanation for why Adam outperforms SGD, it might give an indication of how. As sign descent is a simpler algorithm, understanding its behavior in the deterministic setting might prove a useful tool to understand the success of Adam and the properties of the problem classes where it works well.

Experimental design

Our first experiments focus on the performance of SGD and Adam as the batch size increases, and the noise induced by subsampling decreases. Our goal is to check whether the gap between SGD and Adam persists as noise is removed. We touch on the experimental design decisions necessary to interpret the results here. Note that there are practical limitations when comparing optimizers with large batch sizes due to computational challenges, which we discuss in Section 5.1 and Appendix A.

Problems. We consider two image and three language tasks.

Image classification on MNIST and CIFAR-10 using a small CNN and ResNet18.

Language modeling on PTB and WikiText-2 using a small transformer and Transformer-XL.

Question answering on SQuAD by fine-tuning a pre-trained DistillBERT model.

Focusing on the heavy-tail hypothesis of , our attention is on the behavior of the optimizers on language tasks, though image tasks are included for comparison.

Training performance. Our goal is to study optimization dynamics in cases where SGD fails to make sufficient progress. Our results focus on the training loss, and we use it to select hyperparameters. Results on hold-out data are given in Appendix C, but generalization at large batch sizes is known to require additional tuning and regularization .

Batch sizes and reducing noise. To check if less noisy gradient estimates reduce the gap between SGD and Adam, we measure progress per iteration as we increase the batch size. The batch sizes used, labelled S, M, L and XL, correspond to a ${\approx}4\times$ relative increase. Larger batch sizes are run for more epochs, but the increase in batch size leads to fewer iterations (detailed in Tables 1 and 2, Appendix A). For the full batch runs, we remove a small percentage of the data ( ${\leq}0.5\%$ ) to ensure the dataset is divided evenly in batches. We check that the observed trends also hold when disabling dropout, the other source of randomness on transformers, in Section B.4.

Simple training procedure. We use a constant step-size tuned by grid search. While better performance can be obtained with step-size schedules, our primary objective is to gain insight on the behavior of simple updates rather than reproduce state-of-the-art performance.

Hyperparameter tuning. Increasing the batch size can allow for larger step-sizes to be stable, and previous work has reported that the best step-size across batch size settings need not be monotonic . To account for these effects, we tune the step-size for each batch size individually by grid search. We start with a sparse grid and increase the density around the best step-size. We repeat each run for 3 random seeds. Grid-search validation results are given in Appendix C. We use each method with and without momentum, indicated by ( ${+}\text{m}$ ) and ( ${-}\text{m}$ ). When using momentum, we fix it to $0.9$ ( $\beta$ for SGD, $\beta_{1}$ for Adam). Other parameters are set to defaults ( $\beta_{2}=0.999,\epsilon=10^{-8}$ ) .

Behavior of SGD and Adam in large batch

Comparing SGD and Adam in small and full batch in Figures 1 and 2, we observe the following.

Adam outperforms SGD by a large margin on language tasks, as expected from previous observations .

Despite removing stochasticity, the gap between SGD and Adam is not reduced. It even increases on some problems, especially on the language tasks. On CIFAR-10 and PTB, the performance of Adam improves while the performance of SGD worsens on PTB and WikiText-2.

The importance of momentum is also increased in full batch, as the gap between the same optimizer with and without momentum is larger in Figure 2. But Adam without momentum still outperforms SGD with momentum in large batch on the language tasks.

However, it is not clear that this comparison of the small and full batch settings is a good indicator of the performance per iteration. Although run for more epochs, the optimizers take far fewer steps in the full batch setting. To account for this confounding factor, in Figure 3 we introduce intermediate batch sizes and show the loss in each setting when stopped at a comparable number of iterations. The observations from figures 1 and 2 carry over to Figure 3, which shows the trend across batch sizes more clearly; the gap between Adam and SGD grows with batch size, especially on language models.

The view provided by Figure 3 highlights the gap in performance per iteration between Adam and SGD, but does not provide a full picture of their behavior with increasing batch-size. To stop each method at a comparable number of iterations across batch sizes, we have to stop the smallest batch sizes after one epoch. The results thus need not be representative of standard training, as neither SGD nor Adam can achieve reasonable training error after only one epoch.

To verify that the observed behavior holds beyond the first epoch in small batches, we show the full trajectory of the loss for each batch size in figure 4, focusing on the transformer problems. For each problem, each optimizer is shown running with fives batch sizes, giving a more detailed view of how the gap between the two optimizers increases with batch size, and is greatest at full batch;

The performance of plain SGD does not significantly improve improves with batch-size. The trajectory of the loss for SGD is similar across batch sizes, indicating that the reduction in noise from increasing the batch size does not improve its performance; we might expected similar results from running with the same batch size, but terminating earlier.

Adam improves with batch size and achieves similar or lower error in fewer iterations.

We also observe that, for both optimizers, momentum improves the convergence speed more with larger batch size. These results are consistent with previous studies on the batch size scaling of momentum and Adam .

Our experiments provide additional evidence on language tasks that SGD does not benefit from a reduction in noise in the gradient estimate once beyond a certain batch-size. This suggests that stochasticity is not the limiting factor in the performance of SGD for these problems. That the gap between Adam and SGD grows with batch size is also inconsistent with the idea that the improvement of Adam over SGD results from a more robust estimation of the gradient. Those observations do not rule out the possibility that optimization dynamics for small batch sizes are better explained by heavy-tailed rather than sub-Gaussian noise. However, results with large batch sizes suggest that the benefits of Adam over SGD are deterministic in nature.

Alternative interpretations

As noise cannot explain the gap between (S)GD and Adam in the full batch setting, this raises the question can the improved performance of Adam in full batch be attributed to a simpler algorithm? Are the algorithmic differences between Adam and SGD, such as moving averages or bias-correction, improving the performance in the stochastic setting, or are they necessary even in the deterministic setting? Our goal is to isolate small algorithmic changes from gradient descent that would exhibit a behavior similar to Adam, while being easier to interpret and analyse. Alternative hypotheses on the performance of Adam might provide insight on its performance, such as its similarity to sign descent or gradient clipping.

A common view of Adam is that it performs a form of smoothed sign descent . Indeed, the motivation for RMSprop was to make sign descent more stable in the stochastic setting. The update of RMSprop (with parameters $x_{0}$ , second-moment buffer $v_{0}=0$ , and parameters $\alpha,\beta_{2},\epsilon$ ),

reduces to sign descent when setting the hyperparameters $\beta_{2}$ and $\epsilon$ to 0, as

The same applies to Adam when taking its momentum parameter $\beta_{1}$ to 0 (ignoring bias-correction). Consistent with the sign descent interpretation, the success of Adam is often attributed to its normalization effect; that changes across coordinate are uniform despite large differences in their gradients [22, e.g., ]. However, why sign descent should work in the deterministic setting and whether RMSprop succeeds in “stabilizing” sign descent have not been explored.

An alternative approach to improve the performance of SGD on language models is to use gradient clipping . Clipping and normalized updates were discussed recently in the context of Adam and sign-based methods , as element-wise clipping is equivalent to the sign if all elements are larger than the clipping threshold.

To gain insights on whether the success of Adam in the deterministic setting can be attributed to those simpler heuristics, we repeat the prior experiments with the following updates. We use the normalized gradient, which changes the magnitude of the step, and sign descent, which also change its direction. We implement the updates with heavy-ball momentum by accumulating the transformed gradient;

As before, we do not tune momentum and present results with ( $\beta=0.9$ ) and without ( $\beta=0$ ) it. We reproduce the visualizations of Figure 3–4 for the new updates in Figure 5–6 and observe that;

Sign descent performs poorly at small batch sizes but gets closer to Adam in full batch. At small batch sizes, sign descent can be worse than plain SGD. However, the improvement from increasing batch sizes is greater than for other optimizers, and sign descent goes from being the worst option with small batches to being competitive with Adam in full batch.

Normalized GD scales better but plateaus. While normalizaton improves performances and helps scale with batch size, the scaling plateaus earlier than for sign descent.

As with other optimizers, momentum helps both sign descent and normalized GD scale to larger batch sizes. The improvement in performance with batch size observed for Adam is closer to that of sign descent than normalized GD. In full batch, sign descent with momentum closes most of the gap between SGD and Adam, and it can even outperform Adam on some problems.

To better illustrate that the performance of sign descent is poor in small batch, but that it closes most of the gap between GD and Adam in full batch, we compare all optimizers with momentum in a traditional loss per epoch format in Figure 7 (without momentum in figure 10, appendix B).

While a gap still exists between Adam and sign descent, the improvement in performance over gradient descent supports the hypothesis that the performance gap between Adam and gradient descent has its roots in a difference in the deterministic case. Parts of the benefit of the second-moment estimation in Adam or RMSprop can be attributed to the difference between gradient descent and sign descent. The good scaling properties of Adam to large batch sizes are also shared with sign descent. This connection might provide a new avenue for analysis, by studying a simpler algorithm, to understand the behavior of Adam and the impact of noise.

Discussion

We present experimental evidence that runs counter to the idea that Adam outperforms SGD on transformers due to a more robust estimate of the descent direction. Rather, experimental results show that Adam outperforms gradient descent in the full batch setting, where gradient descent already performs poorly. Those results validate the intuitive motivations for RMSProp—and by extension, Adam—that the sign of the gradient might be a more reliable direction than the gradient. In addition, the results show that the improvement in performance with large batch sizes exhibited by Adam, also observed by previous work , is shared by sign descent. As Adam outperforms sign descent in small batch sizes, this suggests that the benefits of the algorithmic differences between sign descent and Adam indeed improves its behavior in small batch sizes.

Despite the reported similarities between sign descent and Adam, that sign descent outperforms gradient descent by such a margin in full batch on transformers is surprising. The sign descent perspective does not fit the common view of Adam as an “adaptive” algorithm, and there is limited theory explaining the benefits of sign descent in deep learning. But that sign descent performs well on deep learning tasks is corroborated by the recent work of . Using evolutionary search to discover optimization algorithms, they obtain a variant of sign descent with momentum that outperforms Adam variants on a range of deep learning models and exhibits better relative performance with large batch sizes. Understanding how to capture the benefits of sign descent on those models might give us a better understanding of the loss landscape in deep learning. We end by discussing potential open questions, related work, and limitations of the results presented here.

Improvement of sign descent with batch size. Larger improvements with batch size are often associated with methods that are faster than gradient descent, such as second-order methods, momentum or Nesterov acceleration. This improved performance typically comes at the cost of an increased sensitivity to noise, requiring larger batch sizes to achieve better performance. This has been shown empirically for momentum on deep learning models , and can be established formally, e.g. on quadratics . Quadratic models have also been used to justify the improved performance of Adam with large batches . This justification relies on the interpretation of Adam as an approximate second-order method, similar to the KFAC approximation to natural gradients . However, this interpretation of Adam seems incompatible with its interpretation as a form of sign descent, and the quadratic model alone seems insufficient to explain the performance of sign descent.

Limitations of typical assumptions. Typical analyses in optimization assume the objective function is smooth (has bounded Hessian, $\|\nabla^{2}\mkern-1.0muf\|\!\leq\!L$ ) to ensure the accuracy of its Taylor approximation. This assumption implies the following, sometimes called the descent lemma

Without further assumptions, this bound is the only guarantee of performance. As the bound is optimized at the gradient descent step $y=x-\mathopen{}\mathclose{{}\left(\nicefrac{{1}}{{L}}}\right)\nabla\mkern-1.0muf(x)$ , one would expect any corollary to not improve on GD. It is of course possible to provably improve on GD with further assumptions. e.g. with acceleration on convex functions , but this highlights the strong relationship between the typical smoothness assumption and GD. Moreover, recent works have called into question the adequacy of this assumption, showing that it does not hold even in “simple” deep learning problems . We discuss next recent works that explore relaxations of smoothness to better characterize the loss landscape and capture the benefit of clipping and sign-based algorithms.

Justifications for normalization approaches. provides empirical evidence that, on deep learning models, the Hessian and gradient norm can share a similar scale. This observation motivates a relaxation of the uniform smoothness bound ( $\|\nabla\mkern-1.0mu{}^{2}f\|\leq L$ ) to allow the Hessian to grow with the gradient ( $\|\nabla\mkern-1.0mu{}^{2}f\|\leq L_{0}+L_{1}\|\nabla\mkern-1.0muf\|$ ). This suggests $f$ behaves like a smooth function when the gradient is small, and a gradient descent step can be expected to make progress. But if the gradient is large, the Hessian might be large. Dividing by the upper bound on the Hessian, the gradient norm, normalizes the step and leads to clipping. This view of clipping has been further explored by . However, our experiments suggest that normalization alone might be not be sufficient to bridge the gap between SGD and Adam, and that element-wise clipping or sign-based methods exploit additional structure about the problem.

Justifications for sign- and element-wise clipping approaches. provide conditions under which sign-based approaches can outperform SGD in the deterministic or stochastic setting, assuming element-wise smoothness or measuring smoothness in the $L_{\infty}$ norm. As with the standard smoothness assumption, it is unclear whether they provide an accurate description of the optimization dynamics in deep learning. Following the relaxed smoothness assumption of , propose a coordinate-wise relaxation to analyse a variant of sign-descent inspired by Adam, using exponential moving averages to estimate the sign. While providing a justification for sign descent-like algorithms, it is not clear whether the coordinate-wise relaxation is practical. Having $2d$ parameters, one for each of the $d$ weights in the network, checking this variant requires checking each weight and convergence results depend on the $2d$ problem specific-constants. Identying a verifiable assumption that captures the advantage of sign-based methods in the deterministic setting and understanding when it holds remains open.

What architecture choices lead to the increased gap between SGD and Adam? We do not observe such a large gap between SGD and Adam on CNNs on image tasks compared to transformers on language tasks. What properties of the problem or architecture leads to a much better performance of Adam, sign descent or other normalization schemes, beyond the difference in the tails of the stochastic gradients noted by ? The prediction over a large vocabulary in language tasks—much larger than the number of classes in image tasks—could suggest a similarity with overparametrized logistic regression, which is known to exhibit faster convergence of normalized gradient descent . The architectures also differ in their choice of normalization layers, skip connections and initializations, choices which were crucial for the training of deep networks on image tasks but might not yet be as mature for transformer models. Reducing the complexity of such model to a minimal example that still exhibits a large gap between SGD and Adam but without the computational hurdles of large deep learning models would be helpful to serve as a mental model and benchmark to test hypotheses or optimizers.

Designing versions of sign descent that are robust to small batch sizes? The exponential moving average used by RMSProp and Adam works well in practice, but capturing this benefit in theory is challenging. Existing guarantees show a negative dependence on $\beta_{2}$ . In contrast, alternative mechanisms such as majority voting, popular in the distributed setting due to the low communication cost of sign descent, have provable guarantees . How to obtain similar guarantees for exponential moving averages, which are known to work for deep learning, remains open.

A major limitation whenever probing the full batch setting is computational costs. As we measure progress per iteration, keeping the number of iterations constant across batch sizes would be ideal, but it is computationally infeasible given our budget. We mitigate this limitation by introducing intermediate batch sizes that interpolate between the small and full batch setting. However, due to those limitations, our results might only probe the “beginning” of the training procedure compared to state-of-the-art workloads. We also use a constant step-size, and some of the difficulties encountered by SGD on transformers might be mitigated by a warm-up or decaying step-size schedule if the initialization has a strikingly different landscape from the remainder of the optimization path.

Focus on training performance. We do not attempt to optimize hyperparameters for validation error, and instead focus on understanding the optimization performance on the training loss. The results might be of limited applicability in practical settings focused on generalization. However, the validation metrics (available alongside our grid-search results in Appendix C) suggests that optimizers which perform better on training error achieve a lower testing loss faster (before over-fitting).

Qualitative differences across architectures. The striking difference in performance between SGD and Adam on transformers need not hold for other architectures. On the two image tasks included to serve as comparison points, the distinction between SGD and Adam and the observed trends are less clear. For example, on ResNet18/CIFAR10, SGD behaves differently than on the language tasks in figure 8 (Appendix B), and improves with batch size as much as Adam. We hypothesize this is due to the network using Batch Normalization. We might need different assumptions, analysis tools and possibly optimizers for different families of architectures that have qualitatively different behaviors.

We thank Si Yi (Cathy) Meng, Aaron Mishkin, and Victor Sanches Portella for providing comments on the manuscript and earlier versions of this work. This research was partially supported by the Canada CIFAR AI Chair Program, the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants RGPIN-2022-03669, and enabled in part by support provided by the Digital Research Alliance of Canada (alliancecan.ca).

References

Appendix A Experimental details

The code to reproduce our experiments and data generated by each run is available at https://github.com/fkunstner/noise-sgd-adam-sign

Problems in the main text are referred to by the dataset and corresponds to the following problems. For all problems, we use the default train/test split and report the performance on the test split as validation error.

Image classification task using a modified LeNet5 architecture; a 3-convolution, 2-linear layer average pooling and $\tanh$ activations, using a linear rather than Gaussian output.

PyTorch implementation of the 18-layer CNN with residual connections.

Word-level language modeling with sequences of 35 tokens using a simple Transformer model used as a tutorial example in PyTorch. The architecture consists of an 200-dimensional embedding layer, 2 transformer layers (2-head self attention, layer normalization, linear(200x200)-ReLU-linear(200x200), layer normalization) followed by a linear layer. Data processing uses the implementation of .

Word-level language modeling with sequences of length 128 with Transformer-XL, using the implementation of for the model and data preprocessing. Hyperparameters follow the ENWIK8 base experiment of , except with the modifications of , using 6 layers and a target length of 128.

Question-answering by fine-tuning a pretrained DistillBERT model using the HuggingFace implementation of the model and data processing.

A.2 Batch sizes and epoch counts

Our goal is to compare the optimization performance while controlling for the number of iterations. But keeping the iteration count constant across batch sizes is computationally infeasible. For example, running Transformer-XL on WikiText-2 with a batch size of 20 for 40 epoch (32600 iterations) takes 1--2 hours. Running 320 epochs in full batch (320 iterations) takes 8--10 hours. Running $100\times$ more iterations would take over a month, making hyperparameter tuning infeasible. As increasing the batch size leads to better performance in fewer iterations, we run larger batch sizes for fewer iterations. We control how the number of iterations changes from setting to setting to have comparable results. The batch sizes indicated by S, M, L and XL correspond to a relative increase of $4\times$ in batch size, and tune the number of epochs for each setting so that the number of iterations from one batch size to the next stays within a factor of ${\approx}$ 2--8. The settings used in our experiments are described in Table 1.

To highlight the improvement in performance per iteration and account for the confounding factor that larger batch sizes run for fewer iterations (Sections B.1 and B.2), we introduce intermediate batch sizes and show the loss in each setting when stopped at a comparable number of iterations. The stopping times for these experiments (given in Section A.2) were selected to make the number of iterations similar. They could not be made equal as we did not record the loss on the entire dataset at each iteration. For example, on WikiText-2, the full-batch run is stopped after 320 epochs (320 iterations), while the small batch-size run is stopped at one epoch (815 iterations). On the larger language datasets (WikiText-2, SQuAD), the degradation in performance as the batch size increases in Figure 3 is attributable to this decrease in number of iterations. However, we still observe that the gap between SGD and Adam grows with batch size, despite this bias favoring small batch sizes.

For settings where computing the gradient would not fit into memory, we use gradient accumulation to reduce peak memory consumption (computing the gradients of smaller batches and averaging them). We use accumulation for Full batch runs, WikiText-2 ( $\geq$ L) and SQuAD ( $\geq$ M). Due to the batch normalization layers in ResNet18, computing the full-batch gradient would require to pass the entire dataset through the network at once (the normalization makes it so that the average of the gradients of the minibatches is different from the full gradient). We use the average over the minibatch gradients with a batch size of 10000. Transformer models use layer normalization and are unaffected.

The transformer models on PTB and WikiText-2 use dropout by default, meaning that the increase to Full batch reduces noise to a floor but does not remove it entirely. In Section B.4, we verify that the observation that sign descent bridges the gap between (S)GD and Adam in the deterministic setting on those two datasets in full batch after removing dropout.

A.3 Algorithms pseudocode

We use the PyTorch implementation of SGD and Adam, and implement the normalized variants (sign descent, normalized GD) with heavy-ball momentum following the pseudo-code in algorithm 1. The pseudo-code of Adam is given in algorithm 2.

For every algorithm, we tune the main step-size by grid search. The momentum parameters $\beta$ in gradient descent variants and $\beta_{1}$ in Adam is set either to or $0.9$ . This is indicated in the main text and figure legends by the suffix ( $+$ m) (with momentum) or ( $-$ m) (without). We leave the remaining hyperparameters of Adam to their default ( $\beta_{2}=0.999,\epsilon=10^{-8}$ ).

A.4 Histograms in Figure 1

The histograms in Figure 1 show the distribution of the stochastic gradient errors at initialization to confirm the observation of . also show the pattern is preserved through training. We use the batch sizes

As ResNet18 uses batch normalization, it does not have a well-defined ‘‘full’’ gradient (the average of minibatch gradients is not the gradient obtained by passing the entire dataset through the network). We use the average over the minibatch gradients.

A.5 Grid search

We start with a sparse grid of integer powers of 10 (eg. $[10^{-5},10^{-4},\ldots,10^{0}]$ ). After an initial run to identify a reasonable region, we increase the density to include half-powers (eg. $[10^{-3},10^{-2.5},10^{-2},10^{-1.5},10^{-1}]$ ). The density of the grid is the same for all problems. We run those step-sizes for 3 random seeds determining the data ordering and initialization (except for DistillBERT on SQuAD, which is pretrained).

To select the step-size, we minimize the (maximum over seeds of the) training loss at the end of training. End-of-training is the epoch reported in Table 1 for most figures. The exceptions are Figures 3 and 5 which use the stopping times in Table 2.

The final performance as a function of the step-size for the training loss and accuracy/PPL/F1-score (including on holdout data) are given in Appendix C

Appendix B Additional plots

Gap vs. Batch size (Figure 4 including image tasks)

Gap vs. Batch size (Version of Figure 6 including image tasks)

Relative performance of sign descent in medium vs. full batch (Version of Figure 7 without momentum)

Verifying the results hold without Dropout

B.2 Gap vs. Batch size (Version of Figure 6 including image tasks)

B.3 Relative performance of sign descent in medium vs. full batch (Version of Figure 7 without momentum)

B.4 Verifying the results hold without Dropout

The increase of batch size to ‘‘Full batch’’ drives noise to a floor, but does not remove it entirely due to the use of Dropout in the transformer models. figure 11 shows those models after disabling dropout to verify that the same trends holds, showing the performance of the optimizer in full batch, as in the bottom row of figure 7 (momentum) and figure 10 (without momentum). The settings use are the same as for the Full batch in those figures (See Full in Table 1 for batch size and epoch counts) and uses step-sizes tuned by grid-search, with the only modification of disabling dropout. We note that the models achieve lower training error, but the overall trend of Sign Descent closing most of the gap between SGD and Adam in full batch is preserved.

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

Introduction

Experimental design

Behavior of SGD and Adam in large batch

Alternative interpretations

Discussion

References

References

Appendix A Experimental details

A.2 Batch sizes and epoch counts

A.3 Algorithms pseudocode

A.4 Histograms in Figure 1

A.5 Grid search

Appendix B Additional plots

B.2 Gap vs. Batch size (Version of Figure 6 including image tasks)

B.3 Relative performance of sign descent in medium vs. full batch (Version of Figure 7 without momentum)

B.4 Verifying the results hold without Dropout

Appendix C Grid-searches and performance of selected runs

C.1.2 ResNet18 on Cifar10

C.1.3 Small Transformer on PTB

C.1.4 Transformer-XL on WikiText-2

C.1.5 DistillBERT finetuning on SQuAD

C.2 Sign Descent and Normalized GD

C.2.2 ResNet18 on Cifar10

C.2.3 Small Transformer on PTB

C.2.4 Transformer-XL on WikiText-2

C.2.5 DistillBERT finetuning on SQuAD