Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Yan Pan, Yuanzhi Li

Introduction

Stochastic gradient descent (SGD) is one of the most widely used optimization algorithms for deep learning, due to its simplicity and efficiency on various large-scale neural networks. However, in some tasks, such as training transformers , which are powerful models for natural language processing and other domains, SGD often performs poorly compared to adaptive variants of stochastic gradient methods. Adaptive algorithms, such as Adagrad , Adam , and AMSGrad , adjust the learning rate for each parameter based on the magnitude and history of the gradients, which can help them exploit the local geometry of the objective function and escape from saddle points or plateaus. While adaptive algorithms have shown empirical advantages over SGD in many applications , the theoretical understanding of their superior performance in these tasks is limited . The best known non-convex convergence rate for AMSGrad only matches the best convergence rate of SGD but does not improve upon it . While pursuing a better general convergence rate for adaptive algorithms is a possible but challenging direction, a more realistic and relevant question is what makes Adam so effective and SGD so ineffective on certain architectures and tasks, such as transformers on language tasks. We aim to identify some properties of transformers that give rise to this phenomenon, and to find some quantities that can indicate the performance of different optimization algorithms in practice. Such insights could then be used to guide the selection and design of faster and more robust optimization algorithms for deep learning.

Inspired by this example, we study the local geometry of transformers in Section 3. Instead of analyzing the global convergence and trajectory of optimization algorithms, we focus on the simpler question of finding a good update direction in a fixed local geometry. We decompose the goal of locally minimizing the objective function into two components: gradient correlation, which measures the alignment of the update direction with the negative gradient, and directional sharpness, which measures the curvature of the function along the update direction. We argue that the directional sharpness of the update direction is a more useful indicator of the performance of optimization algorithms, as high sharpness usually implies low performance. Empirically, we observe through experiments that the update directions of SGD have much higher directional sharpness compared to adaptive algorithms. By studying more algorithms, we observe that in general, algorithms with high directional sharpness converge much slower than adaptive algorithms, which typically have low directional sharpness. We also visualize the corresponding landscape along the update directions, and our results show that algorithms with low directional sharpness can generally achieve a better local loss reduction if optimal step sizes are chosen.

We investigate the cause of SGD’s high directional sharpness and find that it is mainly due to the imbalanced distribution of gradient across coordinates. We observe that only a small fraction of the coordinates account for most of SGD’s directional sharpness and we infer that it is because of the positive correlation between the Hessian and gradient coordinates. To address this issue, we propose to use coordinate-wise clipping as a simple and effective technique to improve the convergence and directional sharpness of optimization algorithms. The intuition behind clipping is that when a few coordinates have large gradients and bad smoothness, clipping prevents them from dominating the update direction and inflating the directional sharpness. Theoretically, we show that clipping improves the worst-case directional sharpness and enables a better local loss reduction with a larger step size. Empirically, we show that clipping can consistently reduce the directional sharpness, which often leads to a better local function reduction and improves the convergence speed of various optimization algorithms including adaptive algorithms. We demonstrate our findings through two experiments under different settings and show that our observations are robust across different tasks, models, and iterations. Based on the experiments, we argue that the landscape of optimization algorithms in local geometry is a useful proxy for the global convergence speed. We conclude that the adaptive coordinate-wise scaling of Adam can effectively balance the trade-off between optimizing gradient correlation and directional sharpness, and that this ability is the key to Adam’s fast convergence in deep learning training.

Our main contributions can be summarized as follows:

We identify directional sharpness as a key indicator of the performance of optimization algorithms in local geometry, and show that adaptive algorithms have low directional sharpness compared to SGD, especially when training transformers.

We propose coordinate-wise clipping as a simple, effective and universal technique to improve the directional sharpness and convergence speed of various optimization algorithms, and provide theoretical and empirical support for its benefits.

Related Work

General Convergence Rates of Adaptive Algorithms. Adaptive algorithms have long been studied and applied in deep learning . Several previous work has proved convex and non-convex convergence rates for Adagrad and Adam or AMSGrad . The best known non-convex convergence rate for Adagrad is $O(\frac{\log T}{\sqrt{T}})$ and $O(\frac{1}{\sqrt{T}})$ for AMSGrad . While the result by matches the non-convex convergence rate $O(\frac{1}{\sqrt{T}})$ of SGD , there is no theoretical proof that Adam can converge asymptotically faster than SGD for general functions . Therefore, there is still a significant gap of work between the theoretical understanding of Adam and its empirical fast performance.

Faster Convergence Rates Under Certain Settings. Another line of work focused on specific settings that Adam might work better than SGD. Adaptive algorithms can work asymptotically better when the stochastic gradients are sparse or when there is a sparse set of noise . proved that global clipping methods outperforms SGD when the stochastic gradients have heavy-tail noise, argued that Adam can also deal with heavy-tail noise effectively, and designed a new algorithm based on coordinate-wise clipping.

Coordinate-Wise Clipping. Both global clipping and coordinate-wise clipping are commonly used in practice with SGD. While global norm clipping and normalization has been studied both theoretically and empirically , there has been very little research on coordinate-wise clipping methods. The most relevant work is , where the authors use coordinate-wise clipping to propose algorithms CClip and ACClip that works well on transformers in practice. They use adaptive thresholds updated as momentum parameters and clip the coordinates to the corresponding thresholds. shows that ACClip can perform empirically better than Adam on various transformers.

The coordinate-wise properties of the gradient and Hessian is often used in coordinate descent methods . Recently, due to its ability to deal with heavy-tailed noise , coordinate-wise clipping has been applied in differentially private coordinate descent methods as it adapts to the coordinate-wise imbalance of the objective . In particular, designs a strategy to choose an adaptive clipping threshold based on the mean of the gradients, while we use the distribution of the gradients to select a threshold that clips exactly a constant fraction of the gradients.

Our work is inspired by the use of coordinate-wise clipping in algorithm design in , but we propose different explanations of the effectiveness of coordinate-wise clipping with new empirical evidence. We highlight important differences between our work and the analysis of CClip and ACClip algorithms in . First, we propose different explanations for the performance of clipping. claims that clipping can deal with heavy-tailed noise in transformers, while we discover directional sharpness as a quantitative metric that directly relates to loss minimization and whose properties can be verified easily. Second, while CClip and typical coordinate-wise clipping methods choose thresholds independent to the gradient, we choose an adaptive clipping threshold based on the distribution of the gradient. Most importantly, while focus on designing a new algorithm that can outperform Adam, we aim to propose coordinate-wise clipping as a meta algorithm, such that every optimization can use and improve its performance. Then, every algorithm can beat itself if clipping is added as a new unit, similar to the role of momentum in deep learning.

Directional Sharpness of Optimization Algorithms

In this section, we introduce a new measurement directional sharpness that indicates the performance of optimization algorithms. We show that minimizing the term is extremely important to fast convergence of optimization algorithms and argue that it is closely related to the slow convergence of SGD.

In convex and non-convex optimization, a typical proof strategy is to consider the quadratic Taylor expansion of the objective function

where $x_{t+1}-x_{t}$ is the update step of the optimization algorithm and $\eta$ is the step size. In order to get $f(x_{t+1})\leq f(x_{t})$ in expectation, the optimization algorithm should minimize the two terms that depends on the update step, which we respectively denote gradient correlation, which measures the alignment of the update direction with the negative gradient, and directional sharpness, which measures the curvature of the function along the update direction. To bound the second-order term, the default method in convex and non-convex optimization is to assume that the objective function is $L$ -smooth , which equivalently says $\|\nabla^{2}f(x)\|_{2}\leq L$ for every $x$ , where $\|\cdot\|_{2}$ is the spectral norm. The local Hessian spectral norm is often called the sharpness of the function in deep learning . If we have $L$ as the global upper bound on the spectral norm of the Hessian, we would have

Then we have the following inequality, which is one of the most frequently used lemma in optimization proofs

If the function is $L$ -smooth, the loss can decrease when the first-order term is negative and the norm of the update step is sufficiently small, since the second-order term is quadratic in the step size and the first-order term is linear. This can be guaranteed by using a small learning rate, and this leads to the convergence proofs of many optimization algorithms. However, there are disadvantages of the smoothness assumption in theoretical proofs. For example, the Hessian can adapt to the geometry of the trajectory and can vary significantly for different algorithms , so using a global upper bound in the convergence proof might not be fair for some algorithms. Furthermore, even if the local geometry and Hessian are fixed, the update direction $x_{t+1}-x_{t}$ is also extremely important to minimizing the second-order term. The current bound assumes that we are choosing the worst direction possible, but typically optimization algorithm might find better directions in probability. We could probably believe that if a good direction is chosen, the second-order term can be much lower than the global upper bound, so the bound need not be tight.

Although our definition is motivated by the sharpness definition in deep learning, we highlight important differences between them. Sharpness describes the worst-case directional sharpness and is the supremum of directional sharpness over all directions. However, directional sharpness consider the sharpness in the specific update direction of an iterative optimization algorithm, and can be much lower than the sharpness if the direction is “good”. The concept of sharpness is typically associated with the landscape and generalization of neural networks, such as in Sharpness-Aware Minimization and Edge of Stability . We are only interested in optimization of the objective function in the empirical risk minimization problem, or the loss on the training set.

2 Directional Sharpness and Update Directions

We study the update step of different optimization algorithms under the same trajectory and local geometry using pseudo-update steps to compute the momentum in order to rule out the impact of trajectory. We compute the directional sharpness of different optimization algorithms and visualize the optimization landscape in the update direction of a variety of optimization algorithms in Figures 2, 3 and 4 and Table 1. The details of the experiment is described in Section 5 and Appendix B. Empirically, we observe that there can be a significant gap between the directional sharpness in the update direction of different optimization algorithms. In particular, the directional sharpness is much lower for adaptive algorithms than for SGD.

Based on the observation, we argue that minimizing the directional sharpness is more important for fast convergence of optimization algorithms as compared to minimizing the gradient correlation. The update step of SGD has the best correlation with the actual gradient, so the loss decrease faster when the step size is very small, since in this case the linear term dominates the quadratic term in Equation 1. However, because of the large directional sharpness, when the step size increases the quadratic term grows faster than the linear term, so the loss reaches the local minima in the direction after a very small step size. For adaptive algorithms, the directional sharpness is much lower than SGD, so they have the potential to use a much larger step size and the optimal step could give a much lower loss compared to SGD.

In order to explain the sharpness reduction effect of adaptive algorithms, since the strategy for adaptive algorithms is to find a coordinate-wise scaling of the gradient, we investigate the distribution of gradient norm across different coordinates. We visualize a histogram of the absolute value of SGD momentum coordinates in Figure 1. We observe that the gradients are distributed unevenly across the coordinates, with half of the coordinates have absolute value ranging from $10^{-12}$ to $10^{-6}$ , but also exists an innegligible portion of coordinates that can be as high as $10^{-4}$ to $10^{-2}$ , contributing to most of the gradient norm. The histogram suggests that the gradients are concentrated on a small fraction of the coordinates, and this small fraction of coordinates can contribute to a large portion of sharpness, making optimization hard. For adaptive algorithms, since they already used some forms of scaling, the imbalanced gradient distribution will not be as severe as SGD. As a result, they would have better convergence rate.

In Appendix E, we do a simple experiment with ResNet on image classification that shows the property might be related to the transformer architecture. In particular, the directional sharpness of adaptive algorithms might be worse than SGD for ResNets. This is consistent with empirical observations of the performance of adaptive algorithms in vision tasks, that it is often slower than SGD.

Coordinate-wise Clipping

We propose to use coordinate-wise clipping as a solution to the aforementioned imbalanced distribution of gradient based on our experimental findings. We observe that the sharpness is also concentrated in the large coordinates in the gradient, and clipping those coordinates can significantly decrease directional sharpness. Although clipping can decrease gradient correlation, since the dependence on the clipped entry is quadratic for the second-order term and linear for the first-order term, it might not be beneficial to use these coordinates. The use of clipping in optimization algorithms is a trade-off between improving gradient correlation and reducing directional sharpness. By clipping the top coordinates in the gradient, although gradient correlation decreases, the directional sharpness can decrease even more to make up the loss.

For clipping threshold, we use a small clipping fraction of 10% for SGD and normalized SGD since they do not have coordinate-wise scaling in their algorithms. Hence, we can observe a significant improvement with a small clipping fraction. For Adam and Adafactor, since they already did coordinate-wise scaling in the original algorithm, we use a large clipping fraction of 50%. From Table 1, we can see that clipping the top the directional sharpness decrease significantly. Since we normalize the update step when we compute the directional sharpness, the sharpness reduction effect of coordinate-wise clipping is not due to significant reduction of the norm of the update step, but the improved flatness of the direction. The landscape visualization in Figure 2 gives a consistent message, that clipped algorithms can find a direction that has better local reduction of the loss in the local geometry.

Finally we demonstrate that clipping algorithms can converge faster than the original counterpart by directly training transformers with the clipping algorithms, with the loss curve shown in Figure 6. According to the result, clipping algorithms can speedup training significantly. For coordinate-wise scaling algorithms such as Adam, it is possible to consider larger clipping thresholds to improve the convergence of the algorithms. Our result suggests that clipping can be used as an universal technique in any non-coordinate-wise-scaling algorithms and speed up training. The new finding can provide insight into designing new optimization algorithms.

2 Connection with Coordinate-wise Smoothness

Based on our experimental findings, we conjecture that there is a positive correlation between the absolute value of Hessian coordinates and gradient coordinates. The positive correlations is also mentioned in , but their proposed correlation is between the norm of Hessian and norm of gradient. We further suggest that there is a positive correlation between the coordinates of gradient and Hessian, and the success of Adam is due to the ability to scale down the bad coordinates and reduce the sharpness through coordinate-wise scaling of the gradient.

We revisit the example given in Section 1, that $f(x)=x^{\top}Ax$ and $A_{11}=100$ , $A_{ii}=1$ for all $i>1$ . For SGD, the convergence depends on the worst coordinate with smoothness 100, and the gradient is also large in the first coordinate at most of the points since the formula is given as $200x_{1}$ . This gives us a bad sharpness on the first coordinate. But if we use clipping, the gradient could not be too large on the first coordinate, so we could choose a larger learning rate even if the Hessian is still unchanged.

A closedly related concept in optimization is the coordinate-wise version of the $L$ -smooth assumption in convex and non-convex optimization, typically used in analysis of coordinate descent methods . Instead of bounding the Hessian with a constant $L$ , each coordinates were bounded using different constants $L_{1},\dots,L_{d}$ such that $L_{i}\leq L$ and $\max L_{i}=L$ . If the gradient has a balanced distribution, the convergence depends on the average of the constants. Hence, the bound could be better since most of $L_{1},\dots,L_{d}$ could be much less than $L$ . However, if the gradient has an imbalanced distribution, where gradient is concentrated in a small fraction of the coordinates, then the convergence mostly depends on the smoothness of that fraction of coordinates. Then, clipping works well since it removes the imbalanced distribution of the gradients, ensuring “uniformity” of the gradient coordinates. When only an $\varepsilon$ -fraction of coordinates have bad smoothness $L$ , with clipping threshold $c_{t}$ , the norm of clipped gradient on the $\varepsilon$ -fraction of coordinates is at most $\sqrt{\varepsilon d}c_{t}$ , so the dependence on $L$ is at most $O(\sqrt{\varepsilon}L)$ . Similarly, adaptive algorithm enforce the same constraint on the gradients, removing the correlation between the Hessian and gradient.

In Appendix D, we justify with an additional simple experiment that suggests only a small fraction of the coordinates has large smoothness. We approximate the Hessian of the neural network with the Gauss-Newton matrix and study the smoothness of the Hessian if we could remove a small fraction of the coordinates. The result shows that by removing $\leq 4\%$ of the coordinates, the smoothness of the neural network improve by a constant factor. This provides intuition into why coordinate-wise clipping improves the directional sharpness. Then, under the assumption that we can remove a small fraction of coordinates and achieve a better smoothness, we can formally study the local loss reduction of SGD with clipping, as described by the following informal theorem.

The formal statement and proof are given in Appendix A. The theorem shed light onto how gradient clipping can improve the loss locally. Understanding of this phenomenon could be essential in proving convergence rates for Adam or clipping algorithms faster than SGD.

Experiment Setups

In this section, we describe the setting of our full experiments. We demonstrate our findings with two types of experiments, as described in Sections 3 and 4. We explore several different tasks and settings and show our results hold in various setting. Further discussions of the results are in Section C.3

Optimization algorithms. We select a variety of optimization algorithms. The algorithms all uses momentum in their update steps for a fair comparison. The baseline algorithm is SGD momentum, which we compare the sharpness of other algorithms with. For the class of adaptive algorithms, we choose Adam , Adafactor , and Lion . Adam is the most popular adaptive algorithm, and Adafactor and Lion both claim to be the state-of-the-art optimization algorithm on some specific tasks . We also include signSGD due to its similarity with the Lion optimizer and having probably the simplist form of adaptive algorithm. Note that signSGD is just SGD with clipping threshold 100%. To show that the improvement in directional sharpness and convergence speed is more related to coordinate-wise scaling than weight-matrix-wise scaling, we also design an algorithm which we call normalized SGD, that normalizes the square of Frobenius norm of each weight matrix to be proportional to the size of the matrix. By comparing normalized SGD with SGD clipping, we can see the importance of coordinate-wise scaling in adaptive algorithms and clipping.

Tasks. We run our experiments on two tasks, including machine translation and autoregressive language modeling, which are two popular tasks in language processing and can be solved efficiently with transformers. For machine translation, we train a small t5 model on the opus books English-French dataset . For autoregressive, we train a GPT-Neo model on the stack dataset for Python code generation. The code generation task is slightly different from natural language tasks such as machine translation since it deals with programming languages. We will show that most of our results still holds in the setting, suggesting that the observation is more related to properties of the transformer architectures.

Directional Sharpness and Landscape. We compute the directional sharpness of a variety of optimization algorithms, including SGD, normalized SGD, signSGD, Adam , Adafactor , and Lion , and visualize the corresponding loss landscape direction, under different local geometry. We show that SGD has bad sharpness under all of the settings, regardless of the task, model, or local geometry. In addition, we demonstrate clipping can always improve the directional sharpness of optimization algorithms, and often result in better local loss reduction in the update direction.

Global Convergence. We also implement clipping algorithms and use them to train different models, and demonstrate that clipping algorithms converge faster in practice. The result matches the goodness of the direction as measured by the landscape visualization and directional sharpness, that algorithms with better directional sharpness and better local loss reduction in the update direction in the SGD geometry generally converges faster. We conclude that the performance of optimization algorithms in local geometry can be a good indicator of speed of global convergence.

Conclusion

In summary, our work provides a new insight of why Adam converges faster than SGD in practice. In contrast to assumptions on properties of the gradient, we propose to study directional sharpness as an important indicator for the performance of optimization algorithms in deep learning. We show that adaptive algorithms and clipped optimization algorithms can generally achieve significantly better directional sharpness compared to SGD. We argue that the slow convergence of SGD is related to the high directional sharpness, caused by a positive coordinate-wise gradient-Hessian correlation. We propose to use coordinate-wise clipping as a solution to the problem of high sharpness. We demonstrate the sharpness reduction effect of coordinate-wise clipping and show that it is possible to step into a lower loss in the update direction of clipping algorithms compared to the original algorithms. We further demonstrate the effectiveness of coordinate-wise clipping in a wide range of optimization algorithms without coordinate-wise scaling, including SGD, normalized SGD, and Adafactor. We suggest the use of coordinate-wise clipping as a universal technique to speed up any deep learning optimization algorithms. Our work provide useful explanations and conjectures about the superior performance of Adam and further understanding of the results could be useful in theoretical understanding of the empirical advantage of Adam over SGD.

References

Appendix A Convergence of Clipping with Coordinate-wise Smoothness

Then, we show the following version of the gradient descent lemma that establishes the expected loss decrement with respect to the norm of the gradient.

Without loss of generality, assume the first $\varepsilon d$ coordinates can be clipped. Since the Hessian is always symmetric, we can define

so $\nabla^{2}f(x_{t})=P_{1}+P_{2}+P_{3}$ . Then, we can bound the directional sharpness as

Then, we work on the gradient descent lemma.

Then, we use assumptions that $g_{t}$ is uniform, so

We know that for gradient descent, the optimal learning rate is obtained by choosing $\eta=\frac{1}{L}$ , in which case we would have

This finishes the proof for the gradient descent lemma for SGD clipping. ∎

Appendix B Experimental Details

The details of the dataset, training set size, and model we use are in Table 2. For machine translation, we use a batch size of 1024 and we randomly select a subset of 10240 data as our training set, so we have 10 batches each epoch. Since we’re mainly interested in minimizing the training loss, we do not use any test or validation sets, nor any evaluation metrics other than the cross-entropy loss. For machine translation, we use the English to French opus books dataset and t5 model . For autoregressive, we use the GPT-Neo model pretrained on Code Clippy dataset. We use “the-stack-smol” version of the stack dataset . In order to evaluate the function in a offline setting, we generate fixed masks with probability 0.15 at the beginning of the training and does not generate new masks whenever we collate the data.

B.2 Optimization Algorithms and Clipping Methods

We use 6 optimization algorithms, including Adam , SGD, signSGD, normalized SGD, Adafactor , and Lion . The reason for selecting these algorithms are described in Section 5.

We also test clipping the update step instead of the gradient for Adam and Lion. The results are also shown in the landscape visualization. However, since the update steps are already scaled based on the gradient, clipping the update step does not improve the result significantly.

B.3 Experiment for Directional Sharpness of Optimization Algorithms

Pseudo-Update Step. Since all algorithms we use has momentum part, we need to compute the momentum term in a different trajectory using “pseudo-update step.” Specifically, we compute the momentum term for all the optimization algorithms at time $t$ using the past values of $x_{1},\dots,x_{t-1}$ , regardless of the optimization algorithm we use to perform the actual update step. The values we computed for the algorithms were only used to visualize the landscape and compare the sharpness, but not used for training. The momentum parameters are set to the default values .

Training Optimizer. We use different training optimizers to compare our results across different local geometry and optimization trajectory. We use SGD momentum with learning rate $2\times 10^{-4}$ and Adam with learning rate $2\times 10^{-4}$ as training optimizers. The momentum parameters are set to the default values .

Test Batch. Since computation on the full-batch objective function is very computationally expensive, we sample a fixed random subset of size 1024 as the test dataset at the beginning of the training, and fix it during all epochs and batches, in order to speed up the landscape visualization process. The losses in all the plots are the losses on the test batch.

Landscape Visualization. To visualize the landscape, we update the weight with the desired update step and compute the loss. Afterwards, we reset the weight back to the original value before the update, and repeat the above step with a new step size.

Directional Sharpness. We utilize PyTorch’s Hessian-vector product to efficiently compute directional sharpness. Note that if we compute the directional sharpness as $v^{\top}\nabla f(x_{t})v$ , then the sharpness can be negative sometimes. This is because the second-order Taylor expansion is given as

for some $\xi_{t}$ a linear combination of $x_{t}$ and $x_{t+1}$ . In general, we could approximate the directional sharpness using $x_{t}$ instead of $\xi_{t}$ , but in some very rare cases of getting a negative sharpness, we use the following formula to compute a more robust version of directional sharpness

for some small $\delta$ where we choose $\delta=0.01$ . Then the sharpness becomes positive. In the experiment results in Appendix C, we guarantee that the SGD sharpness are all positive and large. We will mark the epochs where SGD sharpness is negative and we use Equation 4 to compute the directional sharpness.

B.4 Experiment for Convergence of Clipped Optimization Algorithms

We demonstrate the convergence of clipped optimization algorithms. We manually tune the learning rate to find the best learning rate for the experiments. The learning rate configuration of our experiment is shown in Table 3.

Appendix C Directional Sharpness Results

In this section we show our experimental result for the directional sharpness of optimization algorithms. For each of the landscape visualization, we show two plots, where one of them has Adafactor and the other does not. The rest of the plots are the same with different scales. We repeat each experiment with 3 different random seeds.

C.2 Adam Trajectory

C.3 Discussion

As we can observe, our observation is very coherent across different tasks, model architectures, iterations, and local geometry. The directional sharpness is relatively stable for the same task across iterations, and coordinate-wise clipping always improve the sharpness of the direction and find a better direction to optimize.

Trade-off Between Directional Sharpness and Gradient Correlation. While we want the directional sharpness of our optimization algorithm to be small in order to decrease loss faster, having as small sharpness as possible does not necessarily lead to fast loss decrement. Adafactor almost always has the lowest directional sharpness across all tasks, iterations, and local geometry, but Adafactor does not always find a good direction to optimize. In many cases, the loss does not decrease significantly even for the optimal step size, and the direction can be even worse than SGD. This shows that merely minimizing the directional sharpness is not enough for an optimization algorithm to work well. As discussed in Section 4, gradient correlation is also important in the convergence of optimization algorithms. However, we can conclude that high sharpness will lead to bad performance, as demonstrated by the performance of SGD, even if SGD has good gradient correlation.

Effect of Trajectory. It is well known that different optimization algorithms can follow different trajectory and converge to different in deep learning. also point out the impact of local geometry of adaptive algorithms such that they implicitly select the trajectory with good smoothness. For SGD on machine translation, the landscape in the direction found by different optimization algorithms were similar in SGD geometry at all epochs. For Adam, the geometry is similar to SGD in the first few iterations, but changes significantly after more itereations. Landscape visualizations show that Adafactor performs well in SGD trajectory but not Adam trajectory on the machine translation task. This shows that different optimization algorithms has local geometry with different properties. The effect of trajectory is therefore an interesting problem to study. However, we point out that almost in all cases, Adam has good performance and significantly outperforms SGD, so trajectory is not necessarily related to the explanation for Adam’s excellent performance in practice.

Appendix D Experiment on the Smoothness of Hessian

We use the standard BERT model for binary text classification task trained with Adam on the IMDb dataset. The reason we use binary classification is that the output dimension of binary classification is 2, so the Gauss-Newton Hessian approximation of a batch of size $k$ has rank at most $2k$ , so we can have a larger batch size as compared to the cases where the logits have high dimension. Similarly, due to space constraints, we cannot use a large batch.

We use the PyTorch framework for all of our experiments. We use the Huggingface implementation of BERT. We use pretrained BertForSequentialClassification model. We use learning rate $10^{-4}$ for both experiments and batch size $25$ .

D.2 Approximation of Hessian with Gauss-Newton Matrix

so it is positive semidefinite and one square root of it is

Then, by $\|G_{S}G_{S}^{\top}\|_{2}=\|G_{S}\|_{2}^{2}$ , it suffices to compute the spectral norm of a $p\times|S|$ matrix.

D.3 Results

The experimental result is shown in Table 8. We remove all the coordinates that the row norm is at least 4 times the mean of the row norms. This removes $1\%$ to $2.5\%$ of the coordinates, and as a consequence the smoothness of the function is 2 to 3 times better compared to the smoothness of the full function, and this is the case for all batches and epochs. This shows that our definition of robust smoothness is reasonable, that it is indeed possible to optimize most of the coordinates under the robust smoothness setting.

Appendix E Directional Sharpness of ResNet

We do an additional simple experiment with ResNet on the CIFAR-10 dataset that shows the properties of the directional sharpness that we discovered are related to the transformer architecture. We use the ResNet-152 architecture with batch sizes of 1000. We compute the directional sharpness in the same setting as Appendix B. The results are shown in Table 9. As we can see, the directional sharpness of adaptive algorithms can be much worse than SGD. This shows that the property we discovered is related to the transformer architecture, and does not hold for ResNet.