SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, Sham Kakade

Introduction

With ever-increasing costs of LLM training, optimization efficiency has become a central question in the field of deep learning. Several recent works have tackled this challenge by addressing both the memory (Zhao et al., 2024a; Wang et al., 2024) and compute (Anil et al., 2020) footprint of optimizers. In Algoperf (Dahl et al., 2023), a recent optimization efficiency benchmark, Shampoo (Gupta et al., 2018a), a second-order algorithm, outperformed all other submissions, including Adam (Kingma & Ba, 2015), reducing wall-clock time by 28%. Higher-order preconditioning has also been applied in large-scale training runs, such as Gemini-1.5 Flash (Gemini Team, 2024).

The success of Shampoo has drawn increasing attention from the deep learning community. Several works have explored ways to scale Shampoo by improving its memory and compute efficiency (Wang et al., 2024; Anil et al., 2020; Shi et al., 2023). Other research (Morwani et al., 2024) has examined the theoretical foundations of Shampoo and proposed minor adjustments (such as using power $1/2$ rather than $1/4$ ) that align with prior empirical findings (Anil et al., 2020). Moreover, Morwani et al. (2024) also showed that Shampoo with the aforementioned modifications is close to the optimal Kronecker approximation of the Adagrad (Duchi et al., 2011b) optimizer.

Our first contribution is demonstrating that the variant of Shampoo proposed by Morwani et al. (2024) is equivalentGiven this connection, the results of Morwani et al. (2024) can be interpreted as showing that the eigenbasis provided by Shampoo’s preconditioner is close to the “optimal” basis for running Adafactor. to running Adafactor (Shazeer & Stern, 2018; Zhai et al., 2022) in the eigenbasis provided by Shampoo’s preconditioner. This interpretation of Shampoo connects it to a broader family of methods (e.g. (George et al., 2018)) that design second-order algorithms by running a first-order method in the eigenbasis provided by a second-order method. Building on this insight, we can explore a broader design space for combining first and second order methods. Many of our design choices are a synthesis of conceptual ideas from prior works of George et al. (2018); Anil et al. (2020); Morwani et al. (2024) as well as implementation ideas from works of Wang et al. (2024); Zhao et al. (2024a).

Explicitly, we study SOAP (ShampoO with Adam in the Preconditioner’s eigenbasis) an algorithm that runs AdamW in the eigenbasis provided by Shampoo. Our main contributions are as follows:

We make a formal connection between the Shampoo and the Adafactor algorithm. This insight leads us to consider the SOAP algorithm, which runs AdamW in the preconditioned space provided by Shampoo.

SOAP outperforms both Shampoo and Adam in LLM pre-training tasks with model sizes 360m and 660m, even after extensive hyperparameter tuning of Shampoo.

SOAP reduces the number of hyperparameters compared to Shampoo, resulting in only one additional hyperparameter compared to AdamW: preconditioning frequency.

SOAP demonstrates greater robustness to large preconditioning frequency compared to Shampoo.

We should also note that while similar algorithmic variants have been discussed in the literature (e.g. see the appendix of Anil et al. (2020)), we are the first to systematically evaluate it.

Organization: In Section 3, we discuss related works. In Section 4, we start by showing an equivalence between Shampoo (with exponent 1/2) and running Adafactor in the eigenspace given by Shampoo, then with this equivalence as the starting point we describe SOAP. In Section 5, we provide our experimental methodology and in Section 6, we compare the performance of AdamW, Shampoo and SOAP on language modeling tasks. In Sections 7.2 and 7.3 we discuss the the space and time complexity of SOAP and how it can be improved.

Notation and Background

Adam (Kingma & Ba, 2015), a widely used first-order optimization algorithm in deep learning is a diagonal approximation of Adagrad. It maintains an exponential moving average of the gradients $G_{t}$ (denoted as $M_{t}$ ) and of element-wise squared gradients $G_{t}^{2}$ (denoted as $V_{t}$ ) for a given weight matrix $W$ . Its update rule with learning rate $\eta$ is given by

where the division is performed element-wise.

Adafactor (Shazeer & Stern, 2018; Zhai et al., 2022), a variant of Adam, replaces $V_{t}$ with its best rank-1 approximation $V_{t}^{\prime}$ to reduce memory usage. While the original Adafactor paper (Shazeer & Stern, 2018) proposed additional modifications, such as changes to the learning rate schedule, we focus on the version of Adafactor proposed in recent works (Zhai et al., 2022; Zhao et al., 2024c), whose update with learning rate $\eta$ is given by

In practice, Shampoo is implemented with several other modifications such as layerwise learning rate grafting and exponents other than $-1/4$ . We will use the DistributedShampoo (Shi et al., 2023) implementation which has these variations available as hyperparameters.

Related Work

We begin by discussing works that are closely related, including George et al. (2018); Anil et al. (2020) and Zhao et al. (2024a). Subsequently, we cover extended related works.

KFAC (Martens & Grosse, 2015) is a well-known second-order optimization algorithm designed for neural networks. E-KFAC (George et al., 2018) builds upon KFAC in a manner analogous to our extension of Shampoo, introducing a diagonal preconditioner that is updated between KFAC inversion steps. However, E-KFAC’s algorithm is not identical to running Adam in KFAC’s eigenbasis, as the diagonal preconditioner is not Adam.

Anil et al. (2020) introduced several algorithmic and numerical improvements to develop a practical and scalable version of Shampoo (Gupta et al., 2018b). Notably, they empirically found that using an exponent of $1/2$ outperforms the original exponent of $1/4$ in Shampoo. Of particular interest to our work is Appendix B of Anil et al. (2020), where, inspired by E-KFAC, they describe an algorithm that is essentially equivalent to SOAP for 2D layers. However, no experiments were provided, and the authors claimed that unpublished experiments showed no empirical improvement over Shampoo. This discrepancy between our findings may be due to some of the implementation details of SOAP.

GaLore (Zhao et al., 2024a) was recently proposed as a method to reduce Adam’s memory footprint by maintaining momentum in a low-rank subspace derived from the singular value decomposition (SVD) of the gradients. Their algorithm’s full-rank version bears similarity to ours, with some notable distinctions. Firstly, their projection subspace is determined by the SVD of the current gradient, while we maintain an exponential moving average of $GG^{T}$ and $G^{T}G$ . Secondly, we retain momentum in the original space and project it onto the preconditioned space, whereas they maintain it in the preconditioned space and do not rotate it each time the preconditioned space is updated. In Appendix B, we study GaLore’s performance and find that our modifications are necessary for improving upon Shampoo. Moreover, their method only projects one side of a layer using the eigenbasis while using the identity basis on the other side. We examine the impact of one-sided projection for SOAP in Section 7.1.

Diagonal Preconditioning based Optimizers: Other than AdamW, there are other optimizers which involve diagonal preconditoning such as Lion (Chen et al., 2023), Sophia (Liu et al., 2024), and Adafactor (Shazeer & Stern, 2018). Recent works of Kaddour et al. (2023); Zhao et al. (2024c) showed that these optimizers perform comparably to AdamW for LLM pretraining but do not surpass it. This suggests the need to explore non-diagonal preconditioners. We discuss prior works on non-diagonal preconditioners below.

Second-Order Optimization: Research on second-order optimization in deep learning is generally divided into two categories: Hessian-free methods and methods that estimate the Hessian.

Hessian-Free Methods: Hessian-free approaches (Martens, 2010; Martens & Grosse, 2015) optimize without explicitly computing the Hessian matrix, instead employing iterative techniques to approximate the Newton step. Other recent works (Li, 2018; 2024; Pooladzandi & Li, 2024) have focused on designing iterative preconditioners to improve the convergence specifically for stochastic optimization algorithms.

Hessian Estimation Methods: These methods maintain an efficient approximation of the Hessian for neural networks. KFAC (Martens & Grosse, 2015) and Shampoo (Gupta et al., 2018b) are two widely recognized methods in this area.

KFAC (Martens & Grosse, 2015) was one of the first approaches to go beyond diagonal preconditioners in neural networks, demonstrating that a layer-wise Kronecker-factored preconditioner approximates the layer-wise Hessian in multi-layer perceptrons (MLPs). Subsequent works (Martens et al., 2018; Osawa et al., 2019) extended KFAC to other architectures. Recent research (George et al., 2018; Gao et al., 2021) has further improved trace and diagonal estimates for KFAC. Efforts to scale up KFAC (Ba et al., 2017; Puiu, 2022; 2023; Eschenhagen et al., 2023) have focused on making the inversion step more efficient or enhancing distributed implementations.

Shampoo (Gupta et al., 2018b), another second-order optimization algorithm, is motivated by the online learning algorithm Adagrad (Duchi et al., 2011a). Shampoo also employs a layer-wise Kronecker-factored preconditioner. A recent distributed implementation of Shampoo (Shi et al., 2023) won an optimization efficiency benchmark (Dahl et al., 2023), highlighting the practical utility of second-order methods in deep learning. Few recent works (Duvvuri et al., 2024; Morwani et al., 2024) have provided theoretical advancements on top of Shampoo. Other works (Anil et al., 2020; Peirson et al., 2022; Lin et al., 2024; Wang et al., 2024) have proposed various strategies to improve Shampoo’s scalability. We defer a comparison of SOAP with these methods to future work.

Algorithm

We begin by describing an equivalence between Shampoo and running Adafactor in the eigenbasis of the Shampoo preconditioner. For simplicity we omit momentum but the equivalence also holds with momentum. For this equivalence we use Shampoo with the following modifications from the original Shampoo optimizer (Gupta et al., 2018b):

We use power $1/2$ instead of power $1/4$ . This was already recommended in practical implementations (Anil et al., 2020; Shi et al., 2023) and a theoretical connection between optimal Kronecker approximation of Adagrad (Duchi et al., 2011b) preconditioner and Shampoo with power $1/2$ was established in Morwani et al. (2024).

We also use the scalar correction to per layer learning rates described in Ren & Goldfarb (2021); Morwani et al. (2024).

Instead of the running average of $L$ and $R$ across time steps, we use dataset averages.

With these changes in place (first occurrence of these changes is highlighted in red in the algorithm below) we formally define the two algorithms whose equivalence we show in Algorithms 1 and 2.

While these two algorithms are equivalent in their idealized forms, practical considerations reveal some differences. Firstly, the algorithms differ when using running averages instead of dataset averages. Secondly, and more significantly in practice, we do not invert or compute the eigenvector decomposition of $L$ and $R$ at every step. This means that the “adaptivity” of learning rates in Shampoo is limitedWe note that practical implementations of Shampoo use grafting which allows for learning rate adaptivity at every step, but this adaptivity is restricted to a single scalar per layer. to the updates of $L$ and $R$ . In contrast, with Adafactor in Shampoo’s eigenspace, the second moment estimates (i.e., $A$ and $C$ in Algorithm 2) can be updated at every step as they are computationally inexpensive. Additionally, instead of using Adafactor, we can optThough using AdamW over Adafactor only gives very small improvements in performance, see Figure 5 and Section 7.2. We also note that one can use any other diagonal preconditioner based optimizer in place of Adam, such as Lion (Chen et al., 2023), Sophia (Liu et al., 2024) or Schedule-Free AdamW (Defazio et al., 2024). for Adam, which offers more generality. Combining these insights leads to Algorithm 3 which can be interpreted as running Adam in Shampoo’s eigenspace.

We now describe some additional implementation details:

Algorithm 3 describes the behavior of the algorithm for 2D layers. Following Zhao et al. (2024a), for 1D layers we run standard AdamW. This reduces the overhead as compared to standard implementations of Shampoo which solve an eigenvector problem for 1D layers too.

Following Wang et al. (2024), we compute eigenvectors of $L$ (and $R$ ) using one step of power method (Algorithm 4). This requires doing one matrix multiplication followed by QR decomposition. QR decomposition is faster (Documentation, 2024) than standard eigenvector decomposition in PyTorch. For the first iteration, eigenvectors are initialized by doing a standard eigenvector decomposition.

For layers with huge dimensions such as the first and last layer in language modeling transformers, maintaining the eigenvectors would be space and time prohibitive. For such dimensions we fix the rotation matrix ( $Q_{L}$ or $Q_{R}$ ) to be identity. Note that if we fix both $Q_{L}$ and $Q_{R}$ to be identity for a 2D layer, we would recover Adam.

Algorithm 3 omits bias correction and weight decay for simplicity, but these are used in the actual implementation, identical to their use in AdamW.

The main focus of the next sections will be to explore the empirical performance of this algorithm and its variations.

Experimental Methodology

Hyperparameter tuning: We begin with hyperparameter values suggested by prior research for both AdamW and Distributed Shampoo (e.g., $\beta_{2}=0.95$ ). Initially, we conduct a learning rate sweep to determine the optimal learning rate for each optimizer. Once the optimal learning rate is identified, we perform two-dimensional sweeps for each of the remaining hyperparameters, where we vary the selected hyperparameter alongside the learning rate. The purpose of these sweeps is to demonstrate that our default hyperparameter settings are near-optimal, disregarding potential interactions between two non-learning-rate hyperparameters. A detailed discussion of the hyperparameter sweeps is provided in Appendix A.

Throughput Measurement: We evaluate the throughput of each optimizer by measuring the number of tokens processed per second. At present, we perform these measurements on a single H100 GPU and utilize gradient accumulation to accommodate large batch sizes. While this approach may seem to disadvantage AdamW— as the overhead of Shampoo/SOAP is compared against multiple gradient accumulation steps— it is important to note that the overhead of Shampoo/SOAP can be amortized across layers by distributing the updates across multiple GPUs. This technique is employed in the distributed implementation of Shampoo (Shi et al., 2023). A comprehensive comparison of distributed implementations of these algorithms is left to future work.

Efficiency Benefits: Simply running SOAP for the same duration as Shampoo and AdamW cannot be directly used to calculate the efficiency benefit (in terms of training steps or wall-clock time) of using SOAP since we use a cosine schedule. Therefore, we run SOAP on $.5,.625,.75$ and $.875$ fraction of the training data and fit a scaling law of the form $a+bN^{-\beta}$ through the final losses obtained, where $N$ represents the number of training points and $a,b,\beta$ are the parameters of the fit. We show these points and the corresponding scaling laws obtained in Figure 2. This scaling law is then used to calculate the efficiency benefit in terms of training steps and wallclock time as shown in Figure 2. Here, the horizontal lines represent the final losses of AdamW and Shampoo.

Language Modeling Experiments

In this section we focus on empirically comparing AdamW, DistributedShampoo, and SOAP on language modeling tasks.

In Figure 1 (left and middle) and Figure 3 we show train loss curves of 360m and 660m models with 2m token batch size for AdamW, Shampoo, and SOAP, where SOAP outperforms the other two. To directly calculate the efficiency benefit of SOAP, we also run SOAP with cosine decay for a shorter lr schedule, as shown in Figures 1 and 3. This allows us to approximate the following efficiency benefits (when setting batch size to 2m and preconditioning frequency to 10): $\geq 40\%$ reduction in number of iterations and $\geq 35\%$ reduction in wall clock time as compared to AdamW; $\approx 20\%$ reduction in iterations and wall clock time as compared to Shampoo. Precise efficiency benefit calculations are presented in Figure 2(left and middle).

2 Effect of Frequency of Finding Eigenvectors/Inverse

In Figure 1 (right), we compare SOAP and Shampoo with respect to preconditioning frequency. We observe the following:

For all frequencies we tried from 1 to 100, both optimizers outperform AdamW.

At frequency 1, SOAP and Shampoo are quite close in performance.

At higher frequencies, the performance of both SOAP and Shampoo degrades but SOAP’s performance degrades significantly slower than Shampoo’s.

3 Effect of Batch Size

In this section, we examine the impact of batch size on the performance of the Shampoo and SOAP optimizers. Specifically, we reduce the batch size by a factor of 8, from 2m to 256k. To maintain the same FLOPS overhead for the eigenvector decomposition steps as in the 2m setting, we increase the preconditioning frequency by a factor of 8, from 10 to 80. In Figure 4, we present the optimal runs for each optimizer. Our results show that SOAP consistently outperforms both Shampoo and AdamW, demonstrating a reduction of 25% or more in the number of iterations compared to AdamW, and approximately a 10% reduction compared to Shampoo. In Figure 2 (right), we show that SOAP also improves in wall-clock time by $\geq 15\%$ over AdamW and approximately $10\%$ over Shampoo. Note that we present these results as a preliminary analysis for small batch size runs. It is quite likely that our increase in preconditioning frequency by a factor of 8 is not optimal and a better trade-off is achievable. Furthermore, the overhead of SOAP can likely be ameliorated by doing $L$ and $R$ updates in lower precision (instead of fp32).

We also note that the decrease in efficiency improvements at smaller batch sizes for second-order methods is consistent with prior works (Zhang et al., 2019; Ishikawa & Yokota, 2024).

Further Efficiency Improvements

In this section, we discuss space and time complexity of SOAP and provide an overview of potential avenues for further space and compute efficiency improvements in SOAP.

As described in Section 3, Zhao et al. (2024a) have an algorithm similar to ours. One of the differences is that they only project the smaller side of the layer using the eigenbasis while using identity as the rotation matrix for the larger side i.e. if $m<n$ we set $Q_{R}=I_{n}$ in Algorithm 3 and if $m>n$ we set $Q_{L}=I_{m}$ . Doing this leads to a reduction in space usage as well as reduction of optimizer time overhead, which is discussed in Sections 7.2.1 and 7.3.1.

In Figure 5, it is evident that the one-sided projection results in slightly reduced performance compared to the original SOAP optimizer. However, it still performs on par with, or marginally better than, Shampoo, while maintaining greater computational efficiency. Further investigation into the potential for these variants to surpass the computational efficiency of original SOAP optimizer is left for future work.

2 Space usage of SOAP

For a $m\times n$ matrix where $m>n$ we require

space usageOne $mn$ is for storing the gradients, this can be avoided (as long as there is no gradient accumulation) by applying gradients along with backprop (Lv et al., 2024b) but this is not implemented by default in standard deep learning frameworks such as PyTorch. Hence we will include this term in all of our calculations. (beyond weights and activations), specifically for $L,Q_{L},R,Q_{R},\text{momentum }(M)$ , AdamW’s second order estimate ( $V$ ), and the gradient. This is the same space usage as DistributedShampoo while AdamW uses $3mn$ .

The most direct way to reduce memory is using low precision to store the $L,R,Q_{L},Q_{R},V$ matrices, which is done by Dettmers et al. (2022); Wang et al. (2024). Orthogonal to the low precision approaches, there are two algorithmic approaches to improving the space usage of SOAP:

Using Adafactor instead of Adam as the diagonal preconditioner after rotating by $Q_{L}$ and $Q_{R}$ . This reduces the space usage by $mn$ .

Using one sided version of SOAP (Section 7.1). This reduces space usage from $2m^{2}+2n^{2}+3mn$ to $2\min(m,n)^{2}+3mn$ .

Combining these approaches yields space usage of $2\min(m,n)^{2}+2mn$ .

For standard transformer architectures the last variant which combines the two approaches would yield less space usage overall compared to AdamW (which uses $3mn$ ).

We try these approaches in Figure 5. We observe that using Adafactor instead of AdamW yields very small reductions in performance while using one-sided preconditioner results in larger reductions. Nonetheless even after combining these two approaches the resulting optimizer outperforms AdamW while having a smaller space requirement than AdamW. Regarding space usage we also note that Adafactor (with momentum added back) itself utilizes only $2mn$ space usage and has been shown to perform comparable to AdamW for ViT training (Zhai et al., 2022) and for language model training (Zhao et al., 2024c). Further space reduction beyond Adafactor has been studied in the Adalomo (Lv et al., 2024a), GaLore (Zhao et al., 2024a), and AdaMeM (Vyas et al., 2024) papers.

3 Time Overhead of SOAP

There are two types of overhead of Shampoo and SOAP over AdamW: the overhead per step and the overhead when changing the preconditioner (or for SOAP, the preconditioner’s eigenbasis). Let us first analyze the first one. For SOAP per step for a layer of size $m\times n$ we have an overhead of

We note that this is more than the overhead of Shampoo which is $m^{3}+n^{3}+m^{2}n+n^{2}m$ . This can be observed in Figure 2 (bottom, right) but not in the other figures since there the second type of overhead is the dominant term.

The second type of overhead is due to changing the preconditioner for Shampoo (or for SOAP, preconditioner’s eigenbasis i.e. $Q_{L}$ and $Q_{R}$ ). The DistributedShampoo (Shi et al., 2023) implementation of Shampoo uses a direct call to torch.linalg.eigh for this. Following Wang et al. (2024) we use Algorithm 4 which uses power iteration based approach which calls torch.linalg.qr. We note that torch.linalg.qr is faster than torch.linalg.eigh (Documentation, 2024). In Figure 6 (right) we see that using power iteration based approach (torch.linalg.qr) performs as well as fresh eigenvector decomposition (torch.linalg.eigh).

Effect of frequency on overhead: In Figure 6 (left), we observe that the overhead decreases as the preconditioning frequency increases, i.e., the frequency of invoking Algorithm 4. If the only additional computation occurred in Algorithm 4, we would expect the overhead to scale as $1.0/(\text{preconditioning frequency})$ , approaching zero. However, empirical results (Figure 6 left) show that the overhead approaches an asymptote greater than zero. This is attributable to the additional matrix multiplications required to update $L$ , update $R$ , project the gradient, and reproject the gradient (for each layer) in the optimizer. Currently, these operations are performed in float32; reducing the precision of these operations, as proposed in Wang et al. (2024), could lower this asymptote.

The per step overhead of SOAP can be reduced by using low precision to store the $L,R,Q_{L},Q_{R},V$ matrices (Dettmers et al., 2022; Wang et al., 2024), which in turn will speed up computation done using these matrices. This approach cannot be used for reducing the overhead for the preconditioner update in popular deep learning frameworks such as Pytorch since torch.linalg.qr does not support precision lower than float32. Orthogonal to the low precision approach we can improve the per step time overhead of SOAP by the following algorithmic approaches:

Using Adafactor instead of Adam (Section 7.2) as the diagonal preconditioner after rotating by $Q_{L}$ and $Q_{R}$ . In this version of SOAP the overhead can be improved by from $m^{3}+n^{3}+2m^{2}n+2n^{2}m$ to $m^{3}+n^{3}+m^{2}n+n^{2}m+\max(m,n)^{2}\min(m,n)+\min(m,n)^{3}$ by merging the project and project back steps for the smaller dimension.

Using one sided version of SOAP (Section 7.1). This reduces overhead from $m^{3}+n^{3}+2m^{2}n+2n^{2}m$ to $\min(m,n)^{3}+2\min(m,n)^{2}\max(m,n)$ .

Combining these approaches yields an overhead of $\min(m,n)^{2}\max(m,n)+2\min(m,n)^{3}$

Using one-sided version also reduces the second type of overhead from a calls to torch.linalg.qr on a $m\times m$ and a $\ n\times n$ matrix to only a single call to $\min(m,n)\times\min(m,n)$ matrix.

Discussion and Future Work

We study an optimizer called SOAP: ShampoO with Adam in the Preconditioner’s eigenbasis. We show that SOAP outperforms both AdamW and Shampoo in language modeling tasks and show that it is more robust to changes in preconditioning frequency than Shampoo. For future work, we would like to explore further improvements to the design of SOAP, in particular, related to using lower precision for the preconditioners as well as a better distributed implementation. We would also like to explore the performance of SOAP on other domains such as vision.

References

Appendix A Experimental Setup

Many aspects of our setup such as models are the same as in Zhao et al. (2024c). We will restate those details verbatim for completeness.

We train language models on C4 tokenized with the T5 tokenizer (Raffel et al., 2020) and report results in terms of validation loss.

We start from the OLMo codebase (Groeneveld et al., 2024) and train decoder-only transformer models of two sizes: 210m, 360m, and 660m, where the parameter count refers to non-embedding parameters. The models have widths of 1024, 1024, and 1408 and depths of 12, 24, 24. We used the 210m model to explore various ablations, most of our reported results are on 360m and 660m. The MLP hidden dimension is 4x of the width. The activation function is GeLU (Hendrycks & Gimpel, 2016). We use RoPE positional encodings (Su et al., 2024). Attention heads are always dimension 64. We use PyTorch default LayerNorm. We use QK layer norm (Dehghani et al., 2023). Following Wortsman et al. (2024) we do not learn biases for the linear layers or LayerNorms. We train in mixed precision with bfloat16.

We use the standard Pytorch implementation of AdamW (Paszke et al., 2019), the DistributedShampoo Shi et al. (2023) implementation of Shampoo. We implement ourselves SOAP and GaLore starting from an older version of Pytorch implementation of AdamW and the official GaLore implementation Zhao et al. (2024b).

We use $\beta_{1}=0.95$ , as we found it to outperform $\beta_{1}=0.9$ in our sweeps for the 360m model. Following Wortsman et al. (2024) we use decoupled weight decay with coefficient $110-4$ and z-loss with coefficient $110-4$ . We use the default value of ${\epsilon}=1e-8$ in AdamW (actual or when used for grafting), SOAP and GaLore. We use warmup followed by cosine decay as our scheduler. We start the warmup and end the cosine decay at $0.1x$ the maximum learning rate.

Shi et al. (2023) state that they find the optimal exponent to be either $-1/2$ or $-1.82/4\approx-1/2.2$ . Our preliminary findings were similar to this. Hence we set the default values of exponent to be $-1/2.5$ for both 1D and 2D parameters. We set ${\epsilon}_{\text{shampoo}}=$ 110-12$ $and$ \beta_{\text{shampoo}}=0.95$ based on our initial set of experiments on the 210m model.

GaLore introduces an additional hyperparameter called scale ( $\alpha$ ) since due to low rank updates the overall update magnitude decreases. Since we are running a full rank version of GaLore we set $\alpha=1$ .

For all of our runs we use a sequence length of 1024. For all models (except in Section 6.3), we use a token batch size of 2048k $\approx$ 2m. We default to training models for the approximately “chinchilla optimal” number of tokens that is $\approx$ 20 times the number of parameters. Explicitly, this means for our default batch size of 2m, the 210m models are trained for 1600 steps or $\approx$ 3.3b tokens. The 360m models are trained for 3200 steps, the 660m models are trained for 6400 steps.

A.1 Sweeping over hyperparameters

AdamW, 2m batch size: Starting from the default hyperparameters above we do the following sweeps:

We sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

(360m) We sweep over the cross product of best 3 learning rates and $\beta_{1}\in\{0.9,0.95,0.99\}$ .

(360m) We sweep over the cross product of best 3 learning rates and $\beta_{2}\in\{0.9,0.95,0.99\}$ .

The last two of the sweeps did not yield any benefit for the 360m model with 2m batch size hence we only sweep over learning rate for the 660m model with 2m batch size.

DistributedShampoo, 2m batch size: Starting from the default hyperparameters above we do the following sweeps:

We sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

(360m) We sweep over over the cross product of best 3 learning rates from above and ${\epsilon}_{\text{shampoo}}\in\{$ 110-11 $,$ 110-12 $,$ 110-13 $\}$ .

(360m) We sweep over over the cross product of best 3 learning rates from above and $\beta_{\text{shampoo}}\in\{.9,.95,.975\}$ .

Let $e_{1},e_{2}$ denote the exponents used in DistributedShampoo for 1D and 2D parameters respectively. We also sweep over the cross product of best 3 learning rates from above and $(e_{1},e_{2})$ in $\{(2,2),(2.5,2.5),(3,3),(2,4)\}$ .

These sweeps did not yield any significant improvement in performance ( $<.004$ ) for the 360m model. Hence we only sweep over the learning rate for the 660m model.

SOAP, 2m batch size: Starting from the default hyperparameters above we sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

AdamW, 256k batch size: For the 360m model with 256 batch size we start from the default hyperparameters and do the following sweeps:

We sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

We sweep over the cross product of best 3 learning rates and $\beta_{2}\in\{0.95,0.99\}$ .

In the second sweep we observe small improvements in performance by using $\beta_{2}=.99$ , so our final numbers use $\beta_{2}=.99$ . This (small) improvement in performance by using a larger $\beta_{2}$ at smaller batch sizes was also observed by Porian et al. (2024); Zhao et al. (2024c).

DistributedShampoo, 256k batch size: For the 360m model with 256 batch size we start from the default hyperparameters and do the following sweeps:

We sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

We sweep over the cross product of best 3 learning rates and $(\beta_{2},\beta_{\text{shampoo}})\in\{(.95,.95),(.99,.99)\}$ .

In the second sweep we observe small improvements in performance by using $\beta_{2}=\beta_{\text{shampoo}}=.99$ , so our final numbers use $\beta_{2}=\beta_{\text{shampoo}}=.99$ .

SOAP, 256k batch size: For the 360m model with 256 batch size we start from the default hyperparameters and do the following sweeps:

We sweep over learning rate in $\{.1,.0316,.01,\ldots,$ 3.1610-4 $\}$ .

We sweep over the cross product of best 3 learning rates and $\beta_{2}\in\{.95,.99\}$ .

In the second sweep we observe small improvements in performance by using $\beta_{2}=.99$ , so our final numbers use $\beta_{2}=.99$ .

Preconditioning frequency sweeps: For the preconditioning frequency experiments of SOAP and Shampoo ( Figure 1 (right)), for each frequency we do a learning rate sweep over the best 3 learning rates found at preconditioning frequency 10. Other hyperparameters are set to their optimal values obtained using the precondition frequency 10 sweeps.

360m and 660m shorter runs: For each of the shorter runs of 360m and 660m models for the SOAP optimizer (Figure 2), we did learning rate sweep over the best 3 learning rates found for the standard length run. Other hyperparameters are set to their optimal values obtained using the standard length run.

Warmup: The warmup duration for the 360m and 660m models were 600 and 1200 steps respectively. For the shorter runs (Figure 2), for 360m model, the warmup durations were 400, 400, 500 and 525 steps for 0.5, 0.625, 0.75 and 0.875 runs respectively. For the 660m model, the warmup durations were 600, 750, 900 and 1050 steps for 0.5, 0.625, 0.75 and 0.875 runs respectively.

Appendix B GaLore

We tried GaLore for 210m model, and while it outperformed AdamW it performed worse than Shampoo. Hence we do not try GaLore for higher model sizes.

Hyperparameter sweeps: We did the following sweeps:

We swept the cross product over learning rate ( $3.16e-4,1e-3,3.16e-3,1e-2$ ), preconditioning frequency ( $10,50,200$ ), both sided and one sided versions. Frequency 200 had the best results matching the observation of Zhao et al. (2024a).

We did a cross product sweep over learning rate ( $3.16e-4,1e-3,3.16e-3,1e-2$ ), both sided and one sided versions with $\beta_{2}=.99$ instead of $.95$ and preconditioning frequency 200.

We did a cross product sweep over learning rate ( $3.16e-4,1e-3,3.16e-3,1e-2$ ), both sided and one sided versions, preconditioning frequency ( $50,200$ ) with $\beta_{1}=.9$ instead of $.95$ .

The best performing run among all of these achieved a final loss of 3.12 while the best Shampoo run achieved a final loss of 3.10.