Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness

Introduction

Recent research in large language models (LLMs) shows important advances that can improve LLM quality and efficiency. Scaling law studies show predictable and significant improvements in model performance by increasing model and dataset size (Hestness et al., 2017; Kaplan et al., 2020). Language models can also be improved just by training on more data (Hoffmann et al., 2022; Touvron et al., 2023). Recent works, such as Maximal Update Parameterization (µP), also show techniques to improve training stability and performance as models scale up (e.g., Bachlechner et al. (2020); Yang et al. (2021)).

Concurrently with these advances, the research community has trained and released many open-source models. Models like GPT-J, GPT-NeoX, OPT, and Pythia have each held state-of-the-art accuracy for open source models for their size, and these models can be tested and used simply by downloading the pre-trained weights (Wang & Komatsuzaki, 2021; Black et al., 2022; Zhang et al., 2022; Biderman et al., 2023). While these models are important contributions, they have not aimed to be compute-efficient. The research community needs more reproducible scaling efforts that can guide collective decisions about training large foundation models in a compute-efficient way.

We introduce Cerebras-GPT, our open effort to combine recent LLM efficient scaling techniques to produce compute-optimal pre-trained models and corresponding scaling laws. Cerebras-GPT is a family of GPT-3-like models that we scale from 111M to 13B parameters. We train them on the open-source dataset, the Pile (Gao et al., 2020), following DeepMind’s Chinchilla scaling rules (Hoffmann et al., 2022). Cerebras-GPT models show state-of-the-art training efficiency when targeting both upstream Pile evaluations as well as a suite of downstream tasks. Our largest model shows state-of-the-art performance on pre-training and most downstream tasks compared to other comparably-sized public models. We also characterize some of the training stability challenges when scaling Cerebras-GPT. We address the challenges by training models with µP, which shows further accuracy improvements and hyperparameter predictability.

Cerebras-GPT models form the compute-optimal Pareto frontier for both pre-training and popular downstream objectives. Figure 1 shows the upstream Pile frontiers compared to contemporary works. We characterize the Pareto frontiers with scaling laws that can be used to predict the benefits of further model and dataset scaling efforts. We also observe and discuss that future open efforts should consider aggregate compute budget (both pre-training and expected inferences) when deciding the appropriate balance of model size and pre-training dataset size.

Overall, the contributions of this work are as follows:

We train Cerebras-GPT compute-optimal models scaled from 111M to 13B parameters on the Pile dataset following Chinchilla scaling rules to collect compute-efficient scaling laws.

We show that these models provide state-of-the-art pre-training efficiency on both pre-training and downstream objectives compared to other open models–the first such open effort.

We provide detailed instructions to reproduce our results, including the use of µP to improve training stability and transfer hyperparameters as models scale up.

We document our experience training these models on the Andromeda AI Cluster, comprising 16 Cerebras CS-2 systems, and we describe the simplicity of scaling models and performance.

Finally, we aim to enable the research community to consume these results. We release our pre-trained models and code, and we share details about our training process here, so the community can use and reproduce our results. Pre-trained models are available on HuggingFace: https://huggingface.co/cerebras. Source code is available in the Cerebras Modelzoo: https://github.com/Cerebras/modelzoo. We hope these models will be a valuable addition for the open-source community.

Methodology

In this section, we describe the details of the models we trained, including hyperparameters used at each model scale and details about how we obtain and use the Pile dataset. We also motivate the need for techniques to stabilize scaling, and we describe how we use Maximal Update Parameterization (µP).

Cerebras-GPT models have a GPT-3-like architecture, an autoregressive transformer decoder model (Brown et al., 2020). The main difference is that unlike GPT-3, which uses alternating dense and sparse-banded attention, we use dense attention in all decoder blocks. We select model dimensions to either follow aspect ratio 80{\sim}80 (dmodel/nlayersd_{\text{model}}/n_{\text{layers}}) or the same shape as GPT-3 models. All models are trained with a maximum sequence length of 2048 tokens. Table 1 lists the specific model dimensions for each model size. Our formula for the number of parameters is provided in appendix E.

2 Pre-training Corpus

We pre-train models on the Pile dataset, which consists of data from 22 data sources, including Common Crawl, PubMed Central, Books3, OpenWebText2, Github, and arXiv (Gao et al., 2020). We use the dataset splits for train, test, and validation sets provided in the Pile configuration. We tokenize the corpora with byte-pair encoding and the GPT-2 vocabulary of size 50257 (Sennrich et al., 2016; Radford et al., 2019). We do not perform deduplication of Pile but believe that deduplication could further improve our results. We include more details about Pile and dataset pre-processing in Appendix A.1.

To evaluate pre-training, we compare Cerebras-GPT models to several publicly-available models using cross-entropy loss on the Pile test set. To ensure fair comparisons, we run evaluation ourselves on all checkpoints rather than using published numbers, though in most cases, our evaluations match prior works. For models that use different vocabularies, we correct cross-entropy back to the equivalent value with the GPT-2 vocabulary based on the number of tokens in each dataset.

3 Model Training

We train models using the following training configurations. We use the AdamW optimizer (Loshchilov & Hutter, 2017) with (beta1, beta2) = (0.90.9, 0.950.95). We set epsilon to 1e1e-88 for small models and to 1e1e-99 for 6.7B and 13B parameter models. We use weight decay of 0.10.1 for all models. We do not use dropout for pre-training. For all runs, we use gradient norm clipping of 1.01.0.

We use learning rates and batch sizes consistent with prior works, as listed in Table 1. We find that linear learning rate decay tends to perform better than cosine decay, so we use it in most of our pre-training runs. With either decay type, we warm up learning rate linearly over 375M tokens and then decay to 10%10\% of the maximum learning rate. The table also includes batch sizing. For the 13B parameter model, we train with a batch size of 720 sequences of length 2048 tokens for the first 84B tokens. At that point, we observed the gap between validation and train loss growing, indicating that the gradient noise was growing, so we increased the batch size to 1080 sequences for the rest of training.

To scale Cerebras-GPT model training in a compute-efficient way, we follow the DeepMind Chinchilla scaling methodology outlined in (Hoffmann et al., 2022). Specifically, we test and find that models trained with roughly 20 tokens per parameter offer the most compute-efficient pre-training, consistent with the Chinchilla results. We believe this paper is the first open effort to estimate the compute-efficient tokens per parameter for the Pile dataset. Our results Section 3.1 characterizes the effect of training with more tokens per parameter, and we include further test results in Appendix D.

Finally, we train models using both FP16 mixed precision and bfloat16 precision (Micikevicius et al., 2018; Abadi et al., 2016). Overall, we find bfloat16 to be more stable due to its extra exponent range, so we use it for all Cerebras-GPT models that we release. We include further discussion of precision in Appendix A.2.

4 Standard (SP) and Maximal Update Parameterization (µP)

Standard Parameterization (SP): We configure our main Cerebras-GPT models with the common standard parameterization (SP) approach. In SP, model weights are initialized from normal distributions with constant standard deviation or standard deviation based on the shape of each layer (Glorot & Bengio, 2010). We initialize embedding and hidden layer weights with a truncated normal distribution with standard deviation σ=0.02\sigma=0.02. An exception is that we use a standard deviation of σ=0.02/2nlayers\sigma=0.02/\sqrt{2\cdot n_{\text{layers}}} for the last layer inside each residual network, following the GPT-2 initialization (Radford et al., 2019).

Unfortunately, the SP approach does not account for potential inter-layer interactions and resulting training dynamics that arise when scaling to very large models. As SP models scale, they tend to become unstable as weight and activation values bump up against the limits of the numerical representations used to train them. For large models, unstable training can cause very costly restarts and researchers might not have budget for extensive hyperparameter tuning.

Maximal Update Parameterization (µP): To address these issues, we also train a set of Cerebras-GPT models with Maximal Update Parameterization (µP) (Yang et al., 2021). µP controls initialization, layer-wise learning rates, and activation magnitudes to ensure analytically stable training independent of a model’s layer widths. In addition to improving training stability, µP also improves the transferability of training hyperparameters from smaller to larger scale models, a technique called µTransfer. µTransfer permits directly using the same settings for some optimizer hyperparameters, most notably the learning rate.

We train a set of Cerebras-GPT models using µP. We follow the µTransfer approach by first tuning hyperparameters for a small, 40M parameter µP model. Then, we transfer the hyperparameters along our µP scaling law up to a 2.7B parameter model. µP requires small changes to our baseline Cerebras-GPT models, including adding element-wise activation tensor scaling, adjusting initializers for affected layers, and adding layer-wise learning rates scaling to certain layers. We discuss the benefits we see with µP in Section 3.3. Refer to Appendix G for our tips to implement µP and our hyperparameter tuning notes.

Results

In this section, we show pre-training and downstream evaluations of Cerebras-GPT models, scaled from 111M to 13B parameters, and we compare against recent related works. We characterize the compute-efficient Pareto frontier for pre-training models on the Pile dataset and show that models on this frontier are also competitive on downstream tasks. We believe this is the first study to release a compute-optimal scaling law for pre-training on the Pile dataset that is openly reproducible by the community.

We show that Cerebras-GPT models define the state-of-the-art compute-optimal Pareto frontier on both pre-training and downstream objectives. Further, our largest model with 13B parameters shows improved accuracy on most downstream tasks compared to other comparably-sized publicly-available modelsWe believe the LLaMa 13B model is better than Cerebras-GPT on downstream tasks because it was trained for 4×4\times more tokens, but were unable to get access to test the model ourselves.. We also train Cerebras-GPT models configured using µP. We show that µP enables direct hyperparameter transfer from smaller to larger models and improves the compute-optimal frontier loss by 0.4%0.4\%.

We scaled and pre-trained Cerebras-GPT models from 111M–13B parameters on the Pile dataset. We compare the Pile test set lossAll Cerebras-GPT development and hyperparameter tuning was evaluated using the Pile validation set. for Cerebras-GPT models against other publicly available pre-trained models, GPT-J, GPT-NeoX, and Pythia (Wang & Komatsuzaki, 2021; Black et al., 2022; Biderman et al., 2023). We believe these models to be fair comparisons either because the models were trained directly on Pile or on similarly-prepared datasets.

Figure 2 plots pre-training efficiency (values also listed in Table 2). The horizontal axis plots floating-point operations (FLOPs) spent during pre-training (log scale), and the vertical axis plots Pile test loss (log scale)Pile test loss is crossentropy in nats/token. We correct all crossentropy results for different vocabularies to be comparable to the GPT-2 vocabulary.. Across all model scales, Cerebras-GPT sets the efficiency frontier, largely because models were pre-trained with 20 tokens per parameter, consistent with findings in the Chinchilla paper. Other public models use more tokens per parameter, requiring more FLOPs to achieve similar loss.

There are a couple notable observations from Figure 2. First, the scaling law for Cerebras-GPT models extrapolates accurately to larger model scales. We estimated the 13B model loss using a similar scaling law from models up to 6.7B parameters, and the 13B model trained to within 0.5% of projected loss. Extending the existing scaling law shows that if we budgeted to train a model with FLOPs equivalent to GPT-NeoX 20B, we would expect the Cerebras-GPT model loss to be 1.2%{\sim}1.2\% better than GPT-NeoX 20B. For future reference, we include the compute-optimal frontier scaling law here (ff is compute FLOPs to loss, L\mathcal{L}):

Second, increasing tokens per parameter above 20 leads to smoothly degraded loss for the FLOP budget. Pythia models are each trained using 299.9B tokens from the Pile. As model size increases, tokens per parameter decreases reciprocally, and losses move closer to the compute-optimal frontier. The largest Pythia model at 12B parameters is trained with 25.3 tokens per parameter and is just 0.3% loss above the Cerebras-GPT scaling law.

The loss gap from the compute-optimal frontier appears to be predictable in terms of tokens per parameter. In Figure 3, we plot the percentage loss increase compared to the Cerebras-GPT frontier as a function of tokens per parameter. Here, Cerebras-GPT models cluster at 20 tokens per parameter, and Pythia results show the smooth curve away from the frontier for more tokens per parameter. We also include an estimate of the Chinchilla loss degradation from curve fitting data in their plots (Hoffmann et al. (2022), Figure 3). These results confirm the estimate that compute optimal pre-training on the Pile should use roughly 20 tokens per parameter, a striking consistency with the Chinchilla results on the MassiveText dataset. Further tokens per parameter tests are in Appendix D.

2 Downstream Results

We evaluate Cerebras-GPT and publicly-available models on a suite of seven common sense reasoning tasks in both the zero-shot and five-shot settings using the EleutherAI evaluation harness (Gao et al., 2021). In particular, we evaluate models on the tasks HellaSwag, PIQA, WinoGrande, Lambada, ARC (both the easy and challenge versions), and OpenBookQA (Zellers et al., 2019; Bisk et al., 2020; Sakaguchi et al., 2021; Paperno et al., 2016; Clark et al., 2018; Mihaylov et al., 2018). We include more detail about these tasks in Appendix B. In addition to models we evaluate on upstream Pile, we add downstream results for OPT models (Zhang et al., 2022), which were trained on a broader dataset but still using 300B pre-training tokens.

Like the pre-training results, Cerebras-GPT models form the compute-optimal Pareto frontier for downstream tasks as well. Figure 4 summarizes the average downstream task results for both zero- and five-shot evaluationsHere, we report accuracy result from each model predictions using token-level probability, consistent with reported results in the GPT-NeoX paper. We report additional accuracy measures in Appendix C.2. comparing Cerebras-GPT to GPT-J, GPT-NeoX, and Pythia. As Pythia and OPT models grow close to the 20 tokens per parameter count, they approach the Cerebras-GPT frontier FLOPs-to-accuracy. Here again, the Cerebras-GPT 13B model shows the best average downstream result for models of comparable size.

Figure 4 also plots downstream averages against model size in parameters (right column). For each model size smaller than 13B parameters, GPT-J, OPT, and Pythia models show significantly better downstream accuracy than Cerebras-GPT models, as expected. The Pythia and OPT accuracy frontiers deflect from straight lines (power-laws in log-log-scale), whereas Cerebras-GPT frontiers continue, indicating that downstream accuracy is predictable by model size for models trained with fixed tokens-per-parameter. The Cerebras-GPT trend suggests these models would be competitive with GPT-NeoX 20B if scaled to that size.

Finally, Table 2 shows more detailed downstream task comparisons for large publicly-available models, grouped into comparable sizes. We bold the results that are the best for each task and model size group. Each model family has at least one model that is best for some tasks. In this table, we also include results for Pythia models trained on a deduplicated version of the Pile. We separated these results, since they may not be directly comparable to others above, which were trained using the same or similar dataset preparation. As expected from the deduplication process, Pythia models show more difficulty generalizing to the pre-training Pile test loss task than other open models, which might have seen duplicated data during training. However, the Pythia Pile-dedup models typically improve accuracy on downstream tasks (1.8% on average), indicating the potential benefits of deduplication.

3 Maximal Update Parameterization (µP) and µTransfer

As we scaled the Cerebras-GPT models with standard parameterization (SP) along our scaling law, we experienced challenges predicting appropriate hyperparameters, and these models show substantial variance around their common scaling law. To address these challenges, we also test µP and µTransfer tuned hyperparameters to 111M–2.7B parameter Cerebras-GPT models. Across model sizes, our µP models exhibit an average of 0.43% improved Pile test loss and 1.7% higher average downstream task accuracy compared to our SP models. Here, we also show that µP performance scales more predictably, enabling more accurate performance extrapolation.

As we scaled up models with SP, we found model training could become unstable when configured with hyperparameters used in other prior works. At different model size scales, the numerical characteristics of different layers can training instability. These instabilities can lead the practitioner to adjust prior hyperparameters in an effort to work around the issuesAppendix A.2 describes example stability challenges, such as FP16 mixed precision training causing numerical underflows.. However, moving away from known good configurations can lead to costly tuning efforts and blocked scaling progress. By shifting to µP, we find more stable training dynamics–key metrics like weight and gradient norms behave similarly at different scales.

We see the benefits of µP readily as we scale. First, after tuning hyperparaeters with a small 40M parameter model, we were able to use the same learning rate hyperparameters for all model scales, as we noted in Table 1. µP features were the only changes we made to these models, so scaling was very simple.

Second, models show significantly more predictable scaling. Figure 5 plots the percentage loss increase for each SP and µP model relative to the SP scaling law (negative values are improved loss). µP models show an average of 0.43% better Pile test loss compared to the Cerebras-GPT SP scaling law fit. Further, µP models show substantially lower variance with just 0.04% standard deviation relative to the SP scaling law, while SP models show deviation 0.66% (16×{\sim}16\times more noisy). For perspective, the run-to-run standard deviation in loss when using different initialization and data random seeds is around 0.35%.

In addition to its pre-training advantages, µP also improves downstream capabilities of these models. In the previous Figure 4, we plotted downstream results for µP models, where we see improved accuracy and distinctively smoother scaling than SP models. Table 3 also lists these zero-shot downstream results for SP and µP models. In particular, µP models show a 1.7% relative improvement in downstream tasks on average. These results are robust across model scales besides the 2.7B parameter model. We believe that we were just lucky when choosing the SP 2.7B model hyperparameters such that it performs significantly better than the SP Pile scaling law. Despite the SP model’s upstream advantage, however, the 2.7B + µP model still performs as well on downstream tasks on average.

Trading Off Training and Inference FLOPs

Up to this point, our analysis has focused on compute-optimal pre-training, where compute cost is proportional to the square of the model’s size, because we train models to a constant number of tokens per parameter. However, recent work has started to also consider model inference costs, showing smaller models trained on more tokens can still significantly improve loss (Hoffmann et al., 2022; Touvron et al., 2023). At inference time, the compute cost is proportional to the model’s size and number of inferences. Thus, smaller models will have an overall inference cost advantage proportional to their size.

We propose a technique to identify training+inference compute-optimal frontiers that practitioners can use to estimate how they should pre-train their models when considering inference deployment costs. Specifically, we define a compute cost metric equal to pre-training FLOPs added to the model’s inference cost and the expected number of inference tokens. Here, FF is the total compute cost, ff represents FLOPs costs for full pre-training and per-token inference, ninfer_tokensn_{\text{infer\_tokens}} is the number of expected inference tokens for the given model, and pp is the model’s parameter countNote that the big-O\mathcal{O} order relations here could incorporate constant factors to account for model compression, quantization, or other techniques that decrease the relative inference costs.:

With this formulation, we can estimate the number of model inferences before the total compute budget matches models trained on fewer or more tokens. Figure 6 plots a comparison of total pre-train + inference compute cost for Cerebras-GPT, GPT-J, GPT-NeoX, and Pythia models assuming either 20B, 200B, or 2T inference tokens. These results show that most Cerebras-GPT models would provide better Pile test-loss-per-compute-FLOP than Pythia models until all models reach roughly 200B inference tokens. Since this total compute metric forms a continuum trade-off, models pre-trained on some number of tokens in between the Cerebras-GPT and Pythia frontiers are likely to achieve better loss for the same total compute budget.

Following this formulation, organizations and governments can better assess the total costs when budgeting large-scale training runs. Specifically, if a model is to be trained in a pre-training compute-inefficient way using too many data samples, that model may need to be used in a very large number of inferences before the training compute cost can be amortized and well-justified. Similar analysis can be applied to monetary, energy, or carbon footprint costs as well. We encourage the community to consider these total costs when training future models.

Cerebras Stack

To collect our compute-efficient LLM scaling laws, we run all studies on the Cerebras Wafer-Scale Cluster named “Andromeda”, which contains 16 Cerebras CS-2 systems. As far as we are aware, this is the first scaling laws study performed on Cerebras systems, which are capable of simple large-scale model training and high-performance scale-out to many systems. In this section, we describe the Andromeda AI Supercomputer, and the Cerebras software platform (CSoft) that we use for scaling and training. We show that Andromeda performance scales linearly up to the full 16 CS-2s, and we describe the simplicity of training models for this study.

Andromeda is a Cerebras Wafer-Scale Cluster composed of 16 CS-2 systems. Figure 7 shows the architecture of Andromeda, which aligns well with the large-scale parallel nature of deep learning training. Each CS-2 system contains a Cerebras Wafer-Scale Engine (WSE-2) processor, which has 40 GB of high bandwidth SRAM and compute capability of 7.5 PetaFLOP/s half precision peak throughput. The WSE-2’s processing cores are specifically designed to perform all compute operations required for deep learning models. Overall, Andromeda has peak throughput of 120 PFLOP/s from these CS-2s.

Weights and command servers drive the CS-2’s computation by broadcasting the weights and control instructions through a broadcast + reduce tree network. This same network collects and reduces gradients from the CS-2s for each training step. When weights servers receive the reduced gradients, they perform the optimizer step and update model weights. They also save and restore model checkpoints to/from disk.

Activation workers act as servers to handle input data and activations. Each worker reads an independent shard of the dataset from disk and creates subbatches to send to a corresponding CS-2 for training. In cases where models must be trained using activation checkpointing, the CS-2 can evict the activations to the corresponding activation worker, which can later refill the activation on the CS-2 when needed.

2 CSoft Platform and Weight Streaming Mode

Andromeda runs deep learning applications through the Cerebras Software Platform (CSoft). For this study, we write and train models in both Tensorflow and PyTorch (reported results are with PyTorch), and CSoft compiles and orchestrates running these models on the hardware. In this process, CSoft automatically selects things like data parallel subbatch sizing and gradient accumulation, activation recomputation and checkpointing, and appropriate data layouts and kernel configurations for high performance.

The logical data flow in Figure 7 is called the Weight Streaming mode, because weight servers stream the weights to the CS-2s and collect gradients on each training step. This execution mode permits training models of size only limited by the memory capacity of weight servers, and we have tested the ability to train beyond the full GPT-3 175B parameter model with no changes outside of model configurations.

The Weight Streaming design stands in contrast with existing accelerator execution modes. Recent trends in large language model training typically require parallelizing training across tens to thousands of accelerator devices, such as GPUs. These efforts require complicated combinations of data and model parallelism (e.g., (Smith et al., 2022)). Models must be carefully divided to fit into memory close to the devices to achieve high throughput at relatively small per-device batch sizes. Weight Streaming permits moving the weights to the wafer and gradients from the wafer—achieving solid performance at small per-system batch sizes—without the need for model parallelism.

We find CSoft Weight Streaming to be significantly easier to develop and scale models than existing accelerator approaches. First, we were able to run each Cerebras-GPT model and even larger models for many training steps on a single CS-2 system. This capability made it easy to quickly test that features of our model and dataset loader implementations would work well even for very large models. Second, the cluster’s near-linear performance scaling meant that we could accurately estimate total training time for each run as we scaled to more CS-2 systems. Finally, it was easy to configure these large-scale runs; Scaling to many CS-2 systems requires changing only the number of systems on which to train, and CSoft automatically chooses the data parallel configurations for us.

3 Performance Scalability

Andromeda provides near-linear performance scaling up to the full 16 CS-2s. We show performance (training speed) scaling from our initial model tests, followed by performance scaling results from our actual training runs. First, as Andromeda came online, we tested performance using a weak scaling approach: As we increased the number of systems, we increase the batch size proportionally (here, batch size is number of sequences of length 2048). We ran 100 training steps for each configuration and take an average training step time over the 100 steps. Table 4 shows the weak scaling performance relative to 1 CS-2. Andromeda achieves linear scaling within 9%9\% for all model sizes and CS-2 system counts.

We also show that Andromeda achieves high utilization even when strong scaling on batch size. We choose to scale out the fixed batch sizes from our training runs across different numbers of Andromeda systems. When running on fewer CS-2s, if the per-CS-2 batch size requires too much memory to fit in each WSE-2’s on-wafer SRAM, the software stack automatically selects a smaller per-CS-2 batch size and accumulates gradients up to the user’s chosen batch size. Table 5 lists the relative performance compared to running on a single CS-2. These results show consistent performance scalability for the batch sizes commonly chosen for these models.

Finally, as we increased model sizes along our scaling law, we tested and compared the cluster’s FLOP/s utilization for each training run. Table 6 lists Andromeda’s utilization relative to the 111M parameter model running on one CS-2. Performance deviates by less than 8% at all model scales. In addition to robust scaling across many machines, these results indicate consistent performance across a range of model and batch sizes.

Related Work

Early deep learning scaling law studies show that when scaling dataset and model size, loss improves predictably (Hestness et al., 2017; Kaplan et al., 2020). These studies indicate generally that scaling could give substantial modeling improvements. From this observation, many organizations scaled to train the largest possible models on their available infrastructure: GPT-3 175B (Brown et al., 2020), Jurassic-1 178B (Lieber et al., 2021), Gopher 280B (Rae et al., 2022), HyperCLOVA 82B (Kim et al., 2021), Ernie 3.0 Titan 260B (Wang et al., 2021), Yuan 1.0 (Wu et al., 2021), PanGu-α\alpha (Zeng et al., 2021), Megatron-Turing NLG 530B (Smith et al., 2022), PaLM 540B (Chowdhery et al., 2022), and LaMDA 137B (Thoppilan et al., 2022). These models show significant performance improvement on many downstream tasks compared to prior language models. However, these studies typically only scale model size without scaling the dataset size as suggested by the early works, often training on roughly 300B tokens. Further, these models could only be trained by select organizations with large compute clusters, and the datasets and resulting pre-trained models have not been released publicly for analysis by the research community.

The research community has released large datasets and pre-trained models—typically much smaller than the largest models above but still quite valuable—and we have noted many of them previously: GPT-J, GPT-Neo, GPT-NeoX, OPT, and Pythia. Another notable work that releases dataset and model is the Big Science collaborative effort to train BLOOM 176B (Scao et al., 2022; 2023). These studies, datasets, and models enable the community to test, compare, and use large language models they would otherwise not have access to or compute budget to train.

In 2022, studies started revisiting early scaling works to note that although model size scaling improves performance, consistently scaling the dataset size is still critical to get the best possible models. Hoffmann et al. (2022) show that for compute-optimal pre-training, the dataset size should grow linearly with transformer model size in parameters, and they scaled training up to a 70B parameter model on 1.4T tokens. The dataset and models are not publicly-available, so our work aims to reproduce these results to offer to the community an open and reproducible scaling law. Recently, the LLaMa paper (Touvron et al., 2023) also reproduces large models trained on large open datasets to improve pre-training. Although these models show strong performance, most are trained in a compute-inefficient way by training on larger datasets than would be compute-optimal for the given model sizes. LLaMa models are available through request. The resulting models trained in these works perform better than prior larger models trained on smaller datasets.

Large language model training is prone to instability, and it is very costly when large model training runs fail due to instability. Various techniques have been developed to control training dynamics and train models stably (Glorot & Bengio, 2010; Yang & Schoenholz, 2017; Schoenholz et al., 2017; Yang & Schoenholz, 2018; Zhang et al., 2019; Bachlechner et al., 2020; Huang et al., 2020; Liu et al., 2020; Li et al., 2022). µP is the first comprehensive method to analytically control width-related training instabilities and allow optimal hyperparameters of small models to be the same as optimal hyperparameters for very large models. We find that the comprehensive nature of µP simplifies our training efforts, so we feel it is useful to share our experience and encourage the community to use it rather than considering combinations of other techniques.

Limitations

In this work, we train well-established model architectures to create foundation models, but we did not explore recent architectural features, downstream task tuning procedures, or dataset cleaning approaches used in contemporary works. Model features worth exploring in future work include position embeddings, such as RoPE (Su et al., 2022) and ALiBi (Press et al., 2022), and activation functions, like SwiGLU (Shazeer, 2020). There are also training paradigms worth exploring, such as denoising pre-training objectives (Tay et al., 2023) and instruction fine-tuning (Ouyang et al., 2022). Finally, we expect that further dataset cleaning can further improve pre-trained models. For instance, our testing in Appendix C.2 shows that the Pythia models improve downstream task accuracy when trained on a deduplicated version of the Pile.

We have not yet tested Cerebras-GPT models extensively in downstream tasks or in real application settings. Specifically, we have not tested for factual accuracy, profanity, toxicity, or other socially undesirable text generation. We do evaluate the bias of our Cerebras-GPT models using the CrowS-Pairs dataset in Appendix C.4. Further safety-related testing, mitigations, and output curation should be applied to our pre-trained models before presenting results to users. Please refer to the model card in the Appendix, Table 7.

Conclusion

In this paper, we introduce Cerebras-GPT, a family of open models scaled from 111M to 13B parameters and pre-trained in a compute-optimal way on the Pile dataset. These models show state-of-the-art pre-training efficiency on pre-training and downstream objectives when compared to other open-source models. We believe this is the first such open effort, and we provide detailed instructions to reproduce our results and we release our pre-trained model checkpointsPre-trained models are available on HuggingFace: https://huggingface.co/cerebras. Source code is available in the Cerebras Modelzoo: https://github.com/Cerebras/modelzoo.. We combine this scaling with µP, a comprehensive technique to improve large model stability, and we show it further improves our scaling results. We document our experience training these models on the Andromeda AI Cluster, comprising 16 Cerebras CS-2 systems, and we describe the simplicity of scaling models and performance.

References

Model Card

Table 7 shows the model card for the largest Cerebras-GPT model following guide in (Mitchell et al., 2019).

Cerebras-GPT Open-Source References

We release our pre-trained models and code, so the community can use and reproduce our results. Pre-trained models are available on HuggingFace: https://huggingface.co/cerebras. We are initially releasing seven Cerebras-GPT models with 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B parameters trained with standard parameterization (SP). These models are released under Apache 2.0 license, which permits commercial and non-commercial use. Source code is available in the Cerebras Modelzoo: https://github.com/Cerebras/modelzoo. We hope these models will be a valuable addition for the open-source community.

Author Contributions and Acknowledgements

We would like to acknowledge the contributions of those who helped in preparation of this manuscript.

Experimental planning and strategy: Nolan Dey, Joel Hestness Model training: Zhiming (Charles) Chen, Hemant Khachane, Ribhu Pathria, Gurpreet Gosal Dataloader development and dataset preparation: Gurpreet Gosal Numerical configuration and validation: Joel Hestness, Hemant Khachane, Gurpreet Gosal Upstream loss comparisons: Gurpreet Gosal, Charles Chen Downstream task comparisons: William Marshall Manuscript preparation: Nolan Dey, Joel Hestness, Gurpreet Gosal, William Marshall Overall project leadership: Joel Hestness, Marvin Tom Overall technical leadership: Joel Hestness

In addition, we would like to thank others who helped in the preparation of this work. Bowen Yang and Faisal Al-Khateeb helped prepare the Pile dataset. We are also thankful for helpful feedback on the manuscript provided by Sean Lie, Anshul Samar, and Vithu Thangarasa. In general, we would also like to acknowledge the contributions of the many Cerebras engineers who made this work possible.

Appendix A Methods Details

We preprocess Pile using tools and instructions provided by Eleuther and the community. We clean the raw text data sources using the ftfy library to normalize text, including cleaning corrupted unicode (Speer, 2019). Our tokenized version of the Pile training set contains roughly 371B tokens (validation 380M, test 371M), similar to results reported in the GPT-NeoX paper (Black et al., 2022). The resulting tokenized dataset files contain contiguous samples from the raw text. For the best model generalization, we find it critical to shuffle samples across all training set documents, rather than shuffling within a window of even a few thousand documents. So, we also shuffle the training dataset across all documents as a final preprocessing step. This dataset-wide shuffling improves validation loss by 0.7-1.5% compared to aggressive shuffling settings over sets of contiguous documents as we tested with our dataloaders.

The Pile dataset has been thoroughly analyzed from various ethical standpoints, and the dataset is known to contain content considered toxic, gender biased, pejorative, racially sensitive, etc. Please refer to Pile dataset references for further information.

A.2 Ensuring Stable Training

As we scaled up models to larger sizes, we encountered and resolved a few issues that improve training stability. We share some details here in hopes they assist others in their scaling efforts.

Mixed Precision Training: Initially, we trained models using FP16 mixed precision, a technique that carries model weights and activations in IEEE half precision floating-point (FP16) while performing dot-products and reductions in single precision 32-bit (FP32). This approach ensures that reductions maintain precision, while taking advantage of the smaller 16-bit data format for storing activations. Because FP16 has a significantly reduced exponent range compared to FP32, models need to be trained with loss scaling, an approach that multiplies the gradients by a large positive value before back-propagation, and then divides out this multiplier just before applying the calculated gradients to the weights in the optimizer step. A dynamic approach to loss scaling sets the scale value by periodically testing larger values to see the largest scale such that the gradients do not overflow FP16.

We found that for models larger than roughly 1.3B parameters (hidden size 2048), loss scaling alone was not sufficient to ensure stable model training. As model weights grow during training, the gradients through layers like softmaxes can become eccentric, leading to large gradient values that tend to overflow. These large values push down the maximum allowed loss scale and cause other gradients to be very small. Very small gradient values have a tendency to underflow in FP16. Underflow can cause weights to receive either no gradient or low-precision, eccentric gradients, which can further exacerbate dynamic loss scale and underflow.

Underflows and Weight Growth: We detect underflows by observing any significant increase in the number of identically zero values in tensors as they go through cast operations from FP32 to FP16. Specifically, we find that attention layer softmax gradients are particularly susceptible to underflow. To fix this issue, we recommend carrying gradients in FP32 from the softmax back through the corresponding query-key dot-product and when calculating the gradients for the query and key projection weights and biases. We have tested various open-source mixed precision attention implementations that suffer this same issue.

We also find specific layers to be most susceptible to eccentric gradients caused by underflow. In the attention layers, the bias weights of the keys projection, specifically, have expected value close to zero early in training. If gradients to these weights partially underflow, the remaining gradients will be eccentric and large relative to their expectation. These K bias weights will tend to grow very quickly under these circumstances. We detect this issue by inspecting weight growth—measuring the weight standard deviation and norms—over many training steps compared to an implementation that uses FP32.

Switching to bfloat16: Another approach to avoid underflows is to use a larger exponent range for activation and gradient tensors. Brain floating-point (bfloat16) is a numerical format introduced by Google Brain and used in various hardware platforms to improve half precision floating point range. Specifically, bfloat16 has 8 bits of exponent compared to 5 bits for FP16. Typical bfloat16 model training implementations still use FP32 for intermediate values (mixed precision) in reduction operations to ensure mantissa precision.

Bfloat16 eliminates the need for dynamic loss scaling that is used with mixed precision, because the exponent range significantly reduces the likelihood of underflows. We find that although bfloat16 does not completely eliminate low-precision training dynamics concerns, it does significantly improve training stability, so we use bfloat16 for all final models that we train in this paper and release publicly. We find that our experience with bfloat16 training stability is consistent with prior works.

Setting Adam Epsilon: When gradients for a set of weights are small, using a relatively large Adam epsilon value can cause weights to grow slowly. This might be an appealing approach in the presence of large weight growth among weights that are expected to be small. However, a large Adam epsilon can cause very poor weights resolution and degrade model quality. Given the Adam update at step tt on weights θ\theta:

Here, mtm_{t} is the momentum, a running average of the gradient, and vtv_{t} is the velocity, a running average of the squared gradient. When gradients to a weight are small (e.g., in the case of K bias weights growth above), vtv_{t} will tend to be very small, because it is squared. In this case, ϵ\epsilon needs to be chosen to be small relative to each vt\sqrt{v_{t}}, or the Adam update denominator will be large, causing the weight updates to be small. As a rule-of-thumb, we find ϵ\epsilon should be less than μv/1000\sqrt{\mu_{v}}/1000, where μv\mu_{v} is the mean of the velocity state weights, to ensure models do not suffer from stagnant weight growth. This analysis is how we choose to lower epsilon for our 6.7B and 13B parameter models.

Appendix B Downstream Task Details

We evaluate our models on the following six downstream tasks in both the zero-shot and the few-shot setting. Here, we briefly describe each of the tasks: HellaSwag, PIQA, WinoGrande, Lambada, ARC, and OpenBookQA.

HellaSwag is a dataset of multiple choice questions aimed to test a model’s common sense reasoning abilities (Zellers et al., 2019). For example,

The authors of the dataset adversely select examples such that they are difficult for language models while still trivial for humans (with reported greater than 95% accuracy).

PIQA tests a model’s common sense reasoning about the physical world by posing a prompt and two potential completions (Bisk et al., 2020). For example

The model must choose which of the two continuations is more likely to follow from the prompt. Human performance on this dataset is approximately 95%.

WinoGrande consists of a set of pronoun resolution problems (Sakaguchi et al., 2021). Samples are constructed as pairs of similar sentences, each with a pronoun referring to a noun earlier in the sentence. The task is to predict which noun the pronoun refers to. For example, in the sample

in sentence (a), “it’s” refers to “trophy”, while in sentence (b), changing a single context word modifies the meaning of the sentence such that “it’s” now refers to “suitcase”.

Lambada is a word prediction task that tests a model’s ability to understand text, with a particular emphasis on global context (Paperno et al., 2016). For example

There are two versions of the Lambada dataset. The original version is that which was published in (Paperno et al., 2016). However, researchers more commonly use version of the dataset with slightly different formatting that was created by Radford et al. in order to evaluate their GPT-2 model (Radford et al., 2019). In our evaluations we use the latter version, referred to as “lambada_openai” in the Eleuther eval harness (Gao et al., 2021).

ARC tests a model’s ability to answer multiple choice science questions (Clark et al., 2018). For example

This dataset is split into an “easy” set and a “challenge” set where samples are selected for the challenge set if they are answered incorrectly by word co-occurrence and retrieval based algorithms.

OpenBookQA is a multiple choice common sense question answering dataset (Mihaylov et al., 2018). One example question from this dataset is

Appendix C Additional Results

In Figure 8, we show the intermediate Pile test losses achieved throughout training for Pythia and Cerebras-GPT models. For all model sizes and compute budgets, Cerebras-GPT models tend to have similar trajectory when approaching their final results along the scaling law. In contrast, Pythia models trained for more tokens per parameter follow less efficient trajectory, trending away from the scaling law and indicating their over-training. For Pythia models trained closer to 20 tokens per parameter, the trajectories align more closely with Cerebras-GPT models.

C.2 Complete Downstream Task Testing

For completeness, we include all downstream task results we collected for this study. Table 8 includes upstream Pile evaluations and all downstream zero-shot tasks for models GPT-J, GPT-NeoX, OPT, and Pythia, as well as Cerebras-GPT. Similarly, Table 9 shows the few-shot (five-shot) results for all models. Full downstream results are plotted in Figures 10 and 11.

Some prior works also use a different methods to select model predictions when evaluating the model’s accuracy on some downstream tasks. Specifically, there are two commonly used techniques to select a model’s prediction. First, the model can predict the probability of an output (continuation) sequence given a context sequence. Here, the selection criteria would be to select the continuation with maximum probability. We use this maximum probability approach in all prior results in the paper to be consistent with results in the GPT-NeoX paper. The second approach is to normalize the model’s predicted probability in the log domain by the length of the continuation, and choose the continuation with the smallest length-normalized negative log-likelihood (NLL) (argmini(ln(pi)/ci)\text{argmin}_{i}(-ln(p_{i})/|c_{i}|), where pip_{i} is the model’s predicted probability of continuation sequence cic_{i}, and ci|c_{i}| is the length of that sequence). This approach will tend to favor longer continuations with moderate probability, which might be preferred for some tasks. For comparison against prior works that report minimum length-normalized NLL, we report Cerebras-GPT results in Tables 10 and 10.

C.3 Differences Between Cerebras-GPT and Other Models

The Cerebras-GPT 13B model improves over other publicly-available models of comparable size. This is a surprising result given that the creators of these other models modified the original GPT-2/3 architecture intending to improve convergence and training efficiency. There are many confounders that could contribute to Cerebras-GPT’s advantages, but we briefly list known differences here to give an idea of the space of possible opportunities for future study.

GPT-J, GPT-NeoX, and Pythia models use rotary positional embeddings, which show modest loss/accuracy improvements and ability to extend to longer sequence lengths (Su et al., 2022). Cerebras-GPT uses standard trainable positional embeddings.

Some GPT-J variants disable bias weights for fully-connected layers in transformer attention blocks. Other studies explain that disabling biases can increase accelerator utilization without loss degradation (Chowdhery et al., 2022; Dehghani et al., 2023). We believe this approach might also improve training stability issues caused by key projection bias weights growth as we describe in Appendix A.2. We have not tested the effects on loss/accuracy from disabling these bias weights.

GPT-J, GPT-NeoX, and Pythia use a parallel structure for attention and feed forward layers (Black et al., 2022). This residual architecture has been reported to cause degradation in model performance at similar scales of models, so it is typically only adopted to increase accelerator utilization (Chowdhery et al., 2022). OPT and Cerebras-GPT models use the standard GPT-2 transformer block, which orders attention sequentially before the feed forward layers.

GPT-NeoX and Pythia models use vocabulary and tokenization designed specifically for the Pile dataset (Black et al., 2022). The resulting vocabulary is different in a few ways from the GPT-2/3 vocabulary. GPT-J, OPT, and Cerebras-GPT models use the GPT-2/3 vocabulary and tokenizer.

Pythia models also include those that were trained on a deduplicated version of Pile (Biderman et al., 2023). These models show an average of 1.2% advantage on downstream tasks. This indicates further opportunity to improve models with further dataset curation.

OPT are trained on a dataset combining the datasets used for RoBERTa, the PushShift.io Reddit dataset, and Pile, along with their own dataset pre-processing (Zhang et al., 2022).

C.4 Bias

Language models carry with them the risk of causing harm through the propagation of bias, toxicity, and other negative traits found in their training data. Accordingly, it is important to test models for such biases. We evaluate our models on the CrowS-Pairs dataset (Nangia et al., 2020), which measures bias across nine different categories. In Table 11, we compare bias measurements for our family of models to Pythia 70M–12B, as well as three well regarded baselines: GPT-3 175B (Brown et al., 2020), OPT 175B (Zhang et al., 2022), and LLaMA 65B (Touvron et al., 2023).

The Cerebras-GPT models exhibit less bias on average than any of the larger model baselines. However, Cerebras-GPT 13B does show bias greater than GPT-3, OPT, or LLaMa on six of the nine bias categories, indicating that compute-efficient pre-training is not immune to large bias.

When observing bias levels across Cerebras-GPT or Pythia models, we see that biases tend to grow with model size. The Cerebras-GPT models tend to show a larger range of bias values over the growing model sizes (e.g., gender), while Pythia models sometimes show similar bias across model sizes (e.g., disability). This suggests that models trained on a fixed dataset size may be likely to extract similar levels of bias regardless of model size. On the other hand, more compute-efficient training (smaller datasets for smaller models) might mitigate some bias issues compared to models trained on more data.

Finally, we note that when comparing Cerebras-GPT and Pythia models trained with similar compute budgets, Pythia models tend to have slightly lower bias. In particular, the Cerebras-GPT models 1.3B, 2.7B, 6.7B, and 13B use similar compute to Pythia models 160M, 410M, 2.8B, and 12B, respectively. These Cerebras-GPT models show roughly 1-2% higher bias on average. This suggests that bias is more efficiently extracted from the training data when using a compute-optimal pre-training setup, possibly due to the larger model sizes.

Overall, we recommend further bias evaluation and mitigations for Cerebras-GPT and larger models if deploying them in production settings.

Appendix D Additional Tokens-per-parameter Experiments

Here, we give more evidence that 20 tokens-per-parameter is nearly compute-optimal when pre-training GPT-like models on the Pile dataset.

First, in Figure 3 in Section 3, we include a curve that estimates the Chinchilla loss degradation when changing the number of tokens per parameter for which a model is pre-trained. To get that estimate, we start by fitting a curve to points we estimate in the Chinchilla paper plot (Figure 3), which shows loss for different model sizes and tokens trained with fixed compute budgets. Given our approach, our estimates are likely to introduce error, and we do not know the true functional form of their fits. However, we pull points from three different FLOP levels, and we validate our curve fit has low error for a fourth held-out FLOP level. This result seems surprising that large changes in model size and training tokens do not appear to have a large effect on the expected degradation from computationally-inefficient training, but it might indicate another invariant to training scale.

We use the Chinchilla curve fit to estimate the proportional loss degradation, ΔL\Delta\mathcal{L}, when changing tokens-per-parameter, τ\tau (this is the Chinchilla trend plotted in Figure 3):

This degradation formulation shows good agreement with our tests. Further, Chinchilla models were trained on MassiveText, while our models and Pythia models were trained on the Pile, suggesting the two datasets have significant commonality in their scaling characteristics when training on more than 20 tokens per parameter.

Our Tokens-per-parameter Experiments

Figure 12 (left) plots the loss degradation (%) from our Cerebras-GPT compute-efficient scaling law for different tokens per parameter (similar to Figure 3), and we add our small scale experiments around 20 tokens per parameter for model sizes 111M, 256M, and 590M parameters. The right plot in Figure 12 zooms in on the region from 15 to 50 tokens per parameter. The right plot shows that losses are quite stable between 20 and 40 tokens per parameter and our empirical best results are between 20 and 30 for each of the three models. We note that the variance in loss for runs at this scale is roughly 0.35%, so most loss values here are also within expected run-to-run variance. Based on these results and the strong agreement between Chinchilla and Pythia results, we were comfortable to conclude that 20 tokens-per-parameter is nearly compute-optimal for these models trained the Pile dataset.

Appendix E Number of Model Parameters

Table 12 shows the formula we use to calculate parameter counts for Cerebras-GPT models.

Appendix F Number of Training FLOPs

We calculate the number of training FLOPs with a formula similar to Chinchilla, but with two modifications. First, we account for the dot product between softmax(QKT)softmax(QK^{T}) and VV. Second, we account for the fact that embedding layers do not need to calculate a delta gradient for earlier layers. Code for FLOPs count is in Table 13. We consider this FLOP calculation to be a measure of the algorithmic calculations required for forward and backward gradient steps of training, or “Algorithmic FLOPs”. This formula does not include any additional FLOPs for things like activation checkpointing and recomputation or specifics related to software or hardware implementation.

Appendix G Additional µP Details

We spent time orienting ourselves to use µP, so this section describes that process for other practitioners.

In Table 14 we detail all the changes required to implement µP in GPT-like models, and we describe these changes in more detail here.

µP adds a tunable embedding output activation multiplier, membm_{\text{emb}}, which is multiplied by the sum of token and position embeddings. This multiplier controls relative activation and gradient magnitudes between the embeddings layers and the transformer backbone.

Similarly, to control the the relative gradient magnitude between embeddings and the transformer backbone, the model’s output logits activation tensor (pre-softmax) is multiplied by 1/mwidth1/m_{\text{width}} (when using shared embedding and output weights).

To control the expected magnitude of activations in the transformer blocks, µP scales the initialization variance for each fully-connected layer’s weights by 1/mwidth1/m_{\text{width}}.

To control the relative weight magnitudes throughout training, µP scales the learning rate of each fully-connected layer’s weights by 1/mwidth1/m_{\text{width}}.

Under the assumption that query and key projections have significant alignment later in training, µP scales the key-query dot-product activations by 1/dhead1/d_{\text{head}}, rather than using the 1/dhead1/\sqrt{d_{\text{head}}} scaling originally proposed in Vaswani et al. (2017).

The next subsection describes how we tune the three transferrable µP hyperparameters: the base learning rate (ηbase\eta_{\text{base}}), the base initialization standard deviation (σbase\sigma_{\text{base}}), and the embedding output multiplier (membm_{\text{emb}}).

G.2 µP Hyperparameter Search Details

We tune µP hyperparameters on a 40M parameter proxy model, and µTransfer those hyperparameters to our models with 111M–2.7B parameters. Figure 13 shows the results of a 200 sample random hyperparameter search on a 40M parameter proxy model (dmodel=dmodel,base=256d_{\text{model}}=d_{\text{model,base}}=256, nlayers=32n_{\text{layers}}=32, dhead=128d_{\text{head}}=128) trained on 600M tokens with a batch size of 131k tokens. From this sweep we obtained the following tuned hyperparameters for our µP models: ηbase=6e\eta_{\text{base}}=6e-33, σbase=0.08\sigma_{\text{base}}=0.08, memb=10m_{\text{emb}}=10. These hyperparameters also closely match those used by Yang et al. (2021).

G.3 Advice for Practitioners

The µP paper suggests small proxy models should be trained with batch sizes similar to the larger target model to which you would like to µTransfer the hyperparameters. We find, however, that choice of proxy model batch size can be more flexible as long as it is large enough. When tuning on a small model, Yang and Hu choose batch size to be quite large such that it is an appropriate batch size for the largest models to which they µTransfer those hyperparameters. We find that proxy model tuning does not need to be performed on a batch size appropriate for the largest model. Rather, batch size must be sufficiently large such that the proxy model’s gradient dynamics are likely to be consistent with the larger models. More specifically, setting the proxy model’s batch size larger than its critical batch size (dictated by gradient noise (McCandlish et al., 2018)) is sufficient to get good hyperparameter transferrability. Then, batch size should be scaled appropriately as model size scales.

Simple Batch Size + Learning Rate Scaling with µP

More precisely, we find that µP learning rate transfers as long as each model size is trained with a batch size roughly consistent with or larger than the critical batch size. The closer the batch size is to the critical batch size for a given model, the better the loss will be when using the µTransferred learning rate. Further, when training models with a batch size smaller than the critical batch size, learning rate should be reduced linearly proportional to the reduction in batch size–consistent with the findings in (Shallue et al., 2018; Yang et al., 2021).