Scaling Data-Constrained Language Models

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

cs.CL cs.AI cs.LG

Introduction

Recent work on compute-optimal language models shows that many previously trained large language models (LLMs, which we define as having more than one billion parameters) could have attained better performance for a given compute budget by training a smaller model on more data. Notably, the 70-billion parameter Chinchilla model outperforms the 280-billion parameter Gopher model while using a similar compute budget by being trained on four times more data. Extrapolating these laws for compute allocation (hereafter "Chinchilla scaling laws") to a 530 billion parameter model, such as the under-trained MT-NLG model , would require training on a massive 11 trillion tokens, corresponding to more than 30 terabytes of text data. For most languages, available data is several orders of magnitude smaller, meaning that LLMs in those languages are already data-constrained. Villalobos et al. estimate that even high-quality English language data will be exhausted by the year 2024 given the Chinchilla scaling laws and the trend of training ever-larger models. This motivates the question : what should we do when we run out of data?

In this work we investigate scaling large language models in a data-constrained regime, and whether training an LLM with multiple epochs of repeated data impacts scaling. Using multiple epochs is, of course, standard in machine learning generally; however, most prior large language models have been trained for a single epoch and some work explicitly advocates against reusing data . An exception is the recent Galactica models that were trained for 4.25 epochs and exhibit continually decreasing validation loss and improving downstream performance throughout training. However, the experiments of Galactica do not compare this setup to an alternative non-data-constrained model trained for one epoch on unique data. Without this comparison, it is difficult to quantify the trade-off between additional compute versus additional data collection.

Our main focus is to quantify the impact of multiple epochs in LLM training such that practitioners can decide how to allocate compute when scaling models. Toward this end, we assembled a battery of empirical training runs of varying data and compute constraints. Specifically, we train more than 400 models ranging from 10 million to 9 billion parameters for up to 1500 epochs and record final test loss. We use these results to fit a new data-constrained scaling law that generalizes the Chinchilla scaling law to the repeated data regime and yields a better prediction of loss in this setting. Figure 1 summarizes our main results targeting the value of repeated data (Return) and optimal allocation of resources in that regime (Allocation). We find that, while models trained for a single epoch consistently have the best validation loss per compute, differences tend to be insignificant among models trained for up to 4 epochs and do not lead to differences in downstream task performance. Additional epochs continue to be beneficial, but returns eventually diminish to zero. We find that, in the data-constrained regime, allocating new compute to both more parameters and epochs is necessary, and that epochs should be scaled slightly faster. These findings suggest a simple way to continue scaling total training compute budgets further ahead in the future than the previously anticipated limits.

Finally, given the challenges imposed by data constraints, we consider methods complementary to repeating for improving downstream accuracy without adding new natural language data. Experiments consider incorporating code tokens and relaxing data filtering. For code, English LLMs, such as PaLM or Gopher , are trained on a small amount of code data alongside natural language data, though no benchmarking was reported to justify that decision. We investigate training LLMs on a mix of language data and Python data at 10 different mixing rates and find that mixing in code is able to provide a 2 $\times$ increase in effective tokens even when evaluating only natural language tasks. For filtering, we revisit perplexity and deduplication filtering strategies on both noisy and clean datasets and find that data filtering is primarily effective for noisy datasets.

Background

Predicting the scaling behavior of large models is critical when deciding on training resources. Specifically, two questions are of interest: (Allocation) What is the optimal balance of resources? (Return) What is the expected value of additional resources? For scaling LLMs, the resource is compute (measured in FLOPs), and it can be allocated to training a larger model or training for more steps.In this work we use ’s approximation for the compute cost: $\text{FLOPs}(N,D)\approx 6ND$ , where N denotes the number of model parameters and D denotes the number of tokens processed. The metric used to quantify progress is the model’s loss on held-out data, i.e. the ability to predict the underlying data as measured in the model’s cross-entropy . We aim to minimize the loss ( $L$ ) subject to a compute resource constraint ( $C$ ) via optimal allocation to $N$ and $D$ as:

Currently, there are established best practices for scaling LLMs. Return follows a power-law: loss scales as a power-law with the amount of compute used for training . Allocation is balanced: resources are divided roughly equally between scaling of parameters and data . These scaling laws were established empirically by training LLMs and carefully extrapolating behavior.

Chinchilla uses three methods for making scaling predictions:

(Fixed Parameters) Train with a fixed model size but on varying amounts of data.

(Fixed FLOPs) Train with fixed computation while parameters and training tokens vary.

(Parametric Fit) Derive and fit a formula for the loss.

For the parametric fit, the loss ( $L$ ) is a function of parameters ( $N$ ) and training tokens ( $D$ ):

Where $\{A,\alpha,B,\beta,E\}$ are learned variables fit using the training runs from the first two approaches . Using these learned variables, they propose calculating the optimal allocation of compute ( $C$ ) to $N$ and $D$ as follows:

These methods lead to the conclusion that $\alpha\approx\beta$ and hence $N$ and $D$ should be scaled proportionally for compute-optimal training. As loss can be an imperfect proxy for performance on natural language tasks , they also validate their conclusions on various downstream tasks.

Method: Data-Constrained Scaling Laws

We are interested in scaling behavior in the data-constrained regime. Specifically, given a limited amount of unique data, what is the best Allocation of and Return for computational resources. Prior work assumes that the necessary data to support scaling is unlimited. Our aim is therefore to introduce a modified version of Equation 2 that accounts for data constraints and fit the terms in the modified scaling law to data from a large body of experiments.

The primary method we consider is repeating data, i.e. allocating FLOPs to multiple epochs on the same data. Given a budget of unique data $D_{C}$ , we split the Chinchilla total data term $D$ into two parts: the number of unique tokens used, $U_{D}$ , and the number of repetitions, $R_{D}$ (i.e. epochs - 1). Given total training tokens $D$ and data budget $D_{C}$ these terms are simply computed as $U_{D}=\min\{D_{C},D\}$ and $R_{D}=(D/U_{D})-1$ . When training for a single epoch like done in prior scaling studies, $R_{D}=0$ . We are thus interested in minimizing Equation 1 with the additional constraint of a data budget $D_{C}$ :

Symmetrically, for mathematical convenience, we split the parameter term $N$ into two parts: the base number of parameters needed to optimally fit the unique tokens $U_{N}$ , and the number of times to “repeat” this initial allocation, $R_{N}$ . We compute $U_{N}$ by first rearranging Equation 3 to find the optimal compute budget for the unique tokens used ( $U_{D}$ ). We input this value into the $N_{opt}$ formula of Equation 3 to get $U_{N}=\min\{N_{opt},N\}$ . $U_{N}$ thus corresponds to the compute-optimal number of parameters for $U_{D}$ or less if $N<N_{opt}$ . Once we have $U_{N}$ , we compute the repeat value as $R_{N}=(N/U_{N})-1$ .

To empirically explore the scaling behavior in a data-limited setting we train LLMs under these constraints. We consider three different experimental protocols in this work:

(Fixed Unique Data) In §5 we fix the data constraint $D_{C}$ and train models varying epochs and parameters. These experiments target Allocation, specifically tradeoff of $D$ and $N$ .

(Fixed FLOPs) In §6 we fix the computation available and vary $D_{C}$ (and thus also $U_{D}$ and $U_{N}$ ). These experiments target Return, i.e. how well does repeating scale compared to having more unique data.

(Parametric Fit) We fit a formula introduced in §3.1 on all our training runs and evaluate its predictive capability throughout §5 and §6.

Before discussing experimental results we describe the parametric assumptions.

To extrapolate scaling curves, it is necessary to incorporate repetition into the Chinchilla formula (Equation 2). We generalize Equation 2 by replacing $D$ and $N$ with terms corresponding to the effective data ( $D^{\prime}$ ) and effective model parameters ( $N^{\prime}$ ).

Intuitively, $D^{\prime}$ should be smaller or equal to $D$ where $D$ is the total number of processed tokens since repeated tokens provide less useful information to the model than new ones. We use an exponential decay formulation, where the value of a data token processed loses roughly $(1-1/R^{*}_{D})$ fraction of its value per repetition, where $R^{*}_{D}$ is a learned constant. After some derivations and approximations (see Appendix A), this boils down to

Note that for $R_{D}=0$ (no repetitions), $D^{\prime}=U_{D}=D$ . For $R_{D}\ll R^{*}_{D}$ , $e^{-R_{D}/R^{*}_{D}}\approx 1-\tfrac{R_{D}}{R^{*}_{D}}$ and so

and hence in this case, repeated data is worth almost the same as fresh data. (This is also consistent with the predictions of the “deep bootstrap” framework .) As $R_{D}$ grows, the value of repeated tokens tends to zero, and the effective data $D^{\prime}$ becomes much smaller than $D$ . The formula implies that no matter how many times we repeat the data, we will not get a better loss than could be obtained with a single epoch on $U_{D}+U_{D}R^{*}_{D}$ fresh tokens.

Just as processing repeated tokens yields a diminishing return, both intuitively and empirically, models with sizes that vastly outstrip the available data also offer diminishing returns per parameter. Hence we use a symmetric formula for the number of effective parameters, where again $R^{*}_{N}$ is learned,

The learned constants $R^{*}_{D}$ , $R^{*}_{N}$ roughly correspond to the “half-life” of repeated data and excess parameters. For example, at $R_{D}=R^{*}_{D}$ , the number of effective tokens $D^{\prime}$ is $U_{D}+U_{D}R_{D}(1-e^{-1})$ which means that the $U_{D}R_{D}$ repeated tokens are worth on average $1-1/e$ fraction of fresh ones.

Using a methodology similar to , $R_{N}^{*}$ and $R_{D}^{*}$ can be fit on empirical measurements, which yields data-driven estimates. See Appendix A for more details on the derivations and the fitting procedure.

Experimental Setup

For all experiments, we train transformer language models with the GPT-2 architecture and tokenizer . Models have up to 8.7 billion parameters and are trained for up to 900 billion total tokens. Following we use cosine learning rate schedules that decay 10 $\times$ over the course of training for each model (different schedules led to different estimates in ). Unlike , we do not use early stopping to also explore the extent of overfitting when repeating. Other hyperparameters are based on prior work and detailed in Appendix S. Models are trained on subsets of C4 . The data constraints are carefully defined to ensure maximal overlap as shown in Figure 2. Unlike , we always repeat the entire available data rather than subsets of it. Data is shuffled after each epoch. As repeating data can result in extreme overfitting (see Appendix H), we report loss on a held-out test set unless otherwise specified (see Appendix K). This contrasts training loss used in , but should not alter our findings as the held-out data stems from the same underlying dataset.

Results: Resource Allocation for Data-Constrained Scaling

Our first experimental setting considers scaling in a setting where all models have the same data constraint. For these experiments, the unique training data budget $D_{C}$ is fixed at either 100M, 400M or 1.5B tokens. For each data budget, we train a set of language models with increasing amounts of compute that is allocated to either more parameters or more epochs on the unique training data.

Figure 3 (left) shows the main results for scaling with 100M unique tokensAlthough small, for example, this is the order of magnitude of a realistic data constraint reflecting data available after filtering the OSCAR dataset for Basque, Punjabi, or Slovenian. (see Appendix C for 400M and 1.5B tokens). For 100M tokens, the corresponding one-epoch compute-optimal model according to scaling laws from has $U_{N}$ of approximately 7M parameters (see Appendix B for the scaling coefficients we use). Results show that more than a 50% reduction in loss can be attained by training for several epochs ( $R_{D}>0$ ) and increasing model size beyond what would be compute-optimal for 100M tokens ( $R_{N}>0$ ). We find the best loss to be at around 20-60 $\times$ more parameters and epochs, which corresponds to spending around 7000 $\times$ more FLOPs. These results suggest that one-epoch models significantly under-utilize their training data and more signal can be extracted by repeating data and adding parameters at the cost of sub-optimal compute utilization.

Figure 3 (right) shows the predicted contours created by fitting our data-constrained scaling laws on 182 training runs. In the single-epoch case ( $R_{D}=0$ ) with near compute-optimal parameters ( $R_{N}=0$ ) our scaling equation (§3.1) reduces to the Chinchilla equation. In this case, both formulas predict the optimal allocation of compute to parameters and data to be the same, resulting in overlapping efficient frontiers. As data is repeated for more than a single epoch, our fit predicts that excess parameters decay faster in value than repeated data ( $R_{N}^{*}<R_{D}^{*}$ ). As a result, the data-constrained efficient frontier suggests allocating most additional compute to more epochs rather than more parameters. This contrasts the Chinchilla scaling laws , which suggest equally scaling both. However, note that they do not repeat the entire training data and their parametric fit explicitly relies on the assumption that models are trained for a single epoch only. Thus, there is no guarantee that their scaling predictions hold for repeated data.

For all three data budgets, our results suggest that Allocation is optimized by scaling epochs faster than additional parameters. We confirm this at scale by training the data-constrained compute-optimal model for $9.3\times 10^{21}$ FLOPs and 25 billion unique tokens as suggested by our efficient frontier. Despite having 27% less parameters, this model achieves better loss than the model suggested by the Chinchilla scaling laws (Figure 1, right). Similarly, the 120 billion parameter Galactica model trained on repeated data should have been significantly smaller according to data-constrained scaling laws (Appendix G). An additional benefit of using a smaller model is cheaper inference, though adding parameters can make it easier to parallelize training across GPUs.

Adding parameters and epochs causes the loss to decrease and eventually increase again, suggesting that too much compute can hurt performance. Results from also show that loss can increase when too many parameters are used, even with early stopping. However, we expect that appropriate regularization (such as simply removing all excess parameters as an extreme case) could prevent this behavior. Thus, our formula presented in §3 and its predicted isoLoss contours in Figure 3 do not model the possibility that excess epochs or parameters could hurt performance.

Results: Resource Return for Data-Constrained Scaling

Next, consider the question of Return on scaling. To quantify this value, we run experiments with three FLOP budgets across eight respective data budgets to compare return on FLOPs.

Figure 4 shows the configurations and validation curves for models trained on the same number of total tokens. Conforming to intuition and prior work on deduplication , repeated data is worth less, thus models trained on less unique data (and, correspondingly, more epochs) have consistently higher loss. However, the loss difference for a few epochs is negligible. For example, the $N=8.7$ billion parameter model trained for four epochs ( $D_{C}=44$ billion unique tokens) finishes training with only 0.5% higher validation loss than the single-epoch model ( $D_{C}=178$ billion unique tokens).

In Figure 5 (left), we compare the final test loss of each model to predictions from our parametric fit. The data-constrained scaling laws can accurately measure the decay in the value of repeated data as seen by the proximity of empirical results (dots) and parametric fit (lines). We note however that it significantly underestimates the final test loss of failing models where loss increases midway through training, such as models trained for 44 epochs (not depicted).

In Figure 5 (right), we extrapolate the three budgets by further scaling compute while keeping the data constraints ( $D_{C}$ ) at 55B, 84B, and 178B tokens, respectively. The parameter $R_{D}^{*}$ introduced in §3 represents roughly the “half-life” of epochs: specifically the point where repeated tokens have lost $\tfrac{1}{e}$ of their value. Through our fitting in Appendix A, we found $R_{D}^{*}\approx 15$ , corresponding to 15 repetitions (or 16 epochs). Graphically, this can be seen by the stark diminishing returns in the proximity of the 16-epoch marker and the flattening out soon after.

Overall, the Return when repeating data is relatively good. Meaningful gains from repeating data can be made up to around 16 epochs ( $R_{D}^{*}$ ) beyond which returns diminish extremely fast.

Results: Complementary Strategies for Obtaining Additional Data

While repeating data is effective, it has diminishing returns. We therefore consider strategies for scaling $D$ targeting improved downstream performance as opposed to directly minimizing loss.

Figure 6 (left) illustrates the strategies: (a) Code augmentation: We use Python code from The Stack to make up for missing natural language data. The combined dataset consisting of code and natural language samples is shuffled randomly. (b) Adapting filtering: We investigate the performance impact of deduplication and perplexity filtering, two common filtering steps that can severely limit available data. Removing such filtering steps can free up additional training data.

For these experiments, we set a maximum data budget ( $D_{C}$ ) of 84 billion tokens. For repetition and code filling, only a subset of $D_{C}$ is available and the rest needs to be compensated for via repeating or adding code. For both filtering methods, we start out with approximately twice the budget (178 billion tokens), as it is easier to gather noisy data and filter it than it is to gather clean data for training. For perplexity filtering, we select the top 25% samples with the lowest perplexity according to a language model trained on Wikipedia. This results in 44 billion tokens that are repeated for close to two epochs to reach the full data budget. For deduplication filtering, all samples with a 100-char overlap are removed resulting in 21 billion tokens that are repeated for four epochs during training. See Appendix N for more details on the filtering procedures.

When comparing across data strategies, loss ceases to be a good evaluation metric as the models are trained on different data distributions. We thus evaluate models on 19 natural language tasks with zero to five in-context few-shot exemplars producing 114 scores per model. As our evaluation tasks cover different metrics and random baselines, we re-scale all scores to be in the same range to better reflect performance ranges before averaging. Details on the evaluation datasets are in Appendix K.

In Figure 6 (right) we compare the downstream performance of all strategies. For repeating data, differences in downstream performance are insignificant for up to around 4 epochs (25% budget) and then start dropping, which aligns with our results on test loss in §6. Filling up to 50% of data with code (42 billion tokens) also shows no deterioration. Beyond that, performance decreases quickly on natural language tasks. However, adding more code data may benefit non-natural language tasks, which are not considered in the benchmarking. Two of the tasks benchmarked, WebNLG , a generation task, and bAbI , a reasoning task, see jumps in performance as soon as code is added, possibly due to code enabling models to learn long-range state-tracking capabilities beneficial for these tasks.

Of the filtering approaches, we find perplexity-filtering to be effective, while deduplication does not help. Prior work found deduplication was able to improve perplexity ; however, it did not evaluate on downstream tasks. Deduplication may have value not captured in our benchmark, such as reducing memorization . We also investigate filtering on a different noisier dataset in Appendix O, where we find it to be more effective. Overall, in a data-constrained regime, we recommend reserving filtering for noisy datasets and using both code augmentation and repeating to increase data tokens. For example, first doubling the available data by adding code and then repeating the new dataset for four epochs results in 8 $\times$ more training tokens that are expected to be just as good as having had 8 $\times$ more unique data from the start.

Related Work

Large language models Scaling up transformer language models across parameter count and training data has been shown to result in continuous performance gains . Starting with the 1.4 billion parameter GPT-2 model , a variety of scaled-up language models have been trained, commonly referred to as large language models (LLMs). They can be grouped into dense models and sparse models depending on whether each forward pass makes use of all parameters. These models are generally pre-trained to predict the next token in a sequence, which makes them applicable to various language tasks directly after pre-training by reformulating said NLP tasks as context continuation tasks (see for an earlier proposal on this topic). We focus on the most common scenario, where a dense transformer model is trained to do next-token prediction on a large corpus and evaluated directly after pre-training using held-out loss or zero- to few-shot prompting.

Scaling laws Prior work has estimated an optimal allocation of compute for the training of LLMs. Kaplan et al. suggested a 10 $\times$ increase in compute should be allocated to a 5.5 $\times$ increase in model size and a 1.8 $\times$ increase in training tokens. This first scaling law has led to the creation of very large models trained on relatively little data, such as the 530 billion parameter MT-NLG model trained on 270 billion tokens . More recent work , however, showed that model size and training data should rather be scaled in equal proportions. These findings called for a renewed focus on the scaling of pre-training data rather than scaling model size via complex parallelization strategies . Up-sampling is often employed when pre-training data is partly limited, such as data from a high-quality domain like Wikipedia or text in a rare language for training multilingual LLMs . Hernandez et al. study up-sampling of data subsets and find that repeating only 0.1% of training data 100 times significantly degrades performance. In contrast, our work focuses on repeating the entire pre-training corpus for multiple epochs rather than up-sampling parts of it.

Alternative data strategies Large pre-training datasets are commonly filtered to remove undesired samples or reduce noise . Perplexity-based filtering, whereby a trained model is used to filter out samples with high perplexity, has been found beneficial to reduce noise in web-crawled datasets . Mixing of data is employed for the pre-training data of multilingual LLMs, where text data from different languages is combined . However, both for code and natural language models, mixing different (programming) languages has been reported to under-perform monolingual models . Some work has investigated mixing code and natural language data for prediction tasks, such as summarizing code snippets or predicting function names . Several pre-training datasets for LLMs include low amounts of code data . However, these past works generally do not provide any ablation on the drawbacks of including code or the benefits for natural language task performance. We perform a detailed benchmarking of mixing Python and natural language in LLM pre-training at 10 different mixing rates.

Conclusion

This work studies data-constrained scaling, focusing on the optimal use of computational resources when unique data is limited. We propose an extension to the Chinchilla scaling laws that takes into account the decay in value of repeated data, and we fit this function using a large set of controlled experiments. We find that despite recommendations of earlier work, training large language models for multiple epochs by repeating data is beneficial and that scaling laws continue to hold in the multi-epoch regime, albeit with diminishing returns. We also consider complementary approaches to continue scaling models, and find that code gives the ability to scale an additional 2 $\times$ . We believe that our findings will enable further scaling of language models to unlock new capabilities with current data. However, our work also indicates that there are limits on the scaling horizon. In addition to collecting additional data, researchers should explore using current data in a more effective manner.

Acknowledgments and Disclosure of Funding

This work was co-funded by the European Union under grant agreement No 101070350. The authors wish to acknowledge CSC – IT Center for Science, Finland, for generous computational resources on the LUMI supercomputer.https://www.lumi-supercomputer.eu/ We are thankful for the immense support from teams at LUMI and AMD, especially Samuel Antao. Hugging Face provided storage and additional compute instances. This work was supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, and DOE grant DE-SC0022199. We are grateful to Harm de Vries, Woojeong Kim, Mengzhou Xia and the EleutherAI community for exceptional feedback. We thank Loubna Ben Allal for help with the Python data and Big Code members for insightful discussions on scaling laws. We thank Thomas Wang, Helen Ngo and TurkuNLP members for support on early experiments.

References

Appendix A Derivation of Data-Constrained Scaling Laws

Let $N$ be the number of model parameters, $D$ be the training tokens and $U$ be the "unique" training tokens i.e. the size of the dataset that is to be trained on for one or more epochs. Chinchilla only deals with non-repeated tokens, thus $D=U$ and we can write their formula (“Approach 3”) as:

where $E$ represents the irreducible loss. $A$ , $B$ , $\alpha$ and $\beta$ are learned parameters.

We now want to generalize this expression to multiple epochs where tokens are repeated. We repeat the data $R_{D}$ times, where $R_{D}=0$ corresponds to the base case of a single epoch. We let $D^{\prime}$ be the “effective data size”: the number of unique data needed to get the same value as repeating $U$ unique tokens for $R_{D}$ repeats. Hence, if $R_{D}=0$ , the effective data is the same as the total data processed. Intuitively, each time a sample is repeated, it is worth less as the model has already learned some of its information. Assume that each time a model trains on a token, it learns a $1-\delta$ fraction of the information in it for some constant $0\leq\delta\leq 1$ . (Thus, if $\delta=0$ repeated tokens are as good as new ones, and if $\delta=1$ , repeated tokens are worth nothing.) In other words, we expect the decrease in value of each repetition to be proportional to the value of the prior repetition, which is equivalent to exponential decay. As we would like to sum up the value of all repetitions, we temporarily assume an integral number of repeats and express it as a geometric series:

We know that the sum $S$ of a geometric series with a common ratio $r$ is:

where $a$ is the first term and $n$ the number of terms in the series. As $r=(1-\delta)$ and $a=(1-\delta)U$ :

Note that Equation 10 can also be used with a non-integer number of repetitions. We can directly use Equation 10 as our effective data and learn $\delta$ but for convenience and interpretability, we redefine it in terms of the number of epochs beyond which repeating does not help. Note that as more data is repeated, the right-hand side tends to $\tfrac{(1-\delta)U}{\delta}$ , as $\lim_{R_{D}\to\infty}(1-(1-\delta)^{R_{D}})=1$ . Let $R^{*}_{D}=\tfrac{1-\delta}{\delta}$ , hence $D^{\prime}$ “plateaus” at $U+R^{*}_{D}U$ as $R_{D}$ goes to infinity.

If we assume $\delta$ to be small, ${1-\delta}$ tends to one and we can approximate $1/R^{*}_{D}=\tfrac{\delta}{1-\delta}\approx\delta$ .

Next, define $e^{x}$ in terms of its Taylor series expansion:

If $x$ is small later terms become increasingly small, thus $e^{x}\approx 1+x$ . As we have assumed $\delta$ to be small, let $x=-\delta$ , which yields

Now inserting $(1-\delta)/\delta=R_{D}^{*}$ and $(1-\delta)^{R_{D}}=e^{(-1/R^{*}_{D})^{R_{D}}}$ into Equation 10 we get our final equation representing the effective data:

where $U$ and $R_{D}$ are given while $R_{D}^{*}$ is a learned constant. If no repeats are done, the second part of the sum is zero and the term simplifies to the single-epoch scaling laws from Equation 7. While $R_{D}\ll R_{D}^{*}$ , the second term is approximated as $U\cdot R_{D}$ and for $R_{D}\gg R_{D}^{*}$ , it plateaus at $U\cdot R_{D}^{*}$ . Hence $R^{*}_{D}$ corresponds to the number of times we can repeat tokens before seeing sharply diminishing returns.

Let us consider a concrete example to show that Equation 13 is a very good approximation of Equation 10 and make the equations more intuitive. Suppose repeated data retains 75% of its value ( $\delta=0.25$ ) and we train on a single token or data unit ( $U=1$ ) for five epochs, i.e. we repeat it four times ( $R_{D}=4$ ). In that case Equation 10 yields $D^{\prime}=U+(1-\delta)U\tfrac{(1-(1-\delta)^{R_{D}})}{\delta}=1+(0.75)*4*(1-0.75^{4})=3.05$ . Thus despite training for 5 total units (4 of which are repetitions), we only get the value equivalent to $3.05$ units. As we have defined $R_{D}^{*}=(1-\delta)/\delta$ , the corresponding $R_{D}^{*}$ value is $3$ . Setting $R_{D}^{*}=3$ in Equation 13 yields $D^{\prime}=U+U\cdot R_{D}^{*}\cdot(1-e^{-R_{D}/R_{D}^{*}})=1+3*(1-e^{-4/3})=3.21$ . Due to our approximations, the results are not the same, i.e. $3.21$ is slightly higher than $3.05$ . However, note that the data term is additionally raised to a power of $\beta=0.353$ (see Equation 7; Appendix B), thus the actual difference calculated as $((3.21^{0.353})/(3.05^{0.353}))-1$ is a mere 1.8% despite this relatively large $\delta$ of $0.25$ . Equation 13 has the benefit that we can interpret $R_{D}^{*}$ as the number of repetitions beyond which repeating yields sharply diminishing returns and flattens out soon after. Consider $R_{D}=100$ then $D^{\prime}=1+3*(1-e^{-100/3})=3.99$ . No matter how many repeats are done the effective data will never exceed $4$ i.e. it plateaus at $U+R_{D}^{*}U$ as $R_{D}$ tends to infinity.

Similarly, we consider repeating parameters. Symmetric to seeing the same data, excess parameters learn the same features and do not add any value in the extreme. For the Chinchilla equation (Equation 7) increasing parameters from 1 billion to 10 billion yields the same absolute decrease in loss regardless of whether the dataset is a single token or 1 billion tokens. However, intuition and our data (Appendix F) suggest that in the first case, adding parameters should not decrease loss at all, as the additional 9 billion parameters cannot possibly learn anything from the single token that the first 1 billion parameters have not already learned. Thus, to allow excess parameters to decay to adding nothing, we also replace $N$ with a symmetric version of Equation 13 yielding our final equation:

We define $U_{N}$ , as the number of "unique" parameters that provide an optimal fit for $U_{D}$ . Additional parameters decay with a symmetric version of the expression for repeated data. $R_{N}$ is the number that the "unique" parameters are repeated i.e. $R_{N}=\max\{(N/U_{N})-1,0\}$ . If $R_{N}^{*}=\infty$ , additional parameters do not decay at all and $(U_{N}+U_{N}R_{N}^{*}(1-e^{\frac{-R_{N}}{R_{N}^{*}}}))$ reduces to $N$ . We compute $U_{N}$ from $U_{D}$ by setting $D_{opt}=U_{D}$ and rearranging Equation 3 to map from $D_{opt}$ to $N_{opt}$ . $U_{N}$ is then $\min\{N_{opt},N\}$ . This is equivalent to the following:

Equation 14 is a generalization of Equation 7: It provides the same estimates for optimal model and data size in the single epoch case, but allows for decay in the value of parameters and tokens, thus generalizing to training for multiple epochs and with excess parameters. It can thus be used as a direct replacement of Equation 7. If $R_{N}^{*}$ and $R_{D}^{*}$ are unknown, one can simply set them to infinity by default, which will make Equation 14 completely equivalent to Equation 7.

To learn the parameters $R_{N}^{*}$ and $R_{D}^{*}$ , we largely follow the approach from . We fix $a$ , $b$ , $e$ , $\alpha$ , $\beta$ to the values learned on C4 in Appendix B and minimize:

We use the LBFGS algorithm to find local minima of the objective above, started on a grid of initialization given by: $R_{N}^{*}\in\{0.,4.,\dots,20.\}$ and $R_{D}^{*}\in\{0.,4.,\dots,20.\}$ . We fit on 182 samples with parameters varying from 7 million up to 9 billion and epochs ranging from 1 to 500. We removed outliers referenced in Appendix F from our fitting, as our formulas do not allow for excess parameters or excess epochs to negatively impact performance. We assume excess parameters or epochs only cause performance to plateau but never to worsen. However, it is difficult to identify all samples where excess parameters or epochs hurt, as for some data budgets we only train a single model, thus we do not know if the loss of that model is already in the range where it starts to increase again. Further, there are samples where loss initially increases and then decreases as a function of epochs (double descent, see Appendix D), which further contributes to noise in the fitting. Nevertheless, we are able to get a fairly stable fit resulting in $R_{N}^{*}=5.309743$ and $R_{D}^{*}=15.387756$ . Since $R_{D}^{*}>R_{N}^{*}$ , excess parameters decay faster. Hence, the data-constrained efficient frontiers in Figures 1,3 suggest scaling compute allocated to epochs faster than to parameters. This value of $R_{D}^{*}$ yields $\delta\approx 6*10^{-2}$ ( $0.19$ for $R_{N}^{*}$ ), which respects the assumption that $\delta$ is small. Inserting these learned parameters and the parameters from Appendix B, and simplifying Equation 15 yields the precise formulation we use to predict loss ( $L$ ) given unique tokens ( $U_{N}$ ), parameter repetitions ( $R_{N}$ ) and data repetitions ( $R_{D}$ ):

We experiment with different versions of our formula and display the learned values in Table 1. No decay or decaying only $D$ or $N$ of Equation 14 leads to worse loss and $R^{2}$ than Equation 14. Thus, it is important to decay both the value of excess parameters and data repetitions. We also consider an explicit exponential where $D^{\prime}=\sum_{k=0}^{R_{D}}U*e^{-R_{D}^{*}k}$ , hence from Equation 9 it follows:

This explicit decay, Equation 10, and Equation 14 all yield similar results with $R^{2}$ around 80. Equation 14 fits the data slightly worse than Equation 10, likely due to our approximations. Nevertheless, we use Equation 14 throughout as it has fewer terms, and we find it easier to interpret.

In our case, consider the setting of a fixed compute budget $C$ and a fixed budget of unique tokens $U_{D}$ implying a set of unique parameters $U_{N}$ . Let $R_{D}$ denote the number of times we repeat data (we assume that we are in the multi-epoch regime and hence $R_{D}>0$ ).

Write $U_{D}=cU_{N}$ (for Chinchilla $c\approx 20$ ). When $R_{D}\ll R^{*}_{D}$ and $R_{N}\ll R^{*}_{N}$ , our scaling agrees with Chinchilla, and so the point $(U_{N},U_{D})$ , corresponding to $R_{D}=R_{N}=0$ is on the optimal compute curve. Increasing $R_{D}$ by $\epsilon$ corresponds to increasing the number of tokens by $\epsilon U_{D}=\epsilon cU_{N}$ , while increasing $R_{N}$ by $\epsilon$ corresponds to increasing the number of parameters by $\epsilon U_{N}$ . For small positive $R_{D},R_{N}$ , our curve agrees with Chinchilla and so we need to increase $R_{N},R_{D}$ by the same amount to maintain the proportionality. Hence up to some value $r>0$ , the optimal compute curve corresponds to $R_{N}=R_{D}=r$ . Our curve differs from Chinchilla when $r$ gets closer to either $R^{*}_{N}$ or $R^{*}_{D}$ . At this point, we start to see sharply diminishing returns.

In our setting, $R^{*}_{D}>R^{*}_{N}$ which means that we reach the point $r\approx R^{*}_{N}$ first. At this point, each added parameter is worth less (specifically worth $e^{-r/R^{*}_{N}}$ ), than an added data point, despite them having equal computational cost. Hence processing more tokens will be more effective than increasing the number of parameters, and we expect the optimal compute curve to break away from proportionality. This is indeed what we see.

Appendix B C4 Scaling Coefficients

While Hoffmann et al. have shown that the equal scaling of model parameters and training tokens holds across different training datasets, the precise ratios vary considerably across datasets and approaches. For example given the Gopher compute budget of $5.76\times 10^{23}$ FLOPs, their parametric loss function fitted on MassiveWeb predicts an optimal allocation of 40 billion parameters. Meanwhile, if the training dataset is C4 their IsoFLOP approach predicts 73 billion parameters to be optimal, almost twice as much. However, for C4, which is our training dataset, they do not provide the coefficients necessary to compute loss with their parametric loss function. Based on their IsoFLOP training runs on C4, they only provide the information that for C4, compute ( $C$ ) allocated to data ( $D$ ) and parameters ( $N$ ) should be scaled exactly equally for optimality, i.e. $a=b=0.5$ in the relationship $N_{opt}\propto C^{a}$ and $D_{opt}\propto C^{b}$ . This corresponds to $\alpha=\beta$ in the parametric loss function (Equation 2). Thus, we use this information together with the methodology and C4 data points from to fit the parametric loss function. We tie the parameters $\alpha$ and $\beta$ to be equal and optimize

where LSE is the log-sum-exp operator and $N_{i}$ , $D_{i}$ and $L_{i}$ the model size, dataset size and loss of the $i$ th run, and $\delta=10^{-3}$ . We fit on 54 samples on a grid of initialization given by: $\alpha\in\{0.,0.5,\dots,2.\}$ , $\beta\in\{0.,0.5,\dots,2.\}$ , $e\in\{-1.,-.5,\dots,1.\}$ , $a\in\{0,5,\dots,25\}$ , and $b\in\{0,5,\dots,25\}$ . Our fit results in $a=6.255414$ , $b=7.3049974$ , $e=0.6254804$ , $\alpha=\beta=0.3526596$ . Exponentiating $a$ , $b$ and $e$ to get $A$ , $B$ and $E$ and inserting all learned coefficients into Equation 2 then allows us to compute loss ( $L$ ) as a function of parameters and data:

To verify the accuracy of our fit, we benchmark the predictions with those of the IsoFLOP C4 curves in . Following , we can compute the optimal number of parameters $N_{opt}$ and tokens $D_{opt}$ for our fit using:

Given the Gopher compute budget of $C=5.76\times 10^{23}$ our fitted parameters predict an optimal allocation of $N_{opt}=70.0$ billion parameters and $D_{opt}=1.37$ trillion tokens. This is very close to the 73 billion parameters and 1.3 trillion tokens predicted by the IsoFLOP curves on C4 from and thus we consider it a good fit. We use these fitted parameters rather than the MassiveWeb parameters for all computations involving Chinchilla scaling laws.

Appendix C Additional Contour Plots

Figure 8 contains additional empirical isoLoss contours for 400 million and 1.5 billion unique tokens. Results show that like in Figure 3 significantly lower loss can be achieved by increasing parameters and epochs beyond what is compute-optimal at a single epoch. The lowest loss is also achieved by allocating more extra compute to repeating data rather than to adding parameters.

Appendix D Double Descent

Prior work has reported double descent phenomena when repeating data, where the loss initially increases and then decreases again as the model is trained for more epochs . In Figure 9, we plot the loss curves of several models trained for varying epochs on 100 million tokens. We find double descent phenomena with the loss of all models increasing at 200 epochs before decreasing again. This contributes to additional noise in the fitting of our functions in Appendix A, as our functional form assumes loss to be monotonically decreasing as epochs increase. Thus, we remove most such examples from the fitting.

Appendix E Repeating on Heavily Deduplicated Data

To investigate whether Figure 3 is dependent on the inherent amount of duplicates in the selected 100 million tokens, we train several models on a deduplicated version of C4 (see Appendix N). We plot the performance of the models trained on the deduplicated C4 versus the regular C4 in Figure 10. All models are evaluated on the same validation dataset from the regular C4. Regardless of deduplication we find 59 epochs to be optimal and the overall trend to be very similar. Together with our results on OSCAR (Appendix I), this suggests that our work generalizes to different datasets with different inherent amounts of duplicates.

Appendix F Do Excess Parameters Hurt, Plateau or Help?

Figures 3, 8 suggest that excess parameters (or epochs) can harm performance. We hypothesize that this is due to suboptimal hyperparameters and could be prevented with better regularization. Thus, we expect with optimal regularization hyperparameters excess parameters would never hurt, but performance would merely plateau, as in extreme cases regularization could just take the form of removing the excess parameters. One approach to selecting optimal hyperparameters is $\mu$ P . We compare excessively large models trained with a data constraint of $D_{C}=100$ million tokens in Figure 11 across $\mu$ P, our default hyperparameters (Appendix S) and scaling law predictions. Surprisingly, $\mu$ P leads to even higher test loss than our default hyperparameters. Nevertheless, we find that also with $\mu$ P excessive parameters hurt: The models with more than 2 billion parameters have significantly higher validation loss after training than the models with 200 million to 1 billion parameters when trained on only 100 million tokens. However, $\mu$ P only covers hyperparameters such as the learning rate, but not explicit regularization hyperparameters like dropout rates, which we hypothesize would prevent this behavior. Thus, our proposed scaling equations predict loss to plateau, as seen in the straight line. As the compute-optimal parameter count for 100 million tokens is around 7 million, all depicted models have a significant amount of excess parameters and data-constrained scaling laws predict their losses to be all the same ( $R_{N}^{*}\ll R_{N}$ ). Meanwhile, the default Chinchilla scaling law predicts loss to continue decreasing as parameters are added, which is in stark contrast to the empirical data.

If one wants to incorporate excess parameters hurting performance into the scaling law equations, one could consider (a) Modifying the exponential decay formulation introduced in Appendix A such that instead of the value of repeated data decaying to it decays to a large negative value (b) decaying the exponents $\alpha$ and $\beta$ in Equation 7 instead of $D$ and $N$ . Decaying the exponents to has the effect of more repetitions eventually hurting performance as $\lim_{\alpha\to 0}D^{\alpha}=1$ and the same for $\beta$ . Thus, initially as $D$ and $N$ increase loss decreases, but ultimately the decay of $\alpha$ and $\beta$ pushes $D$ and $N$ back to $1$ resulting in loss to increase. Specifically, approach (b) could take the form of:

Like the equations in Appendix A this formulation also reduces to the Chinchilla scaling laws in the base case of $R_{D}=0$ or $R_{N}=0$ . As the exponents decrease with more repetitions adding parameters or epochs becomes less beneficial. Eventually, the decay in $\alpha$ or $\beta$ causes loss to increase again as it pushes $N$ or $D$ back down to $1$ . We fit this formula using the same approach outlined in Appendix A but including samples where excess parameters or epochs hurt (296 total samples). We use a grid of initialization given by: $R_{N}^{*}\in\{0.,2000.,\dots,100000.\}$ and $R_{D}^{*}\in\{0.,2000.,\dots,100000.\}$ . This results in $R_{D}^{*}=26530.611$ and $R_{N}^{*}=2040.8163$ . $R_{N}^{*}$ is significantly lower resulting in excess parameters hurting faster than excess epochs, which is in line with empirical data from Figure 3. We visualize Figure 3 with the predictions from this alpha-beta decay formulation in Figure 12. Expected parameters eventually hurt resulting in circle-shaped contours. Due to the very high $R_{D}^{*}$ the area where epochs start to hurt is outside of the boundaries of Figure 12. While the predicted optimal allocation (efficient frontier) is similar to Figure 3, the predicted return from repeated data differs significantly. The alpha-beta decay formulation incorrectly predicts returns to diminish significantly slower as seen by the longer efficient frontier and the smaller distance in contours early on as compared to Figure 3. Beyond its potentially useful properties, we do not have a rigorous mathematical justification for this alpha-beta decay formulation which could be the cause of the incorrect return predictions.

Ultimately, we settle on our exponential decay formulation from Appendix A that does not allow excess parameters or epochs to hurt, as preventing such behavior is trivial by stopping training (in the case of epochs hurting) or removing excess parameters (in the case of model parameters hurting). Further, accurately predicting how much loss increases in the limit is not very useful, as in practice one would want to stop training when it’s expected to plateau anyways.

Appendix G Case Study: Galactica

The Galactica models are the only publicly known LLMs that explicitly trained for a significant number of epochs prior to this work. They trained their models on 106 billion unique tokens for 4.25 epochs. Our findings on Return from repeated data agree with their conclusion that multiple epochs are beneficial, however, we find that even more epochs can be beneficial and a small spike in validation loss does not justify stopping training (Appendix J). Meanwhile, our findings on Allocation significantly deviate from Galactica. Figure 13 visualizes the Galactica models with our predicted efficient frontier in the same style as Figure 1. The creators of Galactica decided to train a 120 billion parameter model on 450 billion tokens, a significant overallocation to parameters even in Chinchilla terms (black efficient frontier). This decision was likely driven by the intuition that repeated data is worth less, thus one should spend more compute on parameters. However, our empirical data contradicts this. Parameters learning from repeated data are worth even less than repeated data, thus one should overallocate to epochs, not parameters. Our data-constrained scaling laws thus predict that a better model could have been trained by allocating significantly more FLOPs to epochs rather than parameters for the largest Galactica model with 120 billion parameters. Specifically, 40 billion parameters trained for 1.35 trillion tokens (12.75 epochs) would have been optimal according to data-constrained scaling laws. Note that these scaling laws have been fitted on C4, which is not the dataset used to pre-train Galactica. The Galactica models are pre-trained on a predominantly scientific dataset, which includes code data among other data sources. Results from show that there are differences in the scaling coefficients when training on C4 as compared to GitHub code, however, the overall allocation trend is the same. Thus, while we expect a smaller model trained for more epochs to be better than the 120 billion parameter model, the optimal allocation is unlikely to be exactly 40 billion parameters and 1.35 trillion tokens.

Appendix H Training Loss

Hoffmann et al. use training loss as their core metric. However, when repeating data for multiple epochs, training loss is a bad metric as models will overfit to the limited data available as shown in Figure 14. Thus, we use loss on a held-out test set as our key performance metric.

Appendix I Scaling Curves on the OSCAR Corpus

To ensure our findings are not dataset-dependent, we train models with the same configurations from Figure 4 on the OSCAR corpus . OSCAR is considered noisier than C4 due to its less stringent duplication. Figures 15,16 depict the validation and training loss of these models. We find the trend to be the same as for models trained on C4: While models with fewer repeats have better loss, differences for a few repeats are insignificant.

Appendix J Validation Loss by Epoch

Taylor et al. decided to early-stop pre-training of the Galactica models due to a small increase in validation loss at the start of the fifth epoch. In Figure 17 we plot the validation loss curves of our isoFLOP models as a function of epochs. We do find small increases in validation loss when models enter a new epoch. For example, upon entering the third and fourth epoch, the 7-epoch 8.7 billion parameter OSCAR model shows loss spikes. However, these are temporary and loss continues to go down smoothly thereafter. Thus, we hypothesize that the Galactica models could have attained better performance by continuing pre-training beyond the loss spike experienced at the beginning of the fifth epoch.

Appendix K Evaluation Details

For all models trained on C4, the final test loss is computed on the same 210 million tokens from the C4 validation set after training. For held-out evaluation during training, such as in Figure 4, the configurations are displayed in Table 2. The small number of evaluation tokens for the 8.7 billion parameter models likely contributes to the loss spikes for 8.7 billion parameter models seen in Figure 4. Thus, we smooth the validation loss curves of 8.7 billion parameter models with exponential moving average smoothing and a weight of 0.85. For training OSCAR, configurations are the same, however, the validation split used is a held-out part from the OSCAR training split, as there is no official validation split for OSCAR. All training loss curves for C4 and OSCAR models are smoothed with exponential moving average smoothing and a weight of 0.999.

We provide statistics of all downstream evaluation datasets in Table 3. We use the evaluation-harness frameworks from BigScience and EleutherAI to evaluate models on 19 evaluation datasets. For each dataset, a maximum of 3000 samples are evaluated with 0,1,2,3,4 and 5 few-shots to produce six scores which are then averaged. We normalize scores to range from the random baseline of each task to 1 and report them as percentages. For example, if random guessing produces 50% accuracy and the maximum accuracy possible is 100%, then a raw accuracy of 55% would be normalized to 10%, and a raw accuracy of 45% would be normalized to -10% since it is worse than random. This is done to give all tasks the same weight. Otherwise average performance would heavily depend on generative tasks, where the random baselines are 0. Prompts are sourced from GPT-3 and PromptSource and detailed in Appendix T. We note that our evaluation is in no means comprehensive and a larger benchmarking would be helpful . However, by training five seeds for most models benchmarked, always averaging 0-5 fewshots, and ensuring maximum data overlap for repeated data (§4) we significantly reduce uncertainty.

Appendix L Downstream Repetition Results

In Tables 4-9 we report downstream results of all models trained on C4 and OSCAR according to the configurations in Figure 4. All scores are from the final checkpoints at the end of training. OSCAR is a noisier dataset than C4 due to less filtering, thus models trained on C4 generally perform better. Notably, models trained on C4 completely fail on bAbI , while OSCAR models are able to perform better than random. This is likely due to code data being present in OSCAR, which enables state-tracking capabilities like for code augmented models in §7. For C4 the creators strictly removed all data that resembles code . There are no significant differences between models trained for a single epoch and models trained for up to 4 epochs. Even models trained for more epochs (and thus on less unique data) have similar performance.

Appendix M Detailed Code Augmentation Results

We report tabular results for replacing part of C4 or OSCAR with code for 4.2 billion parameter and 2.8 billion parameter models in Tables 10-11. We find that training on up to 50% of Python data maintains performance on all natural language tasks while enabling huge performance gains on state-tracking (bAbI) for C4. For OSCAR gains are less clear, which is likely due to OSCAR containing code , while code data was explicitly filtered out for C4 .

Appendix N Filtering Procedure

We follow the approach of to perform perplexity filtering and reuse their artifacts - a SentencePiece tokenizer and a KenLM 5-gram language model trained on Wikipedia introductions and available to download from their repository.https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/01b_oscar_cleaning_and_filtering We compute the model’s perplexity on all OSCAR and C4 samples and only select samples that fall within a certain percentile threshold. For example, to select the top 25%, we only select samples with perplexity lower than the 25th percentile. Figure 18 provides a visual representation of perplexity distribution for respective datasets, highlighting the relevant percentile thresholds.

We perform deduplication leveraging the suffix array-based approach proposed by Lee et al. . We remove any document with at least a 100-character span overlapping with any other document in the corpus. We deduplicate the full C4 dataset. In the case of OSCAR, the memory requirements of the deduplication procedure make performing the full dataset deduplication infeasible. Instead, we select a 25% subset of the full OSCAR and build a suffix array for this subset. We experiment with leveraging the 25% OSCAR suffix array in two ways. First, we deduplicate the selected subset. This is very strict and preserves less than 5% of the full OSCAR. Subsequently, we use the 25% suffix array to deduplicate the full OSCAR, i.e. we remove any document which has at least a 100-character span overlapping with the 25% subset we selected. This is more permissive and allows us to preserve 31% of the original dataset. We refer to the latter as expanded in Table 12 and it is used for the training of the 4.2 billion parameter model in Table 14, while the smaller deduplicated version of OSCAR is used for the 2.8 billion parameter model.

In addition, we benchmark with the filtering procedure from the ROOTS corpus . It applies the following set of filters:

Discarding documents with overly repeated character- and word-n-grams

Discarding documents with too many special characters

Discarding documents with too few grammatical function words (e.g. “of”, “and”)

Discarding documents with too many flagged words

Discarding documents with a low fasttext language identification score

Appendix O Detailed Filtering Results

In Table 13, we report detailed perplexity filtering results on C4 and OSCAR. For C4, perplexity filtering is only effective at 4.2B parameters. Meanwhile, for OSCAR, which is noisier than C4, perplexity filtering seems effective both for 2.8B and 4.2B parameters. Table 14 contains deduplication results and results for the ROOTS filter. Deduplication does not improve downstream performance for C4 while being effective for OSCAR which has significantly more noise. Applying the ROOTS filter on OSCAR is not better than the unfiltered OSCAR on our benchmark, but might have other beneficial effects, such as reducing obscenity, templated messages, or repetition, depending on the final use case.

Appendix P Loss Curves for Complementary Strategies

To compare complementary data strategies in §7, we have used downstream performance on natural language tasks detailed in Appendix K instead of loss. This is because validation loss gives an unfair advantage to models trained on a larger fraction of data from the same distribution. For example, when making up for missing natural language data with code, models that are trained on more code will have better validation loss on code data while having worse loss on the natural language data as seen in Figure 19: The model pre-trained on 90% of Python code data and 10% of C4 has the highest C4 validation loss, but the lowest Python validation loss.

Models trained on deduplicated or perplexity-filtered data have higher validation loss as the held-out validation data has not gone through the same filtering steps. Thus, its distribution more closely resembles the training data of models trained on the unfiltered data resulting in worse validation loss for the two filtering strategies in Figure 20 (left). Meanwhile, for training loss in Figure 20 (right) the model trained on perplexity-filtered data has the lowest loss. Its training data has been filtered to the top 25% of examples with the lowest perplexity (Appendix N) thus high loss examples have been explicitly filtered out from the training data resulting in low training loss. The model trained on deduplicated data has the highest validation and training loss. This is because commonly repeated sequences have been filtered out from its training data. Thus, when encountering these common sequences in the unfiltered validation set, its loss is comparatively high as other models have likely simply memorized them. Similarly, fewer repeated sequences during training results in higher training loss as unseen sequences are harder to predict.

Appendix Q Limitations and Future Work

In this work we focus on repeating the entire unique dataset for several epochs. Alternatively, one can repeat only a fraction of the dataset. For example, repeating 10% of the dataset for 10 epochs while repeating the rest only for a single epoch as done by Hernandez et al. . To predict loss in that scenario, one may need to adapt our scaling laws with an additional parameter to account for the fraction that is repeated and possibly a parameter that captures at what point in training the data is repeated. Repeating earlier in training when most model weights are still randomly initialized is likely to cause less damage than later in training. Adapting our parametric fit to make concrete scaling predictions for such scenarios is an exciting future research direction.

The returns from additional epochs may heavily depend on hyperparameters such as learning rate, dropout, or the optimizer choice. It is likely that increasing the learning rate, for example, would lead to diminishing returns from additional epochs kicking in earlier. In this work, we have fixed most hyperparameters to commonly used values for the training of LLMs and leave such explorations to future work.

The optimal data strategy is dependent on the dataset at hand and we cannot give universally applicable filtering recommendations. By looking into C4 and OSCAR, we have covered two of the most commonly used English text datasets. Our findings on both datasets were overall in agreement with each other. We have highlighted some of the differences, such as deduplication being more effective on OSCAR due to it being more noisy than C4. Further, we have focused on large-scale pre-training datasets. There is a lot of research on the optimal fine-tuning dataset and methodology for LLMs . More investigations of resolving data-constraints when fine-tuning LLMs may be of interest for future work.

Our work focuses on text datasets and uses the GPT transformer architecture . Prior work has experimented with many variations to the GPT or transformer architecture , as well as scaling laws for non-text datasets . Overall, variations of the GPT or transformer architecture have proven very robust and generalizable to other domains . Nonetheless, it may be of interest for future work to test the applicability of our findings in this work to different data modalities or model architectures.

There are numerous strategies to solve data constraints not covered in this work that are worth exploring. Like we have shown for Python, future research may consider to what extent augmenting with a natural language (e.g. Chinese) improves performance in another language (e.g. English) and what is the best language to choose . Similarly, while we have looked at deduplication and perplexity filtering, other filtering strategies, such as popularity-based filters and toxicity filters are worth exploring.

Appendix R Contributions

Niklas Muennighoff led experiments, analysis, writing, and the overall project. He implemented, trained and evaluated all models.

Alexander M. Rush contributed to framing, results analysis, and paper writing.

Boaz Barak contributed to formal and experimental analysis as well as paper writing.

Teven Le Scao provided guidance, led data choices and preprocessing, and contributed to framing and writing.

Aleksandra Piktus created perplexity and deduplication datasets and contributed to writing.

Nouamane Tazi contributed to enabling high-performance training on AMD hardware.

Sampo Pyysalo contributed to enabling high-performance training and early repetition experiments.

Thomas Wolf provided guidance on experimental design and contributed to paper writing.

Colin Raffel provided guidance on experimental design and contributed to paper writing.

Appendix S Hyperparameters and Setup

For all training runs we use 1% of tokens for linear warm-up of the learning rate to a maximum learning rate of 2e-4 that is decayed to 2e-5 following a cosine schedule. We use a batch size of 256 for models with fewer than 2 billion parameters, 512 for models with 2 - 5 billion parameters and 1024 for models with more than 5 billion parameters. All models are trained in bfloat16 precision using the Adam optimizer with $eps=1e-8$ , $beta1=0.9$ . For $beta2$ , we found a value of $0.95$ to result in slightly lower final loss and fewer loss spikes than the default value of $0.999$ in implementations such as PyTorch. However, except for models with FLOP budgets of $C=9.3\times 10^{20}$ and $2.1\times 10^{21}$ , we always use $beta2=0.999$ . We use a dropout rate of $0.1$ , a weight decay rate of $0.1$ and clip gradients at $1.0$ . These hyperparameter choices are largely based on prior work and performance on test runs. As none of our hyperparameter choices is particularly exotic, we expect our setup to generalize to many other setups. In LABEL:tab:all_models we list the model architectures we use. They are an extended version of the architectures from . We calculate model parameters following , which includes embedding parameters:

where $P$ is the final parameter count, $l$ are layers, $h$ is the hidden dimension, $V=50257$ the vocabulary size and $s=2048$ the sequence length. We find the parameter counts reported in Chinchilla to be significantly different than our calculations, especially at larger scales. We report both in LABEL:tab:all_models, but we use our parameter estimates everywhere in this work. Further, we have corrected the number of heads of the 3,530 and 4,084 million parameter models from to obey the relationship $d\_model=kv\_size\cdot n\_heads$ .

To train our models, we have forked the Megatron-DeepSpeed framework and adapted it for ROCm to enable training on AMD GPUs. We have made our training code publicly available at https://github.com/TurkuNLP/Megatron-DeepSpeed. Models are trained using data, tensor and pipeline parallelism on up to 256 AMD Instinct MI250X GPUs distributed across up to 64 nodes on the LUMI supercomputer located in Finland. As of June 2023, LUMI is the largest supercomputer in Europe and ranks third worldwide with a performance of around 310 PFLOPs.https://www.top500.org/lists/top500/2023/06/ We trained models in parallel using up to 2,200 nodes at a single point in time (equivalent to around 8,800 GPUs or 17,600 GCDs or 86% of all GPUs on LUMI). We have used a total of around 3 million GPU hours. The cluster is powered 100% by renewable energy (hydroelectricity) and its waste heat is used for heating the nearby city reducing the city’s carbon emissions by up to 20%. Thanks to the low temperatures in Finland, relatively little cooling for the cluster is required further reducing its impact on the environment. As of June 2023, it ranks as the seventh greenest supercomputer.https://www.top500.org/lists/green500/2023/06/

Appendix T Prompts and Samples

The following figures illustrate the prompts with samples from each evaluation dataset. Prompts stem from PromptSource or GPT-3 . All data comes from the ground truth datasets in this section, and no generations are shown here.

Appendix U Other Experiments

We experimented with the UL2 objective for a causal model but did not find it to outperform regular causal language modeling on our evaluation tasks. This may stem from UL2 being better suited as an Encoder-Decoder model or from mistakes in our UL2 implementation.

We have also trained several models on The Pile and found similar trends as for OSCAR and C4. We make these models publicly available.

Appendix V Release of Artifacts

We open-source all of our models and code under Apache 2.0 licenses. Our filtered datasets are released with the same licenses as the datasets they stem from. All material can be found at: https://github.com/huggingface/datablations.

Appendix W Version Control

Added comparison of different fits in terms of loss and $R^{2}$ in Table 1

Added loss curves of complementary strategies in Appendix P

Fixed OSCAR validation plot in Appendix I

Clarified the usage of smoothing in training and validation plots in Appendix K

Added experiments decaying alpha and beta to allow excess epochs or paramters to hurt in Appendix F

Added more details on the calculation of $U_{N}$ given $U_{D}$ in Appendix A

Added hyperparameter sensitivity limitation in Appendix Q

Added more detail on how score normalization is done in Appendix K

Mentioned modification of number of heads in Appendix S

Appendix X Broader Impacts

Large Language Models carry potential risks such as outputting offensive language, propagating social biases, and leaking private information . By publicly releasing all of our models and providing new insights to improve the scaling of LLMs we may contribute to the further proliferation of these harms. However, we note that there are already much larger and more capable models freely available that can be used in such harmful ways. Thus, we consider the open-source release of our models and research to significantly outweigh its downsides.