SlimPajama-DC: Understanding Data Combinations for LLM Training
Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, Eric Xing
Introduction
The success of modern large-scale models is deeply rooted in their training data. For large language models, the emphasis is not merely on generic text but on “diverse text”. To guarantee the model’s linguistic expertise and its comprehensive understanding of the world, this text must span a broad spectrum of domains, genres, languages, and more. Consequently, the composition of the pretraining data domains, such as Github, Wikipedia, books, and web text like CommonCrawl, plays a critical role in the performance of large language models. In our research, we delve into the domain/source weightings of training data. Leveraging SlimPajama-DC, we investigate two primary areas: (1) global-level and local-level deduplication, and (2) the efficacy of various combinations of thoroughly deduplicated datasets. The first emphasis basically encourages the model to be trained on all sources as no cross-domain overlaps inside, and the second helps us understand how to manage the integration and proportions of diverse domains, especially as datasets for LLM training continue to expand in variety.
Generic Deduplication. Multi-source datasets often combine data from various origins, each with its unique distribution of information. When training large language models, handling data redundancy is critical to ensure that the model generalizes well and does not exhibit undue biases, making training faster and more efficient. Highly deduplicated datasets ensure that the model isn’t repeatedly exposed to the same or very similar data points, making the training more efficient. Redundant data can slow down convergence and might make the model overfit to frequently seen patterns. Deduplication helps in efficient utilization of the model’s capacity. In general, deduplication is the process of removing duplicate data to address this redundancy.
Global Deduplication vs. Local Deduplication. The global deduplication process removes duplicates from the entire combined datasets. When we’re using data from multiple sources, there might be overlaps across sources. Global deduplication identifies and removes these overlapping instances irrespective of their source. In local deduplication, duplicates are removed within each individual source dataset before merging them. However, if two source datasets have overlapping data, those duplicates will still be present in the final combined dataset since deduplication was only done locally within each dataset. In most current open-source LLM training data , only local deduplication is performed within each data source, which neglects the redundancy across the different sources. Given the effects, global deduplication performed in SlimPajama is generally preferable for training large language models, especially when using multi-source datasets. It ensures a balanced representation of information and prevents the pitfalls associated with data redundancy. However, more hardware memory is naturally required by this strategy.
Different Combinations of Highly-deduplicated Datasets. A model trained on diverse data is more likely to generalize well across various tasks. It’s exposed to a wider range of vocabulary, syntax, and semantics, enabling it to handle a broad scope of queries. If diverse sources are chosen such that they represent different cultures, beliefs, and demographics, the model might be more balanced and less prone to biases. However, if many sources share common biases, the final dataset might amplify them. Different sources can provide both a breadth and depth of knowledge on various topics. Combining a technical dataset with a general news dataset, for example, would allow the model to understand both in-depth technical details and broad general knowledge. It’s crucial to note that data quality often outweighs the quantity. In this work, we aim to shed light on this fascinating perspective of comprehensive data combination on SlimPajama.
Specialization vs. Generalization Trade-off. In general, combining many specialized datasets can lead to a jack-of-all-trades model, which might not be as adept at specific tasks as a model trained on a specialized dataset. While the model can tackle a wide range of tasks, it might not have the depth of understanding that a specialized model might have for a particular domain. In this study, we also explore specialization and generalization ability using both individual and combined data sources.
The remainder of this paper is organized as follows. In Section 2, we elaborate the details of dataset statistics, token distributions, and data processing procedure. Section 3 describes dataset combination configurations for this SlimPajama-DC study. Our model architecture and training details are provided in Section 4, followed by the results and analysis in Section 5 on the range of various tasks in the zero- and few-shot settings. Section 6 presents an application of efficient Large Batch-size (LBS) training on a 7B model. Section 7 reviews related work and Section 8 concludes this study.
Dataset Overview
SlimPajama has a total of 627B tokens across different domains, as shown in Table 1. It includes validation and test sets with 500M tokens each, and these have been cleaned to ensure no overlap with the training data. For the SlimPajama-DC study, our entire training dataset for each configuration contains 330B tokens after tokenization which is carefully selected from the original SlimPajama dataset. We tested different sampling strategies for different domains of our training data: (1) each token is trained only once during training, such as Commoncrawl, and (2) we perform more than one epoch for training on particular sources, such as the Wikipedia and Github domains. The detailed domain source proportions of various combinations are shown in Table 3.
2 Dataset Token Frequency Statistics
To examine the similarity between various datasets in SlimPajama, we calculate the KL divergence between two domain distributions of token counts from different datasets, as shown in Fig. 1a. Given that distinct datasets may emphasize dissimilar token types, we subsequently delve into the differences in the distribution of these datasets across token subsets exhibiting distinct characteristics: (1) Tokens exclusively comprising letters (Fig. 1b); (2) The union set of tokens with the top 1000 frequencies on each dataset (Fig. 1c); (3) Numbers and commonly used operators, like ‘30’, ‘+’ and ‘=’ (Fig. 1d); (4) Whitespace Tokens, like ‘nn’ and ‘t’ (Fig. 1e); (5) Non-alphanumeric tokens, like ‘#’ and ‘====’ (Fig. 1f).
There exists a degree of similarity in the distribution of different token subsets among RefinedWeb, Book, C4, and CommonCrawl, as well as between Github and StackExchange. Notably, when it comes to the distribution of non-alphanumeric tokens, Arxiv differs significantly from most datasets. While on the distribution of whitespace tokens, Refinedweb shows notable distinctions in comparison to Github and StackExchange. Among numbers and commonly used operators, the distribution of all datasets is relatively consistent.
3 Dataset Processing Procedure
SlimPajama was created by filtering low-length documents and applying MinHashLSH deduplication to the 1.2T token RedPajama dataset to reduce it to 627B tokens. RefinedWeb shows that training on deduplicated data improves training compute efficiency and decreases the chance of LLMs generating memorized text from the dataset. By removing duplicate and low-length examples, it ultimately improves the training compute efficiency and model performance. The overview of SlimPajama preprocessing pipeline is shown in Fig. 2 and the preprocessing code is under https://github.com/Cerebras/modelzoo.
Additional global filtering is performed to remove short, low-quality documents. After removing punctuation, consecutive spaces, newlines, tabs, and leading or trailing escape characters, documents with less than 200 characters were further filtered out. These documents typically contain only metadata and no useful information. A low-length filter was applied to every corpora other than Books and GitHub where it was found useful for short documents. The percentage of documents filtered out from each corpus within the SlimPajama dataset is detailed in Table 2. In total, this additional step removed 1.86% of the documents.
3.2 Global Deduplication
When building SlimPajama, it is observed that every corpus included in it contained duplicates with the most significant duplication found in CommonCrawl and GitHub. RefinedWeb also found similar rates of deduplication in the CommonCrawl data. It is most common to perform deduplication within each dataset source separately to reduce implementation complexity and meet resource constraints. This local deduplication approach does not have the ability to remove overlap between data sources which can be significant for web-scraped data. Instead, global deduplication removes duplication within and between each data source. Following , global-level deduplication is performed using MinHashLSH algorithm. To facilitate global deduplication efforts and reproducibility for other researchers, a tool designed for scalable performance is offered under the above link.
Specifically, global MinHashLSH deduplication is performed using a Jaccard similarity threshold of 0.8, document signatures constructed with preprocessed lowercase 13-grams, and schema following . To unify a representation of the same content, punctuation, consecutive spaces, newlines, tabs, and leading or trailing escape characters are removed. The level of deduplication performed per data source is presented in Table 2. The initial implementation of MinHashLSH did not scale to trillion token datasets like RedPajama without running out of memory. This is overcome by optimizing the memory usage and parallelization to perform deduplication on 64 CPU cores with 1.4TB GB peak memory usage, which can be easily decreased by creating multiple MinHashLSH objects to query.
Dataset Combination Configurations
Combination Strategies. As shown in Table 3, the adjusted domain weights establish a new training distribution. Using this distribution, we adopt a standard training approach to learn a consistent model architecture. This architecture remains unchanged across various domain weights and is trained using data from diverse combination distributions. Across different setups, we maintain the total training tokens to be the same. Our examination of domain weights in large language model training focuses on three main areas: 1) Incrementally increasing the diversity of source combinations, as seen in configurations 1, 2, and 3. 2) With consistent data sources, we explore varying domain proportions as presented in configurations 2, 4, and 5. 3) We assess the significance of individual domain sources concerning the final model’s performance. Note that given the minimal impact of ArXiv and StackExchange, we have opted to omit them from the ablations in configuration 3 to conserve training resources and keep relatively sufficient training tokens for CommonCrawl. The detailed configurations are as follows:
Configuration-2: 300B CommonCrawl + 30B Github
Configuration-3: 250B CommonCrawl + 30B Github + 26B Books + 24B Wikipedia
Configuration-4: 250B CommonCrawl + 80B Github (adjust sampling proportion)
Configuration-5: 250B CommonCrawl + 80B Wikipedia (adjust sampling proportion)
Configuration-6: 330B RefinedWeb CommonCrawl
2 RefinedWeb
RefinedWeb is a massive English web dataset that is constructed using rigorous filtering and extensive deduplication of CommonCrawl. We use it as the comparison to our SlimPajama-DC CommonCrawl-only training.
Network Architecture and Training Details
Cerebras-GPT Architecture . Cerebras-GPT architecture shares similarities with those built on GPT-3 , particularly in the use of an autoregressive transformer decoder. However, a key difference lies in the attention mechanism employed. While GPT-3 utilizes a mix of dense and sparse-banded attention, Cerebras-GPT consistently uses dense attention across all decoder blocks. In terms of model dimensions, we either adhere to an aspect ratio of approximately 80 (/) or maintain dimensions that are congruent with GPT-3 models. Additionally, all of our models are trained to handle a maximum sequence length of 2,048 tokens. The detailed architecture is shown in Table 4.
Alibi . Alibi introduces a more streamlined and efficient positional approach called Attention with Linear Biases. Rather than adding positional embeddings to word embeddings, ALiBi applies a bias to query-key attention scores, penalizing them based on their distance.
SwiGLU . SwiGLU is an activation function which is a variant of GLU . The formulation is as follows:
where is a vector of the hidden representation at a particular position in the sequence. are the matrices and bias vectors, respectively.
2 Training Details
Tokenizer. We use an adapted GPT-NeoX BPE-based tokenizer similar to that used in GPT-2 for all of our experiments, which has a vocabulary size of 50,277. Our entire training dataset for each configuration contains 330B tokens after tokenization, and each model takes about 2.5 days on Cerebras 16 CS-2S cluster.
Optimizer. We employ the AdamW optimizer to train our models, adopting these specific hyper-parameters: = 0.9, = 0.95, and eps = 1.0e-08. Our chosen learning rate follows a linear scheduler, culminating in a final learning rate that’s 10% of its peak value. Additionally, we apply a weight decay of 0.1, limit the gradient using a clip value of 1.0, and implement a 150-step warmup.
Other Hyperparameters. In our model, the filter size is 5,461, hidden size is 2,048 and attention dropout rate is 0. SwiGLU is used as the nonlinearity and alibi is used for position embedding. Mixed precision and bfloat16 are employed during model training. More hyperparameters are shown in Table 4.
Results and Analysis
This section presents the analytical experiments and results on different combinations of SlimPajama. We first discuss the results following Huggingface Leaderboard Evaluation. Then, we demonstrate the importance of global deduplication and a diverse range of data sources in enhancing LLM’s performance by conducting additional comprehensive evaluations across various topics. Finally, we visualize the training loss curves of different data domain combinations and provide insights on how they connect to the models’ performance.
Following the Huggingface Leaderboard Evaluation , we also assess our models on four key benchmarks using the Eleuther AI Language Model Evaluation Harness . This unified framework facilitates the evaluation of generative language models across a broad scope of tasks. Specifically, our tests comprised:
1) AI2 Reasoning Challenge (25-shot) : This entails a series of grade-school level science questions.
2) HellaSwag (10-shot) : This benchmark gauges commonsense inference. While straightforward for humans, with an average accuracy of 95%, it poses challenges for state-of-the-art models.
3) MMLU (5-shot) : Designed to assess a text model’s multitask proficiency, this test spans 57 diverse tasks, including elementary mathematics, US history, computer science, law, among others.
4) TruthfulQA (0-shot) : This evaluates a model’s inclination to echo inaccurate information frequently encountered online. However, it’s pertinent to note that within the Harness, TruthfulQA is essentially a 6-shot task, as it consistently commences with six examples, even when initialized with zero for the number of few-shot examples.
As shown in Table 5, with the exception of DC-5, our average results are all better than RedPajama-1.3B which is also trained on 330B tokens. Among our combinations, the DC-1 (which relies solely on SlimPajama Commoncrawl) achieves the highest scores for ARC and MMLU among all tested configurations. Yet, its performance on TruthfulQA ranks at the bottom. On the other hand, DC-3 obtains the top average accuracy across all SlimPajama data combinations, while DC-6 stands out with the best results on HellaSwag and superior average performance across the board. A potential strategy to harness the strengths of each configuration might involve a sequential training process on DC-1, DC-3, and DC-6.
Furthermore, SlimPajama is built using global deduplication across all sources. This suggests that merging all domains typically yields better results than selective combinations, given the absence of overlaps among different domain datasets. This also highlights the importance of global deduplication and a diverse range of data sources in enhancing LLM overall performance.
2 More Evaluations
As shown in Table 6, we present additional evaluations across various domains to investigate the fine-grained capabilities offered by different data combinations. Except for DC-6 (model trained on RefinedWeb data), incorporating more sources, such as DC-3, typically leads to improved average performance.
Upon analysis, we find that specific mixtures excel in particular evaluation benchmarks. For example, DC-1 obtains the highest accuracy in the arc challenge and race. Meanwhile, DC-3 outperforms others in the wsc273, swag, and pawsx, and DC-5 emerges as the top performance in the xstory cloze evaluation. Moreover, all of our configurations are superior in the average performance over the comparisons of GPT-neo-1.3B and RedPajama-1.3B .
where is the index of sub-item in MMLU and is the number of items of MMLU. This metric utilizes the probabilities of variance to baseline 25%, aiming to assess the extent to which a model’s prediction resembles random guessing on the MMLU benchmark. The metric has three variations: (1) Consider only items with scores exceeding 25%, i.e., . (2) Focus solely on items with scores less than 25%, i.e., . (3) Include all items and sum them up. The results are shown in Table 7. Generally, a model with a higher MMLU average score will have a low risk of random guessing probability.
It is also crucial to employ a broader and more diverse set of benchmarks, such as in Table 6. Additionally, for a detailed understanding, we have cataloged the complete MMLU results for every sub-item in Table 12. This offers a lens into the knowledge assimilated by the pretrained models within each sub-domain on this comprehensive benchmark.
3 Training Loss
Fig. 3 presents the training loss curves for various data combinations, from which several insights can be observed: 1) While DC-6 demonstrated the highest average accuracy in our quantitative evaluations, its training loss was also the most substantial. This suggests that a lower training loss doesn’t necessarily correlate directly with superior model performance. 2) DC-4, with a considerable portion of its data coming from code domain, exhibited the lowest training loss. This implies that as the amount of code in training increases, the training loss diminishes. 3) The training loss values for other combinations appeared to be relatively consistent with one another.
Application: Large Batch-size Training on 7B
Our 7B large batch size (LBS) training dataset is primarily based on Slimpajama, however, to obtain a sufficient proportion of web text, we have incorporated additional web data from the Commoncrawl corpus in RedPajama. We have also adjusted the proportions of various data sources in line with our 1.3B model training. For instance, we elevate the sampling frequency of Github and Wikipedia and increase the diversity of data sources by adding S2orc and Stack-Markdown following , as detailed in Table 8. It’s crucial to understand that our primary focus is not solely on achieving the best performance. Instead, we place a higher emphasis on optimizing data combinations and ensuring the convergence of training large language models with large batch sizes. Consequently, we continue to utilize the SlimPajama/RedPajama Commoncrawl instead of higher-quality RefinedWeb.
2 7B Model Training Configurations
Architecture. For the 7B model training, we adopt MPT architecture , the max sequence length is 2,048. We use Triton with Flash Attention as the self-attention implementation. Alibi is enabled to make model more flexible for input length extrapolation. The model’s total number of parameters is 6.7B.
Tokenizer. The tokenizer used for 7B training is adapted GPT-NeoX-20b. Following , the model’s vocabulary size is adjusted to 50,432 for improved mfu and leaving a few tokens available that can be used in subsequent training.
Optimizer. We employ the AdamW optimizer to train our models, adopting these specific hyper-parameters: set at 0.9 and at 0.95. We adopt a learning rate schedule that traces a cosine pattern, concluding with a learning rate that is 10% of its maximum value. Along with this, we use a multi-stage weight decay scheduler as described in Sec. 6.4, cap the gradient with a clipping value of 1.0, and use a warmup spanning 2,000 steps.
System and platform. For our 7B model training with a large batch size, we use 232 NVIDIA A100 GPUs (80G). We employ llm-foundry as the training platform. We use FSDP with activation checkpointing enabled to save memory consumption. We also use the automatic mixed precision of bf16 in training.
3 Fast Training with Large Batch-size
Large batch training allows a larger learning rate, leading to a faster convergence of large models. Also, utilizing a larger batch size can optimize hardware resource usage to make training procedures more efficient. Additionally, fewer batches are required, which further accelerates the training process. As shown in Table 9, our large batch training scheme achieves much higher throughput and mfu than LLaMA and MPT with fewer total training GPU hours. Overall, in a convex optimization framework, leveraging a larger portion of the dataset typically leads to enhanced results. However, for most large deep models that involve non-convex optimizations, the precise nature of the loss landscape remains elusive, making the scenario more intricate. Many prior works have noticed that training with larger batches often results in overfitting compared to those using smaller batch sizes for the same network. When utilizing large batch training, there is a propensity for the model to become stuck or even gravitate towards potential saddle points within the loss landscape. While large batch training methods often focus on the nearest relative minima they encounter, networks trained with smaller batches usually navigate the loss landscape more thoroughly before committing to an optimal minimum. The minima reached through large batch training can be distinctly different from those achieved with smaller batch training methods. In the following, we introduce an approach to mitigate overfitting when training large language models in a large batch-size scheme.
4 Progressive Training on Weight Decay
Prior work observed that dropout operation is utilized only in the early stages of training and is deactivated in subsequent phases. Models that incorporate this early dropout strategy tend to exhibit reduced final training loss compared to models that do not use dropout. In contrast to this, our approach emphasizes the role of weight decay during large model training. We introduce a novel training strategy for large language models, wherein the training process is segmented into various stages. Within each stage, a distinct weight decay is applied to the model to serve specific objectives. We’ve termed this approach Progressive Training on Weight Decay (PTWD). Owing to this methodology, our model, even when trained with a large batch size and extremely small iterations, achieves smooth convergence. As illustrated in Fig. 4, our training strategy consists of three distinct phases. Initially, we negate weight decay by setting it to zero and allow the model to train until full convergence is achieved. It usually can reach a lower loss level within this stage compared to using weight decay, even if it slightly overfits. Following this, in the second phase, we introduce a substantial weight decay, with a value of 0.5 in our experiments, to suppress the overfitting. Once the loss values stabilize, we transition to the third phase, wherein a standard weight decay of 0.1 is implemented, a value consistent with many other LLMs training. Intriguing, each phase spontaneously converges to roughly 1/3 of the total training budget, ensuring effective allocation of training budget throughout the process.
5 Results of Pre-training and Instruction Tuning
The results from our pretraining and subsequent instruction tuning on ShareGPT dataset are presented in Table 10. Notably, after instruction tuning, there is a significant enhancement in MMLU and TruthfulQA metrics. In contrast, the performance on ARC and HellaSwag has a slight decrease. On the whole, the average accuracy witnessed a substantial boost following instruction tuning. More evaluation results on the pretrained LBS model are provided in Table 6.
Related Work
RedPajama aims to develop open-source large language models and begins by replicating the LLaMA training dataset , which boasts over 1.2 trillion tokens. This collaborative effort involves entities such as Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and the MILA Québec AI Institute. SlimPajama stands as the highly deduplicated, multi-source, open-source dataset tailored for training large language models. This dataset emerged by refining and eliminating duplicates from the whole 1.2T token RedPajama dataset. Through meticulous filtering of subpar data and repetitive content, it reduced the dataset size by 49.6%, scaling it down from 1.2T to 627B tokens. SlimPajama provides superior quality and computational efficiency for training tasks than the original RedPajama dataset. Other efforts also have been made in this direction to construct diverse datasets, such as Pile . It is an English text corpus of 825 GiB, which is designed for the training of large-scale language models with increased training dataset diversity to improve general cross-domain knowledge and downstream generalization capability. It contains a combination of 22 distinct, high-quality subsets. These subsets incorporate both pre-existing and freshly curated data, with a significant portion sourced from scholarly or professional domains.
2 Data Processing and Optimization Approaches
There have been several advancements in data processing and optimization. The seminal method of importance sampling stands out as a Monte Carlo approach designed to evaluate attributes of a particular distribution, even when the samples are drawn from a distribution that differs from the one under exploration. SlimPajama’s deduplication mechanism is an adaptation of importance sampling, incorporating a heuristic that values unique data points. Recently, several data selection frameworks have been introduced, inspired by the concept of importance sampling. Among them, DSIR presents a framework for the data selection challenge by aiming to choose a subset from a large, unlabeled raw dataset that aligns with a specific target distribution, given a set of unlabeled target examples. It builds upon the traditional importance resampling method, adapting it for data selection in large-scale models. DSIR operates as a scalable algorithm, determining importance weights within a reduced feature space and then selecting data based on these importance resampling weights. In , the authors delve into the relationship between error scaling and dataset size. Their theoretical exploration suggests that by using a robust data pruning metric, which prioritizes which training examples to remove, the proposed method can suppress traditional power law scaling, potentially reaching exponential scaling for pruned dataset sizes.
3 Data Combination for Training Large Language Models
The training of large language models, such as GPT and BERT , requires significant amounts of data to capture and generalize over the vast intricacies of human language. As a result, researchers often combine data from various sources, such as web text, Github, Books, ArXiv, Wikipedia, etc. There are some related work and difficulties that have been explored in the context of data combination for training large language models. (1) Concatenation of diverse datasets: One of the simplest methods for combining data is to concatenate various corpora, covering diverse topics, styles, and sources. This ensures that the model gets a broad view of the language. (2) WebText and similar corpora: For OpenAI’s GPT-2, a dataset called WebText was curated by scraping content from the internet. This kind of data provides a rich mix of formal, informal, factual, and opinionated text, thus offering diverse training material. (3) Balancing and weighting: Simply combining data may lead to issues if one source is overrepresented. Prior studies have applied weights to different data portions or ensure that the combined dataset is balanced in terms of sources, styles, and other criteria. For instance, DoReMi first trains a small proxy model using group distributionally robust optimization across domains, generating domain weights (or mixture proportions) without relying on information from subsequent tasks. Following this, they utilize these domain weights to resample a dataset, on which then train a full-size model. (4) Multimodal Training: Combining text with other data forms, like images or sounds, can also enhance language model training, especially for tasks that require understanding across modalities.
4 Large Batch Training for Large Language Models
Large language models inherently possess a structure that supports parallelization, especially when optimized using techniques that allow for batch training. When computational resources permit, large batch sizes are favored to expedite the training of large models containing potentially millions or billions of parameters. At a fundamental level, larger batch sizes enhance the quality of each gradient update since they consider a more considerable chunk of the dataset. Conversely, a smaller batch size means that model parameter updates are based on gradients derived from a limited dataset portion. This smaller dataset slice might not comprehensively capture the intricate relationships between features and labels. Therefore, it might seem that larger batch sizes consistently offer advantages in training. However, pointed out that this perspective does not factor in the model’s capacity to generalize to new, unseen data, nor the intricate, non-convex optimization landscape of contemporary large models. In practice, multiple studies have demonstrated that while larger batch sizes might hasten convergence, they can impair a model’s generalization to new datasets, irrespective of the deep network type. This observed disparity has been named as the Generalization Gap. A method to address this gap involves starting from a smaller batch size and gradually enlarging it as training advances. In our study, we explore this problem through a new and unique angle of progressive weight decay training.
Conclusion
We have presented SlimPajama-DC, a comprehensive study on understanding the data domain weights and combinations for training large language models. Notably, SlimPajama-DC can operate on compact models, and its advantages can be seamlessly transferred to models that are several times larger. This leads to a remarkable acceleration in training on the SlimPajama with the optimal sampling probabilities across domains for larger models. Through this, we aim to spark further exploration into data-centric methods to enhance the efficiency of large language model training.
References
Appendix
Appendix A Data Proportion Details
Appendix B MMLU
In this section, we provide the detailed item-by-item results in MMLU, as shown in Table 12, it is interesting to notice that on some sub-domains in MMLU, the results from our configured 1.3B models are even better than GPT-3 175B and LLaMA2 7B models.