Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, Kyle Lo

cs.CL

Introduction

Language models are now central to tackling myriad natural language processing tasks, including few-shot learning, summarization, question answering and more. Increasingly, the most powerful language models are built by a few organizations who withhold most model development details (Anthropic, 2023; OpenAI, 2023; Anil et al., 2023; Gemini Team et al., 2023). In particular, the composition of language model pretraining data is often vaguely stated, even in cases where the model itself is released for public use, such as LLaMA 2 (Touvron et al., 2023b). This hinders understanding of the effects of pretraining corpus composition on model capabilities and limitations, and therefore of the models themselves, with impacts on scientific progress as well as on the public who interfaces with these models. We instead target openness and transparency, releasing and documenting a dataset of three trillion tokens alongside tools to reproduce, scrutinize and expand on our work.

Our aim is to allow for more individuals and organizations to participate in language model research and development.

Data transparency helps developers and users of applications that rely on language models to make more informed decisions (Gebru et al., 2021). For example, increased prevalence of documents or terms in language model pretraining data has been linked to better performance on related tasks (Razeghi et al., 2022; Kandpal et al., 2023), and social biases in pretraining data (Feng et al., 2023; Navigli et al., 2023; Seshadri et al., 2023) may necessitate additional consideration in some domains.

Open pretraining data is necessary for analysis via empirical studies exploring how data composition influences model behavior, allowing the modeling community to interrogate and improve current data curation practices (Longpre et al., 2023; Gao, 2021; Elazar et al., 2023). Examples of this research include memorization (Carlini et al., 2022b; Chang et al., 2023), deduplication (Lee et al., 2022), adversarial attacks (Wallace et al., 2021), benchmark contamination (Magar and Schwartz, 2022), and training data attribution (Hammoudeh and Lowd, 2022; Grosse et al., 2023)

Access to data is required for successful development of open language models. For example, newer language models may offer functionality such as attribution of generations to pretraining data (Borgeaud et al., 2022).

To support broader participation and inquiry in these lines of research, we present Data for Open Language Models’ Appetite (Dolma), an open corpus of three trillion tokens designed to support language model pretraining research. Pretraining data mixes are often motivated by a desire to capture so-called “general-purpose” English. We source much of our data from sources similar to those present in past work, including a mix of web text from Common Crawl, scientific research from Semantic Scholar, code from GitHub, public domain books, social media posts from Reddit, and encyclopedic materials from Wikipedia. We compare our dataset to a variety of popular pretraining corpora that are available publicly, and find that Dolma offers a larger pool of tokens at comparable quality and with equally diverse data composition. Dolma has been already used to pretrain OLMo (Groeneveld et al., 2024), a family of state-of-the-art models designed to facilitate the science of language modeling.

In summary, our contributions are two-fold:

We release the Dolma Corpus, a diverse, multi-source collection of 3T tokens across 5B documents acquired from 7 different data sources that are (i) commonly seen in large-scale language model pretraining and (ii) accessible to the general public. Table 1 provides a high-level overview of the amount of data from each source.

We open source the Dolma Toolkit, a high-performance, portable tool designed to efficiently curate large datasets for language model pre-training. Through this toolkit, practitioners can reproduce our curation effort and develop their own data curation pipelines.

The remainder of this manuscript is organized as follows: we first describe the desiderata and design principles that guided the creation of Dolma (§2). We then document the methods applied to process the raw text (§3), including filters for language, “quality,” content filtering, and deduplication. Further processing was required to prepare Dolma for use as a pretraining corpus (§4), including benchmark decontamination and selecting a mixture rate. Throughout, we conduct ablation experiments, measuring domain fit through perplexity tracking and downstream performance on a set of twelve question-answering, common sense, and reasoning tasks. We conclude by discussing the process of releasing Dolma (§5).

Dolma Design Goals

To support large-scale LM pretraining research, we set four design requirements around openness, consistency with prior work, size, and risk mitigation. We discuss each in turn.

By matching data sources and methods used to create other language modeling corpora, to the extent they are known, we enable the broader research community to use our corpus and resulting model artifacts to study (and scrutinize) language models being developed today, even those developed behind closed doors. In this reproduction effort, we follow established practices (i.e., use data sources and techniques for preprocessing and filtering content that appears frequently across language modeling efforts) to the extent they are known, and defer to analysis, experimentation and educated guesses when best practice isn’t known or implementations differ in subtle ways.We note this reproduction effort does not seek to replicate specific language model pretraining data implementations. Instead, we reproduce a range of data curation themes. Notably, this also means scoping Dolma to English-only text to better leverage known curation practices and maximize generalizability of scientific work on Dolma to existing language models.Recognizing that this focus reinforces the assumption of English as the “default” language, we hope to expand Dolma to more languages in the future. We release our data curation tools to support such efforts. To illustrate the open-ended nature of this reproduction effort, we provide a detailed summary of known (and unknown) data curation practices for some of the largest proprietary (e.g., GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al., 2023), Claude (Anthropic, 2023)) as well as open (e.g., OPT (Zhang, 2022), LLaMA (Touvron et al., 2023a), Llama 2 (Touvron et al., 2023b)) language models in Appendix §C.

Hoffmann et al. (2022) suggested that one can train compute-optimal models by maintaining a fixed ratio between language model size (in parameters) and minimum number of training tokens. Recent models that follow these “scaling laws,” such as LLaMA 2 (Touvron et al., 2023b), appear to show there is still room for performance improvement by increasing the number of training tokens.See Figure 5 in Touvron et al. (2023b), in which loss has not converged even at 2T tokens. As this is an active area of research, we aim for a sufficiently large corpus to allow further study of the relationship between model and dataset size—2-3T tokens.

Lack of access to pretraining corpora alongside corresponding language models has been a major obstacle for the broader research community. Very few open models out of the hundreds released in the recent years are released alongside their training data: T5 and C4 (Raffel et al., 2020), BLOOM and ROOTS (Leong et al., 2022; Piktus et al., 2023), GPT-J/GPT-NeoX/Pythia and Pile (Wang and Komatsuzaki, 2021; Black et al., 2022; Biderman et al., 2023; Gao et al., 2020), INCITE and RedPajama v1 (Together Computer, 2023b, c). However, limitations in these prior corpora have motivated need for a new dataset such as Dolma:

C4 (Raffel et al., 2020), Pile (Gao et al., 2020), and Falcon (Almazrouei et al., 2023) are high-quality datasets with demonstrated use in training language models, but are unfortunately limited in scale. ROOTS (Piktus et al., 2023) is large and diverse but given its multilingual focus, its English-only portion is also too small to train English-only models.

RedPajama v2 (Together Computer, 2023a) meet our criteria of scale but don’t reflect representative distributions over sources of content commonly seen in curating the largest language models (e.g., scientific papers, code).

RedPajama v1 (Together Computer, 2023c) is most similar to our effort and a source of inspiration when designing Dolma. While RedPajama v1 was a reproduction of the LLaMA (Touvron et al., 2023a) training data, we have a broader reproduction target which required diving into data sources that RedPajama v1 did not pursue, including larger collections of scientific papers and conversational forums like Reddit.

In all, we expand on these works by creating the largest curated open pretraining corpus to date. We define openness to mean (i) sharing the data itself, which in turn informs our choice of data sources, and (ii) documenting the process used to curate it, including decisions made with justifications, and open-source implementations to allow others to reproduce our work and create new corpora. The resulting open-source high-performance toolkit enables researchers to implement their own data pipelines to either further refine Dolma or process their own datasets.

Curating a pretraining corpus may introduce risk to individuals, either by facilitating access to information that is present in the corpus, or by enabling training of harmful models. To minimize these risk while meeting our stated goals, we engaged with legal and ethics experts from within our organizations early in the project and evaluated data design decisions based on their feedback on a case-by-case basis. Broadly, we follow accepted practices when available (e.g., masking of certain personal identifiable information), and take a measured approach when diverging opinions exist in the literature (e.g., most effective approach to identify and remove toxic content). Further, we provide tools to request data removalAvailable at the following URL: forms.gle/FzpUXLJhE57JLJ3f8 As the landscape around data and AI is evolving, we do not claim that our decisions are correct. Nevertheless, we do believe in compromising on desired research artifact properties like model reproducibility, performance, and extensibility in cases of significant harm to individuals.

Even with these design goals to help scope our effort, there remain myriad decisions we must make when curating Dolma. Without a single clear recipe to follow from prior work, we rely on two principles to guide our decisions:

Use an evaluation suite, wisely. As part of the OLMo project Groeneveld et al. (2024), we developed an evaluation suite (Groeneveld et al., 2023; details in Appendix D) to offer guidance during pretraining across a range of capabilities and tasks. Whenever possible, data decisions are made to improve its metrics. However, our evaluation suite is not perfect. For example, it cannot fully measure the effect of adding data sources that benefit models after instruction tuning For example, the effect of adding code to pretraining data cannot be fully measured until models are able to generate executable code. However, such capability is typically observed after models are further finetuned to follow instructions (Muennighoff et al., 2023a). . In these cases, we make sure that any one decision does not drastically decrease performance of any of the tasks in the suite.

Favor decisions that advance research directions of interest to our organization. Where the above principles do not offer guidance, we seek to build a corpus that will be most useful in research at academic or non-profit organizations like those of the authors. This does not necessarily mean maximizing benchmark performance; many desirable dataset interventions are at odds with each other For example, we would like Dolma to support future investigations of the effect of pretraining on code; while our current evaluation suite is not properly designed to fully assess the impact of code data, we nevertheless include code in our corpus, to further research on this topic. Similarly, while previous research has suggested that removing .

Creating Dolma

Curation of pretraining data often requires defining complex pipelines that transform raw data from multiple sources into a single collection of cleaned, plain text documents. Such a pipeline should support \faDownload acquisition of content from diverse sources (e.g., crawling, API ingestion, bulk processing), data \faFiltercleanup through the use of filtering heuristics and content classifiers, and \faCopy mixing into a final dataset (e.g., deduplication, up/down-sampling of sources).

In curating Dolma, we create a high-performance toolkit to facilitate efficient processing on hundreds of terabytes of text content. The toolkit is designed for high portability: it can run any platform from consumer hardware (thus facilitating the development of new pipelines) to a distributed cluster environment (ideal for processing large datasets like Dolma). Through the curation of Dolma, we implemented commonly used \faFiltercleanup and \faCopy mixing steps that can be used to reproduce and curate similar datasets to Gopher, C4, and OpenWebText.

Using our toolkit, we develop and combine four kinds of data transformations that match Dolma desiderata we introduced in § 2:

Language filtering. To create our English-only corpus, we rely on scalable tools for automated language identification. Identification is performed using fastText’s (Joulin et al., 2016a) language ID model. Depending on the length of documents in each source, we either process the entire text at once or average the score of paragraphs. Documents with a sufficiently low English score are removed.Keeping a low threshold can help mitigate inherent biases (Blodgett et al., 2016) that language detectors have against English dialects spoken by minoritized groups. Scores used for each source are reported in subsequent sections. We do not perform any language identification on datasets that are distributed already pre-filtered to English-only documents.These datasets may have been filtered to English content using other classifiers and thresholds. We note that language filtering is never perfect, and multilingual data is never completely removed from pretraining corpora (Blevins and Zettlemoyer, 2022).

Quality filtering. It is common practice to remove text that is considered “low quality,” though there is no broad consensus about what this means or how best to operationalize this with automated tools.The term “quality filter,” while widely used in literature, does not appropriately describe the outcome of filtering a dataset. Quality might be perceived as a comment on the informativeness, comprehensiveness, or other characteristics valued by humans. However, the filters used in Dolma and other language models efforts select text according to criteria that are inherently ideological (Gururangan et al., 2022). For web sources, we follow recommendations in Gopher (Rae et al., 2021) and Falcon (Almazrouei et al., 2023) which suggest avoiding model-based quality filters like those used for LLaMA (Touvron et al., 2023a) and GPT-3 (Brown et al., 2020). Instead, we reimplemented and applied heuristics used in C4 (Raffel et al., 2020) and Gopher (Rae et al., 2021) that they used for processing Common Crawl. For other sources, we refer the reader to their corresponding sections as each required bespoke quality filtering strategies.

Content filtering. Beside removal of low quality, unnatural content, it is standard practice to filter toxic content from pretraining data to reduce risk of toxic generation (Anil et al., 2023; Rae et al., 2021; Thoppilan et al., 2022; Hoffmann et al., 2022; Longpre et al., 2023). We follow this practice and implement a mix of rules- and classifier-based toxicity filtering techniques depending on the source. Like in the case of “quality”, there is no single definition for “toxicity”; rather, specific definitions vary depending on task (Vidgen and Derczynski, 2020) and dataset curators’ social identities (Santy et al., 2023); annotators’ beliefs also influence toxic language detection (Sap et al., 2021) Using models to identify toxic content remains challenging (Welbl et al., 2021; Markov et al., 2023a), and existing methods have been shown to discriminate against minoritized groups (Xu et al., 2021). . Large pretraining corpora have also be shown to include personal identifiable information (PII; Elazar et al., 2023), which models are able to reproduce at inference time (Carlini et al., 2022a; Chen et al., 2023b). In Dolma, we identify content for removal through a fastText classifier trained on Jigsaw Toxic Comments (cjadams et al., 2017) and a series of regular expressions targeting PII categories from Subramani et al. (2023); Elazar et al. (2023).

Deduplication. Deduplication of pretraining corpora has been shown to be an effective technique to improve token efficiency during model training (Lee et al., 2022; Abbas et al., 2023; Tirumala et al., 2023). In preparing Dolma, we use a combination of URL, document, and paragraph-level deduplication. We achieve linear-time deduplication through the use of a Bloom filters (Bloom, 1970). We perform this deduplication across files from the same subset (e.g., deduplicate all documents in the web subset), but not across sources (e.g., do not check if any web document also appears in the code subset).

In the reminder of this section, we provide a detailed explanation of how the steps above are implemented for each data source shown in Table 1. To support our decisions, we leverage two tools. First, we inspect the output of our pipelines using the WIMBD tools (Elazar et al., 2023). This approach allows us to efficiently spot issues without having to train any models.

Then, we conduct data ablations using a 1 billion parameter decoder-only model trained up to 150 billion tokens; we provide a detailed description of our experimental setup in § D.1. Through these ablations, we can compare the outcome of our data pipelines on our evaluation suite. The evaluation suite is comprised of 18 domains on which we measure perplexity to estimate language fit (Magnusson et al., 2023; described in § D.2), as well as 7 downstream tasks on which we evaluate question answering, reasoning, and commonsense capabilities of resulting models (described in § D.3). For the reminder of this section, we present a subset of results on the evaluation suite; we include all our experimental results in Appendix K. When making decisions, we prioritize interventions that optimize metrics in downstream tasks over language fit.

The web subset of Dolma was derived from Common Crawl.commoncrawl.org Common Crawl is a collection of over 250 billion pages that were crawled since 2007. It is organized in snapshots, each correspond to a full crawl over its seed URLs. In November 2023, there were 89 snapshots. Dolma was curated from 25 snapshots.We use just enough snapshots to meet the volume goal described in § 2 — at least 2T tokens. collected between 2020-05 to 2023-06.

Following data curation practices used to develop LLaMA (Touvron et al., 2023a), our web pipeline leverages CCNet (Wenzek et al., 2020b) to perform language filtering and initial content deduplication. This tool was also used for the Common Crawl subset of RedPajama v1 (Together Computer, 2023c) and RedPajama v2 (Together Computer, 2023a). CCNet processes each web page with a fastText language identification modelhttps://fasttext.cc/docs/en/language-identification.html to determine the primary language for each document; we keep all pages with English document score greater or equal to 0.5 (removed 61.7% of web pages by size). Further, CCNet identifies and removes very common paragraphs by grouping shards in each snapshot into small sets and removing duplicated paragraphs in each. This step removed approximately 70% of paragraphs, primarily consisting of headers and navigation elements. Overall, CCNet pipeline filters out 84.2% of the content in Common Crawl, from 175.1 TB to 27.7 TB. More details provided in Appendix J.4.

1.2 \faFilter Quality Filtering

Web crawled data requires significant cleanup before it can be used for language model pretraining. This step removes artifacts introduced by the conversion from HTML to plain text (e.g., page headers, ill-formatted text) and discards pages that do not contain enough “prose-like” text (e.g., repeated text, short segments). First, CCNet natively provides a quality filter using KenLM (Heafield, 2011) perplexity to group documents into buckets based on Wikipedia-likeness; this buckets are often interpreted as high (21.9%), medium (28.5%), or low (49.6%) quality context. However, per arguments posed in Rae et al. (2021) and Almazrouei et al. (2023) against model-based quality filters, as well as our own manual inspections of content distributed between these buckets, we opted not use these CCNet quality scores. Instead, in Dolma, we achieve quality filtering by combining heuristics introduced by Gopher (Rae et al., 2021) and C4 (Raffel et al., 2020). Specifically we keep all the Gopher rules (henceforth, Gopher All) and keep a single heuristic from C4 designed to remove paragraphs that do not end in punctuation (C4 NoPunc; as opposed to C4 All). Detailed description of filtering rules provided in Appendix J.4.

Ablation results shown in Figure 2 validate our filtering strategy: we find that C4 NoPunc on its own outperforms both C4 All as well as Gopher All on both perplexity and downstream tasks. Finally, combining Gopher All + C4 NoPunc offers the best performance. In all, the Gopher rules tagged 15.23% of UTF-8 characters for removal, while the C4 rule tagged 22.73% of characters for removal. When comparing our heuristics against CCNet’s quality scores, the remaining documents after filtering fall into CCNet buckets of high (22.8%), medium (26.2%) and low (51.0%) quality, revealing very little correlation between model and heuristic-based quality filters.

Using the tool from Elazar et al. (2023), we inspect our filtered dataset for occurrences of repeated $n$ -grams. Despite filtering using Gopher and C4 rules, we still found undesirable texts such as repeated sequences of ‘-’ 100 times, occurring over 60 million times, or repeated sequences of ‘bla’, occurring 19.1 million times (see Table 2). Based on this, we implement $n$ -gram heuristics to identify and remove documents containing these sequences; specifically, we remove any repeated sequence longer than 100 UTF-8 characters. While this only removed 0.003% of the total characters in the dataset, removal of these documents can prevent loss spikes during training, as was empirically foundMore information at github.com/bigscience-workshop/bigscience/blob/master/train/tr8-104B-wide/chronicles.md in Scao et al. (2022). We also note that this was a fairly conservative heuristic that left many repeated sequences remaining in the dataset; we found from manual inspection of these sequences that they often served as webpage layout elements as opposed to parsing irregularities.

1.3 \faFilter Content Filtering

Data sampled from the internet may contain harmful or toxic content (Matic et al., 2020; Luccioni and Viviano, 2021; Birhane et al., 2023a, b). As highlighted in § 2, we filter Dolma to reduce harms that might arise from training language models on toxic content. We used the Jigsaw Toxic Comments dataset (cjadams et al., 2017), which contains forum comments tagged with (multilabel) categories “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and/or “identity hate” alongside unlabeled comments, to train two fastText classifiers—a binary “hate” detector and a binary “NSFW” detector:

For our “hate” detector, we group all unlabeled comments and “obscene”-only comments as negatives and left remaining comments as positives.

For our “NSFW” detector, we take all comments tagged as “obscene” as positives and left other remaining comments as negatives. It is important to note this detector only filters toxic content that mentions sexual or obscene topics, not sexual content in general.

For both these models, we run them on Common Crawl sentencesIdentified using BlingFire sentence splitter (Microsoft, 2019). with a filtering threshold of 0.40 based on manual threshold tuning. We chose our threshold seeking a balance between (1) maximizing precision and recall from inspecting predicted toxic sentences on a single snapshot of Common Crawl, as well as (2) minimizing too much data removal.For example, the “hate” and “NSFW” detectors filter out 34.9% and 29.1% of tokens from Common Crawl at thresholds of 0.0004 and 0.00017, respectively. We always remove just the span that has been tagged as toxic, not the full document. We make both of these models available publicly.“NSFW” fastText tagger and “hate” fastText tagger.

In Figure 3, we compare the effect of two different thresholds for the ‘‘hate’’ and ‘‘NSFW’’ detector. The “High Threshold” configurations remove less content, but generally yield higher perplexity on evaluation set and lower downstream performance. The “Low Threshold” configurations remove more content and generally have higher performance, but remove more units of text (7.3% vs 34.9% and 5.5% vs 29.1%, for ‘‘hate’’ and ‘‘NSFW’’ UTF-8 characters, respectively). Because lower thresholds might lead to false positive, and improved performance can be achieved by combining content filters with quality and deduplication filters, we use the “High Threshold“ versions of the “hate” and “NSFW” filters, removing any sentence with a score greater than or equal to 0.4.

Data sampled from the internet can also leak personal identifiable information (PII) of users (Luccioni and Viviano, 2021; Subramani et al., 2023); such PII is abundant in large-scale datasets (Elazar et al., 2023).

PII detection can be accomplished using model-based tools (Dernoncourt et al., 2017; Microsoft, 2018; Hathurusinghe et al., 2021; Lison et al., 2021; Lukas et al., 2023; Mazzarino et al., 2023) or rule-based approaches (Aura et al., 2006; Elazar et al., 2023). The former generally offer better performance, while the latter are faster.

The size of Dolma makes impractical to use model-based tools; instead, we rely on carefully crafted regular expressions. Following the findings of Subramani et al. (2023), we tag three kinds of PII that can be detected with sufficient accuracy: email addressesRegex: [.\s@,?!;:)(]*([\^\s@]+@[\^\s@,?!;:)(]+?)[.\s@,?!;:)(]?[\s\n\r] , IP addressesRegex: \s+$?(\d{3})$?[-\. ]*(\d{3})[-. ]?(\d{4}), and phone numbersRegex: (?:(?:25|2|?{1,2})\.){3} (?:25|2|?{1,2}). Once spans are tagged, we employ different processing strategies based on the their density on each document:

5 or fewer PII spans detected: we replace all spans on a page with special tokens | | |EMAIL_ADDRESS| | |, | | |PHONE_NUMBER| | |, and | | |IP_ADDRESS| | | for email addresses, phone numbers, and IP addresses respectivelyWhen training models on Dolma, we these special tokens to the tokenizer vocabulary. For all results shown in this paper, we use allenai/gpt-neox-olmo-dolma-v1\_5.. In total, we find 0.02% of documents in the 25 Common Crawl snapshots match this filter.

6 or more PII spans detected: we remove any document that contains 6 or more matching PII spans. We this approach because pages containing abundant phone numbers and email addresses are likely to pose a greater risk of discosing other PII classes. 0.001% of documents in the 25 Common Crawl snapshots match this filter.

In Figure 4, we show results of experiment designed to quantify the impact of our PII strategy. Overall, we find that, in both language modeling and downstream tasks, PII removal and masking has no discernible effect on model performance.

1.4 \faCopy Deduplication

Recent efforts indicate that the deduplication of data leads to language models that train more efficiently (Lee et al., 2022). Following this principle, we deduplicate data in the web pipeline. We perform three stages of deduplication:

Exact URL deduplication: mark pages that share the same URL. No normalization is performed. This filter is primarily intended to remove pages that have been crawled multiple times. Overall, it removes 53.2% of documents in the 25 snapshots used to create Dolma. URL deduplication is commonly used as the first stage for web crawls thanks to its computational efficiency (Agarwal et al., 2009; Koppula et al., 2010; Penedo et al., 2023).

Exact document deduplication: mark pages that contain the same text. No punctuation or whitespace is removed. Empty documents count as duplicates. Overall, it removes an additional 14.9% of documents after URL deduplication.

Exact paragraph deduplication: mark identical paragraphs across pages as duplicates. We keep definition of this unit consistent with previous filters: a paragraph is a span of text separated by the newline UTF-8 character ‘‘\n’’. Overall, this filter tags 18.7% of documents in the URL-deduplicated set as repeated.

This multi-stage approach is designed to increase efficiency: stages (i) and (ii) are designed to remove copies of the same item (identical pages might have multiple URLs, such in the case of the same news article being included in multiple online newspaper), thus can be executed before before any content or quality filtering, reducing the number of pages to process. In contrast, stage (iii) removes repeated content that appears on the different pages (such as the same byline appearing under all articles written by the same author), thus altering portion of pages and potentially disrupting content analysis. All stages use a Bloom filter (Bloom, 1970) data structure for efficient content deduplication.

1.5 \faDownload \faFilter \faCopy Putting It All Together

To summarize, the Dolma web pipeline transform the output of CCNet by first performing URL and document-level deduplication, followed by quality filtering (Gopher, C4 NoPunc), content filtering (toxic content, PII), and, finally, paragraph-level deduplication. But What’s the combined outcome of the filtering?

In Figure 5, we show the compounding effect of the stages of the pipeline. We find that the combination of the three stages achieve the best performance on downstream tasks, while content filtering slightly hurts language fit of C4 100 domains subset. As stated in § 2, we leverage downstream evaluation tasks to make decision; thus we use all steps in the pipeline when creating Dolma.

We use the tool from Elazar et al. (2023) to inspect the final data composition in Figure 6. In particular, we analyze web domain, year, and language distributions.

We note that Dolma contains documents from a broad set of internet domains, mostly from 2020, 2022, and 2021. The most common internet domains in Dolma, per token, are patents.google.com, followed by www.nature.com and www.frontiersin.org. In fact, similar to other corpora reported in Elazar et al. (2023), 63.6% of Dolma’s web documents are from ‘.com’ sites (followed then by ‘.org’ and ‘.co.uk’ sites). Finally, as all language identification tools are imperfect, we summarize what languages are remaining post English-only filtering: We find the most common language after English is not well identified (‘un’) with 0.86% of the documents, followed by 0.06% of the documents identified as Chinese.

In order to further understand how filters described in § 3.1.2 and § 3.1.3 interact with each other, we perform a correlation analysis on a subset of documents sampled from our pipeline.

The correlation among the documents flagged for removal by our Common Crawl filters is depicted in Figure 7. We find that correlations are generally low, thus our filters select fairly different documents and are not redundant. There is some positive correlation between our PII (Personal Identifiable Information) filters and filters removing hate speech. This is likely because hate speech is often directed at people. The Gopher filtering rules correlate negatively with our deduplication, especially for the high-perplexity tail part of our data. This is due to the Gopher rules removing many high-perplexity documents such as random strings, which are not caught by deduplication due to their randomness. As these random strings likely do not contribute to a better understanding of language, it is important to filter them out and thus rely on filters beyond deduplication.

2 \faCode Code Pipeline

We derive the code subset of Dolma from The Stack (Kocetkov et al., 2022), a collection of permissively-licensed GitHub repositories. We use the near-deduplicated version as a starting point, thus removing the need to perform deduplication ourselves. The raw version of this dataset was collected in March 2023. We filter data-heavy documents by removing files with extensions such as JSON and CSV.

2.2 \faFilter Quality Filtering

We apply heuristics derived from RedPajama v1 (Together Computer, 2023c) and StarCoder (Li et al., 2023) datasets. The former consist of rules to remove repetitive file preambles, such as license statementsWe keep this information in the metadata associated with each document in Dolma. and documents with excessively long lines or mostly numerical content. Overall, RedPajama Rules (RPJ) are designed to remove files that are mostly data or generated through templates. To further select high quality code snippets, we leverage rules from the StarCoder pipeline; these heuristics filter GitHub repositories with no to few stars, files with too few or too many comments, and HTML files with low code-to-text ratio. For a detailed description of these rules, see § J.4.

In Figure 9, we present a comparison between RedPajama (RPJ) and StarCoder rules. In our ablations we find that, compared to RPJ rules alone, RPJ and StarCoder combined lead to lower perplexity on code datasets (e.g., HumanEval; Chen et al., 2021b), more stable perplexity during training on non-code test sets (e.g., C4 100 Domains subset of Paloma; Magnusson et al., 2023), and improved downstream performance (e.g., HellaSwag; Zellers et al., 2019). Therefore, we chose to use this combination when creating the final mix for Dolma.

2.3 \faFilter Content Filtering

We apply the same filtering rules to from the web pipeline (§ 3.1) to mask personal identifiable information (PII). Documents with greater than 5 PII instances are removed from Dolma. In all other instances, emails, phone numbers, and IP addresses are masked using special tokens.

We also remove code secrets or personal information. To do so, we use the detect-secrets (Yelp, 2013) library and remove any documents with a match.

2.4 \faCopy Deduplication

We used the already-deduplicated version of The Stack published by Kocetkov et al. (2022); their approach uses the pipeline first introduced by Allal et al. (2023), which uses MinHash Broder (2002) and Locally Sensitive Hashing to find similar documents.

3 \faComments Conversational Forums Pipeline

The conversational subset of Dolma was derived from the Pushshift Reddit dataset (Baumgartner et al., 2020b), a large collection of forum conversations collected through Reddit’s data API and distributed by the Pushshift project. We derive the conversational subset in Dolma from 378M posts from Reddit, from December 2005 until March 2023. We include both submissions—initial message in conversations on Reddit—and comments—replies to messages—in the dataset. We treat all submissions and comments as independent documents without any structure or connection to the thread they appear in; in our evaluation, this simplified representation yields better performance on downstream tasks. A discussion of this trade-off is presented in Appendix E.

For consistency, we use same strategy as the web pipeline to filter non English content. In particular, we keep submission and comments with an English score greater than 0.5.

3.2 \faFilter Quality Filtering

Conversational forum data must be adequately cleaned to remove content that is too short, repetitive, or is negatively ranked by the community it was submitted to. We use the pipeline introduced by Henderson et al. (2019) to facilitate cleanup of submissions and comments using Google Dataflowhttps://cloud.google.com/dataflow. We remove comments shorter than 500 characters, and submissions shorter than 400 charactersQualitative inspection of the data suggested that submissions are of higher quality than comments; thus, we use a more permissive minimum length.. We also remove documents over 40,000 characters in length.

We remove comments with fewer than 3 votesThe total votes for each documents are obtained by computing the difference between positive votes, also known as “upvotes”, negative votes or “downvotes”., as lower score are associated with comments that are deeply nested in a conversational thread (Weninger et al., 2013) or content that is more likely to results in emotionally charged discourse (Davis and Graham, 2021). Votes have been used as a signal in constructing the WebText (Radford et al., 2019) and OpenWebText (Peterson, 2020) corpora. We discard documents that have been deleted by their authors or removed by moderators; further, documents that have been labeled by their authors as “over 18” were also removed. We exclude any document originated from any of the 26,123 banned and not safe for work subredditsThe list is available at https://github.com/allenai/dolma/blob/main/sources/reddit/atomic_content_v5/subreddit_blocklist.txt. The list was obtained by merging several sources that tracked banned subreddits (mostly from posts on Reddit itself). We also measured the fraction of posts within a subreddit tagged as NSFW, and blocked the subreddit when this fraction exceeded 10%. we curated.

3.3 \faFilter Content Filtering

We apply the same filtering rules to used in the web pipeline (§ 3.1.3) to remove toxic content and mask PII. Unlike in the case of the web pipeline, we fully remove a document if part of it are tagged as toxic. We employ this strategy because content from Reddit is shorter in length, thus it is more likely that a single sentence classified as toxic is a strong indication of the entire document being toxic as well.

3.4 \faCopy Deduplication

We employ the same strategy used in the web pipeline (§ 3.1.4). Since submissions and comments are shorter than web documents, we only deduplicate at a document-level. This strategy is useful to reduce the incidence of “Copy pasta” (blocks of text that get often repeated across many comments and subreddits for comedic effect) and other repetitive information.

4 Other Data Sources

In this section, we briefly summarize additional high-quality sources that were used to derive Dolma. For more details on collection and processing, see Appendix § J.3 and § J.4.

Similarly to LLaMA (Touvron et al., 2023a), we include documents from C4 Raffel et al. (2020) in the Dolma dataset. We further refine this data by reprocessing it through our web pipeline to remove long, repeated sequences (§ 3.1.2) and duplicates (§ 3.1.4). Finally, we also perform PII masking as described in (§ 3.1.3);

The PeS2o dataset (Soldaini and Lo, 2023) is a collection of approximately 40 million open-access academic papers that have been cleaned, filtered, and formatted for pre-training of language models. It is derived from the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020). As this dataset has been created for language modeling purposes, we use it as-is.

Project Gutenberg is a repository of over 70 thousand public domain books. We collected Project Gutenberg’s archive in April 2023. We use the same fastText-based language identification model to identify English language books and include them in Dolma. More details in our Data Sheet § J.

This dataset was derived by March 2023 Wikimedia dumps. We use the “English” and “Simple” editions of Wikipedia and Wikibooks as base for the Encyclopedic subset of Dolma. Sources were processed using WikiExtractorgithub.com/attardi/wikiextractor, v. 3.0.7, commit prefix 8f1b434.. We remove any document with 25 or fewer UTF-8-segmented words, as we found shorter pages to either be the result of short, templated pages (e.g., pages containing only a few words and an information box) or XML parsing errors.

Training a Language Model on Dolma

As a final validation step of the Dolma pipeline, we train, evaluate and release a decoder-only, autoregressive language model which we call Olmo-1b. In this section, we discuss potential approaches additional dataset curation decisions specific to model training. In § 4.1, we present an approach to remove benchmark tasks—i.e., decontaminate—from Dolma. Then, in § 4.2, we discuss considerations when combining—i.e., mixing—the various document subsets in Dolma to obtain the final pretraining corpus. Finally, in § 4.3, we present experimental results of the resulting Olmo-1b model. Olmo-1b uses GPT-NeoX tokenizer (Black et al., 2022), which we found to be well suited for Dolma; we present results supporting our decision in Appendix F.

In this section we experiment with approaches to remove benchmark contamination from pretraining and select which is ultimately used in Olmo-1b. Large-scale language datasets contain copies of benchmarks that are commonly used to evaluate language models (Dodge et al., 2021; Yang et al., 2023; Elazar et al., 2023). The impact of such contamination is currently debated. For example, Lee et al. (2022) showed that removing duplicates of validation data from C4 pretraining increases perplexity on the previously duplicated validation data. Meanwhile, work examining post-hoc performance difference between contaminated and uncontaminated downstream data finds no consistent positive or negative impact (Chowdhery et al., 2022; Brown et al., 2020; OpenAI, 2023). To start, we focus on the removal of perplexity benchmark contamination, and we measure the extent of downstream task contamination. We experiment with removing contamination with respect to an early version of Paloma (Magnusson et al., 2023), a benchmark of 585 text domains designed to evaluate language model fit to diverse sources. This selection of perplexity evaluations is detailed in Appendix D.

Using the paragraph deduplication tools described in § 3.1.4, we mark any paragraph in Dolma as contaminated if (i) it is longer than 13 Unicode-segmented tokensLike in Elazar et al. (2023), we only consider paragraph of sufficient length to avoid false positive matches. and (ii) it appears in any of the documents in Paloma. In preliminary experiments on decontaminating C4 (Raffel et al., 2020) against an early version of Paloma, we compare the paragraph-based decontamination technique described above with exact-matching whole documents. Results show that document-based decontamination yields lower matching rate, with only 1 of 12 subsets with greater than 1% contaminated documentsC4 100 Domains subset, which is directly constructed from C4.. However, when considering paragraph-based decontamination, 6 of 12 perplexity tasks have greater than 1% of documents contaminated. Since the latter better reflect expected contamination rates, we chose it for the reminder of this section.

Lastly, we consider two ways of removing contamination. In preliminary experiments on C4, we find that removing just the contaminated paragraphs by excluding them from documents removes 0.01% of tokens, while removing whole documents with any contamination removes 0.02% of tokens. In either case 0.01% of documents are affected. Given that each have relatively small impact, we opt for removing full documents to avoid disrupting reading order, though this does bias towards removing longer documents.

To assess the risk of our decontamination approach, we train This experiment uses the setup described in Appendix D, including model configuration, optimizer, and evaluation setup. two 1B parameter models on a 221B token subset of RedPajama v1 (Together Computer, 2023c), the corpus most similar to Dolma’s intended composition at the time of experimenting. The first model is trained on RedPajama v1 as-is, while the second uses the same corpus after the paragraph-matching, document-removal decontamination approach described above. On this subset, our decontamination approach removes 2.17% of unicode tokens and 0.66% of documents. In Table 3 we show that differences in perplexity and downstream task performance are minimal and do not trend consistently positive or negative. For perplexity, 7 sources degrade and 6 improve; for downstream tasks, 5 degrade and 4 improve. The largest degradation in a perplexity source is 22.0 to 22.3 on Penn Tree Bank. The largest degradation in a downstream task is a drop of 1.5% accuracy on SCIQ to 84.8%. In conclusion, results show no consistent evidence of performance degradation with decontamination.

As our experiments have derisked our approach for removing benchmark contamination, we apply it to our model trained on Dolma. The finalized approach for removing overlap with Paloma is detailed in Magnusson et al. (2023). It applies the steps discussed in this section with the addition of a filter that ignores overlaps consisting of only punctuation, spaces, and emoji. These types of tokens can be arbitrarily repeated in text formatting, leading to common n-grams greater than our 13-gram threshold. On the final Dolma corpus used to train Olmo-1b, our approach finds less than 0.001% characters in training data contaminated, and removes fewer than 0.02% of documents.

We measure data contamination in Dolma. We follow the same setup from WIMBD (Elazar et al., 2023) and compute the percentage of instances from tasks with two or more inputs (e.g., natural language inference) that can be found in a single document. This serves as an upper bound of exact-match contamination in Dolma. We consider 82 datasets from PromptSource (Bach et al., 2022), and report the datasets that at least 5% of their test sets can be found in Dolma. We report the results in Figure 11.

Results indicate that portion of datasets in Promptsource appear in Dolma. Six datasets are completely contaminated (100%): the Winograd Schema Challenge (Levesque et al., 2012), Sick (Marelli et al., 2014), AX from GLUE (Wang et al., 2018), SemEval (specifically, Task 1 from 2014), COPA from SuperGLUE (Roemmele et al., 2011), and AXb (the diagnostic task) from SuperGLUE (Wang et al., 2019). In addition, other datasets are mostly contaminated, with over 90% of their test sets appearing in Dolma documents: OpenAI HumanEval (Chen et al., 2021a), WIC from SuperGLUE (Pilehvar and Camacho-Collados, 2019), ESNLI (Camburu et al., 2018), and SNLI (Bowman et al., 2015). We note that the contaminated datasets have been excluded from the downstream tasks we use for model evaluation (c.r.f. Appendix D).

2 Strategies for Subsets Mixing and Upsampling with Dolma

Like the pretraining corpora of nearly every large-scale language model, Dolma is a multi-source dataset. Training on Dolma thus requires a mixing strategy that determines how much data from each source to include, and potentially which sources to upsample. Like other multi-source corpora (e.g., ROOTS (Laurenccon et al., 2023), the Pile (Gao et al., 2020), RedPajama v1 (Together Computer, 2023c)),RedPajama v1 was a reproduction of the multi-source corpus used in LLaMA (Touvron et al., 2023a). RedPajama v2 (Together Computer, 2023a) focuses solely on Common Crawl and is thus single-source. Dolma does not prescribe a single mixing strategy. We refer the reader to Rae et al. (2021) for an example of how one might programmatically search over mixing configurations to maximize performance. Here, we perform mixing experiments as an opportunity to answer some research questions about how different data sources interact. We use the same ablation setup described in § 3.

It is common practice for language models to be pretrained on some amount of code, even if code generation is not the intended task. Some research has suggested that mixing code into training over plain text documents improves performance on reasoning tasks (Madaan et al., 2022). We investigate whether this observation holds for models trained on Dolma, and if so, how much code is needed?

We create three mixtures from the C4 and Stack subsets containing 0%, 5% and 15% of code data. On each, we train a 1B model. We evaluate these models on three different reasoning tasks: bAbI (Weston et al., 2015), WebNLG Gardent et al. (2017) and GSM8k Cobbe et al. (2021). For the first two tasks, we follow the experimental setup of Muennighoff et al. (2023b) and evaluate each model in an ICL setup with a changing number of demonstrations (0-5) across 5 random seeds. Muennighoff et al. (2023b) show that adding code to pre-training data improves ICL performance on bAbI and WebNLG and they suggest that code improves long-range state-tracking capabilities. Our experiments, as shown in Table 4, corroborate these findings: while the C4-only model fails on all bAbI tasks, adding code improves performance, with a similar trend for WebNLG.

On the more difficult GSM8k benchmark, all models failed to get any correct answer in an ICL setup, and even when fine-tuning the models on the entire training set. However, we find that by fine-tuning on program-aided output, where questions are solved by writing Python snippets as described in Gao et al. (2022), code models outperform the C4-only model. These results show that models pre-trained on code can leverage code generation to answer challenging reasoning tasks even when the original task does not directly involve code.

While Dolma does not prescribe a specific source mixture, we analyze some commonly used strategies We did not include any social data in these mixes as it was not ready at the time of this experiment. and compare their effect using the Paloma evaluation suite (Magnusson et al., 2023). Specifically, we present and evaluate four possible data mixtures in Table 5.

We show results of mixtures in Figure 12. Overall, we observe that the different mixtures have an effect on the ability of resulting models to capture specific subdomains. All mixtures show similar perplexity scores on pages sampled from 100 domains from C4 (Figure 12, left), indicating their general effectiveness at modeling web documents. On the other hand, we note how models struggle to model specialized domains unless they are exposed to them. As an example, a model trained on the Web-only mix struggles to represent data in the code domain (Figure 12, center, HumanEval). Finally, we use results on the S2ORC subset of M2D2, which consists of academic papers, to illustrate how different data mixtures affect perplexity. As is it the case with code, Web-only model exhibits higer perplexity due to domain mismatch. On the other hand, models trained on Reference+ and Gopher-like mixes achieve lower perplexity than the model trained on the Naïve mix, due to more in-domain content. However, we note that, despite significant differences in the amount of academic papers between Reference+ and Gopher-like (4.9% vs 24.2%), they achieve nearly identical results, suggesting that even a relatively small percentage of in-domain data is sufficient to achieve good domain fit.

3 Evaluating Olmo-1b

In Table 6 we compare Olmo-1b with other 1B models. Note that while parameter count is matched here, only TinyLlama has been trained for a comparable number of tokens while Pythia 1B is trained for nearly 10 times fewer tokens and the data composition of StableLM2 is unknown. Nevertheless we find that Olmo-1b performs better on average than the most comparable model, TinyLlama, outperforming it in 4 out of 8 tasks. Though zero-shot evaluations of downstream tasks are often challenging for these relatively small 1B models, the performance for all the tasks on all the models is above naive random performance. Further details about the downstream tasks is included in Appendix D.

In Figure 13 we assess how the Dolma mix that we use to train Olmo-1b compares to other popular pretraining corpora in terms of perplexity of models where all other variables than pretraining data are controlled. In particular we fix the number of tokens each model is trained on to 150B, so that data scale and differences in learning rate schedule do not confound with the effect from data composition that we intend to study. This analysis uses the 1B baselines from Paloma and evaluates Paloma’s highest-level metric, which computes perplexity over the combination of test sets from 11 data sources. Other more fine-grained perplexity results comparing these baselines are available in Magnusson et al. (2023). The present analysis excludes sources that are not publicly available, involve fringe or toxic text, or that consist of code data not supported by the benchmark decontamination approach we use. This leaves C4 (Raffel et al., 2020), mC4-en (Chung et al., 2023), Wikitext 103 (Merity et al., 2016), Penn Treebank (Marcus et al., 1999; Nunes, 2020), RedPajama (Together Computer, 2023c), Falcon-RefinedWeb (Penedo et al., 2023), Dolma (this work), M2D2 S2ORC (Reid et al., 2022), M2D2 Wikipedia (Reid et al., 2022), C4 100 domains (Chronopoulou et al., 2022), and Dolma 100 Subreddits (this work).

Our controlled perplexity analysis reveals the importance of including non-Common Crawl data from diverse curated sources. The metric that we use from Paloma surfaces how models fit to more heterogeneous data, because it samples marked domains from each source equally rather than by their unequal proportions in the source. Intuitively, the baseline trained on the Pile is well fit to such data as that pretraining corpus is mostly sourced from just such smaller, hand-picked sources. But as we wish to scale the total number of tokens in a corpus, the challenge becomes how to integrate more available Common Crawl data without losing sample efficiency on diverse evaluations such as this Paloma metric. In this case we see that the Dolma baseline nearly matches the performance curve of the Pile baseline even though the fraction of Common Crawl data included is more than 4 times greater.

Releasing Dolma

We recognize that any dataset derived from large web crawls will contain factually-incorrect information, toxic language, hate speech, PII, and other types of harmful content. While we have made an effort to curate this dataset taking this into consideration, we believe risk mitigation is best approached from multiple directions, including careful consideration of licenses and access controls.

While most datasets we used were curated with copyright and licensing in mind (e.g., open access papers in peS2o (Soldaini and Lo, 2023), open source repositories in the Stack (Kocetkov et al., 2022)) or were already permissively licensed (e.g., Wikipedia is released under a Creative Commons license), we recognize that large web crawls will also contain copyrighted material. Yet, given current tools, it’s not possibly to reliably or scalably detect copyrighted materials in a corpus of this size. Our decision to release Dolma publicly factors in several considerations, including that all our data sources were publicly available and already being used in large-scale language model pretraining (both open and closed), we refer the reader to our public position on AI and fair use (Farhadi et al., 2023).

We recognize that the legal and ethical landscape of AI is changing rapidly, and we plan to revisit our choices as new information becomes available.

References

Appendix A Acknowledgements

Dolma would not have been possible without the support of many individuals and institutions. The experimental components of this work were made possible through a partnership with AMD and CSC, enabling use of the LUMI supercomputer. We thank Jonathan Frankle, Cody Blakeney, Matthew Leavitt and Daniel King and the rest of the MosaicML team for sharing findings from experiments on preliminary versions of our data. We thank Vitaliy Chiley for messaging us on Twitter with a suggestion for resolving a random number generator bug that was affecting our data shuffling. We thank Erfan Al-Hossami, Shayne Longpre, and Gregory Yauney for sharing findings from their own large-scale pretraining data experiments. We thank Ce Zhang and Maurice Weber of Together AI for thoughtful discussion on open datasets and data distribution format. We thank Stella Biderman and Aviya Skowron for discussions around data licensing and data processing framework. We thank our teammates at AI2 Nicole DeCario, Matt Latzke, Darrell Plessas, Kelsey MacMillan, Carissa Schoenick, Sam Skjonsberg, and Michael Schmitz for their help with the website, design, internal and external communications, budgeting, and other activities that supported smooth progress on this project. Finally, we also express gratitude for the helpful discussions and feedback from our teammates at AI2 and close collaborators, including Prithviraj (Raj) Ammanabrolu, Maria Antoniak, Chris Callison-Burch, Peter Clark, Pradeep Dasigi, Nicole DeCario, Doug Downey, Ali Farhadi, Suchin Gururangan, Sydney Levine, Maarten Sap, Ludwig Schmidt, Will Smith, Yulia Tsvetkov, and Daniel S. Weld.

Appendix B Author Contributions

Dolma would not be possible without the help of our many teammates and collaborators. Weekly project meetings, messaging apps and documentation were accessible for anyone at AI2. Major decisions about Dolma were often made in these channels, with exception for certain topics (e.g., legal, funding). While many were involved in the Dolma effort (see Acknowledgements §A), the authors of this paper were those who owned and delivered a critical piece of the puzzle. We detail their contributions below (authors in alphabetical order):

Contributors to data acquisition and source-specific data processing include Akshita Bhagia, Dirk Groeneveld, Rodney Kinney, Kyle Lo, Dustin Schwenk, and Luca Soldaini. Everyone contributed to literature review on available sources and best practices and decisions around sources to pursue. Akshita Bhagia, Rodney Kinney, Dustin Schwenk, and Luca Soldaini handled the bulk of data acquisition and processing and ablation experiments with 1B models for source-specific design decisions. Kyle Lo and Luca Soldaini handled discussions with legal to inform our choice of sources.

Contributors to infrastructure and tooling include Russell Authur, Dirk Groeneveld, Rodney Kinney, Kyle Lo, and Luca Soldaini. Rodney Kinney, Kyle Lo, and Luca Soldaini designed and implemented the shared toolkit used for processing our corpus at scale. Dirk Groeneveld wrote the Bloom filter for deduplication and decontamination. Russell Authur wrote a toolkit for acquisition and storage of Common Crawl data.

Contributors to source-agnostic data processing include Khyathi Chandu, Yanai Elazar, Rodney Kinney, Kyle Lo, Xinxi Lyu, Ian Magnusson, Aakanksha Naik, Abhilasha Ravichander, Zejiang Shen, and Luca Soldaini. Khyathi Chandu, and Aakanksha Naik developed the toxic text filter. Kyle Lo, and Xinxi Lyu helped evaluate it. Luca Soldaini developed the language filtering approach. Rodney Kinney, Zejiang Shen, and Luca Soldaini developed the “quality” filter. Yanai Elazar identified repeating $n$ -gram sequences. Abhilasha Ravichander, Kyle Lo, and Luca Soldaini developed the PII filter. Jesse Dodge and Ian Magnusson developed the evaluation set decontamination approach.

Contributors to ablation experiments include Iz Beltagy, Akshita Bhagia, Jesse Dodge, Dirk Groeneveld, Rodney Kinney, Kyle Lo, Ian Magnusson, Matthew Peters, Kyle Richardson, Dustin Schwenk, Luca Soldaini, Nishant Subramani, Oyvind Tafjord, and Pete Walsh. This work included designing and prioritizing experiments given compute constraints, implementing and running the 1B model experiments, and interpreting results. In particular, Oyvind Tafjord’s work on the evaluation toolkit and Pete Walsh’s work on the model implementation were critical.

Contributors to posthoc experiments and analysis on the final Dolma artifacts. Ben Bogin led the probing experiments on 1B model weights to assess impact of differing code mixtures with support from Kyle Lo and Niklas Muennighoff. Yanai Elazar ran the data analysis tool to summarize and document Dolma’s composition. Valentin Hofmann led the tokenization fertility analysis with support from Kyle Lo. Ananya Harsh Jha and Ian Magnusson performed experiments training and evaluating baseline 1B models on other open datasets with support from Luca Soldaini. Sachin Kumar and Jacob Morrison performed analysis of systematic issues in our choice of language identification and toxicity classifiers with support from Kyle Lo. Niklas Muennighoff led analysis of correlation between different filters employed on Common Crawl data with support from Kyle Lo and Luca Soldaini.

Contributors to licensing and release policy include David Atkinson, Jesse Dodge, Jennifer Dumas, Nathan Lambert, Kyle Lo, Crystal Nam, and Luca Soldaini. David Atkinson, Jesse Dodge, Jennifer Dumas, and Crystal Nam led the bulk of this, including research into data licenses, risk-level determination for pretraining data, and defining the release policy. Kyle Lo and Luca Soldaini provided feedback throughout this process and handled technical details needed for the release. Nathan Lambert provided feedback on release process and handled the actual release strategy, particularly around external communication.

All of the contributors above helped with documentation and writing of their respective components. In particular, Li Lucy provided an extensive literature review of language models, open corpora and pretraining corpus creation practices. Emma Strubell gave valuable feedback on our manuscript. Nathan Lambert helped with feedback on the blog post and other forms of external-facing communication about Dolma.

Hannaneh Hajishirzi, Noah Smith, and Luke Zettlemoyer advised on the project, including broad strategy, writing, recruiting and providing resources. As OLMo project leads, Iz Beltagy, Jesse Dodge, and Dirk Groeneveld helped with visibility and coordination with other critical OLMo project workstreams. Notably, we credit Noah Smith for coming up with the name Dolma.

Finally, Kyle Lo and Luca Soldaini led the overall Dolma project and were involved in all aspects, including project management, planning and design, discussions with legal and ethics committees, data and compute partnerships, infrastructure, tooling, implementation, experiments, writing/documentation, etc.

Appendix C Details about pretraining data behind largest LMs

We provide a high-level overview of the pretraining data curation practices (or lack of reporting therof) of the largest LMs to illustrate the need for clear documentation and transparency around dataset curation.

Touvron et al. (2023b) provides limited information on pretraining data used for Llama 2; we summarize what we could from gather from their manuscript’s Sections 2.1, 4.1, and A.6:

Data provenance. N/A aside from they avoided using Meta user data.

PII. Reported as excluded data from certain websites known to contain high volumes of PII, though what these sites are was not disclosed.

Toxicity. Not explicitly discussed, but appears to not have performed toxicity filtering, opting instead to handle toxic text generation in a later training stage. They do report results from a post hoc analysis in which they used a HateBERT (Caselli et al., 2021) classifier finetuned on ToxiGen (Hartvigsen et al., 2022) to score each document line (and averaged to produce a document-level score).

Language ID. Not stated as used in pretraining data curation, but they provide a post hoc analysis of the pretraining dataset using fastText Language ID with a 0.5 threshold for detected language. We assume this is likely the same protocol they used for pretraining data curation as it is also seen in the CCNet library (Wenzek et al., 2020a), which was used for Llama (Touvron et al., 2023a).

Decontamination. They provide extensive reporting on their deduplication method, which relies on a modified version of the ngram deduplication tool from Lee et al. (2022).

Other. Reported upsampling certain sources, but without further details. They also report a similar analysis as in PaLM 2 (Anil et al., 2023) on aggregate statistics about demographic identities and English pronouns.

C.2 PaLM 2 (Anil et al., 2023)

Anil et al. (2023) provides limited information on pretraining data used for PaLM 2; we summarize what we could from gather from their manuscript’s Sections 3 and D1:

Corpus size. Unreported other than it’s larger than what was used to train PaLM (Chowdhery et al., 2022)

Data provenance. Unreported other than they use web documents, books, code, mathematics, and conversational data.

PII. Reported as performed filtering, but without further details.

Toxicity. Toxic text identified using Perspective API but lacking details needed for reproduction (i.e., text unit, threshold). No details on removal. They did report tackling toxicity through the use of control tokens, but do not provide enough details on this method.

Language ID. Reports the most frequent languages included as well as their frequencies. Lacking details needed for reproduction (i.e., text unit, tools used, threshold).

Quality. Reported as performed filtering, but without further details.

Deduplication. Reported as performed filtering, but without further details.

Other. Anil et al. (2023) report aggregated statistics of how often certain demographic identities are represented (or not) in the data. Such statistics include identities (e.g., American) or English pronouns. These were identified using tools such as KnowYourData or those available on GoogleCloud, but the manuscript lacks specifics necessary for reproduction.

C.3 GPT-4 (OpenAI, 2023)

OpenAI (2023) provides limited information on pretraining data used for GPT-4; we summarize what we could from gather from their manuscript’s Section 2, Appendix C and D, footnotes 5, 6, 10 and 27, and Sections 1.1 and 3.1 in the System Card:

Data provenance. N/A aside from reporting that (1) data was sourced from both the Internet as well as third-party providers, (2) data was sourced mainly before September 2021 with trace amounts of more recent data, and (3) they included GSM-8K (Cobbe et al., 2021) as a tiny fraction of the total pretraining mix.

Toxicity. Removed documents that violate their usage policies from pretraining, including “erotic content,” using a combination of lexicon-based heuristics and bespoke classifiers following Markov et al. (2023b).

Language ID. N/A aside from reporting that the majority of pretraining data is in English.

Decontamination. No discussion of decontamination procedures, but instead reported post-hoc statistics measuring extent of contamination on professional and academic exams, as well as several academic benchmarks. Method for identifying contamination based on exact substring match (after removing whitespaces) of a test example against a pretraining data example. They reported some contamination with BIG-Bench (Srivastava et al., 2023).

Other. There are myraid works performing “data archeology” on GPT-4 that is, attempting to glean information about the pretraining data used in GPT-4 through probes for memorization. For example, Chang et al. (2023) show GPT-4 can generate sequences from copyrighted books. We do not attempt to survey all of these investigative works.

C.4 Claude (Anthropic, 2023)

Unfortunately, we know next to nothing about the pretraining data used for Claude.

C.5 LLaMA (Touvron et al., 2023a)

Touvron et al. (2023a) provides some information on pretraining data used for training LLaMA; we summarize what we could gather from their manuscript’s Section 2.1.

Data provenance. LLaMA used data with known provenance, including five shards of CommonCrawl between 2017 and 2020, C4 (Raffel et al., 2020), GitHub code from Google BigQuery public datasets (restricted to Apache, BSD and MIT licenses), Wikipedia dumps from June to August 2022, Project Gutenberg books, Books3 from The Pile (Gao et al., 2020), LaTeX files from arXiv, and StackExchange pages.

Toxicity. N/A. Reports evaluation on the RealToxicityPrompts (Gehman et al., 2020) benchmark.

Language ID. Reports use of the CCNet library (Wenzek et al., 2020b), which employs fastText (Joulin et al., 2016a) classifiers to remove non-English text (below a 0.5 threshold). No additional language ID reported for C4, GitHub, Books, arXiv, and StackExchange sets. For Wikipedia, reported restriction of pages to those using Latin or Cyrillic scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk.

Quality. Reports use of the CCNet library (Wenzek et al., 2020b) to remove low-quality content from CommonCrawl; CCNet uses KenLM (Heafield, 2011), an $n$ -gram language model to score perplexity of text as a measure of similarity to Wikipedia text. They do not report their chosen threshold for filtering. They also report use of a linear model trained to classify pages as Wikipedia Reference-like or not. They also report light heuristic filtering of boilerplate content for GitHub and Wikipedia subsets.

Deduplication. Reports use of the CCNet library (Wenzek et al., 2020b) to identify duplicated lines for Common Crawl texts, file-level exact match deduplication for GitHub code, and deduplicating books with over 90% for Gutenberg and Books3 subsets.

Mixture. The manuscript reports a mixture of 67% CommonCrawl, 15% C4, 4.5% GitHub, 4.5% Wikipedia, 4.5% Books, 2.5% arXiv, and 2.0% StackExchange. Model training was a single epoch over this mixture except for an upsampling of Wikipedia and Books (2 epochs).

C.6 OPT (Zhang, 2022)

From Zhang (2022)’s manuscript and provided datasheet (Gebru et al., 2021), we summarize the following:

The OPT model was trained on 180B tokens from data sources with known provenance: the datasets used for RoBERTa (Liu et al., 2019), a subset of the Pile (Gao et al., 2020), and the Pushshift Reddit Dataset (Baumgartner et al., 2020a) as processed by (Roller et al., 2021). They made several notable changes to these sources:

RoBERTa. (Zhang, 2022) updated the CC-News collection up to September 2021.

Pile. (Zhang, 2022) restricted to the following collections: CommonCrawl, DM Mathematics, Project Gutenberg, HackerNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. (Zhang, 2022) report omission of other Pile subsets due to gradient norm spikes at the 1B model scale.

Pushshift Reddit. (Zhang, 2022) restricted to only the longest chain of comments in each thread; an operation that reportedly reduced the dataset by 66%.

(Zhang, 2022) also describe: (1) deduplication using MinHashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity threshold of 0.95, and (2) language ID filtering to English-only text, though they do not describe the method used.

They do not discuss whether they do (or do not) perform any processing for PII, toxicity, quality, or decontamination.

Appendix D Experimental Setup

For all data ablations described in this section, we train a 1B parameter model on up to 150B tokens. This is in-line with similar model sizes that have been used for ablations in prior work (Le Scao et al., 2022). Each model is an decoder-only transformer model with 16 layers, 16 attention heads, and 2048 dimensionality. We use ALiBi positional embeddings (Ofir Press et al., 2021), SwiGLU activation (Shazeer, 2020), and mixed precision; model context size is set to $2048$ tokens. We use EleutherAI’s GPT NeoX tokenizer (Black et al., 2022). The model is trained using the LionW optimizer (Chen et al., 2023a) with $1\text{e-}4$ peak learning rate, warm-up of $2000$ steps, cosine decay, and $1\text{e-}2$ weight decay. Batch size was set to $1024$ . While we set our max number of steps to 95k (which is approximately 200B tokens), we conclude our experiments at 150B tokens.

We use 64 AMD Instinct MI250X accelerators. Each MI250X accelerator contains two logical nodes; therefore, from the point of view of our training code, our experiments ran on 128 compute units grouped in 16 nodes. Per each logical unit, we use a micro-batch size of 8. We implement our experiments using the OLMo codebase.

D.2 Perplexity Evaluation Suite

During training, we keep track of perplexity using an early version of the Paloma benchmark (Magnusson et al., 2023). Unless otherwise noted references to Paloma indicate this early version. This version of Paloma was derived from the following datasets:

C4 (Raffel et al., 2020; Dodge et al., 2021): Standard contemporary LM pretraining corpus automatically filtered from the April 2019 Common Crawl scrape.

mC4 (Xue et al., 2020); English subset: the English language portion of a pretraining corpus automatically filtered from 71 Common Crawl scrapes.

Pile (Gao et al., 2020), validation set: widely-used used language modeling pretraining corpus; contains documents curated from multiple sources including several non-web sources.

WikiText 103 (Merity et al., 2016): a standard collection of verified “Good” and “Featured” articles on Wikipedia.

Penn Tree Bank (Marcus et al., 1994): widely-used NLP corpus derived from Wall Street Journal articles.

M2D2 (Reid et al., 2022), S2ORC subset: papers from Semantic Scholar (Lo et al., 2020) grouped by hierarchical academic field categories.

M2D2 (Reid et al., 2022), Wiki subset: Wikipedia articles grouped by hierarchical categories in the Wikipedia ontology

C4 100 domains (Chronopoulou et al., 2022): balanced samples of the top 100 domains in C4.

Gab (Zannettou et al., 2018): data from 2016-2018 from an alt-right, free-speech-oriented social media platform that has been shown to contain more hate speech than mainstream platforms.

ICE (Greenbaum, 1991): English from around the world curated by local experts, with subsets for Canada, East Africa, Hong Kong, India, Ireland, Jamaica, Philippines, Singapore, and the USA.

Twitter AAE (Blodgett et al., 2016): balanced sets of tweets labeled as African American or white-aligned English.

Manosphere (Ribeiro et al., 2021): sample of 9 forums where a set of related masculinist ideologies developed over the past decade.

4chan (Papasavva et al., 2020): data from 2016-2019 politics subsection of an anonymity-focused forum found shown to contain high rates of toxic content.

In some experiments we use the finalized version of Paloma released in Magnusson et al. (2023). This contains evaluation data sampled from the following additional datasets:

Dolma (this work), uniform sample: A sample 8,358 documents from the Dolma corpus across all of its subsets (13 from books, 1,642 from Common Crawl web pages, 4,545 Reddit submissions, 450 scientific articles, 1,708 Wikipedia and Wikibooks entries).

RedPajama v1 (Together Computer, 2023a): 1 trillion tokens replication of the LLaMA 1 (Touvron et al., 2023a) pretraining corpus.

Falcon RefinedWeb (Penedo et al., 2023): A corpus of English sampled from all Common Crawl scrapes until June 2023, more aggressively filtered and deduplicated than C4 and mC4-en.

Dolma 100 Subreddits (this work): Balanced samples of the top 100 subreddits by number of posts, sourced from the Dolma Reddit subset.

Dolma 100 Programming Languages (this work): Balanced samples of the top 100 programming languages by number of tokens, sourced from the Dolma Stack subset.

D.3 Downstream Evaluation Suite

We also evaluate models on the following downstream task datasets using the Catwalk framework (Groeneveld et al., 2023):

AI2 Reasoning Challenge (Clark et al., 2018): A science question-answering dataset broken into easy and challenge subsets. Only the easy subset was used in online evaluations. The challenge subset was, however, included in offline evaluations.

BoolQ (Clark et al., 2019): A reading comprehension dataset consisting of naturally occurring yes/no boolean questions and background contexts.

HellaSwag (Zellers et al., 2019): A multiple-choice question-answering dataset that tests situational understanding and commonsense.

OpenBookQA (Mihaylov et al., 2018): A multiple-choice question-answering dataset modeled on open-book science exams.

Physical Interaction: Question Answering (PIQA) (Bisk et al., 2019): A multiple-choice question-answering dataset that focuses on physical commonsense and naive physics.

SciQ (Welbl et al., 2017): A crowdsourced multiple-choice question-answering dataset consisting of everyday questions about physics, chemistry and biology, among other areas of science.

WinoGrande (Sakaguchi et al., 2019): A dataset of pronoun resolution problems involving various forms of commonsense. Modeled after the Winograd challenge of Levesque et al. (2012).

D.4 Training Setup for Olmo-1b

For Olmo-1b, we follow the experimental setup outlined for dataset ablation experiments in Appendix D, with the following differences:

We set the max number of steps to 739,328 (which is roughly 3.1T tokens).

We double the batch size to $2048$ and do so by scaling up to 256 compute units (double what we used for data ablations).

Due to instabilities we found in the LionW optimizer, we switched to using AdamW.

Appendix E Construction of Conversational Threads in Forums Data

Content comes from Reddit’s data API in two separate but linked forms: submissions and comments. Submissions are either "link posts" to external content (e.g. news articles, blogs, or even multimedia content) or "self posts" (submissions written by the poster meant to initiate a discussion thread on a topic). Comments are user replies to either the initiating post (top level comments) or to another user’s comment. Posts, top-level comments, and replies to comments form a nested conversational thread with a submission post at it’s root and comments branching out into multiple possible dialogue trees.

The tree-like structure of Reddit threads allows for multiple possible data formats depending on how the various components of a thread are combined.

We investigate three formats for their potential as LM pretraining data:

Atomic content. This simple format treats all comments and submissions as independent documents without any structure or connection to the thread they appear in.

Partial threads. This format assembles comments from the same thread into a structured, multi-round dialogue between users. Submissions are left as separate documents. Assembled dialogues are limited to a maximum parent depth, and the resulting documents are only snippets of a their originating thread (which are spread across several documents).

Full threads. This complex format combines a given submission and all of its child comments into a single document encompassing an entire thread. Code-like indentation is used to indicate the depth of a comment in the thread’s hierarchy.

We experimentally evaluated these strategies for assembling documents in Figure 14. We found that, for language modeling purposes, treating comments and submissions as atomic units leads to better downstream performance compared to partial and full threads. We hypothesize that the more complex formatting required to handle dialogues might introduce undesirable content for language modeling, such as short and repeated comments. We leave the study of better formatting for forum content for language modeling to future work.

Appendix F Tokenization Analysis

The first step of processing text with LMs is tokenization, i.e., mapping the text to a sequence of tokens with corresponding input embeddings (Sennrich et al., 2016; Kudo, 2018; Kudo and Richardson, 2018). Recently, there has been a growing interest in the question of how well LM tokenizers fit different data sources (e.g., data in different languages; Ahia et al., 2023; Petrov et al., 2023) Inspired by this emerging line of work, we conduct an explorative analysis of the GPTNeoX tokenizer (Black et al., 2022) applied to Dolma, which provides a first picture of how challenging the different data sources comprised by Dolma are for current LM tokenizers.

We start by taking a global look at the tokenizer’s fit to Dolma. Out of the 50,280 tokens in the tokenizer vocabulary, 50,057 are present in the tokenized text of Dolma. In other words, 223 tokens are never used, amounting to roughly 0.4% of the tokenizer vocabulary. The 223 tokens mostly consist of combinations of whitespace characters (e.g., “\n\n ”, two newline characters followed by two blank space characters). Note that when training an LM with the examined tokenizer on Dolma, the input embeddings corresponding to these tokens would not be updated. In terms of the count distribution of tokens, we find that tokens with smaller IDs tend to have higher counts in Dolma (see Figure 15(a)), which is also reflected by a strong Spearman’s correlation between (i) the ranking of tokens based on their counts in Dolma and (ii) the token IDs ( $r=$ 0.638, $p<$ 0.001). Given how the tokenizer was trained (Sennrich et al., 2016; Black et al., 2022), smaller IDs correspond to byte pairs merged earlier and hence tokens occurring more frequently in the tokenizer training data Overall, these results suggest a good fit of the GPTNeoX tokenizer to Dolma.

Does the tokenizer fit all data sources included in Dolma equally well? To examine this question, we analyze fertility, which is defined as the average number of tokens per word generated by a tokenizer (Acs, 2019; Scao et al., 2022), in our case measured on a specific data source. We find that fertility is similar for most data sources, ranging between 1.15 (conversational forum subset) and 1.28 (books subset), with the exception of the code subset, which has a substantially higher fertility of 2.45 (see Figure 15(b)). This means that the costs of processing the code subset — be they computational or financial in nature (Petrov et al., 2023) — are more than twice as high compared to the other data sources.

What causes this discrepancy? We find that in the code subset (which mostly contains code), words are often preceded by whitespace characters other than a blank space (e.g., newline, tab, return). Crucially, while a blank space before a word is tokenized as part of that word (e.g., I love you $\rightarrow$ “I”, “ love”, “ you”), other whitespace characters yield separate tokens (e.g., I love you $\rightarrow$ “I”, “\t”, “love”, “\t”, “you”). This can also be seen by plotting the relative frequency of tokens representing whitespace characters by data source, which is one order of magnitude higher for The Stack compared to most other data sources (see Figure 15(c)). When training LMs on The Stack (or code more generally), it thus might be advisable to add special tokens to the tokenizer (e.g., “\nif”; Hong et al., 2021). It is important to notice that this observation applies to most tokenizers in use today (e.g., the tokenizer used by GPT-4), which tend to lack tokens such as “\nif”.

Appendix G Evaluating Language Identification

To analyze the impact of the fastText language identification classifier, we ran an external audit on the International Corpus of English (ICE) (Kirk and Nelson, 2018), a dataset containing spoken and written English from nine countries around the world. We ran our language ID tool on all documents in the ICE dataset to estimate how many documents from each region would have been erroneously filtered. The ground truth in this analysis is that every document is in English, and should be classified as such. Interestingly, we found that at our fairly permissive threshold (keeping documents with at least a 0.5 score for English) correctly identified all English-language documents in ICE each as English, no matter the region it was from.

Appendix H Evaluating Toxicity Classification

To measure dialectal biases in the jigsaw toxicity classifier, we analyze its proclivity to predict English variations spoken in different countries as toxic. Starting with the unfiltered Reddit corpus, we create a dataset of comments from location-based subreddits, filtering for country-specific subreddits with more than 50K comments. This dataset serves as a crude proxy for different dialects of English, assuming most commenters live in the respective locations and speak the variation. We further assume the fraction of actually toxic comments in each of these subreddits to be roughly the same. We compute the toxicity score for each comment in this dataset using the jigsaw classifier and report the percentage of comments marked as toxic against different classifier thresholds in Figure 17. For all thresholds, for any two locations, we find <5% difference in the fraction of comments marked as toxic suggesting little to no bias. Further, we plot the distribution of toxicity scores for comments in each subreddit and find that scores assigned to the comments often fall at the extremes (close to 0 or close to 1), suggesting that any reasonable threshold (lying between 0.1 to 0.9) to predict toxicity will lead to similar outcomes.

Appendix I Analysis of Filters for Code Pipeline

In Table LABEL:tab:stackfilters, we display the number of documents flagged by our two groups of filters for The Stack, as well as their correlation. We find that the RedPajama v1 filters flag significantly more documents than the StarCoder ones for most languages. However, for Java, JavaScript and Python, our filters derived from StarCoder flag a very large number of documents. This is because it contains an additional Code to Text ratio filter that is not employed for other languages. The two groups of filters generally have low correlation with the exception of a few languages, such as txl where they are perfectly correlated.

Appendix J Data Sheet

Dolma was created with the primary purpose of training AI2’s autoregressive language model OLMo. It is a mixture of documents from multiple data sources. Documents have been transformed using a combination of rule-based and statistical tools to extract textual content, remove layout information, and filter for English content.

Dolma contains data sourced from different domains. In particular, it contains a mixture of text obtained from a web scrape, scientific content extracted from academic PDFs and its associated metadata, code over a variety of programming languages, reference material from Wikipedia and Wikibooks, as well as public domain books from Project Gutenberg.

We expect this dataset to be useful to train other language models, either in its current form or through further filtering and combining it with other datasets.

Beside language model training, this dataset could be used to study interaction between pretraining corpora and models trained on them. For example, one could study provenance of generations from the model, or perform further corpus analysis.

Specific subset of Dolma could be used to train domain specific models. For example, the code subset could be used to train an AI programming assistant.

Due to the myriad transformations applied to the original source materials to derive our dataset, we believe it is ill-suited as a replacement for users seeking to directly consume the original content. We refer users of our dataset to our license and terms on the HuggingFace Hub huggingface.co/datasets/allenai/dolma which detail any use restrictions.

No model trained on this dataset has been publicly released yet.

All individuals who are responsible for this dataset are employed by the Allen Institute for AI. Similarly, computing resources are provided by AI2.

Compute for the OLMo project is provided by AMD and CSC, using GPUs on the LUMI supercomputer.

J.2 Dataset Composition

Instances are plain-text spans on English text or computer code. Each instance was obtained by processing web pages (which might include news, documents, forums, etc), academic articles, computer code from GitHub, encyclopedic content from Wikipedia, or Project Gutenberg books.

Metadata for subsets of Dolma could be used to reconstruct relationships between items:

Common Crawl. Each document uses the URL of the web page from which it was extracted as its identifier; therefore, it can be used to identify relationships between documents.

C4. The URL of each web page from which documents were extracted is included as metadata; therefore, it can be used to identify relationships between documents.

Reddit. The originating subreddits and thread ids of documents are included in the metadata.

peS2o. The id of each document is the Semantic Scholar Corpus ID of its corresponding manuscript. Metadata for each manuscript can be obtained using the Semantic Scholar APIs (Kinney et al., 2023).

The Stack. The name of the GitHub repository each document belongs to is included as metadata.

Project Gutenberg. The title of each book is included as the first line of each document.

Wikipedia, Wikibooks. For both, metadata includes the URL corresponding to the page content was extracted from. Structure and connections between documents can be recovered through the URL.

Summary statistics are reported in Table 1.

For each source, raw data is not available directly but could be recovered using source-specific methods:

Common Crawl. We obtain data from common crawl snapshots from 2020-05 to 2023-06. WARC files from Common Crawl can be intersected with Dolma ids to recover original HTML files.

C4. We obtained this corpus from the HuggingFace Hub https://huggingface.co/datasets/allenai/c4. In turn, documents in C4 have been derived from a Common Crawl snapshot for 04/2019. URLs in C4 can be used to recover HTML files.

Reddit. The complete set of monthly data dumps used in this work are no longer distributed by Pushshift, however they can still be obtained through torrents and some public web archives.

peS2o. peS2o is derived from S2ORC Lo et al. (2020). Original parsed documents can be obtained from extracting documents in S2ORC that share the same ID with peS2o. Further, metadata in S2ORC can be used to obtain original PDF.

The Stack (deduplicated). The filename and repository name, both available in metadata, can be used to recover original file contents.

Project Gutenberg. The title of each book is the first line of each document.

Wikipedia, Wikibooks. For both, metadata includes the URL corresponding to the page content was extracted from. Structure and connections between documents can be recovered through the URL.

There are no labels associated with instances. Many text instances were likely created by people or groups of people, but in the vast majority of cases authorship information is unavailable let alone subpopulation metadata. we leave aggregation and reporting of these statistics to future work.

The data are derived from the web and the original resources may not persist over time. However, each source represents an archival snapshot of that data that should remain fixed and available:

Common Crawl. The Common Crawl data is available on Amazon S3 as part of the Amazon Web Services’ Open Data Sponsorship program and can be freely downloaded https://commoncrawl.org/the-data/get-started/. We followed Common Crawl terms of usehttps://commoncrawl.org/terms-of-use/.

C4. This corpus can be obtained from from the HuggingFace Hub33 and is released under ODC-By 1.0 (Open Data Commons, 2010).

Reddit. Pushshift no longer distributes this dataset due to changes to the Reddit API’s terms. Unofficial copies of the data might be be available through torrents and some public web archives. Pushshift data dumps inherithttps://www.reddit.com/r/pushshift/comments/d6luj5/comment/f0ugpqp the Terms of use of the Reddit API at the time of their collection (March 2023).

peS2o. peS2o is derived from S2ORC Lo et al. (2020). S2ORC is released through the Semantic Scholar Public APIhttps://www.semanticscholar.org/product/api under ODC-By 1.0 (Open Data Commons, 2010).

The Stack (deduplicated). The corpus is available on the HuggingFace Hub https://huggingface.co/datasets/bigcode/the-stack-dedup and consists of code released under a variety of permissive licenses. More details including terms of use for hosting or sharing the corpus are provided in the datacard at the link above.

Project Gutenberg. Project Gutenberg consists of books that are not protected under U.S. copyright law. The corpus is available at gutenberg.org.

Wikipedia, Wikibooks. Wikimedia data dumps are freely availablehttps://dumps.wikimedia.org and released under CC BY-SA 4.0 license (Creative Commons, 2013).

No. A separate evaluation suite Dolma as been decontaminated against will be released at a later date. Downstream users of this dataset could use any alternative evaluation suite.

A forthcoming manuscript will detail ablations and other experiments that have been conducted to guide the creation of this dataset.

J.3 Data Collection Process

Data acquisition for each subset was performed as follows:

Common Crawl. snapshots were downloaded from Common Crawl’s official S3 buckets3://commoncrawl/ using the cc_net pipeline (Wenzek et al., 2020b). Data was obtained between March 17 ${}^{\textrm{th}}$ and March 27 ${}^{\textrm{th}}$ , 2023.

C4. We clone C4 from the HuggingFace Hub33 using Git with the Git-LFS extension. Repository cloned on May 24 ${}^{\textrm{th}}$ , 2023.

Reddit. Reddit was acquired in the form of monthly data dumps of comments and submissions collected and distributed by the Pushshift projecthttps://files.pushshift.io/reddit/submissions/ https://files.pushshift.io/reddit/comments/. We used the complete set of 422 publicly availible dumps (208 comments, 214 submissions) spanning a period from 06/2005–03/2023. The majority of Dumps were acquired in March, 2023 with the last dumps downloaded in May of 2023.

peS2o. We clone peS2o from the HuggingFace Hubhttps://huggingface.co/datasets/allenai/peS2o using Git with the Git-LFS extension. We use pes2o V2. Repository cloned on June 30 ${}^{\textrm{th}}$ , 2023.

The Stack (deduplicated). We clone The Stack (deduplicated) from the HuggingFace Hub38 using Git with the Git-LFS extension. Repository cloned on May 28 ${}^{\textrm{th}}$ , 2023.

Project Gutenberg. Data was downloaded directly from gutenberg.org. We used GutenbergPy (Angelescu, Radu, 2013) to extract books. Website accessed on April 3 ${}^{\textrm{rd}}$ , 2023.

Wikipedia, Wikibooks. Dumps were downloaded from Wikimedia’s website39. We use the dump from March 20 ${}^{\textrm{th}}$ , 2023.

Data was collected and postprocessed by full-time employees at the Allen Institute for AI. No instances in this dataset are manually annotated.

Any metadata associated with each instance was obtained directly from each source.

Sampling for each subset was performed as follows:

Common Crawl. Common Crawl is not a representative sample of the web. Summary statistics about Common Crawl are reported through the cc-crawl-statistics (Common Crawl, 2016) project, available at commoncrawl.github.io/cc-crawl-statistics. Dolma uses Common Crawl snapshots from 2020-05 to 2023-06Common Crawl snapshots follow naming convention xxxx-yy, where xxxx is the year the snapshot was finalized, and yy is the week, ranging from 01 to 52..

Reddit. We use all available Reddit content from from 06/2005–03/2023.

The Stack (deduplicated). We use The Stack (deduplicated) in its entirety.

Project Gutenberg. We process all Gutenberg books.

Wikipedia, Wikibooks. We use the English and Simple subset of Wikipedia and Wikibooks in their entirety.

Common Crawl is the only source we did not use in its entirety. We use only about a quarter of all snapshots available. This amount was deemed sufficient for the goal of the OLMo project (train an autoregressive language model with up to 70 billion parameters) given the amount of compute we have available. We decided to use the 24 most recent Common Crawl snapshots.

Not that we are aware of, although a negligible portion of Common Crawl data could have been lost due to network issues with S3 storage. When accessing Common Crawl, we implemented retry mechanisms, but copy could have failed due to exceeding the retry limits.

J.4 Data Preprocessing

All data sources are filtered using FastText language identification models (Joulin et al., 2016a, b) with an English threshold of 0.5.

For the Common Crawl and C4 subsets, we use the following filters (Figure 1) that substantially modify the original data. Note that data might be tagged for removal by one or more filter.

Only Common Crawl, as part of their distribution pipeline: Linearize all HTML into plain text files (WET files generationhttps://commoncrawl.org/get-started);

Only Common Crawl, as part of CCNet pipeline: We remove frequently occurring paragraph in Common Crawl by identifying repeated paragraphs on small subsets of each snapshots. This step gets rid of headers that are shared across many pages, such as navigational headers. Removal is operationalized as follows: given $1\ldots,n,\ldots,N$ shards each snapshot is comprised to, group shards in sets $S=\{n-k,n\}$ ; then, remove exact duplicates of paragraphs in $S$ . Paragraphs are defined as newline-separated slices of documents, and compared using their SHA1. We choose $k$ such that each set is at most 20GBThis is a slight modification of the original CCNet pipeline, where $k$ is chose so that each set is 2% of snapshot. We chose to use a fixed shard size, rather an a percentage of the corpus, because fixed size is more predictable in terms of resource usage, leading to less-error prone code. Conceptually it’s equivalent to putting a threshold on the absolute probability of a paragraph occurring. (approximately 70% of paragraph removed);

Only Common Crawl, deduplication by URL: We deduplicate pages by URL (53% of duplicates removed);

Language identification: remove all documents with an English score lower than 0.5, as determined by FastText language identification models (Joulin et al., 2016a, b) (removed 61.69% of web pages by size);

Quality filterThe term “quality filter”, while widely used in literature, does not appropriately describe the outcome of filtering a dataset. Quality might be perceived as a comment on the informativeness, comprehensiveness, or other characteristics valued by humans. However, the filters used in Dolma and other language models efforts select text according to criteria that are inherently ideological (Gururangan et al., 2022).: Remove documents with more than half of their line not ending in “.”, “?”, “!”, or “"”. (22.73% of characters tagged for removal);

Quality filter47: Remove any document that does not pass any of the Gopher rules (Rae et al., 2021) (15.23% of characters tagged for removal);

Fraction of characters in most common ngram greater than a thresholdFor bigrams, threshold of 0.20. For trigrams, 0.18. For 4-grams, 0.16.

Fraction of characters in duplicate ngrams greater than a thresholdFor 5-grams, 0.15. For 6-grams, 0.14. For 7-grams, 0.13. For 8-grams, 0.12. For 9-grams, 0.11. For 10-grams, 0.10.

Contains fewer than 50 or more than 100K words

Median word length is less than 3 or greater than 10

Fraction of words with alpha character less than 0.80

Contains fewer than 2 of a set of required words“the”, “be”, “to”, “of”, “and”, “that”, “have”, “with”

Fraction of lines in document starting with bullet point greater than 0.90

Fraction of lines in document ending with ellipsis greater than 0.30

Fraction of lines in document that are duplicated greater than 0.30

Fraction of characters in duplicated lines greater than 0.30

Quality filter47: Remove any document that contains a token or sequence of tokens repeating over 100 timesWe use allenai/gpt-neox-olmo-dolma-v1_5 to obtain tokens. (0.003% of characters tagged for removal);

Content filter: Remove sentences that get ranked as toxic by a FastText classifier (score above $0.4$ ). We train a bigram classifier on the Jigsaw dataset (cjadams et al., 2017) (1.01% of data tagged for removal);

Content filter: Mask Personal Identifiable Information (PII) using regular expressions that identify emails, phone numbers, and IP addresses; pages containing 6 or more PIIs are completely removed from the corpus (0.05% tagged for masking, 0.11% tagged for removal);

Exact document deduplication: duplicate documents the same text. No punctuation or whitespace is removed. Empty documents count as duplicates (14.9% of documents tagged for removal).

Only Common Crawl, deduplication by paragraph: We deduplicate the web subset at a paragraph level using a Bloom filter (19.1% of UTF-8 characters tagged for removal).

For the Reddit subset, we use the following filters that substantially reduce the original data.

Language identification: remove all documents with an English score lower than 0.5, as determined by a FastText language identification model.

Quality filter47: Remove comments and submissions shorter than 500 characters in length.

Quality filter47: Remove user comments with fewer than three upvotes (Reddit users vote on the quality of submissions and comments).

Content filter47: Remove comments and submissions from banned, toxic, or NSFW subreddits.

Content filter47: Remove sentences that get ranked as toxic or as hatespeech by a FastText classifier (score above $0.4$ ).

Content filter: Mask Personal Identifiable Information (PII) using regular expressions that identify emails, phone numbers, and IP addresses

Deduplication: We deduplicate comments and submissions (jointly) at a paragraph level using a Bloom filter.

For the code subset derived from The Stack (deduplicated), we use the following filters (Figure 8):

Language filter: Removed files associated with the following programming languages:

Data or numerical content: csv, json, json5, jsonld, jsoniq, svg

Quality filter47: Removed copyright statements in code files from document preambleCode license and provenance is still tracked in metadata.;

Quality filter47: Removed documents matching any of the RedPajama v1 (Together Computer, 2023c) code filters (41.49% of data tagged for removal):

Proportion of alpha-numeric characters < 0.25.

Ratio of alphabetical characters to number of tokens < 1.5Tokens counted using whitespace tokenizer.

Quality filter47: Removed documents matching any of the following Starcoder filters (Li et al., 2023):

Java, Javascript, Python code-to-comment ratio <= 0.01 or > 0.8.

For the Wikipedia and Wikibooks subsets, we remove pages that contain fewer than 25 UTF-8 words.

Language identification: for each paragraph (defined as newline-separated spans of text), we use FastText to perform language identification. Then, we compute the average language score by averaging the score for all passages. If a document has a language score lower than $0.5$ , it is discarded;

Quality filter47: we remove pages that contain fewer than 25 UTF-8 words;

Quality filter47: Remove any document that contains a token or sequence of tokens repeating over 100 times51.

For the PeS2o subset, we remove any document that contains a token or sequence of tokens repeating over 100 times51 .

For Dolma versions 1.0 and 1.5, we perform decontamination for all subsets of Dolma. In particular, we remove paragraphs that are shared with documents in the Paloma evaluation suite Magnusson et al. (2023). Overall, only 0.003% of our dataset is removed due to contamination with this evaluation set. Dolma version 1.6 is not decontaminated.

Raw data is available for all subsets except Common Crawl. Due to space constrains, we only keep linearized version of Common Crawl snapshots, filtered by Language ID as described above.

Raw data is not available for download outside the Allen Institute for AI. Interested individuals may contact authors of this manuscript if they require access to raw data.

Yes, all preprocessing software is available on GitHub at github.com/allenai/dolma and on PyPIhttps://pypi.org/project/dolma/.

J.5 Dataset Distribution

Dolma is distributed via the HuggingFace Hub, which offers access via the datasets (Lhoest et al., 2021) Python package, direct download, and Git using the Git-LFS extension. Additionally, a copy is stored on the cloud storage of the Allen Institute for AI.

The dataset is available now. This manuscript serves as a reference for the dataset.

Information about the license associated with Dolma are available on its release page on the HuggingFace Hub: huggingface.co/datasets/allenai/dolma.

The dataset is distributed for free. Users should verify any restrictions on its release page on the HuggingFace Hub: huggingface.co/datasets/allenai/dolma.

J.6 Dataset Maintenance

The Allen Institute for AI maintains the dataset. For support questions, users are invited to open an issue on GitHubhttps://github.com/allenai/dolma/issues or on the community tab of dataset pagehttps://huggingface.co/datasets/allenai/dolma/discussions (the former being preferred over the latter). Any other inquiry should be sent to ai2-info@allenai.org.

Dataset will be uploaded on a need-to basis by maintainers at the Allen Institute for AI. Newer version of the dataset will be labeled accordingly. The latest version of the dataset, as well as a changelog, will be made available starting from the first revision.

Users should keep track of the version of the dataset in use. Information about latest version of Dolma are available on its release page on the HuggingFace Hub: huggingface.co/datasets/allenai/dolma. Dolma users should cite this manuscript when using this data.

Creation and distribution of derivatives is described above. In case contributors want to flow their improvement back to future Dolma releases, they should contact corresponding authors of this manuscript.

J.7 Legal & Ethical Considerations

Subsets of Dolma derived from web data are likely created by people or groups of people, however authorship information is often unavailable.

Authors were not directly informed about the data collection. For encyclopedic and web content, logs of web servers will contain records of spiders ran by Common Crawl. For academic content, the pes2o subset (Soldaini and Lo, 2023) is derived from manuscripts that are licensed for permissive distribution by their authors. Reddit content was acquired through a public API adherent to terms of service; individual authors of Reddit posts were not contacted directly. Finally, the Allen Institute for AI did not contact Project Gutenberg.

Due to the nature of and size of Dolma, it is impossible to determine which obligations, if any, are appropriate.

The OLMo project includes Ethics committee comprised of internal and external members to the Allen Institute for AI. Plans for the creation of Dolma were reviewed with the committee, and we incorporated their recommendations.

Following practices established in similar efforts, no consent was collected from individuals who might be represented in the dataset. We make available a formhttps://forms.gle/q4BNUUxUxKwKkfdT6 for individuals who wish to be removed from the dataset.

Dolma contains text instances that have been derived from web pages Common Crawl crawled from the web. Content might contain sensitive information including personal information, or financial information users of the web chose to put publicly online. This data is taken only from public places, so the same data is or has been accessible via browsing the web. We have measured a variety of types of personal information, and built tools specifically to remove some types of sensitive information, and through our license we restrict what users can do with this data.

We recommend individuals to submit a request using through our form57 if they wish their information to be removed.

Dolma is not a representative sample of none of its sources. It might underrepresent or overrepresent some communities on the internet; further, papers in the peS2o subset are skewed towards STEM disciplines; books in the Gutenberg library are mostly from the public domain (at the time of publication, books published before 1927); finally, the English and Simple subset of Wikipedia and Wikibooks might be biased towards events and people from the global north.

We did not attempt to alter distribution of social groups in Dolma. Large-scale interventions to correct societal biases in large datasets remain challenging, and are left to future work.

This datasets contains text that was derived from web paged scraped by Common Crawl from the web. For much of that data it’s not possible identify the authors. In many instances, creators purposely choose to post anonymously online, so aiming to infer authorship can be ethically fraught. We provide access to our data, and encourage any creators that would likely to have data from or about them removed to reach out.

We created this dataset in aggregate, not separately identifying any individual’s content or information. We took reasonable steps to remove types of personal information that were possible to reliably detect. We restrict who has access to the data, and we release this under a license that prohibits uses that might be deemed discriminatory. We also provide an avenue for any person to contact us to have text from or about them removed from our corpus57.

This datasets contains text that was derived from web paged scraped by Common Crawl from the web. Therefore, it can contain text posted on public websites by creators on the internet. If an author publicly posted personal information or offensive content, it could be included in this dataset. We took reasonable steps to remove types of personal information that were possible to reliably detect. We also removed documents that contained sentences that were classified as being toxic.