CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

cs.CL cs.AI

Introduction

Large language models (LLMs) have fundamentally transformed research and applications of natural language processing (NLP), significantly advancing the state-of-the-art performance for numerous tasks and revealing new emergent abilities Brown et al. (2020); Wei et al. (2022). Based on the transformer architecture Vaswani et al. (2017), three major variants of LLMs have been explored in the literature: the encoder-only models to encode input texts into representation vectors, e.g., BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019); the decoder-only models to generate texts, e.g., GPT Radford et al. (2019); Brown et al. (2020); and the encoder-decoder models to perform sequence-to-sequence generation, e.g., BART Lewis et al. (2020) and T5 Raffel et al. (2020). The remarkable capabilities of LLMs have primarily been propelled by the ever-expanding scale of model sizes and training datasets, which have been deemed essential for achieving optimal performance by the scaling laws Hernandez et al. (2022). For instance, beginning with the BERT model, which had a mere few hundred million parameters Devlin et al. (2019), recent GPT-based models have been expanded to encompass hundreds of billions of parameters Shoeybi et al. (2019); Scao et al. (2022); Lieber et al. (2021); Chowdhery et al. (2022). Similarly, the training datasets for LLMs have grown exponentially, evolving from a modest 13GB of text data from Wikipedia and books used for BERT Devlin et al. (2019); Liu et al. (2019) to consume terabytes of data for the latest models, such as Falcon Penedo et al. (2023), MPT MosaicML (2023), LLaMa Touvron et al. (2023), PolyLM Wei et al. (2023) and ChatGPThttps://openai.com/blog/chatgpt.

As the field keeps progressing rapidly, pre-trained LLMs have typically been released to the public to foster further research and advancements. These models are obtainable either through commercial APIs, as illustrated by ChatGPT and GPT-4, or via open-source initiatives, exemplified by Falcon and LLaMa. Nevertheless, in contrast to the public accessibility of LLMs, the training datasets that underpin the state-of-the-art models have mostly remained closely guarded secrets, even in the case of open-source LLMs such as BLOOM, LLaMa, MPT, and Falcon. For example, Falcon Penedo et al. (2023) and BLOOM Scao et al. (2022) only provide a glimpse of their complete training data, whereas MPT’s, LLaMa’s and PolyLM’s datasets Touvron et al. (2023); Wei et al. (2023) remain inaccessible to the public. On one hand, the lack of transparency has impeded in-depth analysis and comprehension of LLMs, hindering crucial research into attributing and addressing fundamental issues stemming from the training data, such as hallucinations, biases, and toxic content Tamkin et al. (2021); Weidinger et al. (2021); Kenton et al. (2021); Bommasani et al. (2021). On the other hand, concealing the training data restricts the development of LLMs to a select few stakeholders with ample resources, thereby constraining the democratization and benefits of the technology and exacerbating its biases within broader society.

To attain transparency and democratization for LLMs, it is thus crucial to create large-scale and high-quality datasets for training high-performing LLMs while ensuring their public accessibility to foster deeper research and advancements. In the realm of LLMs, high-quality training datasets are often crafted through the application of extensive data cleaning and deduplication processes, aimed at eliminating noisy and redundant content from vast text collections Allamanis (2018); Penedo et al. (2023). To this end, there have been recent efforts from the community to develop such open-source datasets for LLMs, such as RedPajama with 1.21T tokens Computer (2023), SlimPajama%TT␣every␣character␣and␣hyphenate␣after␣ithttps://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama with 627B tokens, and AI2 Dolma%TT␣every␣character␣and␣hyphenate␣after␣ithttps://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64 with 3T tokens. However, most of the existing open-source datasets for LLMs are tailored for the English language, which hinders the utilization and performance of the resulting LLMs when applied to non-English languages, particularly those with limited linguistic resources Bang et al. (2023); Lai et al. (2023). This emphasis on English also restricts the capacity of open-source datasets to comprehensively tackle the research challenges and democratization concerns of LLMs across the diverse spectrum of over 7,000 languages spoken worldwide.

Simultaneously, some multilingual datasets have been developed and made available, providing text data for multiple languages. Nevertheless, their quality and scale fall short of meeting the requirements for training high-performing LLMs. Specifically, the multilingual text dataset sourced from Wikipedia, while of high quality, is regarded as relatively small when it comes to training LLMs Conneau et al. (2020). The OSCAR datasets Ortiz Suárez et al. (2019); Ortiz Suárez et al. (2020); Abadji et al. (2021, 2022)https://oscar-project.org extract text data from CommonCrawl (CC) for more than 160 languages. However, these datasets lack document-level deduplication (i.e., removing similar documents in the dataset), leading to the inclusion of redundant information and impairing the performance of generative LLMs Lee et al. (2022). Similarly, the mC4 Xue et al. (2021), CCAligned Conneau et al. (2020), WikiMatrix Schwenk et al. (2021), and ParaCrawl Bañón et al. (2020) datasets altogether support over 100 languages but suffers from less accurate language identification, introducing noise into the data Kreutzer et al. (2022). These datasets are also not deduplicated at fuzzy and document levels, e.g., via MinHash Broder (1997). Additionally, the CC100 dataset Wenzek et al. (2020); Conneau et al. (2020), employed in training the multilingual XLM-RoBERTa model across 100 languages, only considers the snapshots of CC in 2018, constraining its size and the availability of up-to-date information to train high-performing LLMs.

To address the aforementioned issues for open-source datasets, our work introduces a novel multilingual dataset, called CulturaX, for training LLMs in 167 languages. CulturaX merges the latest iteration of mC4 (version 3.1.0) with all available OSCAR corpora up to the current year, encompassing distributions 20.19, 21.09, 22.01, and 23.01. This amalgamation results in a large multilingual dataset, comprising 27 TB of text data with 6.3 trillion tokens and offering the most up-to-date data for LLM development. More than half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios. Importantly, CulturaX is extensively cleaned and deduplicated at the document level to produce the highest quality to train LLMs for multiple languages. In particular, our data cleaning process includes a comprehensive pipeline designed to eliminate low-quality data. This involves removing noisy text, non-linguistic content, toxic data, incorrect language identification, and more. Our data cleaning pipeline employs a variant of the Interquartile Range (IQR) method Dekking et al. (2007) to select appropriate thresholds for various dataset metrics (e.g., stopword ratios, data perplexity, and language identification scores), which can be used to filter noisy outliers for the dataset. As such, we leverage the percentiles of the distributions computed over large samples of data to effectively guide the threshold selection process for each filtering metric and language. Finally, we perform extensive deduplication for the data of the languages within our datasets based on the near deduplication method MinHashLSH Broder (1997); Leskovec et al. (2020) and URLs, leading to high-quality data to train multilingual LLMs. Our dataset will be fully available to the public to promote further research and development for multilingual learning. To our knowledge, CulturaX is the largest open-source multilingual dataset to date that is deeply cleaned and deduplicated for LLM and NLP applications.

Multilingual Dataset Creation

To develop a multilingual public dataset for LLMs, our strategy is to combine mC4 Xue et al. (2021) and OSCAR Ortiz Suárez et al. (2019); Abadji et al. (2021, 2022), two largest multilingual datasets at our disposal. We then process the data with an extensive pipeline, involving two major steps of cleaning and deduplication, to produce an enormous and high-quality dataset for multilingual LLMs.

mC4 is a multilingual document-level dataset, originally created to train the multilingual encoder-decoder model mT5 Xue et al. (2021) for 101 languages. This dataset is extracted from 71 monthly snapshots from CC by removing pages with less than three long lines (line length filter), pages with bad words, and duplicated lines across documents. Language identification for the pages in mC4 is done by the cld3 tool Botha et al. (2017)https://github.com/google/cld3, which is a small feed-forward network Xue et al. (2021). Any pages with a language confidence below 0.95% are excluded. mC4 is deduplicated with exact match at the document level; however, fuzzy document-level deduplication is not performed. We utilize the latest version of mC4 (version 3.1.0)https://huggingface.co/datasets/mc4 prepared by AllenAI in this work.

A notable aspect of our dataset pertains to the web-based origin of our selected datasets, mC4 and OSCAR, extracted from CC. This differs from certain previous work Radford et al. (2019); MosaicML (2023); Touvron et al. (2023) that has also relied on curated datasets like The Pile Gao et al. (2020) and BookCorpus Zhu et al. (2015) to train LLMs, presuming their higher overall quality. However, in the context of multilingual settings, we argue that web-scraped datasets can be a more suitable approach, as curated datasets of superior quality might not be available for various languages. Our strategy of using web-scraped data facilitates efficient data collection across multiple languages, contributing to enhanced training data scales. Furthermore, recent studies have demonstrated the effectiveness of cleaning web-scraped data to yield state-of-the-art LLMs Raffel et al. (2020); Almazrouei et al. (2023). In total, the combination of mC4 and OSCAR provides us 13.5B documents for further processing. Figure 1 illustrates the distribution of the document counts for mC4 and the four available versions of OSCAR in our initial dataset.

Given the combination of the mC4 and OSCAR datasets, we first perform a comprehensive data cleaning procedure to remove noisy and bad content from the data, including language identification, ULR-based filtering, metric-based cleaning, and document refinement.

Language Identification: A particular issue concerns the use of two different language identification tools, i.e., cld3 and FastText, for mC4 and OSCAR (respectively). It has been shown in previous studies that cld3 is significantly worse than FastText, causing substantially more language detection errors for mC4 Kreutzer et al. (2022). In fact, compared to several other language detectors, FastText has demonstrated state-of-the-art performance over benchmark datasetshttps://modelpredict.com/language-identification-survey. To this end, our first data cleaning step involves applying FastText to re-predict the languages for the documents in mC4. Documents whose predicted languages are different from the provided ones in mC4 will be removed from the dataset. The rationale is to avoid documents that are confusing for the language detectors cld3 and FastText, thus potentially introducing noise for the data. Finally, to ensure the highest quality, we remove data for any language found in mC4 but not supported by FastText.

URL-based Filtering: In the next step, we aim to eliminate pages from the known toxic and harmful sources to reduce relevant risks from our data. In particular, we leverage the latest UT1 blacklist of URLs and domains provided by the University of Toulouse to support Internet use regulation for administrators at schools. This list involves sites from different topics, including pornography, grumbling, and hacking, that should be discarded for LLM training. Updated twice to thrice per week, the blacklist involves more than 3.7M records that are contributed by both human and robots (e.g., search engines, known addresses and indexes) Abadji et al. (2022). As such, we remove any page from our dataset whose associated URL matches a site in the blacklist. This step is helpful for our dataset as the blacklist is not employed before for the mC4 dataset. In addition, although OSCAR has already used this blacklist for data cleaning, our approach incorporates the most up-to-date information from the list, which might not be available for the current distributions of OSCAR.

Metric-based Cleaning: To enhance the dataset’s quality, motivated by the data processing pipeline from the BigScience’s ROOTS corpus for BLOOM Laurençon et al. (2022); Scao et al. (2022), we further utilize the distributions for various dataset metrics to identify and filter outlying documents. Each metric provides a singular value for every document within the dataset, quantifying specific attributes such as number_words, stopword_ratios, and perplexity_score for each document. For each metric and its range of possible values within the dataset, a threshold will be determined to partition the range into two zones: a normal range and an abnormal range. The abnormal range is designated for documents exhibiting metric values significantly deviating from the norm, classifying them as outliers/noises, and consequently, these outliers are removed from our dataset. As such, we employ a comprehensive array of dataset metrics, which will be collectively employed to refine our dataset, as outlined below:

The last four metrics are suggested by the OSCAR dataset while the others are inherited from the BigScience ROOTS corpus’s pipeline to process OSCAR data. For the perplexity score, following the BigScience ROOTS corpus, we train a SentencePiece tokenizer Kudo (2018) and 5-gram Kneser-Ney language models as provided in the KenLM library Heafield (2011) using the 20230501 dumps of Wikipedia. Documents displaying high perplexity scores based on these KenLM models are considered notably different from Wikipedia articles. This indicates a level of noise that will be excluded from our dataset Wenzek et al. (2020). The tokenizer will also be used to obtain the number of words/tokens in the documents for our metrics. We publicly release our KenLM models in HuggingFacehttps://huggingface.co/uonlp/kenlm to faciliate future exploration.

Repeated information (e.g., words, paragraphs) can appear in the web-curated data due to crawling errors and low-quality sources, causing detrimental consequences for training LLMs Holtzman et al. (2019). The character and word repetition ratios are thus designed to avoid documents with excessively repeated information. High frequencies of special characters, stop words, or flagged words can indicate noisy and low-quality documents. We thus utilize the stop word and flagged word lists for different languages to compute their ratios for document removal. In addition to the stop word and flagged word lists provided by BigScience ROOTS for their 13 languages, we further collect dictionaries for these types of words for other languages. We prioritize the lists that have been shared on personal GitHub accounts for various languages, as these are often crafted by native speakers and exhibit higher quality. Moreover, lower language identification confidence might also suggest noisy language structures for the data. For each document in the dataset, we thus obtain a language identification confidence via the probability that FastText assigns to its corresponding language to aid data filtering. Finally, for the short line-based criteria, we implement a threshold of 100 characters to classify lines as short, as used by OSCAR. Documents with excessive occurrence of short lines will not be retained in our dataset.

Threshold Selection: Given the set of dataset metrics, an important question concerns the selection of appropriate thresholds for each metric and language to generate high-quality multilingual data. In the BigScience ROOTS project Laurençon et al. (2022), this selection process is carried out by native speakers of 13 languages. The resulting thresholds are employed for the rest of their 46 languages. The project offers a visualization interface that indexes a sample of a few thousand documents per language, enabling users to monitor data statistics as they adjust thresholds for the metrics. However, this process cannot be easily extended to different languages due to the requirement of experienced native speakers, which incurs significant costs. Furthermore, the limited sample sizes hinder the representativeness of the chosen thresholds for the full datasets. In our analysis, we observe that some selected thresholds for certain languages within BigScience ROOTS almost fall outside the value ranges for the entire dataset, leading to the deactivation of the corresponding metrics.

To address these issues, we leverage a variant of the Interquartile Range (IQR) method Dekking et al. (2007) to select appropriate thresholds for the filtering metrics for our dataset. For each metric and language, we generate a distribution of its possible values across the entire dataset for the language. There is an exception for languages with substantial amounts of data, such as Spanish and Russian, where only 25% of the data is used to calculate these distributions. Afterward, we compute the $Q_{1}$ -th and $Q_{3}$ -th percentiles of the distribution ( $Q_{1}<Q3$ ) and use them for the thresholds for our filtering metrics. In particular, the lower $Q_{1}$ -th percentile will be chosen for the metrics that favor high values (e.g., language identification confidence), while metrics favoring low values (e.g., perplexity scores and document length) will utilize the upper $Q_{3}$ -th percentile. We investigate different values for $(Q_{1},Q_{3})$ , considering $(25,75)$ , $(20,80)$ , $(15,85)$ , $(10,90)$ , and $(5,95)$ . The selection of $Q_{1}=10$ and $Q_{2}=90$ has achieved the best data quality for a sample of languages in our examination.

It is worth emphasizing that the utilization of percentiles for threshold selection enables our approach to efficiently draw upon more extensive data samples for each language compared to those employed in the BigScience ROOTS project. This results in more reliable thresholds for the full datasets over different languages. Specifically, concerning the large languages where only a 25% data sample is employed to compute the value distribution for a metric, we observe that the proportion of discarded data to the entire dataset closely aligns with that of the data sample when applying the same selected filtering threshold. This underscores the representativeness of the thresholds selected through our methodology. Finally, once the thresholds for the metrics in a given language have been determined, we will eliminate any document that surpasses a metric’s threshold and enters the unfavorable range of the data.

Document Refinement: The previous cleaning steps are done at the dataset level, aiming to remove low-quality documents from the dataset. In this step, we further clean the retained documents to improve the quality. It is important to note that our prior metric-based filtering step plays a vital role in eliminating highly noisy documents, which, in turn, streamlines the process of developing effective document cleaning rules during this step. Notably, since the documents from mC4 and OSCAR are extracted from HTML pages crawled from the Internet, a significant portion of them may carry crawling and extraction errors, including long JavaScript lines and extraneous content. Consequently, filtering out these documents greatly simplifies our task of designing rules to clean the documents within our dataset.

As such, for each document, we eliminate its noisy or irrelevant portions via a series of operations. First, we remove any short lines located at the end of each document, as these lines typically contain footer details or unhelpful information from the websites. Second, we eliminate the lines that contain words from our list of JavaScript (JS) keywords (e.g., “

2 Data Deduplication

Despite thorough data cleaning, the remaining dataset might still contain a substantial amount of repeated data due to various reasons, including information being reposted on the web, multiple references to the same articles, boilerplate content, and plagiarism. The duplicated data can thus cause memorization and significantly hinder generalization for LLMs Lee et al. (2022); Hernandez et al. (2022). Although expensive, data deduplication is thus considered as a crucial step to guarantee the highest quality of data for training LLMs. To this end, we undertake a comprehensive deduplication procedure for our dataset, utilizing MinHash Broder (1997) and URLs. This deduplication process is carried out independently for each language. Furthermore, we restrict deduplication to languages that retain over 100K documents following our data cleaning procedures (i.e., $51.5$ % of our languages), aiming to promote smaller languages within our dataset.

MinHash Deduplication: For each language’s dataset, we first apply the MinHashLSH method Leskovec et al. (2020) to filter similar documents in the dataset. MinHashLSH is a near deduplication technique based on MinHash Broder (1997) with multiple hash functions for $n$ -grams and the Jaccard similarity. Locality-Sensitive Hashing (LSH) is incorporated to improve efficiency by focusing on document pairs that are most likely similar. We leverage a variant of the Spark implementation of MinHashLSH in the text-dedup repohttps://github.com/ChenghaoMou/text-dedup/tree/main, employing $5$ -grams and a threshold of $0.8$ to determine similar documents for the Jaccard similarity. Running MinHashLSH for each language’s dataset, especially for languages with the largest data volumes like English, Russian, Spanish, and Chinese, represents the most computationally expensive operation in our dataset creation effort.

URL-based Deduplication: Finally, we eliminate all documents that share identical URLs with other documents in the dataset. This step is necessary to address situations where various versions of the same articles are linked to identical URLs but have been updated or modified during the publication process, effectively bypassing the near deduplication step. Some URLs for the articles in CC might only display their general domains due to crawling errors. To enhance accuracy, we refrain from removing URLs that only include their general domains.

We utilize 600 AWS c5.24xlarge EC2 instances to preprocess and deduplicate our multilingual dataset. Each instance is equipped with 96 CPU cores, 192GB of memory, and 1TB of disk space. The disk space can be used to replace memory when necessary (e.g., for data deduplication).

Data Analysis and Experiments

After completing all the cleaning and deduplication steps, our ultimate dataset comprises 6.3 trillion tokens spanning 167 languages. Table 1 provides an overview of the number of documents and tokens for the top 42 languages in CulturaX following each processing stage. As can be seen, our data-cleaning pipeline can substantially reduce the number of documents in the original mC4 and OSCAR datasets for each language. The total number of removed documents accounts for 46.48% of our initial documents, suggesting the the effectiveness of our approaches to filter noisy information for multilingual datasets.

Related Work

Compared to other NLP tasks, language models can be trained with unlabeled data, enabling efficient data collection to produce gigantic scales for the training data. There are two primary types of data commonly used for training LLMs: curated data and web crawl data. Curated data typically consists of well-written and well-formatted text from targeted sources and domains, e.g., Wikipedia articles, books, newswire articles, and scientific papers, as used for the “The Pile” Gao et al. (2020) and “BookCorpus” Zhu et al. (2015) datasets. In contrast, web crawl data encompasses text gathered from a wide array of sources across the internet, varying significantly in terms of format and writing styles, e.g., blogs, social media posts, news articles, and advertisements. CommonCrawl (CC) is a widely-used web crawl repository that has collected petabytes of data over the Internet for 12 years. To this end, curated data is frequently considered to possess higher quality, which has resulted in its preference for training early LLMs, e.g., BERT Devlin et al. (2019) and GPT-2 Radford et al. (2019). However, as the demand for larger models has grown, web crawl data has gained more attention as it contributes a substantial portion to the training data of recent LLMs, e.g., RoBERTa Liu et al. (2019), BART Lewis et al. (2020), T5 Raffel et al. (2020), GPT-3 Rae et al. (2021), LLaMa Touvron et al. (2023), MPT MosaicML (2023), and Falcon Almazrouei et al. (2023). As such, different extractions of CC has been produced to train such LLMs, including C4 Raffel et al. (2020), CC-News Nagel , and STORIES Trinh and Le (2018).

Regarding the accessibility of training data, datasets used to train early LLMs are often made available to the public Devlin et al. (2019); Raffel et al. (2020). However, in the case of the most recent state-of-the-art (SOTA) generative LLMs, their training datasets are not released fully, potentially due to commercial interests. This applies not only to proprietary models like ChatGPT and GPT-4 but also to models that claim to be open-source models such as LLaMa, MPT, Falcon, and BLOOM Scao et al. (2022). To address the transparency issue with existing LLMs, recent efforts have been made to replicate and release the training datasets for the state-of-the-art LLMs, i.e., RedPajama Computer (2023), SlimPajama, and AI2 Dolma. The key distinctions for these datasets concern their large-scale text data that has been meticulously cleaned and document-level deduplicated to ensure high quality for training LLMs. Nonetheless, a common drawback of these open-source datasets is that they remain predominantly focused on English data, offering limited data for other languages.

To obtain a multilingual large-scale dataset for training LLMs, it is more convenient to exploit web-scrape datasets such as CC to enable efficient data collection with up-to-date information in multiple languages. In addition, to ensure high quality for high-performing LLMs, it is necessary to extensively clean and deduplicate the multilingual data to avoid noisy and irrelevant content, e.g., low-quality machine-generated text and adult content Trinh and Le (2018); Kreutzer et al. (2022); Raffel et al. (2020). As such, a typical data processing pipeline to generate high-quality datasets can involve multiple steps, as demonstrated by FastText Joulin et al. (2016), CC-Net Wenzek et al. (2020), the BigScience ROOTS corpus for the BLOOM models Laurençon et al. (2022); Scao et al. (2022), the RefinedWeb dataset for the Falcon model Penedo et al. (2023); Almazrouei et al. (2023), and the dataset to train the LLaMa models Touvron et al. (2023). The first step necessitates in such pipelines language identification to appropriately assign data to their corresponding languages Joulin et al. (2016). The next steps features various dataset-specific rules and heuristics to filter undesirable content according to the ratios of special characters, short lines, bad words, among others Grave et al. (2018); Laurençon et al. (2022). The data can also be filtered via lightweight models, e.g., via the KenLM language models Heafield (2011), to avoid noisy documents Wenzek et al. (2020). Finally, data deduplication should be performed to remove similar or repeated information Laurençon et al. (2022); Penedo et al. (2023). An important step in this regard involves fuzzy deduplication at document level, e.g., via MinHash Broder (1997), to eliminate similar documents, thus mitigating memorization and improving the generalization for resulting LLMs Lee et al. (2022).

To this end, while there are multilingual open-source datasets with text data in multiple languages, such as mC4 Xue et al. (2021), OSCAR Ortiz Suárez et al. (2019), CC100 Wenzek et al. (2020); Conneau et al. (2020), and the BigScience ROOT corpus Laurençon et al. (2022), their quality and scale do not meet the requirements for effectively training LLMs, particularly generative models such as GPT. For example, as highlighted in the introduction, both mC4 and OSCAR lack fuzzy deduplication for the data at the document level. mC4 also suffers from its poorer language identification due to the use of cld3. BigScience ROOTS only provides a small sample data for 46 languages while CC100 does not have information beyond 2018. Our dataset CulturaX thus comprehensively addresses the issues for the existing datasets, offering a multilingual, open-source, and large-scale dataset with readily usable and high-quality data to train LLMs.

Conclusion