CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
Introduction
Large language models (LLMs) have fundamentally transformed research and applications of natural language processing (NLP), significantly advancing the state-of-the-art performance for numerous tasks and revealing new emergent abilities Brown et al. (2020); Wei et al. (2022). Based on the transformer architecture Vaswani et al. (2017), three major variants of LLMs have been explored in the literature: the encoder-only models to encode input texts into representation vectors, e.g., BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019); the decoder-only models to generate texts, e.g., GPT Radford et al. (2019); Brown et al. (2020); and the encoder-decoder models to perform sequence-to-sequence generation, e.g., BART Lewis et al. (2020) and T5 Raffel et al. (2020). The remarkable capabilities of LLMs have primarily been propelled by the ever-expanding scale of model sizes and training datasets, which have been deemed essential for achieving optimal performance by the scaling laws Hernandez et al. (2022). For instance, beginning with the BERT model, which had a mere few hundred million parameters Devlin et al. (2019), recent GPT-based models have been expanded to encompass hundreds of billions of parameters Shoeybi et al. (2019); Scao et al. (2022); Lieber et al. (2021); Chowdhery et al. (2022). Similarly, the training datasets for LLMs have grown exponentially, evolving from a modest 13GB of text data from Wikipedia and books used for BERT Devlin et al. (2019); Liu et al. (2019) to consume terabytes of data for the latest models, such as Falcon Penedo et al. (2023), MPT MosaicML (2023), LLaMa Touvron et al. (2023), PolyLM Wei et al. (2023) and ChatGPThttps://openai.com/blog/chatgpt.
As the field keeps progressing rapidly, pre-trained LLMs have typically been released to the public to foster further research and advancements. These models are obtainable either through commercial APIs, as illustrated by ChatGPT and GPT-4, or via open-source initiatives, exemplified by Falcon and LLaMa. Nevertheless, in contrast to the public accessibility of LLMs, the training datasets that underpin the state-of-the-art models have mostly remained closely guarded secrets, even in the case of open-source LLMs such as BLOOM, LLaMa, MPT, and Falcon. For example, Falcon Penedo et al. (2023) and BLOOM Scao et al. (2022) only provide a glimpse of their complete training data, whereas MPT’s, LLaMa’s and PolyLM’s datasets Touvron et al. (2023); Wei et al. (2023) remain inaccessible to the public. On one hand, the lack of transparency has impeded in-depth analysis and comprehension of LLMs, hindering crucial research into attributing and addressing fundamental issues stemming from the training data, such as hallucinations, biases, and toxic content Tamkin et al. (2021); Weidinger et al. (2021); Kenton et al. (2021); Bommasani et al. (2021). On the other hand, concealing the training data restricts the development of LLMs to a select few stakeholders with ample resources, thereby constraining the democratization and benefits of the technology and exacerbating its biases within broader society.
To attain transparency and democratization for LLMs, it is thus crucial to create large-scale and high-quality datasets for training high-performing LLMs while ensuring their public accessibility to foster deeper research and advancements. In the realm of LLMs, high-quality training datasets are often crafted through the application of extensive data cleaning and deduplication processes, aimed at eliminating noisy and redundant content from vast text collections Allamanis (2018); Penedo et al. (2023). To this end, there have been recent efforts from the community to develop such open-source datasets for LLMs, such as RedPajama with 1.21T tokens Computer (2023), SlimPajama%TT␣every␣character␣and␣hyphenate␣after␣ithttps://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama with 627B tokens, and AI2 Dolma%TT␣every␣character␣and␣hyphenate␣after␣ithttps://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64 with 3T tokens. However, most of the existing open-source datasets for LLMs are tailored for the English language, which hinders the utilization and performance of the resulting LLMs when applied to non-English languages, particularly those with limited linguistic resources Bang et al. (2023); Lai et al. (2023). This emphasis on English also restricts the capacity of open-source datasets to comprehensively tackle the research challenges and democratization concerns of LLMs across the diverse spectrum of over 7,000 languages spoken worldwide.
Simultaneously, some multilingual datasets have been developed and made available, providing text data for multiple languages. Nevertheless, their quality and scale fall short of meeting the requirements for training high-performing LLMs. Specifically, the multilingual text dataset sourced from Wikipedia, while of high quality, is regarded as relatively small when it comes to training LLMs Conneau et al. (2020). The OSCAR datasets Ortiz Suárez et al. (2019); Ortiz Suárez et al. (2020); Abadji et al. (2021, 2022)https://oscar-project.org extract text data from CommonCrawl (CC) for more than 160 languages. However, these datasets lack document-level deduplication (i.e., removing similar documents in the dataset), leading to the inclusion of redundant information and impairing the performance of generative LLMs Lee et al. (2022). Similarly, the mC4 Xue et al. (2021), CCAligned Conneau et al. (2020), WikiMatrix Schwenk et al. (2021), and ParaCrawl Bañón et al. (2020) datasets altogether support over 100 languages but suffers from less accurate language identification, introducing noise into the data Kreutzer et al. (2022). These datasets are also not deduplicated at fuzzy and document levels, e.g., via MinHash Broder (1997). Additionally, the CC100 dataset Wenzek et al. (2020); Conneau et al. (2020), employed in training the multilingual XLM-RoBERTa model across 100 languages, only considers the snapshots of CC in 2018, constraining its size and the availability of up-to-date information to train high-performing LLMs.
To address the aforementioned issues for open-source datasets, our work introduces a novel multilingual dataset, called CulturaX, for training LLMs in 167 languages. CulturaX merges the latest iteration of mC4 (version 3.1.0) with all available OSCAR corpora up to the current year, encompassing distributions 20.19, 21.09, 22.01, and 23.01. This amalgamation results in a large multilingual dataset, comprising 27 TB of text data with 6.3 trillion tokens and offering the most up-to-date data for LLM development. More than half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios. Importantly, CulturaX is extensively cleaned and deduplicated at the document level to produce the highest quality to train LLMs for multiple languages. In particular, our data cleaning process includes a comprehensive pipeline designed to eliminate low-quality data. This involves removing noisy text, non-linguistic content, toxic data, incorrect language identification, and more. Our data cleaning pipeline employs a variant of the Interquartile Range (IQR) method Dekking et al. (2007) to select appropriate thresholds for various dataset metrics (e.g., stopword ratios, data perplexity, and language identification scores), which can be used to filter noisy outliers for the dataset. As such, we leverage the percentiles of the distributions computed over large samples of data to effectively guide the threshold selection process for each filtering metric and language. Finally, we perform extensive deduplication for the data of the languages within our datasets based on the near deduplication method MinHashLSH Broder (1997); Leskovec et al. (2020) and URLs, leading to high-quality data to train multilingual LLMs. Our dataset will be fully available to the public to promote further research and development for multilingual learning. To our knowledge, CulturaX is the largest open-source multilingual dataset to date that is deeply cleaned and deduplicated for LLM and NLP applications.
Multilingual Dataset Creation
To develop a multilingual public dataset for LLMs, our strategy is to combine mC4 Xue et al. (2021) and OSCAR Ortiz Suárez et al. (2019); Abadji et al. (2021, 2022), two largest multilingual datasets at our disposal. We then process the data with an extensive pipeline, involving two major steps of cleaning and deduplication, to produce an enormous and high-quality dataset for multilingual LLMs.
mC4 is a multilingual document-level dataset, originally created to train the multilingual encoder-decoder model mT5 Xue et al. (2021) for 101 languages. This dataset is extracted from 71 monthly snapshots from CC by removing pages with less than three long lines (line length filter), pages with bad words, and duplicated lines across documents. Language identification for the pages in mC4 is done by the cld3 tool Botha et al. (2017)https://github.com/google/cld3, which is a small feed-forward network Xue et al. (2021). Any pages with a language confidence below 0.95% are excluded. mC4 is deduplicated with exact match at the document level; however, fuzzy document-level deduplication is not performed. We utilize the latest version of mC4 (version 3.1.0)https://huggingface.co/datasets/mc4 prepared by AllenAI in this work.
A notable aspect of our dataset pertains to the web-based origin of our selected datasets, mC4 and OSCAR, extracted from CC. This differs from certain previous work Radford et al. (2019); MosaicML (2023); Touvron et al. (2023) that has also relied on curated datasets like The Pile Gao et al. (2020) and BookCorpus Zhu et al. (2015) to train LLMs, presuming their higher overall quality. However, in the context of multilingual settings, we argue that web-scraped datasets can be a more suitable approach, as curated datasets of superior quality might not be available for various languages. Our strategy of using web-scraped data facilitates efficient data collection across multiple languages, contributing to enhanced training data scales. Furthermore, recent studies have demonstrated the effectiveness of cleaning web-scraped data to yield state-of-the-art LLMs Raffel et al. (2020); Almazrouei et al. (2023). In total, the combination of mC4 and OSCAR provides us 13.5B documents for further processing. Figure 1 illustrates the distribution of the document counts for mC4 and the four available versions of OSCAR in our initial dataset.
Given the combination of the mC4 and OSCAR datasets, we first perform a comprehensive data cleaning procedure to remove noisy and bad content from the data, including language identification, ULR-based filtering, metric-based cleaning, and document refinement.
Language Identification: A particular issue concerns the use of two different language identification tools, i.e., cld3 and FastText, for mC4 and OSCAR (respectively). It has been shown in previous studies that cld3 is significantly worse than FastText, causing substantially more language detection errors for mC4 Kreutzer et al. (2022). In fact, compared to several other language detectors, FastText has demonstrated state-of-the-art performance over benchmark datasetshttps://modelpredict.com/language-identification-survey. To this end, our first data cleaning step involves applying FastText to re-predict the languages for the documents in mC4. Documents whose predicted languages are different from the provided ones in mC4 will be removed from the dataset. The rationale is to avoid documents that are confusing for the language detectors cld3 and FastText, thus potentially introducing noise for the data. Finally, to ensure the highest quality, we remove data for any language found in mC4 but not supported by FastText.
URL-based Filtering: In the next step, we aim to eliminate pages from the known toxic and harmful sources to reduce relevant risks from our data. In particular, we leverage the latest UT1 blacklist of URLs and domains provided by the University of Toulouse to support Internet use regulation for administrators at schools. This list involves sites from different topics, including pornography, grumbling, and hacking, that should be discarded for LLM training. Updated twice to thrice per week, the blacklist involves more than 3.7M records that are contributed by both human and robots (e.g., search engines, known addresses and indexes) Abadji et al. (2022). As such, we remove any page from our dataset whose associated URL matches a site in the blacklist. This step is helpful for our dataset as the blacklist is not employed before for the mC4 dataset. In addition, although OSCAR has already used this blacklist for data cleaning, our approach incorporates the most up-to-date information from the list, which might not be available for the current distributions of OSCAR.
Metric-based Cleaning: To enhance the dataset’s quality, motivated by the data processing pipeline from the BigScience’s ROOTS corpus for BLOOM Laurençon et al. (2022); Scao et al. (2022), we further utilize the distributions for various dataset metrics to identify and filter outlying documents. Each metric provides a singular value for every document within the dataset, quantifying specific attributes such as number_words, stopword_ratios, and perplexity_score for each document. For each metric and its range of possible values within the dataset, a threshold will be determined to partition the range into two zones: a normal range and an abnormal range. The abnormal range is designated for documents exhibiting metric values significantly deviating from the norm, classifying them as outliers/noises, and consequently, these outliers are removed from our dataset. As such, we employ a comprehensive array of dataset metrics, which will be collectively employed to refine our dataset, as outlined below:
The last four metrics are suggested by the OSCAR dataset while the others are inherited from the BigScience ROOTS corpus’s pipeline to process OSCAR data. For the perplexity score, following the BigScience ROOTS corpus, we train a SentencePiece tokenizer Kudo (2018) and 5-gram Kneser-Ney language models as provided in the KenLM library Heafield (2011) using the 20230501 dumps of Wikipedia. Documents displaying high perplexity scores based on these KenLM models are considered notably different from Wikipedia articles. This indicates a level of noise that will be excluded from our dataset Wenzek et al. (2020). The tokenizer will also be used to obtain the number of words/tokens in the documents for our metrics. We publicly release our KenLM models in HuggingFacehttps://huggingface.co/uonlp/kenlm to faciliate future exploration.
Repeated information (e.g., words, paragraphs) can appear in the web-curated data due to crawling errors and low-quality sources, causing detrimental consequences for training LLMs Holtzman et al. (2019). The character and word repetition ratios are thus designed to avoid documents with excessively repeated information. High frequencies of special characters, stop words, or flagged words can indicate noisy and low-quality documents. We thus utilize the stop word and flagged word lists for different languages to compute their ratios for document removal. In addition to the stop word and flagged word lists provided by BigScience ROOTS for their 13 languages, we further collect dictionaries for these types of words for other languages. We prioritize the lists that have been shared on personal GitHub accounts for various languages, as these are often crafted by native speakers and exhibit higher quality. Moreover, lower language identification confidence might also suggest noisy language structures for the data. For each document in the dataset, we thus obtain a language identification confidence via the probability that FastText assigns to its corresponding language to aid data filtering. Finally, for the short line-based criteria, we implement a threshold of 100 characters to classify lines as short, as used by OSCAR. Documents with excessive occurrence of short lines will not be retained in our dataset.
Threshold Selection: Given the set of dataset metrics, an important question concerns the selection of appropriate thresholds for each metric and language to generate high-quality multilingual data. In the BigScience ROOTS project Laurençon et al. (2022), this selection process is carried out by native speakers of 13 languages. The resulting thresholds are employed for the rest of their 46 languages. The project offers a visualization interface that indexes a sample of a few thousand documents per language, enabling users to monitor data statistics as they adjust thresholds for the metrics. However, this process cannot be easily extended to different languages due to the requirement of experienced native speakers, which incurs significant costs. Furthermore, the limited sample sizes hinder the representativeness of the chosen thresholds for the full datasets. In our analysis, we observe that some selected thresholds for certain languages within BigScience ROOTS almost fall outside the value ranges for the entire dataset, leading to the deactivation of the corresponding metrics.
To address these issues, we leverage a variant of the Interquartile Range (IQR) method Dekking et al. (2007) to select appropriate thresholds for the filtering metrics for our dataset. For each metric and language, we generate a distribution of its possible values across the entire dataset for the language. There is an exception for languages with substantial amounts of data, such as Spanish and Russian, where only 25% of the data is used to calculate these distributions. Afterward, we compute the -th and -th percentiles of the distribution () and use them for the thresholds for our filtering metrics. In particular, the lower -th percentile will be chosen for the metrics that favor high values (e.g., language identification confidence), while metrics favoring low values (e.g., perplexity scores and document length) will utilize the upper -th percentile. We investigate different values for , considering , , , , and . The selection of and has achieved the best data quality for a sample of languages in our examination.
It is worth emphasizing that the utilization of percentiles for threshold selection enables our approach to efficiently draw upon more extensive data samples for each language compared to those employed in the BigScience ROOTS project. This results in more reliable thresholds for the full datasets over different languages. Specifically, concerning the large languages where only a 25% data sample is employed to compute the value distribution for a metric, we observe that the proportion of discarded data to the entire dataset closely aligns with that of the data sample when applying the same selected filtering threshold. This underscores the representativeness of the thresholds selected through our methodology. Finally, once the thresholds for the metrics in a given language have been determined, we will eliminate any document that surpasses a metric’s threshold and enters the unfavorable range of the data.
Document Refinement: The previous cleaning steps are done at the dataset level, aiming to remove low-quality documents from the dataset. In this step, we further clean the retained documents to improve the quality. It is important to note that our prior metric-based filtering step plays a vital role in eliminating highly noisy documents, which, in turn, streamlines the process of developing effective document cleaning rules during this step. Notably, since the documents from mC4 and OSCAR are extracted from HTML pages crawled from the Internet, a significant portion of them may carry crawling and extraction errors, including long JavaScript lines and extraneous content. Consequently, filtering out these documents greatly simplifies our task of designing rules to clean the documents within our dataset.
As such, for each document, we eliminate its noisy or irrelevant portions via a series of operations. First, we remove any short lines located at the end of each document, as these lines typically contain footer details or unhelpful information from the websites. Second, we eliminate the lines that contain words from our list of JavaScript (JS) keywords (e.g., “