MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat
Introduction
The availability of large multilingual corpora has accelerated the progress of multilingual natural language processing (NLP) models . However, most publicly available general-domain multilingual corpora contain 100-200 languages , with some datasets containing more languages in specific domains such as religious content , children’s books or dialects .
A common approach to creating such datasets is to mine language specific data from general web crawls such as CommonCrawl to create datasets. We simply take this approach and scale it. We train a document-level LangID model on 498 languages to obtain CommonCrawl annotations at a document level and obtain a 5-trillion token, document-level monolingual dataset.
However, such web-scale corpora are known to be noisy and contain undesirable content , with their multilingual partitions often having their own specific issues such as unusable text, misaligned and mislabeled/ambiguously labeled data . To mitigate this, we manually audit our data. Based on our findings, we discard 79 of the languages from our preliminary dataset, rename or combine several languages and apply additional preprocessing steps. Finally, to validate the efficacy of our dataset, we train multilingual machine translation models of various sizes up to 10.7B parameters, as well as an 8B decoder-only model, and then evaluate these models on highly multilingual translation evaluation sets.
In Section 2, we describe the creation and composition of MADLAD-400, and discuss the results of the audit. Then, in Section 3, we describe the parallel data we collect using publicly available sources to train the multilingual machine translation models described in Section 4.1. In Section 4, we describe the training process of the multilingual machine translation models and 8B decoder-only model, and then evaluate these models on highly multilingual translation datasets. In Section 5 we describe our tests for memorization in the multilingual models that we release and discuss preliminary results. Finally, we discuss the limitations of this work and directions for future work.
MADLAD-400
The process we follow to create MADLAD-400 is similar to that of other large-scale web corpora . First, we collect as large a dataset of unlabeled web text as possible. More specifically, we use all available snapshots of CommonCrawl222https://commoncrawl.org/ as of August 20, 2022. After some preliminary data cleaning, we use a highly multilingual LangID model to provide document-level annotations (Section 2.2). Finally, we conduct a self-audit (Section 2.4), or quality review, of this preliminary dataset partitioned by language, and design filters to remove noisy content. When appropriate, we correct language names and remove languages from the preliminary dataset. We note that building MADLAD-400 was an iterative process, and that while we describe one major quality review in depth, we conducted several stages of filtering. To reflect this, we describe the preprocessing steps and improvements made in chronological order.
We release two version of this dataset: a 5 trillion token noisy dataset, which is the dataset obtained before applying document-level LangID and the final filters, and a 3 trillion token clean dataset, which has a variety of filters applied based on our self-audit, though it naturally has a fair amount of noise itself. Each dataset is released in both a document-level form and a sentence-level form. Some overall statistics for these dataset versions are given in Table 2, with a graph visualizing the distribution of sizes (number of tokens) across languages in Figure 1. The final version of MADLAD-400 has 419 languages, with a varied geographic distribution, as seen in Table 2.
We carry out a few preliminary preprocessing steps on the web-crawled corpus: first, we deduplicate lines across documents . Then, we filter out all pages that do not contain at least 3 lines of 200 or more characters (as done by Xue et al. ). We also use other commonly used filtering heuristics such as removing lines containing the word “Javascript” and removing pages that contain “lorem ipsum” and curly brackets “{” (as done by Raffel et al. ).
2 Language Identification (LangID)
We train a Semi-Supervised LangID model (SSLID) on 500 languages, following the recipe introduced by Caswell et al. . We then filter the corpus on document-level LangID, which was taken to be the majority sentence-level LangID prediction. The resulting dataset is MADLAD-400-noisy. For the Additional details on these LangID models is in Appendix A.1.
3 Filtering Out Questionable Content
To assess the quality of this preliminary dataset, we inspected 20 sentences each from a subset of 30 languages in our dataset. Based on our observations, we introduced a score, pct_questionable. The pct_questionable score is simply the percentage of sentences in the input document that were “questionable”. A sentence was considered questionable if any of the following were true:
Document consistency: Sentence-level LangID does not match the document-level LangID.
List Case: Over 50% percent of the tokens began in a capital letter (we apply this filter only if the sentence has at least 12 tokens.)
Abnormal Lengths: The sentence has under 20 characters or over 500 characters. We note that this is a bad heuristic for ideographic languages333http://www.grcdi.nl/dqglossary/ideographic%20language.html).
Technical Characters: Over 20% of the characters in the sentence match [0-9{}+/()>].
Cursed Regexes: The sentence matched a “cursed regex”. These are a heuristic set of substrings and regexes that we found accounted for a significant amount of questionable content in the data samples we observed. They are described in depth in Appendix A.2.
We removed all documents with a percent_questionable score greater than 20%. Furthermore, we removed any document with under 5 sentences.
4 Self-Audit (Quality Review)
After filtering out generally lower-quality content with the approach described above, we performed a self-audit of every corpus in this dataset, following Kreutzer et al. . The aim of our self-audit was to correct any remaining systematic issues by either applying additional filters, renaming/merging language codes, or completely removing the language from the dataset. Although we do not speak most of the 498 languages, we were able to give high-level comments on the general quality. For each language, we inspected a sample of 20 documents. This task was evenly divided between the first two authors based in part on which scripts they could read. We used the following guidelines:
If dataset is mostly plausibly in-language text, we can keep it. For unknown languages, search the web for a few sentences and look at the website and URL for language clues.
If dataset is noisy but the noise looks filterable, leave a note of how to filter it.
If the dataset is very noisy and does not look possible to filter, mark it for removal.
Optionally put note that may be helpful for downstream users, e.g. if dataset is 100% Bible.
We made the decision to include languages that looked noisy, but omit any language that was majority noise, or only had 20 or fewer docs. While this is not a high quality bar, we hope it still has the potential to be useful to the research community, given that foundation models have demonstrated the potential to learn distributions for very few exammples . The motivation for not releasing “nonsense” or tiny datasets is to avoid giving a false sense of how multilingual the dataset is (“Representation washing”), as recommended by Quality at a Glance .
Of the 498 languages that we obtained LangID annotations for, we decided to omit 79 languages, bringing the final number of languages in MADLAD-400 to 419. Based on the self-audit, we also expanded the filters (particularly the cursed regexes), and made changes as described in Sections 2.5 and 2.6. We details stats for these languages in Appendix Section A.4.
For transparency, we provide full results of the self-audit in Appendix A.4. In Table 3, we provide an overview of the issues surfaced through this self-audit. We find that a significant fraction of languages contain mostly or entirely religious documents, while other issues include misrendered text, pornographic content, and boilerplate.
5 Additional Filters
Based on the results of the self-audit, we apply three additional filters.
Many languages using Brahmic Abugida (South and Southeast Asian scripts like Devanagari, Khmer, etc.) use some variant on the virama 444https://en.wikipedia.org/wiki/Virama character. We found that such languages in MADLAD-400-noisy had incorrectly encoded viramas: for example, was rendered as , where the middle character is a detached virama. Therefore, for the languages bn, my, pa, gu, or, ta, te, kn, ml, si, th, tl, mn, lo, bo, km, hi, mr, ne, gom, as, jv, dv, bho, dz, hne, ks_Deva, mag, mni, shn, yue, zh, ja, kjg, mnw, ksw, rki, mtr, mwr and xnr, we did a special filtering/correction step — we removed all extraneous spaces before virama characters. We provide the pseudocode and list of virama characters in Appendix A.2.
We found that languages using Myanmar script like my and mnw appeared to have the same issues with virama characters that still remained after applying the virama correction. This was because a large fraction of Myanmar script data on the internet is Zawgyi encoded data, which appears to have the rendering issues described above if rendered in Unicode. Therefore, we used an open-source Zawgyi detector 555https://github.com/google/myanmar-tools to convert the encoding of documents with more than a 50% probability of being Zawgyi encoded into standard Unicode encoding.
The Mandarin (zh) data in CommonCrawl had a particular issue with pornographic content. We combed through the data and developed a list of strings likely to be present in pornographic content, and filtered out all documents containing the strings in the blocklist. This resulted in a 17% reduction in the number of documents and a 56% reduction in file size. We list these strings in Appendix A.2.
6 Correcting Other Systematic Issues.
Based on various specific notes from the self-audit, we made a variety of changes. Five datasets were found to be in the wrong language, and were renamed or merged into the correct dataset. Six languages that looked suspicious were run by native speakers of those or related languages, some of which were discarded, and some of which were merged into the correct dataset. Finally, we removed all languages with fewer than 20 documents. Details can be seen in Appendix A.3.
Parallel Data
To train the machine translation (MT) models described in Section 4.1, we also collect a dataset composed of publicly available datasets coming from various data sources. A full list of the data sources and associated language pairs are in Appendix A.5. The final dataset has 156 languages across 4.1B sentence pairs and 4124 language pairs total. In the rest of the paper, we refer to the input sentence to an MT model as the “source side" and the reference/output sentence as the “target side".
We describe the data preprocessing steps taken below. We find that a significant amount of data is filtered out, with the amount of data available 396 of 4.1k language pairs reducing by more than .
We deduplicate sentence pairs that are an exact match on both the source and target.
We observed the same issues described in Section 2.5, and used the same filters for sentence pairs where either the source language or target language belonged to the list of languages in Section 2.5.
We use the unmatched toxicity filters described by NLLBTeam et al. , but ultimately unusable for our purposes in most cases. For the languages ace, am, ar, az, bg, bm, bn, bs, cs, din, en, es, fa, fr, ga, gl, ha, hi, id, it, kk, ko, ml, ms, my, nl, no, nus, prs, ru, scn, sd, so, sv, tg, th, tt, ur, uz and zh, more than 3% of documents were marked as having unmatched toxicity. On closer inspection, we found that while zh and ko had a lot of pornographic content that was removed by the filtering process, most other languages removed sentences that had homonyms of non-toxic words. Similarly, languages like id, ur, tg, fa and no had data from Tanzil (Qur’an dataset), but the toxicity word lists contained words such as kafir, mercy and purity, that are not normally considered toxic content for our purpose of filtering the dataset using wordlists.
We removed all sentences that have more than 75% overlap between the source and target side. To avoid filtering out valid entity translations, we only applied this filter on sentences longer than 5 tokens. In addition, we remove sentence pairs whose source length to target length ratio falls outside of . We omitted this filter for the following, which are mainly non-whitespace languages: zh, ja, ko, km, my, lo, th, wuu, shn, zh_tw, zh_cn,iu, simple, dz, kr_Arab, din, nus and mi.
We removed all sentences that are less than 50% in-script for both the source and target language. For instance, if the sentence was supposed to be in kaa (Cyrillic script) but was 70% in the Latin script, we removed it.
2 Self-Audit (Quality Review)
Similar to the self-audit done for MADLAD-400, we conducted a review of the data sources that compose the parallel data we collected to verify the quality of this data. We collected 20 source-target pairs from each language, and assessed the data for the presence of offensive content, porn, and whether the data seemed to be of the correct language pair and whether the target sentence seemed to be a plausible translation. Since we did not have access to native speakers of all 157 languages, the latter was primarily based on guesses. In Appendix A.5 we provide full details of the instructions we provided to auditors, the results of the self-audit and any changes made the dataset.
3 A Note on Language Codes
As observed by Kreutzer et al. , the datasets used to create the parallel data (and MADLAD-400) use a variety of different language codes. We use the BCP-47 standard, which specifies the 2-letter ISO-693-1 code when applicable, and otherwise the ISO-693-3 code. Script tags and region tags are omitted when they are defined as the default value by CLDR 666https://cldr.unicode.org/, and otherwise included. For example, ks refers to Kashmiri in Nastaliq/Arabic script (CLDR default), whereas ks_Deva refers to Kashmiri in Devanagari. A detailed investigation of codes in MADLAD-400 can be found in Appendix A.3.
4 Multiway Data
We create additional multiway data by applying the -gram matching method () from Freitag and Firat to the processed dataset. Using this, and the publicly available data, we obtain 11.9B sentences across a total of 20742 language pairs. Full details may be found in Appendix A.7.
Experiments
We validate our data by training encoder-decoder machine translation models in Section 4.1 and decoder-only language models in Section 4.2, and test them on several translation benchmarks.
We train models of various sizes: a 3B, 32-layer parameter model,777Here and elsewhere, ‘X-layer’ means X encoder layers and also X decoder layers, for a total of 2X layers. a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model. We share all parameters of the model across language pairs, and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target language .
We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset with a MASS-style objective to train this model. Each of these objectives is sampled with a 50% probability. Within each task, we use the recently introduced UniMax sampling strategy to sample languages from our imbalanced dataset with a threshold of epochs for any particular language. We also explored back-translation by randomly sampling 2M monolingual samples (or the total number of samples for that given language) for each language and translating them to/from English using the 3B model. Following Bapna et al. (§3.5), we filter the back-translated data in a variety of ways. For a natural target and a back-translated source, we filter by round-trip ChrF to discourage hallucinations (threshold of 0.32), by ChrF between source and target to discourage copying (threshold of 0.30), by the length ratio of source to target (asymmetric bounds of (0.45, 1.6), and by LangID prediction of the source. We then finetune the 7.2B model for a steps by randomly mixing the original and the back-translated data with a combining ratio of 1:1. We list specific architecture and training details of these models in Appendix A.8.
2 Zero-shot Translation with Language Models
Given recent interest in the efficacy of unsupervised translation using large language models, we explore training language models solely on the monolingual data. We follow the same training schedule and model configurations from Garcia et al. . In particular, we consider 8B decoder-only models, following the same model hyperparameters as previous work . We train these models using a variant of the UL2 objective adapted for decoder-only models, and use the same configuration as previous work . We provide additional details in Appendix A.8.
3 Evaluation
We use the sacreBLEU implementation of bleu888 BLEU+case.mixed+lang.
We use the 15 WMT languages frequently used to evaluate multilingual machine translation models by Siddhant et al. , Kim et al. , Kudugunta et al. , NLLBTeam et al. : cs, de, es, fi, fr, gu, hi, kk, lv, lt, ro, rs, es, tr and zh.
We evaluate on the languages in the Flores-200 dataset that overlap with the languages available in either MADLAD-400 or the parallel data described in Section 3. We list these languages in Appendix A.9. For non-English-centric pairs, we evaluate on a 272 language pair subset of the 40k language pairs possible due to computational constraints. We evaluate on all language pairs possible using the following languages as either source or target language: en, fr, cs, zh, et, mr, eu, cy, so, ckb, or, yo, ny, ti, ln, fon and ss. We obtained this set of languages by selecting every language by number of tokens in MADLAD-400 (clean), starting with French (fr). Noticing that this had no Indian languages, we shifted af and fo (both close dialects of HRLS) down one index to mr and or, respectively. Finally, we noticed that this initial list had supervised and unsupervised languages, but didn’t have a good representative of a “slightly supervised language”, that is, one with a small but extant amount of parallel data. Therefore, we added yo to the list, which has the least parallel data of any supervised language. This resulting subset of languages also contains a nice variety of scripts: Latin, Chinese, Devanagari, Arabic, Odia, and Ethiopic scripts.
We evaluate on the languages in the recently introduced NTREX dataset .
Finally, we evaluate on the languages in Gatones, the in-house, 38-language eval set used in and the Gatitos paper . Again, we take the subset of languages overlapping with the languages available in either MADLAD-400 or the parallel training data.
3.1 Few-shot evaluation for language modeling
We perform few-shot prompting to evaluate the language model with the following prompt:
[sl]:\n[tl]:\n\n[sl]:\n[tl]:\n\n…[sl]:\n[tl]:
where [sl] and [tl] denote the source and target language name (expressed in English. For example, when translating a sentence from en to te, we use [sl]=English and [tl]=Telugu), respectively. and are demonstration examples used for prompting, and is the test input.
For each test example, we randomly sample demonstration examples, which is simple yet performs competitively with more complicated strategies . In particular, we randomly select examples from the dev split of each dataset. Since NTREX does not have a dev split, we randomly sample 1000 examples as the dev set and use the rest for test evaluation.
4 Results
In Tables 4 and 6 we present evaluation scores on the WMT datasets and NTREX datasets, which are evaluation sets in the news domain. We find that both the 7.2B parameter model and the 10B parameter model is competitive with the significantly larger NLLB-54B model on WMT. For the recent NTREX dataset, the only published results are small-scale results by Baziotis et al. .
In Table 5 we find that on Flores-200, our model is within 3.8 chrf of the 54B parameter NLLB model, while on xxyy pairs the 10.7B model is behind by 6.5 chrf. This is likely due to a combination of factors, including using a significantly smaller model (5x smaller), domain differences , and back-translated data . Similarly, in Table 7, we find that the 10.7B parameter model is within 5.7 chrf of the scores reported by Bapna et al. . Again, it is very difficult to compare their results to ours; their two largest advantages are 1) iterative back-translation, and 2) access to a much larger in-house text data. In Table 8, we display the results for when we finetune the 7.2B parameter model on backtranslated data. While this setup is very likely sub-optimal, we see that back-translation greatly improves en2xx translation (by 3.0 chrf, in the case of Flores-200) in most cases. We note that the results we present are merely baselines to demonstrate the utility of MADLAD-400, and hope that future work builds upon these experiments by applying improved modeling techniques.
Finally, across all evaluation datasets, we find that while results on few-shot translation using the 8B language model increase with an increasing number of demonstrations, these results are still significantly weaker than the results of models trained on supervised data. We present per-language pair results on all datasets in Appendix A.10.
Training Data Extraction and Memorization
Generative models have been shown to regurgitate training data that may plagiarize, violate copyright assumptions, or infringe privacy. It can be difficult to assess and prevent these cases because such information may be paraphrased in ways that are difficult for automated systems to detect . Instead, existing literature measures memorization in generative models to estimate the propensity for disallowed outputs. Typically, this means prompting a language model with some prefix of length and comparing generated outputs of length with the training data to see if they are ‘novel’ or if the generation is simply a regurgitation of its training data . In the multilingual setting this may present new risks because tail languages may be more vulnerable to memorization .
While memorization has been well-studied for language models, assessing the extent of memorization is difficult within translation settings. This is primarily because translation has a significantly smaller space of valid outputs, as opposed to many possible continuations for language modeling. This presents some difficulty in extending common memorization tests for language generation to translation. As an illustrative example, consider the case of translating to the same target language as the source ("translate_copy"). Performing a standard training data extraction attack would test if the generation matches the continuation. However, success would not indicate training data extraction as the adversary would have already had access to it.101010Though membership inference may be possible. Thus, we modify the standard framework for testing memorization to better identify additional leaked data.
We define memorization in translate_copy to be when the model outputs any generation with length that matches the continuation; then, captures the additional bits. In cases where the source and target language are different ("translate_diff"), performing a similar test would require knowledge of which part of the continuation exactly corresponded to the prompt. Given that such an alignment is not easily obtained, we instead use the relative token lengths between the continuation and the prompt to choose an appropriate size of . For example, if at training time the continuation for the target language was larger, we set where captures the additional bits. For each of translate_copy and translate_diff, we sample sequences for each language and choose . We then perform both a verbatim match of the generation with the continuation and an approximate match requiring Levenshtein similarity similar to .
We show the per-language and average training data extraction rates, for both the translate_copy and translate_diff settings in Figure 2, with set to test for tokens of additional information leakage. We find that translate models can memorize and regurgitate their training data, even beyond what is contained in the prompt. We also observe that some lower resource languages may exhibit higher memorization rates, however we observe no strong correlation between the resource level and the level of memorization. In the translate_diff tests, we observe much lower memorization - we hypothesize this may be due to the higher difficulty of the task. Even though many languages have nontrivial memorization, we found that many languages exhibited no memorization across the samples tested (257/370 for translate_copy and 130/146 for translate_diff ). We also present results for approximate memorization in Appendix A.12, which show that translate models may also paraphrase memorizations leading to even higher memorization rates.
Our preliminary experiments show that memorization can exist in the translation setting. However, capturing when memorization is intended or beneficial versus undesired is still an open question. To aid future research in this direction, we design and include “canaries”—carefully crafted data designed to be outliers to the natural training distribution that can be used to analyze memorization. Canaries enable studying memorization in the multilingual and machine translation settings by measuring the capability to extract canaries added to the training set . As with Anil et al. , our canaries are designed to share characteristics with the natural training data so as to better ground memorization evaluation in practical risks. The canaries are also designed tosl be outliers to assess varying degrees of risk. To ensure similarity with natural data, canaries are generated by sampling and then randomly modifying real data in a manner similar to , where each source of randomness defines the canary type. In total, we generate canaries across both the monolingual MADLAD-400 dataset and the parallel data ( of the training data). The methodology for each canary type and the exact distribution of canaries are detailed in Appendix A.11.
Related Work
Extensive work has been done to mine general purpose datasets for multilingual machine translation and language modeling. Xue et al. introduce mC4, a general web domain corpus on 101 languages to train mT5, a pretrained language model for downstream NLP tasks. Similarly, Conneau et al. introduce CC-100, later extended to CC100-XL by Lin et al. . The OSCAR corpus is also a mined dataset that supports 166 languages and the ROOTS corpus is a compiled dataset that contains 46 natural languages. Glot500-C covers 511 languages: however, it is not clear how many of these languages comprise solely of religious texts. Bapna et al. create an internal dataset on 1500+ languages, while NLLBTeam et al. mine a dataset from CommonCrawl and ParaCrawl . Recently, Leong et al. created a 350+ language dataset from children’s books.
In addition, there have been efforts to get better represented corpora and models for languages often underrepresented in general multilingual corpora: Serengeti introduces a dataset and associated model trained on 517 African languages and language varieties, while IndicTrans2 introduces a machine translated model for the 22 scheduled languages in India.
Limitations
While we used thorough self-audits to guide the creation of MADLAD-400, we note that most audits were conducted by non-speakers of the languages in MADLAD-400; as a result, many types of noise, like machine-generated or disfluent content, could not be detected. Moreover, toxicity detectors, classifiers and filters that work reliably for all the 419 languages in MADLAD-400 do not exist, limiting the extent to which we can clean and document the dataset. It is possible that issues still remain, so we encourage users to report issues that will be listed on the project Github page111111https://github.com/google-research/google-research/tree/master/madlad_400. This paucity extends to the availability of multilingual evaluation sets for these languages - we could only evaluate our models on 204 of the languages in MADLAD-400. Additionally, even though decoder-only models are often evaluated on NLP tasks that are not necessarily machine translation , we did not conduct such evaluations - most available benchmarks cover only 30-50 languages of which most are not tail languages (which forms the focus of MADLAD-400). We instead leave this to future work. Finally, during our self-audit we noted the skew of data on the long tail towards specific domains such as religious texts. We hope that these limitations motivate the creation of more language-specific corpora not captured by web crawls, and the development of language-specific data cleaning tools and practices.
Conclusion
Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset. We perform a self-audit of this dataset for quality on samples of all 498 languages, develop filters, and remove spurious datasets, for a total of 419 languages in the release. We carefully describe the dataset creation process, laying out the iterations of audits and improvements upon the preliminary dataset along with observations that guided our decisions. We hope that this encourages creators of large-scale pretraining datasets both to put in their due diligence for manually inspecting and dealing with data, and also to describe and publicize the process in a level of detail that is reproducible and insightful for downstream users. This increased visibility into the dataset creation cycle can in turn improve model development and enable responsible data use . Using MADLAD-400, we train and release large machine translation and general NLP models and evaluate them thoroughly. We hope that this further motivates work towards language technologies that are more inclusive of the rich language diversity housed by humanity.
Ethics Statement
Innovation in NLP technologies in English has been accelerated by training large scale deep learning models on massive web corpora . However, on the long tail of written languages in the world there is a lack of high quality general data sources that impede the progress of NLP tools for other languages. We hope that making an audited and cleaned corpus such as MADLAD-400 available mitigates this issue. While we extensively cleaned MADLAD-400, the extent to which we can preprocess this data is limited by how not all languages have available tools for removing problematic content such as porn, toxic content, PII, copyrighted content or noise. We urge practitioners to carefully consider their target usecase before using MADLAD-400.
Acknowledgements
We would like to thank Wolfgang Macherey, Zoubin Ghahramani and Orevaoghene Ahia for their helpful comments on the draft. We would also like to thank Subramanian Venkateswaran for debugging the virama rendering issues, and Ali Dabirmoghaddam for his insight on data samples of various languages in MADLAD-400.
References
Appendix A Appendix
Following Language Id In the Wild , we trained a Transformer-Base Semi-Supervised LangId model (SSLID) on 498 languages. The training data is as described in Language ID in the Wild, with the differences that 1) training data is sampled to a temperature of T=3 to reduce over-triggering on low-resource languages; and 2) the data is supplemented with web-crawled data from the same paper (that has already been through the various filters described therein). The purpose of adding this data is to increase robustness to web-domain text, and possibly distill some of the filters used to create the web-crawl. The languages chosen for this model were roughly the top 498 by number of sentences in the dataset reported by Language ID in the Wild. The complete list may be seen in Table LABEL:tab:madlad-full.
A.2 Filtering Details
Following is the list of cursed substrings that we used to filter the monolingual data. Here are a few general notes about these strings:
low quality sentences ending in the pipe character were very common. (Note: this was not Devanagari-script text using a Danda.)
The last few regexes are meant to match A N T S P E A K, List Case, and weirdly regular text (for instance, lists of shipping labels or country codes)
Here is the complete list of cursed substrings and cursed regexes, along with the function used for filtering:
A.3 Other issues fixed after the self-audit
For a few languages, we had strong suspicions that the text was noisy or spurious, but were unable to acertain the quality of the data. In these cases we asked a native speaker to audit the data. Based on their recommendations, we did the following:
zh, zh_Latn: This resulted in the special filters described below.
en_Arab, tly_IR: This data was found to boilerplate, so we removed this data.
For several languages, we found that (mostly by checking URLs) the corpora were in languages different from the LangID predictions. This led to the following changes:
dty renamed to zxx-xx-dtynoise, aka a “language” of noise. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.
ss-SZ renamed to ss – this was a result of inconsistent data labels.
A.4 Monolingual Data Details
Notes from rounds 2 and 3 of the self-audit can be seen in Table LABEL:tab:audit-full. Some of these notes may refer to previous, less filtered versions of the data, especially those with a “r1” (meaning “round 1”). Some of them however do have some useful information about quirks of problems with the current dataset. The overall statistics of MADLAD-400 are in Table LABEL:tab:madlad-full.
A.5 Parallel Data Details
To create the dataset described in Section 3, we use the data sources described in Table 11. After preprocessing, we obtain a dataset with a total of 157 different languages and 4.1B sentence pairs that vary from en-es with 280.3M sentence pairs to zu-en with 7959 sentence pairs. The list of language pairs along with the associated data count is available along with the model checkpoints.
A.6 Language Codes
The specifics of the language code changes described in Section 4.1 that we made are as follows:
We use ak for Twi/Akan, rather than tw. This includes Fante.
Unfortunately, we use the macro code chm for Meadow Mari (instead of the correct mhr), and mrj for Hill Mari
By convention, we use no for Norwegian Bokmål, whereas some resources use nb
By convention we use ps for Pashto instead of pbt (Southern Pashto)
By convention, we use ms for Standard Malay, not zlm
By convention, we use sq for Albanian, and don’t distinguish dialects like Gheg (aln) and Tosk (als)
We use ber as the code for Tamazight, after consultation with Tamazight speakers opining that the dialect distinctions are not significant. Other resources use the individual codes like tzm and kab.
We use the macrocode qu for Quechua. In practice, this seems usually to be a mix of the Ayacucho and Cusco dialects. Other resources, like NLLB, may use the dialect code, e.g. quy for Ayacucho Chanka. The same is true for a few other macro codes, like ff (Macro code for Fulfulde, whereas other sources may use e.g. fuv.)
Really, there are notes that can be made about almost any code, from the well-accepted conventions like zh for Mandarin, to many dialectical notes, like which variant of Hmong really is the hmn data? But The above ones are made specifically for ones where we are aware of other datasources floating out there that use different conventions.
A.7 Multiway Data Details
On creating the multiway data described in Section 3.4, we obtain a dataset with 11.9B sentence pairs across 19.7k language pairs. In Table 12, we list the combined number of sentence pairs for each target language.
A.8 Model Training Details
We train models of various sizes: a 3B, 32-layer parameter model,121212Here and elsewhere, ‘X-layer’ means X encoder layers and also X decoder layers, for a total of 2X layers. a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model. We describe the specifics of the model architecture in Table 13.
We share all parameters of the model across language pairs, and use a Sentence Piece Model (SPM) with 256k tokens shared on both the encoder and decoder side. We train the SPM model on upto 1M sentence samples of the sentences in the sentence-level version of MADLAD-400, supplemented by data from the languages in the parallel data used to train MT models when not available in MADLAD-400 with a temperature of and a character coverage of .
Each input sentence has a <2xx> token prepended to the source sentence to indicate the target language . We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset with a MASS-style objective to train this model. Each of these objectives is sampled with a 50% probability. Within each task, we use the recently introduced UniMax sampling strategy to sample languages from our imbalanced dataset with a threshold of epochs for any particular language.
We used a square root learning rate decay schedule over the total number of training steps, starting at 0.01 and ending at X, as well as the AdaFactor optimizer with factorized=False and 10k warmup steps. We note that for the 10.7B model we use a dropout probability of instead of in order to mitigate overfitting to low-resource languages.
We follow the same training schedule and model configurations from Garcia et al. . In particular, we consider 8B decoder-only models, following the same model hyperparameters as previous work . We train these models using a variant of the UL2 objective adapted for decoder-only models, and use the same configuration as previous work . We point the reader to these papers for a detailed overview of the training process, and include basic architectural details in Table 13. We use the same SPM trained for the MT models.
We use a square root learning rate decay schedule over 500k steps, starting at 0.01 and ending at 0.001, as well as the AdaFactor optimizer with factorized=False with 10k warmup steps. We describe the evaluation setup in Section 4.3.1.
A.9 Languages Evaluated
In Table 14, we list the languages for which we evaluate the models trained as described in Sections 4.1 and 4.2.
A.10 Results Details
In Tables 15, LABEL:tab:ntrex-full, LABEL:tab:ntl-full, LABEL:tab:flores-full and LABEL:tab:flores-direct we list the WMT, NTREX, Gatones, Flores-200 and Flores-200 (direct pairs) chrf and SacreBLEU scores respectively by language pair along with the model checkpoints.
A.11 Canaries
We design and generate different types of canaries for each dataset. We treat the MADLAD-400 dataset as a large unlabeled pretraining corpus and for each language with sufficient size ( samples), we generate no more than canaries per language, leading to a total of canaries across all languages. For parallel data, we design and generate different canaries specific to its usage for translation. We generate canaries in total for all target languages with samples. In both cases, we scale the proportion of canaries based on the size of each language to minimize their impact to utility in model training. Finally, we also generate “generic” canaries that share no resemblance to natural data. In total, we generate canaries ( of the training data).
Because MADLAD-400 contains only monolingual data, i.e., each example relates to only one language, we treat it as a large unlabeled pretraining corpus. Similar to Anil et al. , we aim to generate canaries that share characteristics of the underlying training data but are still outliers. For this, we design three types of canaries: shuffle, interleave and training_prefix. interleave canaries can be viewed as the closest to natural data, where all tokens are sampled from the underlying distribution and most sequence-level correlations are kept intact. We generate these canaries by sampling two real documents from a language, and interspersing chunks of tokens in their same relative ordering. On the other hand, shuffle canaries can be viewed as the farthest from natural data, sharing only the average token frequencies of the language but with no sequence-level correlations. These are generated by sampling a real document and uniformly sampling its tokens without replacement. In addition, we also propose training_prefix canaries which can be viewed as something in between. Here, each canary is generated by sampling tokens from a real sample and then completing the sequence with random tokens (i.e, taken uniformly with replacement from the vocabulary).
We take care to adjust the number and type of canaries based on the resources of the language, in order to minimize any harm to model utility. We group languages based on their relative size. Then, prior to generation we fix a target canary rate based on this resource level - this determines the total number of canaries that can be added per language. We choose a smaller proportion of canaries (relative to the total number of sequences in the language) for lower-resource languages based on the intuition that these languages can tolerate less noisy data before utility is altered. With the number of canaries fixed, we then choose how many times each canary will be repeated, as this has a significant impact on memorization . Note that repeating canaries may be beneficial to study the relative vulnerability of repeated data, but also reduces to the total number of unique canaries that can be added within that language’s canary budget. We choose the distribution of repeats heuristically aiming to maximize the support of the distribution while also ensuring that each bucket has enough unique canaries to achieve meaningful results. Finally, as the language size grows, (and thus the canary budget as well) we also introduce more canary types. We describe the full distribution of canaries generated in Table LABEL:tab:canary-distributions.
A.11.2 Parallel Canaries
Unlike MADLAD-400, the parallel data consists of source-target pairs from different languages corresponding to the source and target languages for translation. This leads to new intricacies not present in the monolingual setting, e.g., because languages have different grammatical structure, it may be difficult to tailor modifications to both the inputs and outputs simultaneously that maintain linguistic structure.
Rather than design canaries for all combinations of language pairs, where many pairs may have insufficient resource levels to incorporate canaries, we instead focus on the multiway setting where languages are grouped by the target language for translation. To minimize impact on the source languages (which may include very low-resource languages), we opt to not use any real training data from the source as inputs to the model. Instead, we generate canary data following one of two methodologies for the source: random_prefix corresponds to cases where all canaries for a given target language share the same prefix of tokens but have unique uniformly random (w.r.t. the token vocabulary) suffixes following it and full_random canaries are analogous with no shared prefix. The shared prefix of random_prefix canaries is designed to align with cases where data may share common subsequences. For the targets, we either interleave or shuffle them as done in the MADLAD-400 case above (Appendix A.11.1), except interleaving in batches of 50. Taking the outer product of these options, we get four possible canaries, e.g., random_prefix_interleave and so forth. The resource level groups and distribution of canaries is the same as for the MADLAD-400 canaries described in Table LABEL:tab:canary-distributions; the mapping for parallel languages to these resource level groups is shown in Table LABEL:tab:parallel-canaries. In total, canaries are generated this way across all languages.
Finally, we also design two additional types of canaries that use natural data from the source. Because this might impact utility more, we restrict these canaries to only the largest language pairs that have at least examples. The set of language pairs and languages satisfying this threshold are shown in Figure 3 and Figure 4. These two canary types are interleaved_both and interleaved_mislabeled_to. The former performs the same interleaving operation on the source and targets for a language pair, interleaving in batches of 50 tokens. The latter does the same, with the addition of also select a new target language label, uniformly at random, from all qualifying high resource languages. For each language pair listed in Figure 3, we generate 60 canaries in total, split evenly across the two canary types, and in the following distribution: 10 canaries are repeated once, 5 are repeated twice, and 2 are repeated 5 times. This gives a total of canaries in total across all language-pairs. Combined with the canaries from the prior 4 canary types, there are a total of canaries.
A.11.3 Generic Canaries
Finally, we also designed and generated generic canaries. Unlike the canaries of the prior two sections (A.11.1 and A.11.2), these canaries share minimal resemblance to natural data. These canaries may be useful for understanding memorization of highly outlier data. Here, we generate monolingual canaries where the source and targets are the same. We propose two types of canaries: random_prefix and fully_random canaries. These are the same as the canaries described in Section A.11.2 but with the source matching the target. We generated canaries in total split evenly among 4 types of canaries: fully_random canaries and random_prefix canaries with shared prefixes of length 50, 100, and 200.
A.12 Additional Memorization Figures
A.13 Datasheet
For what purpose was the dataset created? (Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.) We create MADLAD-400 as a general purpose monolingual document level dataset covering 419 languages for the purpose of providing general training data for multilingual NLP tasks such as MT and language modeling. One of the goals of the Language Inclusivity Moonshot (LIM) 131313https://blog.google/technology/ai/ways-ai-is-scaling-helpful/ is to scale our language support to 1,000 languages and to support speech recognition for 97% of the global population. A core objective associated with this goal is to open source data and models. Our expectation is that releasng MADLAD-400 will foster progress on the language research, especially on medium and low resource languages. An estimate of over 1B people globally speak languages that are not covered by mainstream models at Google or externally.
Who created this dataset and on behalf of which entity? Sneha Kudugunta†, Isaac Caswell⋄, Biao Zhang†, Xavier Garcia†, Derrick Xin†, Aditya Kusupati⋄, Romi Stella†, Ankur Bapna†, Orhan Firat† (†Google DeepMind, ⋄Google Research)
Who funded the creation of the dataset? Google Research and Google DeepMind
What do the instances that comprise the dataset represent? Each instance is a preprocessed web-crawled document whose language that we annotated using a LangID model described by . For the sentence level version, we used a sentence-splitter to split the documents into sentences and then deduplicated the resulting dataset.
How many instances are there in total? MADLAD-400 has 4.0B documents (100B sentences, or 2.8T tokens) total across 419 languages with the median language containing 1.7k documents (73k sentences of 1.2M tokens.)
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? (If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).) MADLAD-400 are created from CommonCrawl documents that have been annotated by language, filtered and preprocessed. To maintain high precision, we filtered out data aggressively, and may not have captured every document of a given language in CommonCrawl. Moreover, there may also be languages in CommonCrawl that we may not have mined.
What data does each instance consist of? Each instance is raw text in either document form for the document level data, or in sentence form for the sentence level data.
Is there a label or target associated with each instance? If so, please provide a description. No.
Is any information missing from individual instances? No.
Are relationships between individual instances made explicit? No.
Are there recommended data splits (e.g., training, development/validation, testing)? No.
Are there any errors, sources of noise, or redundancies in the dataset? While we have taken extensive care to audit and filter MADLAD-400, there may still be documents annotated with the wrong language or documents of low quality.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? (If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.) Yes
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? (If so, please provide a description.) Given that MADLAD-400 is a general web-crawled dataset it is possible that documents in the dataset may contain such information.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? (If so, please describe why.) Given that MADLAD-400 is a general web-crawled dataset, even after filtering, it is possible that there are documents containing offensive content, etc.
Does the dataset relate to people? It is likely that some documents in MADLAD-400 contain sentences referring to and describing people.
Does the dataset identify any subpopulations (e.g., by age, gender)? It is likely that some documents in MADLAD-400 contain sentences referring to and describing people of certain subpopulations.
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? Yes, it is possible that their names are mentioned in certain documents.
Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? Given that MADLAD-400 is a general web-crawled dataset, even after filtering, it is possible that there are documents containing sensitive data.
How was the data associated with each instance acquired? Each instance was acquired by performing transformations on the documents in all available snapshots of CommonCrawl as of August 20, 2022.
What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? We annotated the CommonCrawl data using a LangID model trained using the procedure described by . Then, we manually inspected the data and then filtered or preprocessed the documents to create MADLAD-400.
If the dataset is a sample from a larger set, what was the sampling strategy? MADLAD-400 is a subset of CommonCrawl documents determined using LangID annotations and filtering/preprocessing steps.
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? For the audit, the authors inspected the dataset. In some cases native speaker volunteers provided advice on the quality of the dataset.
Over what timeframe was the data collected? (Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.) We do not annotate timestamps. The version of CommonCrawl that we used has webcrawls ranging from 2008 to August 2022.
Were any ethical review processes conducted (e.g., by an institutional review board)? No.
Does the dataset relate to people? It is likely that some documents in MADLAD-400 contain sentences referring to and describing people.
Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? We collected this data via webpages crawled by CommonCrawl.
Were the individuals in question notified about the data collection? No.
Did the individuals in question consent to the collection and use of their data? No.
If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? No.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? No.
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? (If so, please provide a description. If not, you may skip the remainder of the questions in this section.) Various types of preprocessing were done: deduplication of 3 sentence spans, filtering substrings according to various heuristics associated with low quality, Virama encoding correction, converting Zawgyi encoding to Unicode encoding for Myanmar script characters and a Chinese pornographic content filter heuristic. In addition, 79 annotated language datasets were removed on inspection due to low quality or mislabeling.
Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? No, this raw data is hosted by CommonCrawl.
Is the software used to preprocess/clean/label the instances available? As of June 13, 2023, no.
Has the dataset been used for any tasks already? (If so, please provide a description.) MADLAD-400 has been used for MT and language modeling.
Is there a repository that links to any or all papers or systems that use the dataset? No
What (other) tasks could the dataset be used for? This dataset could be used as a general training dataset for any of the languages in MADLAD-400.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? (For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?) While steps have been taken to clean MADLAD-400, content containing sensitive content about individuals or groups could affect the performance of some downstream NLP tasks. Moreover, while building applications for (a) given language(s), we urge practitioners to assess the suitability of MADLAD-400 for their usecase.
Are there tasks for which the dataset should not be used? (If so, please provide a description.) N/A.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? MADLAD-400 is made available through a GCP bucket.
When will the dataset be distributed? June 2023
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? (If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.) AI2 has made a version of this data available under the ODC-BY license. Users are also bound by the CommonCrawl terms of use in respect of the content contained in the dataset.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? Users are bound by the CommonCrawl terms of use in respect of the content contained in the dataset.
Who is supporting/hosting/maintaining the dataset? An external organization, AI2 is hosting the dataset.
How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Sneha Kudugunta snehakudugunta@google.com or Isaac Caswell (icaswell@google.com) for questions about the dataset contents, or Dirk Groeneveld dirkg@allenai.org for questions related to the hosting of the dataset.
Is there an erratum? (If so, please provide a link or other access point.) https://github.com/google-research/google-research/tree/master/madlad_400
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances’)? (If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?) There are no such plans, but major issues may be corrected when reported through email or the Github page (https://github.com/google-research/google-research/tree/master/madlad_400).
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? (If so, please describe these limits and explain how they will be enforced.) N/A
Will older versions of the dataset continue to be supported/hosted/maintained? (If so, please describe how. If not, please describe how its obsolescence will be communicated to users.) No
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? (If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.) A relatively unprocessed version of MADLAD-400, MADLAD-400-noisy is made available for others to build upon using superior cleaning/preprocessing techniques for their specific usecases.
A.14 Model Card
Canaries Datasheet
For what purpose was the dataset created? (Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.) We create these canaries with the goal of enabling the study of memorization in the multilingual and translate settings. Models can be trained on these canaries and then their risk of memorization assessed.
Who created this dataset and on behalf of which entity? Christopher A. Choquette-Choo†, Katherine Lee† (†Google DeepMind)
Who funded the creation of the dataset? Google DeepMind
What do the instances that comprise the dataset represent? Each instance is constructed from a MADLAD-400 sentence-level example. Careful modifications are used, e.g., shuffling the tokens, to make the sample outlier to the natural distribution.
How many instances are there in total? There are in total.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? (If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).) The canary dataset itself is a subsampling from MADLAD-400. However, all described canaries are included in the release.
What data does each instance consist of? Each instance consists of the original text as well as the modified instance in tokens.
Is there a label or target associated with each instance? If so, please provide a description. No.
Is any information missing from individual instances? No.
Are relationships between individual instances made explicit? No.
Are there recommended data splits (e.g., training, development/validation, testing)? No.
Are there any errors, sources of noise, or redundancies in the dataset? Some canaries are duplicated for the purposes of studying repetition in memorization.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? (If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.) Yes
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)? (If so, please provide a description.) This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? (If so, please describe why.) This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Does the dataset relate to people? This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Does the dataset identify any subpopulations (e.g., by age, gender)? This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
How was the data associated with each instance acquired? Each instance was acquired by performing transformations on the documents in all available snapshots of CommonCrawl as of August 20, 2022.
What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? We randomly subsampled MADLAD-400 and then applied random
If the dataset is a sample from a larger set, what was the sampling strategy? A predefined number of sentences were uniformly sampled from each language.
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? The authors created the canary dataset.
Over what timeframe was the data collected? (Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.) We do not annotate timestamps.
Were any ethical review processes conducted (e.g., by an institutional review board)? No.
Does the dataset relate to people? This may be possible given the underlying MADLAD-400 may contain such data. However, the modifications used for generating canaries reduce the chance of this.
Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? Obtained second-hand through MADLAD-400, a CommonCrawl based dataset.
Were the individuals in question notified about the data collection? No.
Did the individuals in question consent to the collection and use of their data? No.
If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? No.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? No.
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? (If so, please provide a description. If not, you may skip the remainder of the questions in this section.) Various types of processing were applied on top of MADLAD-400. Sentences were either interleaved in batches of tokens or shuffled at the token-level.
Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? The raw MADLAD-400 data is saved.
Is the software used to preprocess/clean/label the instances available? As of June 13, 2023, no.
Has the dataset been used for any tasks already? (If so, please provide a description.) MADLAD-400 has been used for MT.
Is there a repository that links to any or all papers or systems that use the dataset? No
What (other) tasks could the dataset be used for? This dataset can be used to study memorization in broad language modelling scenarios.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? (For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?) We urge users to read the datasheet for MADLAD-400 to understand the underlying risk for the canaries.
Are there tasks for which the dataset should not be used? (If so, please provide a description.) N/A.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? MADLAD-400 is made available through a GCP bucket.
When will the dataset be distributed? June 2023
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? (If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.) AI2 has made a version of this data available under the ODC-BY license. Users are also bound by the CommonCrawl terms of use in respect of the content contained in the dataset.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances? Users are bound by the CommonCrawl terms of use in respect of the content contained in the dataset.
Who is supporting/hosting/maintaining the dataset? An external organization, AI2 is hosting the dataset.
How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Christopher A. Choquette-Choo cchoquette@google.com for questions about the dataset contents or Dirk Groeneveld dirkg@allenai.org for questions related to the hosting of the dataset.
Is there an erratum? (If so, please provide a link or other access point.) No
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances’)? (If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?) There are no such plans, but major issues may be corrected when reported through email or the Github page.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? (If so, please describe these limits and explain how they will be enforced.) N/A
Will older versions of the dataset continue to be supported/hosted/maintained? (If so, please describe how. If not, please describe how its obsolescence will be communicated to users.) No
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? (If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.) Others may build upon this by similarly generating canaries from MADLAD-400 which is made available.