Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot
Introduction
The demand for large corpora has considerably increased in recent years with the advent of semi-supervised learning methods in Natural Language Processing (NLP), such as word embeddings [Mikolov et al., 2013, Pennington et al., 2014, Mikolov et al., 2018], contextualized word representations [Howard and Ruder, 2018, Peters et al., 2018, Devlin et al., 2019] and more recently very large generative language models like GPT-3, T5, GPT-Neo [Raffel et al., 2020, Brown et al., 2020, Black et al., 2021]. While there have been some recent efforts to manually curate such corporahttps://bigscience.huggingface.co [Gao et al., 2020], the common approach to collect large amounts of raw textual data still relies primarily on crawled web text [Ortiz Suárez et al., 2019, Ortiz Suárez et al., 2020, Xue et al., 2021, El-Kishky et al., 2020, Esplà et al., 2019, Bañón et al., 2020, Gao et al., 2020], and although some of the initial concerns of using crawled data [Trinh and Le, 2018, Radford et al., 2019] have been addressed in recent years [Ortiz Suárez et al., 2020, Martin et al., 2020] there a many concerns that still need to be tackled [Caswell et al., 2020] specially for multilingual data [Caswell et al., 2021].
In this demand for large raw textual corpora we can observe a clear back and forth in the type of data used to pre-train these models. On one hand some authors have opted for highly curated or edited data like Wikipedia such as ?) and ?) for static word embeddings, the 1B Word Benchmark [Chelba et al., 2014] for ELMo [Peters et al., 2018], and the BookCorpus [Zhu et al., 2015] and Wikipedia for BERT [Devlin et al., 2019]. On the other hand projects like those of ?) or ?) used crawled data for the pre-training of fixed word embeddings, CamemBERT [Martin et al., 2020] a contextualized model for French successfully used only Crawled data for pre-training, and even large generative language models like T5 have used mainly crawled data successfully [Raffel et al., 2020]. We can of course also see examples of projects successfully using a mix of both manually curated and automatically crawled data such as RoBERTa [Liu et al., 2019], XLNet [Yang et al., 2019] and GPT-Neo [Black et al., 2021, Gao et al., 2020]. However, no matter the chosen approach to build these large corpora, there are in every case concerns that have been expressed, specially for the datasets used in very large generative language models [Bender et al., 2021], even when using manually edited resources like Wikipedia [Barera, 2020].
In this paper, that is part of the OSCAR projecthttps://oscar-corpus.com or Open Super-large Crawled Aggregated coRpus [Ortiz Suárez et al., 2019, Ortiz Suárez et al., 2020, Abadji et al., 2021] we would like to tackle some of the existing problems with OSCAR and its pipeline Ungolianthttps://github.com/oscar-corpus/ungoliant pointed out by ?; ?), by completely shifting our language classification pipeline Ungoliant from line level classification, to document level language classification. Moreover we propose a new set of automatic annotations that we add to the document metadata after language classification and that we hope will help OSCAR users more easily determine which documents they would like to use.
The contributions of the paper are as follows:
A new, document oriented corpus that is comparable in total size and language size distribution with OSCAR 21.09,
A line filtering that intends to limit the integrity destruction of the documents, keeping contiguous lines and making documents human readable and exploitable as documents,
Annotations that enable quality related filtering, enabling the query of documents meeting certain length criteria, potentially increasing the quality of data for less data hungry applications,
A deduplicated English corpus, as well as a line deduplication tool
While we are aware that this set of improvements still does not address all the concerns expressed by ?) or ?). We still believe the new proposed features as well as the release of the OSCAR 22.01 will hopefully be of use to the users of the OSCAR projects, specially considering that maintaining an up-to date, manually curated, large multilingual corpus still remains a very expensive, time-consuming task.
Related Work
Crawled data and more specifically Common Crawlhttps://commoncrawl.org has been extensively used for pre-training language representations and large generative language models in recent years. One of the first proposed pipelines to automatically classify Common Crawl by language was that of ?), it classified Common Crawl entries at line level using the FastText linear classifier [Joulin et al., 2016, Joulin et al., 2017]. However, even though FastText word embeddings were released for 157 different languages [Grave et al., 2018], the data itself was never released.
Later ?) reproduced and optimized ?) pipeline and actually released the data which came to be the first version of the OSCAR corpus (now referred to as OSCAR 2019). This pipeline was then rewritten and optimized by ?) which in turn released a second version of OSCAR (referred to as OSCAR 21.09) but, other than adding the metadata and using a more recent dump of Common Crawl, it remained virtually the same as the original one proposed by ?). All these three mentioned pipelines [Grave et al., 2018, Ortiz Suárez et al., 2019, Abadji et al., 2021] classified Common Crawl’s text at the line level, meaning that the apparent “documents” of OSCAR were actually just contiguous lines of text that were classified as being the same language. This approach preserved somehow the document integrity of monolingual entries in Common Crawl, but it completely destroyed the document integrity of multilingual entries.
Parallel to the development of OSCAR, there is also Multilingual C4 (mC4) [Xue et al., 2021] and CCNet [Wenzek et al., 2020] both of which are also derived from Common Crawl but propose pipelines that propose a document level language classification as opposed to OSCAR’s line level classification. Both CCNet and mC4 pipelines proposed methods for filtering “undesired” data: CCNet used small language models trained on Wikipedia and based on the KenLM library [Heafield, 2011] while mC4 used a simple badword filterhttps://github.com/LDNOOBW/.
Filtering
Previous OSCAR pipelines were line-oriented (where a line is defined as a string separated by \n), which meant that the highest filtering granularity were lines. Having a document-oriented corpus implies that:
We must try to keep the document integrity, by altering it in a way that does not completely destroy its coherence.
Operations on the document (filtering, identification, annotation) must take into account the document as a whole.
We aim to produce a corpus that is similar in size and quality to OSCAR 21.09, looking for a set of filters that limits the inclusion of short, noisy lines in documents, while keeping a sufficient quantity of data, especially for low- and mid-resource languages. Those filters either keep/discard a given document, or remove lines from the document body then keep it.
Similar to previous OSCAR pipelines, we use a length-based filter discarding short-lines. However, we restrict the removal on contiguous sequences of short lines that are located either at the head or at the tail of the document. In the following document, only the lines preceded by an exclamation point would be kept.
The solution still has numerous drawbacks, especially when dealing with documents crawled from the internet, a source known to be extremely noisy and full of edge cases: Adding a long line at the very head and tail of the previous document would completely negate the benefits of the filter.
2. Short lines proportion filter
In order to refine the filtering process, we use a count-based filter that separates the data in two bins: One for short lines and one for long lines. The filter then checks which bin is bigger, and filters out documents where the short lines bin is bigger.
This filter may limit the impact of documents containing low-quality long lines at the head/tail, then a high number of short lines.
Identification
The backbone of the language identification process is similar to the one used in goclassy [Ortiz Suárez et al., 2019] for the generation of OSCAR 2019 and Ungoliant [Abadji et al., 2021] for the generation of OSCAR 21.09. However, shifting to a document oriented corpus (with a single top-level identification per document) requires to infer the document identification, based on line identifications.
We define a document as a pair where is the set of lines (strings separated by \n) that constitute the document and Note that since FastText identifies one language by line, we have always have for every document . is the set of languages identified by FastText for the document . When FastText is no able to identify a language for an specific line, for instance because the confidence isn’t higher than , we tag said line with the No Identification Language that we simply note by . Furthermore, we define each line in a document as a triplet where is the language identified by FastText with the highest confidence for the line , is said confidence and is the size in bytes of the line . We also note and we thus define the size of a document as
Moreover, for each identified language in a document containing lines, we define its size as
Finally for each language we can also compute its overall weighted confidence throughout the document as the following weighted mean:
A document can contain lines in multiple languages for several reasons:
Identification mismatch, that can show up frequently, especially with languages that have significant vocabulary overlap (Czech and Slovak),
Crawl from a website where the interface is written in a language, and the body is written in another one,
Crawl from a translation page, where the same content is present in two (or more) different languages.
In these examples, we should aim to limit the presence of 1. and 2., while maximizing the presence of 3.: documents having a balanced set of lines per language. Thus, we decide to take a cautious approach, restricting the multilingual document identification test to the documents that:
Next, we compute the proportion for each language in the document defined as follows
including for the no identification language .
A document containing lines is identified as multilingual if and only if:
As an example, a document holding languages is multilingual if each language makes up at least of the document, and that there is at most of the document that is of unknown identification.
2. Monolingual identification
We begin by identifying each line, keeping in memory the language identified, the confidence of the identification, and the size of the line. We keep track of lines that have not been identified with a special token, and a confidence of 1.
If the document does not pass the multilingual check, we then take the largest represented language and compute its overall confidence and use a minimum confidence threshold of that is way lower than the previous pipelines (). This is motivated by the following reason: The document-based filtering removes documents containing lines that could have been kept by former pipelines, thus reducing the size of the generated data.
Using a lower threshold could help getting lower-quality documents that still hold high-confidence lines in themselves.
Annotation
While the filtering and identification steps are lenient by using lower thresholds than the previous pipelines, we introduce annotations, as non-destructive filters that enable more precise downstream filtering for the corpus users, as well as a useful resource to quickly assess the quality of a corpus. Annotations enable more aggressive filters to be run, since the non-destructive nature of annotations can in turn be used to refine annotation filters.
Numerous annotations are available, and each document can have several ones at the same time.
Some simple annotations are added when documents doesn’t meet certain length requirements:
The document has a low () number of lines (tiny)
The document has a high number () of short lines (short_sentences)
These annotations helps spotting potentially tiny documents, where the line structure or the document size could negatively influence training tasks.
A third annotation checks the occurrence of short lines at the start of the document, and adds a header annotation if it is the case, indicating that low-quality content could be present at the start of the document.
A fourth annotation named footer works in the same way on the tail of the document.
2. Noise detection
Some documents make their way into the corpus while being extremely noisy or non-linguistic. As an example, source code can be found in English corpora because of the presence of English words in the source itself.
We use a filter that computes a ratio between letters and non-letters.
This filter is based on Unicode categories. We use categories Lu, Ll, Lt, Lm, LoLu: Uppercase letter, Ll: Lowercase letter, Lt: Titlecase, Lm: Modifier, Lo: Other for letters, and we add categories Mn, Mc, MeMn: Nonspacing mark, Ms: Spacing mark, Me: Enclosing mark for accents and diacritics.
A noisy annotation is added if the ratio passes a certain threshold, set to .
3. Adult documents
We use the UT1 blocklisthttps://dsi.ut-capitole.fr/blacklists/ as a base for adult content filtering.
The UT1 blocklist is a collection of thematic blocklists (adult, gambling, blogs, …), usually utilized in internet access control for schools. The list is constituted and extended by both human and robots contributions (known indexes, search engines, exploration of already known addresses). The blocklist is updated twice to thrice a week by Fabrice Prigent.
Each folder contains URL and domain blocklists, enabling filtering of both websites that are centered around adult content, and websites hosting user-generated content that can be of adult nature (several social networks…).
The adult blocklist is comprised of roughly 3.7M records.
Corpus
We apply the aforementioned pipeline to the November/December 2021 crawl dump of CommonCrawl. The result is a new corpus, OSCAR 22.01. While its structure is different from the previous OSCAR corpora (due to the choice of generating a document oriented corpus), we attempt to compare the two corpora, especially in terms of size and news-related topic presence and recall. We also evaluate the occurrence and pertinence of the annotations.
The data layout of OSCAR 22.01 may limit the relevance of raw size comparisons, since metadata are larger (annotations and line identifications were not present in previous OSCAR Corpora), and fused with textual data (metadata were distributed in separate files for OSCAR 21.09).
However, comparing the distribution of corpus sizes may help us ensure that the new corpus has a size distribution similar to the older one.
We compare the distribution of the corpus sizes between OSCAR 21.09 and OSCAR 22.01 in figure 1. We see that while the overall distribution is similar, the lower end of the distribution has more variance: The range shows more corpora at its bounds than at its center. We also plot the empirical cumulative density function, that helps to assert the distribution similarity between OSCAR 21.09 and OSCAR 22.01.
We also select three low-resourced languages, three mid-resourced languages and three high-resources languages and compare their content (that is, textual data excluding metadata) between OSCAR 22.01 and OSCAR 21.09. Comparison is shown in figure 2. While the overall sizes of these corpora have slightly decreased, the sizes of the mid and high resource languages are similar enough.
1.2. Size differences in low-resource languages
The low-sized corpora exhibit important size changes. As an example, the Alemannic German corpus went from 7MB to 360KB between OSCAR 21.09 and OSCAR 22.01. This size decrease can be explained by the way the document identification works: by reasoning at a document level, documents containing a majority of German identified lines and a minority of Alemannic German identified lines will be identified as a German document, whereas previous OSCAR pipelines would have separated the lines and increase the size of the Alemannic German corpus.
By extracting the lines identified as Alemannic from the German corpus, we get around 30 MB of data, which could constitute an Alemanic corpus with a size comparable to the OSCAR 21.09 Alemanic corpus after confidence and length based filterings.
This situation can, in a way, help us investigate the cases of linguistic proximity, where languages have a lexical overlap: When a line identified as Alemannic German is found inside a document that has been identified as German:
Is the line in German and it is an identification error?
Is the line in Alemannic German, in a document that is in German? (ex: A German website related to the Alemannic German language)
Is the whole document in Alemannic German, and the identification classified the majority of Alemannic as German?
Those three cases can arise and may help to enhance the detection of a said language, by finding (1) identification mismatches, hoping that these cases would improve identification after training, or (3), after verification by a speaker of the language, state that the whole document is in Alemannic. The new data collected could in turn be used to improve language detection.
1.3. New themes
As OSCAR 22.01 is based on a November/December 2021 dump (compared to OSCAR 21.09, based on a February 2021 dump), the corpus should include data related to events contemporary to February 2021. We conduct a simple word search similar to the one conducted for the generation of OSCAR 21.09 [Abadji et al., 2021], using both old and new events, in order to give a rough idea of both the actuality and the memory of the corpus.
We see that the events and terms related to events predating February 2021 are still occurrent in the corpus, but have a diminished count that is in the same order of magnitude. We also count the occurrences of the term Omicron, related to the Omicron variant, and observe that the term has a higher count on the 21.09 sample.
1.4. Absence of deduplication
Contrary to OSCAR 21.09, we do not distribute a deduplicated version of the majority of OSCAR 22.01.
The line-level deduplication of documents would have destroyed the integrity of documents themselves, hampering human readability and even sequential sentence sense. We can imagine having forum discussions’ sense destroyed because of identical responses, or song lyrics being altered.
Moreover, the similarity-based document-level deduplication procedure is very costly in terms of computing power and time [Gao et al., 2020].
We make the choice of distributing a non deduplicated version of OSCAR along with a deduplicated, line oriented version of the English corpus, while encouraging the use of deduplication in the context of training language models [Lee et al., 2021]. A line-level deduplication tool will be available as part of the OSCAR toolkithttps://github.com/oscar-corpus/oscar-tools. We will also distribute a deduplicated version of the English part of OSCAR 22.01, with a data layout similar to OSCAR 21.09 corpora.
2. Annotations
Annotations helps us to infer the composition of the corpora: The tiny, short_sentences and especially noisy annotations may indicate documents of a varying poor quality, with noisy being the worst.
Also, comparing corpora annotation distributions, especially related to their size, could highlight potentially very low quality corpora. This semi-automated quality checking process could be used to label corpora where data quality is bad.
We select 3 low-resource (), 3 mid-resource () and 3 high-resource () languages and plot the number of documents per annotation, adding a total legend for the total document count and a clean legend for documents that do not have any annotation. We then plot the counts for each resource group using adapted scales.
We observe that the annotation distribution is similar for each resource group, but that the lower resourced languages have a higher proportion of documents annotated with short_sentences and tiny.
In order to better compare the resource groups, we display the annotation distribution in a heat map (figure 4). We notice important differences between low and mid/high resource groups. A very large proportion of the low resource group is annotated as tiny while simultaneously detaining few documents annotated short_sentences, indicating the presence of long sentences within documents with a low number of sentences.
2.2. Multilinguality
The OSCAR 22.01 Corpus also contains a multilingual corpus, composed of documents holding lines in multiple languages. Each document contains at least 2 languages, and at most 5.
We check the co-occurrence of languages, highlighting the coupling of language tuples. These tuples may highlight either linguistic similarity (Czech and Slovak, Russian and Uzbek) and subsequent poor classification, errors or languages commonly found together on documents. Due to the number of languages and the sparsity of the data, we show the language couples with a number of documents greater than 20 000 (Figure 5).
We also note the presence of English in a high number of documents. This could be explained by boilerplate content in web pages, such as menu headers or footers.
Using the clean annotation filter on the multilingual corpus may help to retrieve the highest quality multilingual documents.
2.3. Clean documents
We also look into documents that did not get annotated at all, and we find that these documents are usually of a high quality. However, their relative proportion in corpora may limit their usage.
We use a sample of the English corpus (183,497 documents, 1.3 GB) and compare the size of documents depending on the presence (or not) of annotations. The stacked counts are shown in figure 6.
We observe that clean document mean length is slightly shorter than non-clean ones. Also, we note that while the length standard deviation of clean documents seems to be shorter, the computation yields larger numbers, caused by outliers in the high end (Annotations: , Clean: ). By removing the top and bottom 5%, we get (Annotations: , Clean: ).
These results are not sufficient to state on the intrinsic quality of the clean documents, but may ease the study of the filters and identify future filtering needs.
2.4. Adult documents
While very small in proportions, adult annotation documents highlight interesting facts.
The French sample contains 32,870 adult documents, out of 52,037,098.
We count if some documents coming from tetu.com are labeled as adult, in order to probe the possibility of finding LGBTQI+ content annotated as adult. We find 1063 documents, representing of the adult documents. This may imply that more LGBTQI+ content sites are present in the blocklist, thus increasing the ratio of LGBTQI+ content labeled as adult.
We take the first 100 adult documents of the French corpus and check whether they are properly classified.
true positives documents that exhibit explicit sexual content geared towards pornography (pornographic websites, sexually explicit fictions)
false positives documents that do not meet this criteria,
We separately count websites that are simultaneously non explicit and from LGBTQI+ websites.
false positives belonging to LGBTQI+ websites,
While the majority of true positives are properly classified, numerous educational documents do appear: These type of documents exhibit an explicit language, but does feature a good document quality, and a better representation of sexuality that is less offensive compared to the usual associations between sexually explicit content and hate speech. [Luccioni and Viviano, 2021].
The false positives are, for the majority, websites that do not belong in the blocklist in the first place. We suppose that the addresses were previously used as adult websites.
2.5. Hard bounds problems
Several pipeline steps (especially annotators), work using hard thresholds. As an example, any document that is less than 5 lines is considered to be tiny. However, when exploring data, we can see that there is a number of documents whose number of lines is in the neighboring of the threshold, and quality is similar to the documents labeled as tiny.
When plotting the distribution of clean and annotated corpus data, we can notice that a very high number of documents are of a tiny () size, which coincidentally happens to be the minimum size for a document to be accepted, since the first filter removes lines that are shorter than 100 characters .
Discussion
We provide a new, document-oriented corpus of the same size of OSCAR 21.09 that keeps document integrity and is easier to filter thanks to annotations.
While the mid and high resourced languages are of a similar size, several low resource languages have seen an important decrease of size. We still have to check whether this size decrease comes with a quality increase, since previous low resource OSCAR corpora sometimes exhibited extremely poor quality: Many non-linguistic corpora that were published and deemed unusable weeks or months after release.
We also note that documents of similar languages could have been merged into larger corpora, and we show that the German corpus holds MB of Alemannic that, with appropriate filtering, could be treated as an independent corpus. These cases of merging are also interesting to investigate, as they can explain identification mismatches and could, in turn, help to build better language identification models. More work has to be done in order to properly map the connection between low-resource languages and mid and high resource languages potentially containing data in these languages.
2. Annotations
The selected annotations exhibit numerous caveats that have to be addressed in the future iterations of OSCAR generation pipelines.
The length-based annotations are widespread in the corpus, especially in mid to high resource languages ( in Czech) highlighting the potential low quality of a high number of documents as well as the need of better characterizing the nature of these line length discrepancies. Web crawls often contain boilerplate content extracted from headers, footers and sidebars, and these lines are present in the Common Crawl dumps. Another solution would be to base the whole OSCAR generation pipeline on raw HTML files, potentially multiplying the computational cost and complexity of generating corpora.
The adult annotation, based from an adult URL blocklist, is present on a very limited set of documents. However, studies have shown that adult content has been present in a previous version of OSCAR in a larger proportion than the one measured here [Caswell et al., 2021], hinting at a bad performance of the blocklist based adult content filtering approach. Moreover, we noticed that the blocklist contained websites representing LGBTQI+ related topics, which damages the representation of the LGBTQI+ (association with adult content, filtering out LGBTQI+ documents, which in turn could limit the representation in downstream tasks..). Model-based approaches may help in improving the adult annotation, and should be the next step towards a better annotation of adult content [Luccioni and Viviano, 2021].
Bibliographical References
References
Appendix A Carbon Footprint
Taking into consideration recent concerns regarding the power consumption and carbon footprint of machine learning experiments [Schwartz et al., 2020, Bender et al., 2021] we report the power consumption and carbon footprint of the OSCAR generation, assuming the whole dump of Common Crawl has already been downloaded. We follow the approach of ?).
We use a single machine having 192 GB of RAM and two Intel Xeon Gold 5218 processors, which is rated at 125 W,Intel Xeon Gold 5218 specification. For the DRAM we can use the work of ?) to estimate the total power draw of 192GB of RAM at around 20W. The total power draw of this setting adds up to around 270 W.
Having this information, we can now use the formula proposed by ?) in order to compute the total power required to pre-train one model from scratch:
Where is the number of CPUs, is the average power draw (in Watts) from all CPU sockets and the average power draw from all DRAM sockets. We estimate the total power consumption by adding CPU and DRAM consumption, and then multiplying by the Power Usage Effectiveness (PUE), which accounts for the additional energy required to support the compute infrastructure. We use a PUE coefficient of 1.58, the 2018 global average for data centers [Strubell et al., 2019]. The total time to generate OSCAR 22.01 in this infrastructure was of 42.6 hours. We use this information to compute the total power consumption of the OSCAR generation, which amounts to 0.4266kWh.
We can further estimate the CO2 emissions in kilograms of the OSCAR generation by multiplying the total power consumption by the average CO2 emissions per kWh in our region which were 38.64g/kWh in average between the 3rd and the 5th of January 2022Rte - éCO2mix., the exact time at which the generation was run. Thus the total CO2 emissions in kg for one single model can be computed as:
Thus total CO2 emissions amount to 0.01648kg or 16.48g.