FlauBERT: Unsupervised Language Model Pre-training for French

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, Didier Schwab

Introduction

A recent game-changing contribution in Natural Language Processing (NLP) was the introduction of deep unsupervised language representations pre-trained using only plain text corpora. Previous word embedding pre-training approaches, such as word2vec [Mikolov et al., 2013] or GloVe [Pennington et al., 2014], learn a single vector for each wordform. By contrast, these new models are trained to produce contextual embeddings: the output representation depends on the entire input sequence (e.g. each token instance has a vector representation that depends on its left and right context). Initially based on recurrent neural networks [Dai and Le, 2015, Ramachandran et al., 2017, Howard and Ruder, 2018, Peters et al., 2018], these models quickly converged towards the use of the Transformer [Vaswani et al., 2017], such as GPT [Radford et al., 2018], BERT [Devlin et al., 2019], XLNet [Yang et al., 2019b], RoBERTa [Liu et al., 2019]. Using these pre-trained models in a transfer learning fashion has shown to yield striking improvements across a wide range of NLP tasks. One can easily build state-of-the-art NLP systems thanks to the publicly available pre-trained weights, saving time, energy, and resources. As a consequence, unsupervised language model pre-training has become a de facto standard in NLP. This has been, however, mostly demonstrated for English even though multi-lingual or cross-lingual variants are also available, taking into account more than a hundred languages in a single model: mBERT [Devlin et al., 2019], XLM [Lample and Conneau, 2019], XLM-R [Conneau et al., 2019].

In this paper, we describe our methodology to build FlauBERT – French Language Understanding via Bidirectional Encoder Representations from Transformers, a French BERTWe learned of a similar project that resulted in a publication on arXiv [Martin et al., 2019]. However, we believe that these two works on French language models are complementary since the NLP tasks we addressed are different, as are the training corpora and preprocessing pipelines. We also point out that our models were trained using the CNRS (French National Centre for Scientific Research) public research computational infrastructure and did not receive any assistance from a private stakeholder. model that outperforms multi-lingual/cross-lingual models in several downstream NLP tasks, under similar configurations. FlauBERT relies on freely available datasets and is made publicly available in different versions.https://github.com/getalp/Flaubert For further reproducible experiments, we also provide the complete processing and training pipeline as well as a general benchmark for evaluating French NLP systems. This evaluation setup is similar to the popular GLUE benchmark [Wang et al., 2018], and is named FLUE (French Language Understanding Evaluation).

Related Work

Self-supervisedSelf-supervised learning is a special case of unsupervised learning where unlabeled data is used as a supervision signal. pre-training on unlabeled text data was first proposed in the task of neural language modeling [Bengio et al., 2003, Collobert and Weston, 2008], where it was shown that a neural network trained to predict next word from prior words can learn useful embedding representations, called word embeddings (each word is represented by a fixed vector). These representations were shown to play an important role in NLP, yielding state-of-the-art performance on multiple tasks [Collobert et al., 2011], especially after the introduction of word2vec [Mikolov et al., 2013] and GloVe [Pennington et al., 2014], efficient and effective algorithms for learning word embeddings.

A major limitation of word embeddings is that a word can only have a single representation, even if it can have multiple meanings (e.g. depending on the context). Therefore, recent works have introduced a paradigm shift from context-free word embeddings to contextual embeddings: the output representation is a function of the entire input sequence, which allows encoding complex, high-level syntactic and semantic characteristics of words or sentences.

This line of research was started by ?) who proposed pre-training representations via either an encoder-decoder language model or a sequence autoencoder. ?)It should be noted that learning contextual embeddings was also proposed in [McCann et al., 2017], but in a supervised fashion as they used annotated machine translation data. showed that this approach can be applied to pre-training sequence-to-sequence models [Sutskever et al., 2014]. These models, however, require a significant amount of in-domain data for the pre-training tasks. ?, ELMo) and ?, ULMFiT) were the first to demonstrate that leveraging huge general-domain text corpora in pre-training can lead to substantial improvements on downstream tasks. Both methods employ LSTM [Hochreiter and Schmidhuber, 1997] language models, but ULMFiT utilizes a regular multi-layer architecture, while ELMo adopts a bidirectional LSTM to build the final embedding for each input token from the concatenation of the left-to-right and right-to-left representations. Another fundamental difference lies in how each model can be tuned to different downstream tasks: ELMo delivers different word vectors that can be interpolated, whereas ULMFiT enables robust fine-tuning of the whole network w.r.t. the downstream tasks. The ability of fine-tuning was shown to significantly boost the performance, and thus this approach has been further developed in the recent works such as MultiFiT [Eisenschlos et al., 2019] or most prominently Transformer-based [Vaswani et al., 2017] architectures: GPT [Radford et al., 2018], BERT [Devlin et al., 2019], XLNet [Yang et al., 2019b], XLM [Lample and Conneau, 2019], RoBERTa [Liu et al., 2019], ALBERT [Lan et al., 2019], T5 [Raffel et al., 2019]. These methods have one after the other established new state-of-the-art results on various NLP benchmarks, such as GLUE [Wang et al., 2018] or SQuAD [Rajpurkar et al., 2018], surpassing previous methods by a large margin.

2. Pre-trained Language Models Beyond English

Given the impact of pre-trained language models on NLP downstream tasks in English, several works have recently released pre-trained models for other languages. For instance, ELMo exists for Portuguese, Japanese, German and Basque,https://allennlp.org/elmo while BERT and variants were specifically trained for simplified and traditional Chinese8 and German.https://deepset.ai/german-bert A Portuguese version of MultiFiT is also available.https://github.com/piegu/language-models Recently, more monolingual BERT-based models have been released, such as for Arabic [Antoun et al., 2020], Dutch [de Vries et al., 2019, Delobelle et al., 2020], Finnish [Virtanen et al., 2019], Italian [Polignano et al., 2019], Portuguese [Souza et al., 2019], Russian [Kuratov and Arkhipov, 2019], Spanish [Cañete et al., 2020], and Vietnamese [Nguyen and Nguyen, 2020]. For French, besides pre-trained language models using ULMFiT and MultiFiT configurations,7 CamemBERT [Martin et al., 2019] is a French BERT model concurrent to our work.

Another trend considers one model estimated for several languages with a shared vocabulary. The release of multilingual BERT for 104 languages pioneered this approach.https://github.com/google-research/bert A recent extension of this work leverages parallel data to build a cross-lingual pre-trained version of LASER [Artetxe and Schwenk, 2019] for 93 languages, XLM [Lample and Conneau, 2019] and XLM-R [Conneau et al., 2019] for 100 languages.

3. Evaluation Protocol for French NLP Tasks

The existence of a multi-task evaluation benchmark such as GLUE [Wang et al., 2018] for English is highly beneficial to facilitate research in the language of interest. The GLUE benchmark has become a prominent framework to evaluate the performance of NLP models in English. The recent contributions based on pre-trained language models have led to remarkable performance across a wide range of Natural Language Understanding (NLU) tasks. The authors of GLUE have therefore introduced SuperGLUE [Wang et al., 2019a]: a new benchmark built on the principles of GLUE, including more challenging and diverse set of tasks. A Chinese version of GLUEhttps://github.com/chineseGLUE/chineseGLUE is also developed to evaluate model performance in Chinese NLP tasks. As of now, we have not learned of any such benchmark for French.

Building FlauBERT

In this section, we describe the training corpus, the text preprocessing pipeline, the model architecture and training configurations to build FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} and FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}.

Our French text corpus consists of 24 sub-corpora gathered from different sources, covering diverse topics and writing styles, ranging from formal and well-written text (e.g. Wikipedia and books)http://www.gutenberg.org to random text crawled from the Internet (e.g. Common Crawl).http://data.statmt.org/ngrams/deduped2017 The data were collected from three main sources: (1) monolingual data for French provided in WMT19 shared tasks [Li et al., 2019, 4 sub-corpora]; (2) French text corpora offered in the OPUS collection [Tiedemann, 2012, 8 sub-corpora]; and (3) datasets available in the Wikimedia projects [Meta, 2019, 8 sub-corpora].

We used the WikiExtractor toolhttps://github.com/attardi/wikiextractor to extract the text from Wikipedia. For the other sub-corpora, we either used our own tool to extract the text or download them directly from their websites. The total size of the uncompressed text before preprocessing is 270 GB. More details can be found in Appendix A.1.

Data preprocessing

For all sub-corpora, we filtered out very short sentences as well as repetitive and non-meaningful content such as telephone/fax numbers, email addresses, etc. For Common Crawl, which is our largest sub-corpus with 215 GB of raw text, we applied aggressive cleaning to reduce its size to 43.4 GB. All the data were Unicode-normalized in a consistent way before being tokenized using Moses tokenizer [Koehn et al., 2007]. The resulting training corpus is 71 GB in size.

Our code for downloading and preprocessing data is made publicly available.https://github.com/getalp/Flaubert

2. Models and Training Configurations

FlauBERT has the same model architecture as BERT [Devlin et al., 2019], which consists of a multi-layer bidirectional Transformer [Vaswani et al., 2017]. Following ?), we propose two model sizes:

FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}: L=12,H=768,A=12L=12,H=768,A=12,

FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}: L=24,H=1024,A=16L=24,H=1024,A=16,

where L,HL,H and AA respectively denote the number of Transformer blocks, the hidden size, and the number of self-attention heads. As Transformer has become quite standard, we refer to ?) for further details.

Training objective and optimization

Pre-training of the original BERT [Devlin et al., 2019] consists of two supervised tasks: (1) a masked language model (MLM) that learns to predict randomly masked tokens; and (2) a next sentence prediction (NSP) task in which the model learns to predict whether B is the actual next sentence that follows A, given a pair of input sentences A,B.

?) observed that removing NSP significantly hurts performance on some downstream tasks. However, the opposite was shown in later studies, including ?, XLNet), ?, XLM), and ?, RoBERTa).?) hypothesized that the original BERT implementation may only have removed the loss term while still retaining a bad input format, resulting in performance degradation. Therefore, we only employed the MLM objective in FlauBERT.

To optimize this objective function, we followed ?) and used the Adam optimizer [Kingma and Ba, 2014] with the following parameters:

FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}: warmup steps of 24k, peak learning rate of 6104610-4, β1=0.9\beta_{1}=0.9, β2=0.98\beta_{2}=0.98, ϵ=\epsilon=110-6$$ and weight decay of 0.01.

FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}: warmup steps of 30k, peak learning rate of 3104310-4, β1=0.9\beta_{1}=0.9, β2=0.98\beta_{2}=0.98, ϵ=\epsilon=110-6$$ and weight decay of 0.01.

Training very deep Transformers is known to be susceptible to instability [Wang et al., 2019b, Nguyen and Salazar, 2019, Xu et al., 2019, Fan et al., 2019]. Not surprisingly, we also observed this difficulty when training FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} using the same configurations as BERT\textsclarge\text{BERT}_{\textsc{large}} and RoBERTa\textsclarge\text{RoBERTa}_{\textsc{large}}, where divergence happened at an early stage.

Several methods have been proposed to tackle this issue. For example, in an updated implementation of the Transformer [Vaswani et al., 2018], layer normalization is applied before each attention layer by default, rather than after each residual block as in the original implementation [Vaswani et al., 2017]. These configurations are called pre-norm and post-norm, respectively. It was observed by ?), and again confirmed by later works e.g. [Wang et al., 2019b, Xu et al., 2019, Nguyen and Salazar, 2019], that pre-norm helps stabilize training. Recently, a regularization technique called stochastic depths [Huang et al., 2016] has been demonstrated to be very effective for training deep Transformers, by e.g. ?) and ?) who successfully trained architectures of more than 40 layers. The idea is to randomly drop a number of (attention) layers at each training step. Other techniques are also available such as progressive training [Gong et al., 2019], or improving initialization [Zhang et al., 2019a, Xu et al., 2019] and normalization [Nguyen and Salazar, 2019].

For training FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}, we employed pre-norm attention and stochastic depths for their simplicity. We found that these two techniques were sufficient for successful training. We set the rate of layer dropping to 0.20.2 in all the experiments.

Other training details

A vocabulary of 50K sub-word units is built using the Byte Pair Encoding (BPE) algorithm [Sennrich et al., 2016]. The only difference between our work and RoBERTa is that the training data are preprocessed and tokenized using a basic tokenizer for French [Koehn et al., 2007, Moses], as in XLM [Lample and Conneau, 2019], before the application of BPE. We use fastBPE,https://github.com/glample/fastBPE a very efficient implementation to extract the BPE units and encode the corpora.

FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} is trained on 32 GPUs Nvidia V100 in 410 hours and FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} is trained on 128 GPUs in 390 hours, both with the effective batch size of 8192 sequences.

Finally, we summarize the differences between FlauBERT and BERT, RoBERTa, CamemBERT in Table 1.

FLUE

In this section, we compile a set of existing French language tasks to form an evaluation benchmark for French NLP that we called FLUE (French Language Understanding Evaluation). We select the datasets from different domains, level of difficulty, degree of formality, and amount of training samples. Three out of six tasks (Text Classification, Paraphrase, Natural Language Inference) are from cross-lingual datasets since we also aim to provide results from a monolingual pre-trained model to facilitate future studies of cross-lingual models, which have been drawing much of research interest recently.

Table 2 gives an overview of the datasets, including their domains and training/development/test splits. The details are presented in the next subsections.

The Cross Lingual Sentiment CLS [Prettenhofer and Stein, 2010] dataset consists of Amazon reviews for three product categories: books, DVD, and music in four languages: English, French, German, and Japanese. Each sample contains a review text and the associated rating from 1 to 5 stars. Following ?) and ?), ratings with 3 stars are removed. Positive reviews have ratings higher than 3 and negative reviews are those rated lower than 3. There is one train and test set for each product category. The train and test sets are balanced, including around 1 000 positive and 1 000 negative reviews for a total of 2 000 reviews in each dataset. We take the French portion to create the binary text classification task in FLUE and report the accuracy on the test set.

2. Paraphrasing

The Cross-lingual Adversarial Dataset for Paraphrase Identification PAWS-X [Yang et al., 2019a] is the extension of the Paraphrase Adversaries from Word Scrambling PAWS [Zhang et al., 2019b] for English to six other languages: French, Spanish, German, Chinese, Japanese and Korean. PAWS composes English paraphrase identification pairs from Wikipedia and Quora in which two sentences in a pair have high lexical overlap ratio, generated by LM-based word scrambling and back translation followed by human judgement. The paraphrasing task is to identify whether the sentences in these pairs are semantically equivalent or not. Similar to previous approaches to create multilingual corpora, ?) used machine translation to create the training set for each target language in PAWS-X from the English training set in PAWS. The development and test sets for each language are translated by human translators. We take the related datasets for French to perform the paraphrasing task and report the accuracy on the test set.

3. Natural Language Inference

The Cross-lingual NLI (XNLI) corpus [Conneau et al., 2018] extends the development and test sets of the Multi-Genre Natural Language Inference corpus [Williams et al., 2018, MultiNLI] to 15 languages. The development and test sets for each language consist of 7 500 human-annotated examples, making up a total of 112 500 sentence pairs annotated with the labels entailment, contradiction, or neutral. Each sentence pair includes a premise (pp) and a hypothesis (hh). The Natural Language Inference (NLI) task, also known as recognizing textual entailment (RTE), is to determine whether pp entails, contradicts or neither entails nor contradicts hh. We take the French part of the XNLI corpus to form the development and test sets for the NLI task in FLUE. The train set is obtained from the machine translated version to French provided in XNLI. Following ?), we report the test accuracy.

4. Parsing and Part-of-Speech Tagging

Syntactic parsing consists in assigning a tree structure to a sentence in natural language. We perform parsing on the French Treebank [Abeillé et al., 2003], a collection of sentences extracted from French daily newspaper Le Monde, and manually annotated with both constituency and dependency syntactic trees and part-of-speech tags. Specifically, we use the version of the corpus instantiated for the SPMRL 2013 shared task and described by ?). This version is provided with a standard split representing 14 759 sentences for the training corpus, and respectively 1 235 and 2 541 sentences for the development and evaluation sets.

5. Word Sense Disambiguation Tasks

Word Sense Disambiguation (WSD) is a classification task which aims to predict the sense of words in a given context according to a specific sense inventory. We used two French WSD tasks: the FrenchSemEval task [Segonne et al., 2019], which targets verbs only, and a modified version of the French part of the Multilingual WSD task of SemEval 2013 [Navigli et al., 2013], which targets nouns.

We made experiments of sense disambiguation focused on French verbs using FrenchSemEval [Segonne et al., 2019, FSE], an evaluation dataset in which verb occurrences were manually sense annotated with the sense inventory of Wiktionary, a collaboratively edited open-source dictionary. FSE includes both the evaluation data and the sense inventory. The evaluation data consists of 3 199 manual annotations among a selection of 66 verbs which makes roughly 50 sense annotated occurrences per verb. The sense inventory provided in FSE is a Wiktionary dump (04-20-2018) openly available via Dbnary [Sérasset, 2012]. For a given sense of a target key, the sense inventory offers a definition along with one or more examples. For this task, we considered the examples of the sense inventory as training examples and tested our model on the evaluation dataset.

Noun Sense Disambiguation

We propose a new challenging task for the WSD of French, based on the French part of the Multilingual WSD task of SemEval 2013 [Navigli et al., 2013], which targets nouns only. We adapted the task to use the WordNet 3.0 sense inventory [Miller, 1995] instead of BabelNet [Navigli and Ponzetto, 2010], by converting the sense keys to WordNet 3.0 if a mapping exists in BabelNet, and removing them otherwise.

The result of the conversion process is an evaluation corpus composed of 306 sentences and 1 445 French nouns annotated with WordNet sense keys, and manually verified.

For the training data, we followed the method proposed by ?), and translated the SemCor [Miller et al., 1993] and the WordNet Gloss CorpusThe set of WordNet glosses semi-automatically sense annotated which is released as part of WordNet since version 3.0. into French, using the best English-French Machine Translation system of the fairseq toolkithttps://github.com/pytorch/fairseq [Ott et al., 2019]. Finally, we aligned the WordNet sense annotation from the source English words to the the translated French words, using the alignment provided by the MT system.

We rely on WordNet sense keys instead of the original BabelNet annotations for the following two reasons. First, WordNet is a resource that is entirely manually verified, and widely used in WSD research [Navigli, 2009]. Second, there is already a large quantity of sense annotated data based on the sense inventory of WordNet [Vial et al., 2018] that we can use for the training of our system.

We publicly releasehttps://zenodo.org/record/3549806 both our training data and the evaluation data in the UFSAC format [Vial et al., 2018].

Experiments and Results

In this section, we present FlauBERT fine-tuning results on the FLUE benchmark. We compare the performance of FlauBERT with Multilingual BERT [Devlin et al., 2019, mBERT] and CamemBERT [Martin et al., 2019] on all tasks. In addition, for each task we also include the best non-BERT model for comparison. We made use of the open source libraries [Lample and Conneau, 2019, XLM] and [Wolf et al., 2019, Transformers] in some of the experiments.

We followed the standard fine-tuning process of BERT [Devlin et al., 2019]. The input is a degenerate text-\varnothing pair. The classification head is composed of the following layers, in order: dropout, linear, tanh\tanh activation, dropout, and linear. The output dimensions of the linear layers are respectively equal to the hidden size of the Transformer and the number of classes (which is 22 in this case as the task is binary classification). The dropout rate was set to 0.10.1.

We trained for 30 epochs using a batch size of 16 while performing a grid search over 4 different learning rates: 1105110-5, 5105510-5, 1106110-6, and 5106510-6. A random split of 20% of the training data was used as validation set, and the best performing model on this set was then chosen for evaluation on the test set.

Results

Table 3 presents the final accuracy on the test set for each model. The results highlight the importance of a monolingual French model for text classification: both CamemBERT and FlauBERT outperform mBERT by a large margin. FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} performs moderately better than CamemBERT in the books dataset, while its results on the two remaining datasets of DVD and music are lower than those of CamemBERT. FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} achieves the best results in all categories.

2. Paraphrasing

The setup for this task is almost identical to the previous one, except that: (1) the input sequence is now a pair of sentences A,B; and (2) the hyper-parameter search is performed on the development data set (i.e. no validation split is needed).

Results

The final accuracy for each model is reported in Table 4. One can observe that the monolingual French models perform only slightly better than the multilingual model mBERT, which could be attributed to the characteristics of the PAWS-X dataset. Containing samples with high lexical overlap ratio, this dataset has been proved to be an effective measure of model sensitivity to word order and syntactic structure [Yang et al., 2019a]. A multilingual model such as mBERT, therefore, could capture these features as well as a monolingual model.

3. Natural Language Inference

As this task was also considered in [Martin et al., 2019, CamemBERT ], for a fair comparison, here we replicate the same experimental setup. Similar to paraphrasing, the model input of this task is also a pair of sentences. The classification head, however, consists of only one dropout layer followed by one linear layer.

Results

We report the final accuracy for each model in Table 5. The results confirm the superiority of the French models compared to the multilingual model mBERT on this task. FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} performs moderately better than CamemBERT. Both of them clearly outperform XLM-R\textscbase\text{XLM-R}_{\textsc{base}}, while cannot surpass XLM-R\textsclarge\text{XLM-R}_{\textsc{large}}.

4. Constituency Parsing and POS Tagging

We use the parser described by ?) and ?). It is an openly availablehttps://github.com/nikitakit/self-attentive-parser chart parser based on a self-attentive encoder. We compare (i) a model without any pre-trained parameters, (ii) a model that additionally uses and fine-tunes fastTexthttps://fasttext.cc/ pre-trained embeddings, (iii) models based on pre-trained language models: mBERT, CamemBERT, and FlauBERT. We use the default hyperparameters from ?) for the first two settings and the hyperparameters from ?) when using pre-trained language models, except for FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}. For this last model, we use a different learning rate (0.00001), batch size (8) and ignore training sentences longer than 100 tokens, due to memory limitation. We jointly perform part-of-speech (POS) tagging based on the same input as the parser, in a multitask setting. For each setting we perform training 3 times with different random seeds and select best model according to development F-score.

For final evaluation, we use the evaluation tool provided by the SPMRL shared task organizershttp://pauillac.inria.fr/~seddah/evalb_spmrl2013.tar.gz and report labelled F-score, the standard metric for constituency parsing evaluation, as well as POS tagging accuracy.

Results

We report constituency parsing results in Table 6. Without pre-training, we replicate the result from ?). FastText pre-trained embeddings do not bring improvement over this already strong model. When using pre-trained language models, we observe that CamemBERT, with its language-specific training improves over mBERT by 0.9 absolute F1. FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} outperforms CamemBERT by 0.7 absolute F1 on the test set and obtains the best published results on the task for a single model. Regarding POS tagging, all large-scale pre-trained language models obtain similar results (98.1-98.2), and outperform models without pre-training or with fastText embeddings (97.5-97.7). FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} provides a marginal improvement on the development set, and fails to reach FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} results on the test set.

In order to assess whether FlauBERT and CamemBERT are complementary for this task, we evaluate an ensemble of both models (last line in Table 6). The ensemble model improves by 0.4 absolute F1 over FlauBERT on the development set and 0.2 on the test set, obtaining the highest result for the task. This result suggests that both pre-trained language models are complementary and have their own strengths and weaknesses.

5. Dependency parsing

We use our own reimplementation of the parsing model of ?) with maximum spanning tree decoding adapted to handle several input sources such as BERT representations. The model does not perform part of speech tagging but uses the predicted tags provided by the SPMRL shared task organizers.

Our word representations are a concatenation of word embeddings and tag embeddings learned together with the model parameters on the French Treebank data itself, and at most one of (fastText, CamemBERT, FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}, FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}, mBERT) word vector. As ?), we use word and tag dropout (d=0.5d=0.5) on word and tag embeddings but without dropout on BERT representations. We performed a fairly comprehensive grid search on hyperparameters for each model tested.

Results

The results are reported in Table 7. The best published results in this shared task [Constant et al., 2013] were involving an ensemble of parsers with additional resources for modelling multi word expressions (MWE), typical of the French treebank annotations. The monolingual French BERT models (CamemBERT, FlauBERT) perform better and set the new state of the art on this dataset with a single parser and without specific modelling for MWEs. One can observe that both FlauBERT models perform marginally better than CamemBERT, while all of them outperform mBERT by a large margin.

6. Word Sense Disambiguation

Disambiguation was performed with the same WSD supervised method used by ?). First we compute sense vector representations from examples found in the Wiktionary sense inventory: given a sense ss and its corresponding examples, we compute the vector representation of ss by averaging the vector representations of its examples. Then, we tag each test instance with the sense whose representation is the closest based on cosine similarity. We used the contextual embeddings output by FlauBERT as vector representations for any given instance (from the sense inventory or the test data) of a target word. We proceeded the same way with mBERT and CamemBERT for comparison. We also compared our model with a simpler context vector representation called averaged word embeddings (AWE) which consists in representing context of target word by averaging its surrounding words in a given window size. We experimented AWE using fastText word embeddings with a window of size 5. We report results in Table 8. BERT-based models set the new state of the art on this task, with the best results achieved by CamemBERT and FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}}.

Noun Sense Disambiguation

We implemented a neural classifier similar to the classifier presented by ?). This classifier forwards the output of a pre-trained language model to a stack of 6 trained Transformer encoder layers and predicts the synset of every input words through softmax. The only difference between our model and ?) is that we chose the same hyper-parameter as FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}} for the dffd_{ff} and the number of attention heads of the Transformer layers (more precisely, dff=3072d_{ff}=3072 and A=12A=12).

At prediction time, we take the synset ID which has the maximum value along the softmax layer (no filter on the lemma of the target is performed). We trained 8 models for every experiment, and we report the mean results, and the standard deviation of the individual models, and also the result of an ensemble of models, which averages the output of the softmax layer. Finally, we compared FlauBERT with CamemBERT, mBERT, fastText and with no input embeddings. We report the results in Table 9. On this task and with these settings, we first observe an advantage for mBERT over both CamemBERT and FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}. We think that it might be due to the fact that the training corpora we used are machine translated from English to French, so the multilingual nature of mBERT makes it probably more fitted for the task. Comparing CamemBERT to FlauBERT\textscbase\text{FlauBERT}_{\textsc{base}}, we see a small improvement in the former model, and we think that this might be due to the difference in the sizes of pre-training corpora. Finally, with our FlauBERT\textsclarge\text{FlauBERT}_{\textsc{large}} model, we obtain the best scores on the task, achieving more than 1 point above mBERT.

Conclusion

We present and release FlauBERT, a pre-trained language model for French. FlauBERT was trained on a multiple-source corpus and achieved state-of-the-art results on a number of French NLP tasks, surpassing multi-lingual/cross-lingual models. FlauBERT is competitive with CamemBERT [Martin et al., 2019] – another pre-trained language model for French – despite being trained on almost twice as fewer text data. In order to make the pipeline entirely reproducible, we not only release preprocessing and training scripts, together with FlauBERT, but also provide a general benchmark for evaluating French NLP systems (FLUE). FlauBERT is also now supported by Hugging Face’s transformers library.https://huggingface.co/transformers/

Acknowledgements

This work benefited from the ‘Grand Challenge Jean Zay’ program and was also partially supported by MIAI@Grenoble-Alpes (ANR-19-P3IA-0003).

We thank Guillaume Lample and Alexis Conneau for their active technical support on using the XLM code.

A Appendix

Table 10 presents the statistics of all sub-corpora in our training corpus. We give the description of each sub-corpus below.

We used four corpora provided in the WMT19 shared task [Li et al., 2019].http://www.statmt.org/wmt19/translation-task.html

Common Crawl includes text crawled from billions of pages in the internet.

News Crawl contains crawled news collected from 2007 to 2018.

EuroParl composes text extracted from the proceedings of the European Parliament.

News Commentary consists of text from news-commentary crawl.

Datasets from OPUS

OPUShttp://opus.nlpl.eu is a growing resource of freely accessible monolingual and parallel corpora [Tiedemann, 2012]. We collected the following French monolingual datasets from OPUS.

OpenSubtitles comprises translated movies and TV subtitles.

EU Bookshop includes publications from the European institutions.

MultiUN composes documents from the United Nations.

GIGA consists of newswire text and is made available in WMT10 shared task.https://www.statmt.org/wmt10/

DGT contains translation memories provided by the Joint Research Center.

Global Voices encompasses news stories from the website Global Voices.

TED Talks includes subtitles from TED talks videos.https://www.ted.com

Euconst consists of text from the European constitution.

Wikimedia database

This includes Wikipedia, Wiktionary, Wikiversity, etc. The content is built collaboratively by volunteers around the world.https://dumps.wikimedia.org/other/cirrussearch/current/

Wikipedia is a free online encyclopedia including high-quality text covering a wide range of topics.

Wikisource includes source texts in the public domain.

Wiktionary is an open-source dictionary of words, phrases etc.

Wikiversity composes learning resources and learning projects or research.

Wikiquote consists of sourced quotations from notable people and creative works.

Wikivoyage includes information about travelling.

Project Gutenberg

This popular dataset contains free ebooks of different genres which are mostly the world’s older classic works of literature for which copyright has expired.

EnronSent

This dataset is provided by [Styler, 2011] and is a part of the Enron Email Dataset,https://www.cs.cmu.edu/~enron/ a massive dataset containing 500K messages from senior management executives at the Enron Corporation.

PCT

This sub-corpus contains patent documents collected and maintained internally by the GETALPhttp://lig-getalp.imag.fr/en/home/ team.

Le Monde

This is also collected and maintained internally by the GETALP team, consisting of articles from Le Mondehttps://www.lemonde.fr collected from 1987 to 2003.

Bibliographical References

References