MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

Introduction

Pre-training and fine-tuning are widely used when target tasks are of low or zero resource in terms of training data, while pre-training has plenty of data (Girshick et al., 2014; Szegedy et al., 2015; Ouyang et al., 2015; Dai & Le, 2015; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). For example, in computer vision, models are usually pre-trained on the large scale ImageNet dataset and then fine-tuned on downstream tasks like object detection (Szegedy et al., 2015; Ouyang et al., 2015) or image segmentation (Girshick et al., 2014). Recently, pre-training methods such as ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018) and BERT (Devlin et al., 2018) have attracted a lot of attention in natural language processing, and achieved state-of-the-art accuracy in multiple language understanding tasks such as sentiment classification (Socher et al., 2013), natural language inference (Bowman et al., 2015), named entity recognition (Tjong Kim Sang & De Meulder, 2003) and SQuAD question answering (Rajpurkar et al., 2016), which usually have limited supervised data. Among the pre-training methods mentioned above, BERT is the most prominent one by pre-training the bidirectional encoder representations on a large monolingual corpus through masked language modeling and next sentence prediction.

Different from language understanding, language generation aims to generate natural language sentences conditioned on some inputs, including tasks like neural machine translation (NMT) (Cho et al., 2014; Bahdanau et al., 2015a; Vaswani et al., 2017), text summarization (Ayana et al., 2016; Suzuki & Nagata, 2017; Gehring et al., 2017) and conversational response generation (Shang et al., 2015; Vinyals & Le, 2015). Language generation tasks are usually data-hungry, and many of them are low-resource or even zero-source in terms of training data. Directly applying a BERT like pre-training method on these natural language generation tasks is not feasible, since BERT is designed for language understanding, which are usually handled by just one encoder or decoder. Therefore, how to design pre-training methods for the language generation tasks (which usually adopt the encoder-decoder based sequence to sequence learning framework) is of great potential and importance.

In this paper, inspired by BERT, we propose a novel objective for pre-training: MAsked Sequence to Sequence learning (MASS) for language generation. MASS is based on the sequence to sequence learning framework: its encoder takes a sentence with a masked fragment (several consecutive tokens) as input, and its decoder predicts this masked fragment conditioned on the encoder representations. Unlike BERT or a language model that pre-trains only the encoder or decoder, MASS is carefully designed to pre-train the encoder and decoder jointly in two steps: 1) By predicting the fragment of the sentence that is masked on the encoder side, MASS can force the encoder to understand the meaning of the unmasked tokens, in order to predict the masked tokens in the decoder side; 2) By masking the input tokens of the decoder that are unmasked in the source side, MASS can force the decoder rely more on the source representation other than the previous tokens in the target side for next token prediction, better facilitating the joint training between encoder and decoder.

MASS just needs to pre-train one model and then fine-tune on a variety of downstream tasks. We use transformer as the basic sequence to sequence model and pre-train on the WMT monolingual corpusThe monolingual data for each language is downloaded from http://www.statmt.org/wmt16/translation-task.html., and then fine-tune on three different language generation tasks including NMT, text summarization and conversational response generation. Considering the downstream tasks cover cross-lingual task like NMT, we pre-train one model on multiple languages. We explore the low-resource setting for all the three tasks, and also consider unsupervised NMT which is a purely zero-resource setting. For NMT, the experiments are conducted on WMT14 English-French, WMT16 English-German and WMT16 English-Romanian datasets. For unsupervised NMT, we directly fine-tune the pre-trained model on monolingual data with back-translation loss (Lample et al., 2018), instead of using additional denoising auto-encoder loss as in Lample et al. (2018). For low-resource NMT, we fine-tune our model on limited bilingual data. For the other two tasks, we conduct experiments on: 1) the Gigaword corpus for abstractive text summarization; 2) the Cornell Movie Dialog corpus for conversational response generation. Our method achieves improvements on all these tasks as well as both the zero- and low-resource settings, demonstrating our method is effective and applicable to a wide range of sequence generation tasks.

The contributions of this work are listed as follows: 1) We propose MASS, a masked sequence to sequence pre-training method for language generation; 2) We apply MASS on a variety of language generation tasks including NMT, text summarization and conversational response generation, and achieve significant improvements, demonstrating the effectiveness of our proposed method. Specially, we achieve a state-of-the art BLEU score for unsupervised NMT on two language pairs: English-French and English-German, and outperform the previous unsupervised NMT method (Lample & Conneau, 2019) by more than 4 points on English-French and 1 point on French-English in terms of BLEU score, and even beating the early attention-based supervised model (Bahdanau et al., 2015b).

Related Work

There are a lot of works on sequence to sequence learning and the pre-training for natural language processing. We briefly review several popular approaches in this section.

Sequence to sequence learning (Cho et al., 2014; Bahdanau et al., 2015a; Wu et al., 2016; Gehring et al., 2017; Vaswani et al., 2017) is a challenging task in artificial intelligence, and covers a variety of language generation applications such as NMT (Cho et al., 2014; Bahdanau et al., 2015a; Wu et al., 2016; Gehring et al., 2017; Vaswani et al., 2017; Tan et al., 2019; Artetxe et al., 2017; Lample et al., 2017, 2018; He et al., 2018; Hassan et al., 2018; Song et al., 2018; Shen et al., 2018), text summarization (Ayana et al., 2016; Suzuki & Nagata, 2017; Gehring et al., 2017), question answering (Yuan et al., 2017; Fedus et al., 2018) and conversational response generation (Shang et al., 2015; Vinyals & Le, 2015).

Sequence to sequence learning has attracted much attention in recent years due to the advance of deep learning. However, many language generations tasks such as NMT lack paired data but have plenty of unpaired data. Therefore, the pre-training on unpaired data and fine-tuning with small-scale paired data will be helpful for these tasks, which is exactly the focus of this work.

2 Pre-training for NLP tasks

Pre-training has been widely used in NLP tasks to learn better language representation. Previous works mostly focus on natural language understanding tasks, and can be classified into feature-based approaches and fine-tuning approaches. Feature-based approaches mainly leverage pre-training to provide language representations and features to the downstream tasks, which includes word-level representations (Brown et al., 1992; Ando & Zhang, 2005; Blitzer et al., 2006; Collobert & Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014) and sentence-level representations (Kiros et al., 2015; Logeswaran & Lee, 2018; Le & Mikolov, 2014), as well as context sensitive features from the NMT model (McCann et al., 2017) and ELMo (Peters et al., 2018). Fine-tuning approaches mainly pre-train a model on language modeling objective and then fine-tune the model on the downstream tasks with supervised data (Dai & Le, 2015; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). Specifically, Devlin et al. (2018) proposed BERT based on masked language modeling and next sentence prediction and achieved a state-of-the-art accuracy on multiple language understanding tasks in the GLUE benchmark (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016).

There are also some works pre-training the encoder-decoder model for language generation. Dai & Le (2015); Ramachandran et al. (2016) leverage a language model or auto-encoder to pre-train the encoder and decoder. Their improvements, although observed, are limited and not as general and significant as the pre-training methods (e.g., BERT) for language understanding. Zhang & Zong (2016) designed a sentence reordering task for pre-training, but only for the encoder part of the encoder-decoder model. Zoph et al. (2016); Firat et al. (2016) pre-train the model on similar rich-resource language pairs and fine-tuned on the target language pair, which relies on supervised data on other language pairs. Recently, XLM (Lample & Conneau, 2019) pre-trained BERT-like models both for the encoder and decoder, and achieved the previous state of the art results on unsupervised machine translation. However, the encoder and decoder in XLM are pre-trained separately and the encoder-decoder attention mechanism cannot be pre-trained, which are sub-optimal for sequence to sequence based language generation tasks.

Different from previous works, our proposed MASS is carefully designed to pre-train both the encoder and decoder jointly using only unlabeled data, and can be applied to most language generations tasks.

MASS

In this section, we first introduce the basic framework of sequence to sequence learning, and then propose MASS (MAsked Sequence to Sequence pre-training). We then discuss the differences between MASS and previous pre-training methods including the masked language modeling in BERT and standard language modeling.

We denote $(x,y)\in\mathcal{(X,Y)}$ as a sentence pair, where $x=(x_{1},x_{2},...,x_{m})$ is the source sentence with $m$ tokens, and $y=(y_{1},y_{2},...,y_{n})$ is the target sentence with $n$ tokens, and $\mathcal{X}$ and $\mathcal{Y}$ are the source and target domains. A sequence to sequence model learns the parameter $\theta$ to estimate the conditional probability $P(y|x;\theta)$ , and usually uses log likelihood as the objective function: $L(\theta;\mathcal{(X,Y)})=\Sigma_{(x,y)\in\mathcal{\mathcal{(X,Y)}}}\log P(y|x;\theta)$ . The conditional probability $P(y|x;\theta)$ can be further factorized according to the chain rule: $P(y|x;\theta)=\prod_{t=1}^{n}P(y_{t}|y_{<t},x;\theta)$ , where $y_{<t}$ is the proceeding tokens before position $t$ .

A major approach to sequence to sequence learning is the encoder-decoder framework: The encoder reads the source sequence and generates a set of representations; the decoder estimates the conditional probability of each target token given the source representations and its preceding tokens. Attention mechanism (Bahdanau et al., 2015a) is further introduced between the encoder and decoder to find which source representation to focus on when predicting the current token.

2 Masked Sequence to Sequence Pre-training

MASS pre-trains a sequence to sequence model by predicting the sentence fragment $x^{u:v}$ taking the masked sequence $x^{\setminus u:v}$ as input. We also use the log likelihood as the objective function:

Actually, the masked language modeling in BERT (Devlin et al., 2018) and the standard language modeling (Bengio et al., 2003; Mikolov et al., 2010) in GPT (Radford et al., 2018) can be viewed as special cases of MASS. We have an important hyperparameter $k$ , which denotes the length of the masked fragment of the sentence. Our method with different $k$ values can cover the special cases that are related to previous pre-training methods, as shown in Table 1.

When $k=1$ , the masked fragment in the source sentence contains only one token, and the decoder predicts this token without any tokens as input but conditioned on the unmasked source tokens, as shown in Figure 2(a). It becomes the masked language modeling as used in BERT. One may argue that the model structure is a little bit different from the masked language model. However, since all the input tokens of the decoder are masked, the decoder is itself like a non-linear classifier, analogous to the softmax matrix used in BERT. In this case, the conditional probability is $P(x^{u}|x^{\setminus u};\theta)$ and $u$ is the position of the masked token, which is exactly the formulation of masked language modeling used in BERTOne may argue that the masked language modeling in BERT randomly masks multiple tokens rather than just one token at a time. However, the key idea behind masking language modeling in BERT is to leverage bidirectional context information. Masking multiple tokens at a time is mainly for training speedup..

When $k=m$ where $m$ is the number of tokens in sentence $x$ , all the tokens on the encoder side are masked and the decoder needs to predict all tokens given previous tokens, as shown in Figure 2(b). The conditional probability is $P(x^{1:m}|x^{\setminus 1:m};\theta)$ , and it becomes the standard language modeling in GPT, conditioned on null information from the encoder as all the tokens in the encoder side are masked.

3 Discussions

MASS is a pre-training method for language generation. While its special cases are related to the previous methods including the standard language modeling in GPT and the masked language modeling in BERT, it is different from these methods in general.

Standard language modeling has long been used for pre-training, and the most prominent ones are the recently proposed ELMo (Peters et al., 2018) and OpenAI GPT (Radford et al., 2018). BERT introduces two pre-training tasks (masked language modeling and next sentence prediction) for natural language understanding, and uses one encoder to extract the representation for a single sentence or a pair of sentences. Both standard language modeling and BERT can just pre-train the encoder or decoder separately. While achieving promising results on language understanding tasks, they are not suitable for language generation tasks which typically leverage an encoder-decoder framework for conditional sequence generation.

Experiments and Results

In this section, we describe the experimental details about MASS pre-training and fine-tuning on a variety of language generation tasks, including NMT, text summarization, conversational response generation.

We choose Transformer (Vaswani et al., 2017) as the basic model structure, which consists of 6-layer encoder and 6-layer decoder with 1024 embedding/hidden size and 4096 feed-forward filter size. For neural machine translation task, we pre-train our model on the monolingual data of the source and target languages. We respectively conduct experiments on three language pairs: English-French, English-German, and English-Romanian. For other language generation tasks, including text summarization and conversational response generation, we pre-train the model with only English monolingual data respectively. To distinguish between the source and target languages in neural machine translation task, we add a language embedding to each token of the input sentence for the encoder and decoder, which is also learnt end-to-end. We implement our method based on codebase of XLM https://github.com/facebookresearch/XLM.

Datasets

We use all of the monolingual data from WMT News Crawl datasetsWhile we choose the WMT monolingual data in the current setting, pre-training on Wikipedia data is also feasible., which covers 190M, 62M and 270M sentences from year 2007 to 2017 for English, French, German respectively. We also include a low-resource language, Romanian, in the pre-training stage, to verify the effectiveness of MASS pre-trained with low-resource monolingual data. We use all of the available Romanian sentences from News Crawl dataset and augment it with WMT16 data, which results in 2.9M sentences. We remove the sentences with length over 175. For each task, we jointly learn a 60,000 sub-word units with Byte-Pair Encoding (Sennrich et al., 2016) between source and target languages.

Pre-Training Details

To verify the effectiveness of MASS, we fine-tune the pre-trained model on three language generation tasks: NMT, text summarization and conversational response generation. We explore the low-resource setting on these tasks where we just leverage few training data for fine-tuning to simulate the low-resource scenario. For NMT, we mainly investigate the zero-resource (unsupervised) setting, as unsupervised NMT has become a challenging task in recent years (Artetxe et al., 2017; Lample et al., 2017, 2018).

2 Fine-Tuning on NMT

In this section, we first describe the experiments on the unsupervised NMT, and then introduce the experiments on low-resource NMT.

For unsupervised NMT, there is no bilingual data to fine-tune the pre-trained model. Therefore, we leverage the monolingual data that is also used in the pre-training stage. Different from Artetxe et al. (2017); Lample et al. (2017, 2018); Leng et al. (2019), we just use back-translation to generate pseudo bilingual data for training, without using denoising auto-encoderMASS is better than denoising auto-encoder as we will show in Table 3.. During fine-tuning, we use Adam optimizer (Kingma & Ba, 2015) with initial learning rate $10^{-4}$ , and the batch size is set as 2000 tokens for each GPU. During evaluation, we calculate the BLEU score with multi-bleu.plhttps://github.com/moses-smt/mosesdecoder/blob/master/ scripts/generic/multi-bleu.perl on newstest2014 for English-French, and newstest2016 for English-German and English-Romanian.

Results on Unsupervised NMT

Our results are shown in Table 2. On all the 6 translation directions, our method outperforms all of the previous results, including the methods without pre-training (Lample et al., 2018) and with pre-training (Lample & Conneau, 2019). XLM (Lample & Conneau, 2019) is the previous state-of-the-art method which leverage BERT like pre-training in encoder and decoder, which covers several pre-training methods: masked language model (MLM) and causal language model (CLM). Our method still outperforms XLM by 4.1 BLEU points on en-fr.

Compared with Other Pre-training Methods

We also compare MASS with the previous pre-training methods for language generation tasks. The first baseline is BERT+LM, which use masked language modeling in BERT to pre-train the encoder and the standard language modeling to pre-train the decoder. The second baseline is DAE, which simply uses denoising auto-encoder (Vincent et al., 2008) to pre-train the encoder and decoder. We pre-train the model with BERT+LM and DAE, and fine-tune on the unsupervised translation pairs with same fine-tuning strategy of XLM (i.e., DAE loss + back-translation). These methods are also configured with the 6-layer Transformer setting.

As shown in Table 3, BERT+LM achieves higher BLEU score than DAE, and MASS outperforms both BERT+LM and DAE on all the unsupervised translation pairs. While DAE usually leverages some denoising methods like randomly masking tokens or swapping adjacent tokens, the decoder can still easily learn to copy the unmasked tokens through encoder-decoder attentionThe popular encoder-decoder based model structures (Wu et al., 2016; Gehring et al., 2017; Vaswani et al., 2017) all adopt residual connection (He et al., 2016). Therefore, the token generation in the top layer of the decoder side can directly depend on the token embedding in the encoder side through residual connection and attention.. On the other hand, the decoder in DAE takes the full sentence as the input, which is enough to predict the next token like the language model, and is not forced to extract additional useful representation from the encoder.

Experiments on Low-Resource NMT

In the low-resource NMT setting, we respectively sample 10K, 100K, 1M paired sentence from the bilingual training data of WMT14 English-French, WMT16 English-German and WMT16 English-Romanian, to explore the performance of our method in different low-resource scenarios. We use the same BPE codes learned in the pre-trained stage to tokenize the training sentence pairs. We fine-tune the pre-trained model on the paired data for 20,000 steps with Adam optimizer and the learning rate is set as $10^{-4}$ . We choose the best model according to the accuracy on development set. We report the BLEU scores on the same testsets used in the unsupervised setting. As shown in Figure 3, MASS outperforms the baseline models that are trained only on the bilingual data without any pre-training on all the six translation directions, demonstrating the effectiveness of our method in the low-resource scenarios.

3 Fine-Tuning on Text Summarization

Text summarization is the task of creating a short and fluent summary of a long text document, which is a typical sequence generation task. We fine-tune the pre-trained model on text summarization task with different scales (10K, 100K, 1M and 3.8M) of training data from the Gigaword corpus (Graff et al., 2003)https://github.com/harvardnlp/sent-summary, which consists of a total of 3.8M article-title pairs in English. We take the article as the encoder input and title as the decoder input for fine-tuning. We report the F1 score of ROUGE-1, ROUGE-2 and ROUGE-L on the Gigaword testset during evaluation. We use beam search with a beam size of 5 for inference.

Results

Our results are illustrated in Figure 4. We compare MASS with the model that is trained only on the paired data without any pre-training. MASS consistently outperforms the baseline on different scales of fine-tuning data (more than 10 ROUGE points gain on 10K data and 5 ROUGE points gain on 100K data), which demonstrates that MASS is effective in low-resource scenarios with different scale of training data on this task.

Compared with Other Pre-Training Methods

We further compare MASS with the pre-training methods of BERT+LM and DAE described in Section 4.2, with 3.8M data on the text summarization task. As shown in Table 4, MASS consistently outperforms the two pre-training methods on the three ROUGE scores.

4 Fine-Tuning on Conversational Response Generation

Conversational response generation generates a flexible response for the conversation (Shang et al., 2015; Vinyals & Le, 2015). We conduct experiments on the Cornell movie dialog corpus (Danescu-Niculescu-Mizil & Lee, 2011)https://github.com/suriyadeepan/datasets/tree/master/seq2seq/ cornell_movie_corpus that contains 140K conversation pairs. We randomly sample 10K/20K pairs as the validation/test set and the remaining data is used for training. We adopt the same optimization hyperparameters from the pre-training stage for fine-tuning. We report the results with perplexity (PPL) following Vinyals & Le (2015).

Results

We compare MASS with the baseline that is trained on the available data pairs. We conduct experiments on the 10K pairs (randomly chosen) and the whole 110K pairs, and show the results in Table 5. MASS achieves lower PPL than the baseline on both the 10K and 110K data.

Compared with Other Pre-Training Methods

We also compare MASS with the pre-training methods of BERT+LM and DAE on conversational response generation. As shown in Table 5, MASS consistently outperforms the two pre-training methods with lower PPL on 10K and 110K training data respectively.

5 Analysis of MASS

The length of the masked fragment $k$ is an important hyperparameter of MASS and we have varied $k$ in Section 3.2 to cover the special cases of masked language modeling in BERT and standard language modeling. In this section, we study the performance of MASS with different $k$ , where we choose $k$ from 10% to 90% percentage of the sentence length $m$ with a step size of 10%, plus with $k=1$ and $k=m$ .

We observe both the performance of MASS after pre-training, as well as the performance after fine-tuning on several language generation tasks, including unsupervised English-French translation, text summarization and conversational response generation. We first show the perplexity (PPL) of the pre-training model on the English and French languages with different $k$ . We choose the English and French sentences from newstest2013 of WMT En-Fr as the validation set, and plot the PPL in Figure 5(a) (English) and 5(b) (French). It can be seen that the pre-trained model achieves the best validation PPL when $k$ is between $50\%$ and $70\%$ of the sentence length $m$ . We then observe the performance on fine-tuning tasks. We show the curve of the validation BLEU scores on unsupervised En-Fr translation in Figure 5(c), the validation ROUGE scores on text summarization in Figure 5(d), and the validation PPL on conversational response generation in Figure 5(e). It can be seen that MASS achieves best performance on these downstream tasks when $k$ is nearly $50\%$ of the sentence length $m$ . Therefore, we set $k=50\%$ of $m$ for MASS in our experiments.

Actually, $k=50\%$ of $m$ is a good balance between the encoder and decoder. Too few valid tokens in the encoder side or in the decoder side will bias the model to concentrate more on the other side, which is not suitable for language generation task that typically leverages the encoder-decoder framework to extract the sentence representation in the encoder, as well as to model and generate the sentence in the decoder. The extreme cases are $k=1$ (masked language modeling in BERT) and $k=m$ (standard language modeling), as illustrated in Figure 2. Neither $k=1$ nor $k=m$ can achieve good performance on the downstream language generation tasks, as shown in Figure 5.

Ablation Study of MASS

Conclusion

In this work, we have proposed MASS: masked sequence to sequence pre-training for language generation tasks, which reconstructs a sentence fragment given the remaining part of the sentence in the encoder-decoder framework. MASS just needs to pre-train one model and then fine-tune on multiple language generation tasks such as neural machine translation, text summarization and conversational response generation. Through experiments on the three above tasks and total eight datasets, MASS achieved significant improvements over the baseline without pre-training or with other pre-training methods. More specifically, MASS achieved the state-of-the-art BLEU scores for unsupervised NMT on three language pairs, outperforming the previous state-of-the-art by more than 4 BLEU points on English-French.

For future work, we will apply MASS to more language generation tasks such as sentence paraphrasing, text style transfer and post editing, as well as other sequence generation tasks (Ren et al., 2019). We will also investigate more of the theoretical and empirical analysis on our masked sequence to sequence pre-training method.

Acknowledgements

This work was partially supported by the National Key Research and Development Program of China under Grant 2018YFB1004904. We thank Yichong Leng, Weicong Chen, Yi Zhuang, Hao Sun and Yi Ren for the further development on the work of MASS. We also thank the anonymous reviewers for their valuable comments on our paper.

Introduction

Related Work

2 Pre-training for NLP tasks

MASS

2 Masked Sequence to Sequence Pre-training

3 Discussions

Experiments and Results

Datasets

Pre-Training Details

2 Fine-Tuning on NMT

Results on Unsupervised NMT

Compared with Other Pre-training Methods

Experiments on Low-Resource NMT

3 Fine-Tuning on Text Summarization

Results

Compared with Other Pre-Training Methods

4 Fine-Tuning on Conversational Response Generation

Results

Compared with Other Pre-Training Methods

5 Analysis of MASS

Ablation Study of MASS

Conclusion

Acknowledgements

References