HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization

Xingxing Zhang, Furu Wei, Ming Zhou

Introduction

Automatic document summarization is the task of rewriting a document into its shorter form while still retaining its important content. Over the years, many paradigms for document summarization have been explored (see Nenkova and McKeown (2011) for an overview). The most popular two among them are extractive approaches and abstractive approaches. As the name implies, extractive approaches generate summaries by extracting parts of the original document (usually sentences), while abstractive methods may generate new words or phrases which are not in the original document.

Extractive summarization is usually modeled as a sentence ranking problem with length constraints (e.g., max number of words or sentences). Top ranked sentences (under constraints) are selected as summaries. Early attempts mostly leverage manually engineered features Filatova and Hatzivassiloglou (2004a). Based on these sparse features, sentence are selected using a classifier or a regression model. Later, the feature engineering part in this paradigm is replaced with neural networks. Cheng and Lapata (2016) propose a hierarchical long short-term memory network (LSTM; Hochreiter and Schmidhuber 1997) to encode a document and then use another LSTM to predict binary labels for each sentence in the document. This architecture is widely adopted recently Nallapati et al. (2017); Narayan et al. (2018); Zhang et al. (2018). Our model also employs a hierarchical document encoder, but we adopt a hierarchical transformer Vaswani et al. (2017) rather a hierarchical LSTM. Because recent studies Vaswani et al. (2017); Devlin et al. (2018) show the transformer model performs better than LSTM in many tasks.

Abstractive models do not attract much attention until recently. They are mostly based on sequence to sequence (seq2seq) models Bahdanau et al. (2015), where a document is viewed a sequence and its summary is viewed as another sequence. Although seq2seq based summarizers can be equipped with copy mechanism Gu et al. (2016); See et al. (2017), coverage model See et al. (2017) and reinforcement learning Paulus et al. (2017), there is still no guarantee that the generated summaries are grammatical and convey the same meaning as the original document does. It seems that extractive models are more reliable than their abstractive counterparts.

However, extractive models require sentence level labels, which are usually not included in most summarization datasets (most datasets only contain document-summary pairs). Sentence labels are usually obtained by rule-based methods (e.g., maximizing the ROUGE score between a set of sentences and reference summaries) and may not be accurate. Extractive models proposed recently Cheng and Lapata (2016); Nallapati et al. (2017) employ hierarchical document encoders and even have neural decoders, which are complex. Training such complex neural models with inaccurate binary labels is challenging. We observed in our initial experiments on one of our dataset that our extractive model (see Section 3.3 for details) overfits to the training set quickly after the second epoch, which indicates the training set may not be fully utilized. Inspired by the recent pre-training work in natural language processing Peters et al. (2018); Radford et al. (2018); Devlin et al. (2018), our solution to this problem is to first pre-train the “complex”’ part (i.e., the hierarchical encoder) of the extractive model on unlabeled data and then we learn to classify sentences with our model initialized from the pre-trained encoder. In this paper, we propose Hibert, which stands for HIerachical Bidirectional Encoder Representations from Transformers. We design an unsupervised method to pre-train Hibert for document modeling. We apply the pre-trained Hibert to the task of document summarization and achieve state-of-the-art performance on both the CNN/Dailymail and New York Times dataset.

Related Work

In this section, we introduce work on extractive summarization, abstractive summarization and pre-trained natural language processing models. For a more comprehensive review of summarization, we refer the interested readers to Nenkova and McKeown (2011) and Mani (2001).

Extractive summarization aims to select important sentences (sometimes other textual units such as elementary discourse units (EDUs)) from a document as its summary. It is usually modeled as a sentence ranking problem by using the scores from classifiers Kupiec et al. (1995), sequential labeling models Conroy and O’leary (2001) as well as integer linear programmers Woodsend and Lapata (2010). Early work with these models above mostly leverage human engineered features such as sentence position and length Radev et al. (2004), word frequency Nenkova et al. (2006) and event features Filatova and Hatzivassiloglou (2004b).

As the very successful applications of neural networks to a wide range of NLP tasks, the manually engineered features (for document encoding) are replaced with hierarchical LSTMs/CNNs and the sequence labeling (or classification) model is replaced with an LSTM decoder Cheng and Lapata (2016); Nallapati et al. (2017). The architecture is widely adopted in recent neural extractive models and is extended with reinforcement learning Narayan et al. (2018); Dong et al. (2018), latent variable models Zhang et al. (2018), joint scoring Zhou et al. (2018) and iterative document representation Chen et al. (2018). Recently, transformer networks Vaswani et al. (2017) achieves good performance in machine translation Vaswani et al. (2017) and a range of NLP tasks Devlin et al. (2018); Radford et al. (2018). Different from the extractive models above, we adopt a hierarchical Transformer for document encoding and also propose a method to pre-train the document encoder.

Abstractive Summarization

Abstractive summarization aims to generate the summary of a document with rewriting. Most recent abstractive models Nallapati et al. (2016) are based on neural sequence to sequence learning Bahdanau et al. (2015); Sutskever et al. (2014). However, the generated summaries of these models can not be controlled (i.e., their meanings can be quite different from the original and contents can be repeated). Therefore, copy mechanism Gu et al. (2016), coverage model See et al. (2017) and reinforcement learning model optimizing ROUGE Paulus et al. (2017) are introduced. These problems are alleviated but not solved. There is also an interesting line of work combining extractive and abstractive summarization with reinforcement learning Chen and Bansal (2018), fused attention Hsu et al. (2018) and bottom-up attention Gehrmann et al. (2018). Our model, which is a very good extractive model, can be used as the sentence extraction component in these models and potentially improves their performance.

Pre-trained NLP Models

Most model pre-training methods in NLP leverage the natural ordering of text. For example, word2vec uses the surrounding words within a fixed size window to predict the word in the middle with a log bilinear model. The resulting word embedding table can be used in other downstream tasks. There are other word embedding pre-training methods using similar techniques Pennington et al. (2014); Bojanowski et al. (2017). Peters et al. (2018) and Radford et al. (2018) find even a sentence encoder (not just word embeddings) can also be pre-trained with language model objectives (i.e., predicting the next or previous word). Language model objective is unidirectional, while many tasks can leverage the context in both directions. Therefore, Devlin et al. (2018) propose the naturally bidirectional masked language model objective (i.e., masking several words with a special token in a sentence and then predicting them). All the methods above aim to pre-train word embeddings or sentence encoders, while our method aims to pre-train the hierarchical document encoders (i.e., hierarchical transformers), which is important in summarization.

Model

In this section, we present our model Hibert. We first introduce how documents are represented in Hibert. We then describe our method to pre-train Hibert and finally move on to the application of Hibert to summarization.

Let $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ denote a document, where $S_{i}=(w_{1}^{i},w_{2}^{i},\dots,w_{|S_{i}|}^{i})$ is a sentence in $\mathcal{D}$ and $w_{j}^{i}$ a word in $S_{i}$ . Note that following common practice in natural language processing literatures, $w_{|S_{i}|}^{i}$ is an artificial EOS (End Of Sentence) token. To obtain the representation of $\mathcal{D}$ , we use two encoders: a sentence encoder to transform each sentence in $\mathcal{D}$ to a vector and a document encoder to learn sentence representations given their surrounding sentences as context. Both the sentence encoder and document encoder are based on the Transformer encoder described in Vaswani et al. (2017). As shown in Figure 1, they are nested in a hierarchical fashion. A transformer encoder usually has multiple layers and each layer is composed of a multi-head self attentive sub-layer followed by a feed-forward sub-layer with residual connections He et al. (2016) and layer normalizations Ba et al. (2016). For more details of the Transformer encoder, we refer the interested readers to Vaswani et al. (2017). To learn the representation of $S_{i}$ , $S_{i}=(w_{1}^{i},w_{2}^{i},\dots,w_{|S_{i}|}^{i})$ is first mapped into continuous space

where $e(w_{j}^{i})$ and $\mathbf{p}_{j}$ are the word and positional embeddings of $w_{j}^{i}$ , respectively. The word embedding matrix is randomly initialized and we adopt the sine-cosine positional embedding Vaswani et al. (2017)We use the sine-cosine embedding because it works well and do not introduce additional trainable parameters.. Then the sentence encoder (a Transformer) transforms $\mathbf{E}_{i}$ into a list of hidden representations $(\mathbf{h}_{1}^{i},\mathbf{h}_{2}^{i},\dots,\mathbf{h}_{|S_{i}|}^{i})$ . We take the last hidden representation $\mathbf{h}_{|S_{i}|}^{i}$ (i.e., the representation at the EOS token) as the representation of sentence $S_{i}$ . Similar to the representation of each word in $S_{i}$ , we also take the sentence position into account. The final representation of $S_{i}$ is

Note that words and sentences share the same positional embedding matrix.

In analogy to the sentence encoder, as shown in Figure 1, the document encoder is yet another Transformer but applies on the sentence level. After running the Transformer on a sequence of sentence representations $(\hat{\mathbf{h}}_{1},\hat{\mathbf{h}}_{2},\dots,\hat{\mathbf{h}}_{|\mathcal{D}|})$ , we obtain the context sensitive sentence representations $(\mathbf{d}_{1},\mathbf{d}_{2},\dots,\mathbf{d}_{|\mathcal{D}|})$ . Now we have finished the encoding of a document with a hierarchical bidirectional transformer encoder Hibert. Note that in previous work, document representation are also learned with hierarchical models, but each hierarchy is a Recurrent Neural Network Nallapati et al. (2017); Zhou et al. (2018) or Convolutional Neural Network Cheng and Lapata (2016). We choose the Transformer because it outperforms CNN and RNN in machine translation Vaswani et al. (2017), semantic role labeling Strubell et al. (2018) and other NLP tasks Devlin et al. (2018). In the next section we will introduce how we train Hibert with an unsupervised training objective.

2 Pre-training

Most recent encoding neural models used in NLP (e.g., RNNs, CNNs or Transformers) can be pre-trained by predicting a word in a sentence (or a text span) using other words within the same sentence (or span). For example, ELMo Peters et al. (2018) and OpenAI-GPT Radford et al. (2018) predict a word using all words on its left (or right); while word2vec Mikolov et al. (2013) predicts one word with its surrounding words in a fixed window and BERT Devlin et al. (2018) predicts (masked) missing words in a sentence given all the other words.

All the models above learn the representation of a sentence, where its basic units are words. Hibert aims to learn the representation of a document, where its basic units are sentences. Therefore, a natural way of pre-training a document level model (e.g., Hibert) is to predict a sentence (or sentences) instead of a word (or words). We could predict a sentence in a document with all the sentences on its left (or right) as in a (document level) language model. However, in summarization, context on both directions are available. We therefore opt to predict a sentence using all sentences on both its left and right.

Specifically, suppose $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ is a document, where $S_{i}=(w_{1}^{i},w_{2}^{i},\dots,w_{|S_{i}|}^{i})$ is a sentence in it. We randomly select 15% of the sentences in $\mathcal{D}$ and mask them. Then, we predict these masked sentences. The prediction task here is similar with the Cloze task Taylor (1953); Devlin et al. (2018), but the missing part is a sentence. However, during test time the input document is not masked, to make our model can adapt to documents without masks, we do not always mask the selected sentences. Once a sentence is selected (as one of the 15% selected masked sentences), we transform it with one of three methods below. We will use an example to demonstrate the transformation. For instance, we have the following document and the second sentence is selectedThere might be multiple sentences selected in a document, but in this example there is only one.:

William Shakespeare is a poet . He died in 1616 . He is regarded as the greatest writer .

In 80% of the cases, we mask the selected sentence (i.e., we replace each word in the sentence with a mask token [MASK]). The document above becomes William Shakespeare is a poet . [MASK] [MASK] [MASK] [MASK] [MASK] He is regarded as the greatest writer . (where “He died in 1616 . ” is masked).

In 10% of the cases, we keep the selected sentence as it is. This strategy is to simulate the input document during test time (with no masked sentences).

In the rest 10% cases, we replace the selected sentence with a random sentence. In this case, the document after transformation is William Shakespeare is a poet . Birds can fly . He is regarded as the greatest writer . The second sentence is replaced with “Birds can fly .” This strategy intends to add some noise during training and make the model more robust.

Sentence Prediction

Then we include the information of $\widetilde{\mathcal{D}}$ by addition:

Note that the transformer decoder can have multiple layers by applying Equation (3) to (5) multiple times and we only show the computation of one layer for simplicity.

The probability of $w_{j}^{k}$ given $w_{0}^{k},\dots,w_{j-1}^{k}$ and $\widetilde{\mathcal{D}}$ is:

Finally the probability of all masked sentences $\mathcal{M}$ given $\widetilde{\mathcal{D}}$ is

The model above can be trained by minimizing the negative log-likelihood of all masked sentences given their paired documents. We can in theory have unlimited amount of training data for Hibert, since they can be generated automatically from (unlabeled) documents. Therefore, we can first train Hibert on large amount of data and then apply it to downstream tasks. In the next section, we will introduce its application to document summarization.

3 Extractive Summarization

Extractive summarization selects the most important sentences in a document as its summary. In this section, summarization is modeled as a sequence labeling problem. Specifically, a document is viewed as a sequence of sentences and a summarization model is expected to assign a True or False label for each sentence, where True means this sentence should be included in the summary. In the following, we will introduce the details of our summarization model based Hibert.

Let $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ denote a document and $Y=(y_{1},y_{2},\dots,y_{|\mathcal{D}|})$ its sentence labels (methods for obtaining these labels are in Section 4.1). As shown in Figure 2, we first apply the hierarchical bidirectional transformer encoder Hibert to $\mathcal{D}$ and yields the context dependent representations for all sentences $(\mathbf{d}_{1},\mathbf{d}_{2},\dots,\mathbf{d}_{|\mathcal{D}|})$ . The probability of the label of $S_{i}$ can be estimated using an additional linear projection and a softmax:

Experiments

In this section we assess the performance of our model on the document summarization task. We first introduce the dataset we used for pre-training and the summarization task and give implementation details of our model. We also compare our model against multiple previous models.

We conducted our summarization experiments on the non-anonymous version CNN/Dailymail (CNNDM) dataset Hermann et al. (2015); See et al. (2017), and the New York Times dataset Durrett et al. (2016); Xu and Durrett (2019). For the CNNDM dataset, we preprocessed the dataset using the scripts from the authors of See et al. (2017)Scripts publicly available at https://github.com/abisee/cnn-dailymail . The resulting dataset contains 287,226 documents with summaries for training, 13,368 for validation and 11,490 for test. Following Xu and Durrett (2019); Durrett et al. (2016), we created the NYT50 dataset by removing the documents whose summaries are shorter than 50 words from New York Times dataset. We used the same training/validation/test splits as in Xu and Durrett (2019), which contain 137,778 documents for training, 17,222 for validation and 17,223 for test. To create sentence level labels for extractive summarization, we used a strategy similar to Nallapati et al. (2017). We label the subset of sentences in a document that maximizes Rouge Lin (2004) (against the human summary) as True and all other sentences as False.

To unsupervisedly pre-train our document model Hibert (see Section 3.2 for details), we created the GIGA-CM dataset (totally 6,626,842 documents and 2,854 million words), which includes 6,339,616 documents sampled from the English Gigawordhttps://catalog.ldc.upenn.edu/LDC2012T21 dataset and the training split of the CNNDM dataset. We used the validation set of CNNDM as the validation set of GIGA-CM as well. As in See et al. (2017), documents and summaries in CNNDM, NYT50 and GIGA-CM are all segmented and tokenized using Stanford CoreNLP toolkit Manning et al. (2014). To reduce the vocabulary size, we applied byte pair encoding (BPE; Sennrich et al. 2016) to all of our datasets. To limit the memory consumption during training, we limit the length of each sentence to be 50 words (51th word and onwards are removed) and split documents with more than 30 sentences into smaller documents with each containing at most 30 sentences.

2 Implementation Details

Our model is trained in three stages, which includes two pre-training stages and one finetuning stage. The first stage is the open-domain pre-training and in this stage we train Hibert with the pre-training objective (Section 3.2) on GIGA-CM dataset. In the second stage, we perform the in-domain pre-training on the CNNDM (or NYT50) dataset still with the same pre-training objective. In the final stage, we finetune Hibert in the summarization model (Section 3.3) to predict extractive sentence labels on CNNDM (or NYT50).

The sizes of the sentence and document level Transformers as well as the Transformer decoder in Hibert are the same. Let $L$ denote the number of layers in Transformer, $H$ the hidden size and $A$ the number of attention heads. As in Vaswani et al. (2017); Devlin et al. (2018), the hidden size of the feedforward sublayer is $4H$ . We mainly trained two model sizes: $\text{\sc Hibert}_{S}$ ( $L=6$ , $H=512$ and $A=8$ ) and $\text{\sc Hibert}_{M}$ ( $L=6$ , $H=768$ and $A=12$ ). We trained both $\text{\sc Hibert}_{S}$ and $\text{\sc Hibert}_{M}$ on a single machine with 8 Nvidia Tesla V100 GPUs with a batch size of 256 documents. We optimized our models using Adam with learning rate of 1e-4, $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , L2 norm of 0.01, learning rate warmup 10,000 steps and learning rate decay afterwards using the strategies in Vaswani et al. (2017). The dropout rate in all layers are 0.1. In pre-training stages, we trained our models until validation perplexities do not decrease significantly (around 45 epochs on GIGA-CM dataset and 100 to 200 epochs on CNNDM and NYT50). Training $\text{\sc Hibert}_{M}$ for one epoch on GIGA-CM dataset takes approximately 20 hours.

Our models during fine-tuning stage can be trained on a single GPU. The hyper-parameters are almost identical to these in the pre-training stages except that the learning rate is 5e-5, the batch size is 32, the warmup steps are 4,000 and we train our models for 5 epochs. During inference, we rank sentences using $p(y_{i}|\mathcal{D})$ (Equation (8)) and choose the top $K$ sentences as summary, where $K$ is tuned on the validation set.

3 Evaluations

We evaluated the quality of summaries from different systems automatically using ROUGE Lin (2004). We reported the full length F1 based ROUGE-1, ROUGE-2 and ROUGE-L on the CNNDM and NYT50 datasets. We compute ROUGE scores using the ROUGE-1.5.5.pl script.

Additionally, we also evaluated the generated summaries by eliciting human judgments. Following Cheng and Lapata (2016); Narayan et al. (2018), we randomly sampled 20 documents from the CNNDM test set. Participants were presented with a document and a list of summaries produced by different systems. We asked subjects to rank these summaries (ties allowed) by taking informativeness (is the summary capture the important information from the document?) and fluency (is the summary grammatical?) into account. Each document is annotated by three different subjects.

4 Results

Our main results on the CNNDM dataset are shown in Table 1, with abstractive models in the top block and extractive models in the bottom block. Pointer+Coverage See et al. (2017), Abstract-ML+RL Paulus et al. (2017) and DCA Celikyilmaz et al. (2018) are all sequence to sequence learning based models with copy and coverage modeling, reinforcement learning and deep communicating agents extensions. SentRewrite Hsu et al. (2018) and InconsisLoss Chen and Bansal (2018) all try to decompose the word by word summary generation into sentence selection from document and “sentence” level summarization (or compression). Bottom-Up Gehrmann et al. (2018) generates summaries by combines a word prediction model with the decoder attention model. The extractive models are usually based on hierarchical encoders (SummaRuNNer; Nallapati et al. 2017 and NeuSum; Cheng and Lapata 2016). They have been extended with reinforcement learning (Refresh; Narayan et al. 2018 and BanditSum; Dong et al. 2018), Maximal Marginal Relevance (NeuSum-MMR; Zhou et al. 2018), latent variable modeling (LatentSum; Zhang et al. 2018) and syntactic compression (JECS; Xu and Durrett 2019). Lead3 is a baseline which simply selects the first three sentences. Our model $\text{\sc Hibert}_{S}$ (in-domain), which only use one pre-training stage on the in-domain CNNDM training set, outperforms all of them and differences between them are all significant with a 0.95 confidence interval (estimated with the ROUGE script). Note that pre-training $\text{\sc Hibert}_{S}$ (in-domain) is very fast and it only takes around 30 minutes for one epoch on the CNNDM training set. Our models with two pre-training stages ( $\text{\sc Hibert}_{S}$ ) or larger size ( $\text{\sc Hibert}_{M}$ ) perform even better and $\text{\sc Hibert}_{M}$ outperforms BERT by 0.5 ROUGEThe difference is significant according to the ROUGE script.. We also implemented two baselines. One is the hierarchical transformer summarization model (HeriTransfomer; described in 3.3) without pre-training. Note the setting for HeriTransfomer is ( $L=4$ , $H=300$ and $A=4$ ) We tried deeper and larger models, but obtained inferior results, which may indicates training large or deep models on this dataset without a good initialization is challenging.. We can see that the pre-training (details in Section 3.2) leads to a +1.25 ROUGE improvement. Another baseline is based on a pre-trained BERT Devlin et al. (2018)Our BERT baseline is adapted from this implementation https://github.com/huggingface/pytorch-pretrained-BERT and finetuned on the CNNDM dataset. We used the $\text{BERT}_{\text{base}}$ model because our 16G RAM V100 GPU cannot fit $\text{BERT}_{\text{large}}$ for the summarization task even with batch size of 1. The positional embedding of BERT supports input length up to 512 words, we therefore split documents with more than 10 sentences into multiple blocks (each block with 10 sentencesWe use 10 sentences per block, because maximum sentence length $50\times 10<512$ (maximum BERT supported length). The last block of a document may have less than 10 sentences.). We feed each block (the BOS and EOS tokens of each sentence are replaced with [CLS] and [SEP] tokens) into BERT and use the representation at [CLS] token to classify each sentence. Our model $\text{\sc Hibert}_{S}$ outperforms BERT by 0.4 to 0.5 ROUGE despite with only half the number of model parameters ( $\text{\sc Hibert}_{S}$ 54.6M v.s. BERT 110M).

Results on the NYT50 dataset show the similar trends (see Table 2). EXTRACTION is a extractive model based hierarchical LSTM and we use the numbers reported by Xu and Durrett (2019). The improvement of $\text{\sc Hibert}_{M}$ over the baseline without pre-training (HeriTransformer) becomes 2.0 ROUGE. $\text{\sc Hibert}_{S}$ (in-domain), $\text{\sc Hibert}_{M}$ (in-domain), $\text{\sc Hibert}_{S}$ and $\text{\sc Hibert}_{M}$ all outperform BERT significantly according to the ROUGE script.

We also conducted human experiment with 20 randomly sampled documents from the CNNDM test set. We compared our model $\text{\sc Hibert}_{M}$ against Lead3, DCA, Latent, BERT and the human reference (Human)We obtained the outputs of DCA and Latent via emails.. We asked the subjects to rank the outputs of these systems from best to worst. As shown in Table 4, the output of $\text{\sc Hibert}_{M}$ is selected as the best in 30% of cases and we obtained lower mean rank than all systems except for Human. We also converted the rank numbers into ratings (rank $i$ to $7-i$ ) and applied student $t$ -test on the ratings. $\text{\sc Hibert}_{M}$ is significantly different from all systems in comparison ( $p<0.05$ ), which indicates our model still lags behind Human, but is better than all other systems.

As mentioned earlier, our pre-training includes two stages. The first stage is the open-domain pre-training stage on the GIGA-CM dataset and the following stage is the in-domain pre-training on the CNNDM (or NYT50) dataset. As shown in Table 3, we pretrained $\text{\sc Hibert}_{S}$ using only open-domain stage (Open-Domain), only in-domain stage (In-Domain) or both stages (Open+In-Domain) and applied it to the CNNDM summarization task. Results on the validation set of CNNDM indicate the two-stage pre-training process is necessary.

Conclusions

The core part of a neural extractive summarization model is the hierarchical document encoder. We proposed a method to pre-train document level hierarchical bidirectional transformer encoders on unlabeled data. When we only pre-train hierarchical transformers on the training sets of summarization datasets with our proposed objective, application of the pre-trained hierarchical transformers to extractive summarization models already leads to wide improvement of summarization performance. Adding the large open-domain dataset to pre-training leads to even better performance. In the future, we plan to apply models to other tasks that also require hierarchical document encodings (e.g., document question answering). We are also interested in improving the architectures of hierarchical document encoders and designing other objectives to train hierarchical transformers.