Fine-tune BERT for Extractive Summarization

Yang Liu

Introduction

Single-document summarization is the task of automatically generating a shorter version of a document while retaining its most important information. The task has received much attention in the natural language processing community due to its potential for various information access applications. Examples include tools which digest textual content (e.g., news, social media, reviews), answer questions, or provide recommendations.

The task is often divided into two paradigms, abstractive summarization and extractive summarization. In abstractive summarization, target summaries contains words or phrases that were not in the original text and usually require various text rewriting operations to generate, while extractive approaches form summaries by copying and concatenating the most important spans (usually sentences) in a document. In this paper, we focus on extractive summarization.

Although many neural models have been proposed for extractive summarization recently (Cheng and Lapata, 2016; Nallapati et al., 2017; Narayan et al., 2018; Dong et al., 2018; Zhang et al., 2018; Zhou et al., 2018), the improvement on automatic metrics like ROUGE has reached a bottleneck due to the complexity of the task. In this paper, we argue that, BERT (Devlin et al., 2018), with its pre-training on a huge dataset and the powerful architecture for learning complex features, can further boost the performance of extractive summarization .

In this paper, we focus on designing different variants of using BERT on the extractive summarization task and showing their results on CNN/Dailymail and NYT datasets. We found that a flat architecture with inter-sentence Transformer layers performs the best, achieving the state-of-the-art results on this task.

Methodology

Let $d$ denote a document containing several sentences $[sent_{1},sent_{2},\cdots,sent_{m}]$ , where $sent_{i}$ is the $i$ -th sentence in the document. Extractive summarization can be defined as the task of assigning a label $y_{i}\in\{0,1\}$ to each $sent_{i}$ , indicating whether the sentence should be included in the summary. It is assumed that summary sentences represent the most important content of the document.

To use BERT for extractive summarization, we require it to output the representation for each sentence. However, since BERT is trained as a masked-language model, the output vectors are grounded to tokens instead of sentences. Meanwhile, although BERT has segmentation embeddings for indicating different sentences, it only has two labels (sentence A or sentence B), instead of multiple sentences as in extractive summarization. Therefore, we modify the input sequence and embeddings of BERT to make it possible for extracting summaries.

As illustrated in Figure 1, we insert a [CLS] token before each sentence and a [SEP] token after each sentence. In vanilla BERT, The [CLS] is used as a symbol to aggregate features from one sentence or a pair of sentences. We modify the model by using multiple [CLS] symbols to get features for sentences ascending the symbol.

Interval Segment Embeddings

We use interval segment embeddings to distinguish multiple sentences within a document. For $sent_{i}$ we will assign a segment embedding $E_{A}$ or $E_{B}$ conditioned on $i$ is odd or even. For example, for $[sent_{1},sent_{2},sent_{3},sent_{4},sent_{5}]$ we will assign $[E_{A},E_{B},E_{A},E_{B},E_{A}]$ .

The vector $T_{i}$ which is the vector of the $i$ -th [CLS] symbol from the top BERT layer will be used as the representation for $sent_{i}$ .

2 Fine-tuning with Summarization Layers

After obtaining the sentence vectors from BERT, we build several summarization-specific layers stacked on top of the BERT outputs, to capture document-level features for extracting summaries. For each sentence $sent_{i}$ , we will calculate the final predicted score $\hat{Y}_{i}$ . The loss of the whole model is the Binary Classification Entropy of $\hat{Y}_{i}$ against gold label $Y_{i}$ . These summarization layers are jointly fine-tuned with BERT.

Like in the original BERT paper, the Simple Classifier only adds a linear layer on the BERT outputs and use a sigmoid function to get the predicted score:

Inter-sentence Transformer

Instead of a simple sigmoid classifier, Inter-sentence Transformer applies more Transformer layers only on sentence representations, extracting document-level features focusing on summarization tasks from the BERT outputs:

The final output layer is still a sigmoid classifier:

where $h^{L}$ is the vector for $sent_{i}$ from the top layer (the $L$ -th layer ) of the Transformer. In experiments, we implemented Transformers with $L=1,2,3$ and found Transformer with $2$ layers performs the best.

Recurrent Neural Network

Although the Transformer model achieved great results on several tasks, there are evidence that Recurrent Neural Networks still have their advantages, especially when combining with techniques in Transformer Chen et al. (2018). Therefore, we apply an LSTM layer over the BERT outputs to learn summarization-specific features.

To stabilize the training, pergate layer normalization Ba et al. (2016) is applied within each LSTM cell. At time step $i$ , the input to the LSTM layer is the BERT output $T_{i}$ , and the output is calculated as:

The final output layer is also a sigmoid classifier:

Experiments

In this section we present our implementation, describe the summarization datasets and our evaluation protocol, and analyze our results.

We use PyTorch, OpenNMT Klein et al. (2017) and the ‘bert-base-uncased’https://github.com/huggingface/pytorch-pretrained-BERT version of BERT to implement the model. BERT and summarization layers are jointly fine-tuned. Adam with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ is used for fine-tuning. Learning rate schedule is following Vaswani et al. (2017) with warming-up on first 10,000 steps:

All models are trained for 50,000 steps on 3 GPUs (GTX 1080 Ti) with gradient accumulation per two steps, which makes the batch size approximately equal to $36$ . Model checkpoints are saved and evaluated on the validation set every 1,000 steps. We select the top-3 checkpoints based on their evaluation losses on the validations set, and report the averaged results on the test set.

When predicting summaries for a new document, we first use the models to obtain the score for each sentence. We then rank these sentences by the scores from higher to lower, and select the top-3 sentences as the summary.

During the predicting process, Trigram Blocking is used to reduce redundancy. Given selected summary $S$ and a candidate sentence $c$ , we will skip $c$ is there exists a trigram overlapping between $c$ and $S$ . This is similar to the Maximal Marginal Relevance (MMR) Carbonell and Goldstein (1998) but much simpler.

2 Summarization Datasets

We evaluated on two benchmark datasets, namely the CNN/DailyMail news highlights dataset Hermann et al. (2015) and the New York Times Annotated Corpus (NYT; Sandhaus 2008). The CNN/DailyMail dataset contains news articles and associated highlights, i.e., a few bullet points giving a brief overview of the article. We used the standard splits of Hermann et al. (2015) for training, validation, and testing (90,266/1,220/1,093 CNN documents and 196,961/12,148/10,397 DailyMail documents). We did not anonymize entities. We first split sentences by CoreNLP and pre-process the dataset following methods in See et al. (2017).

The NYT dataset contains 110,540 articles with abstractive summaries. Following Durrett et al. (2016), we split these into 100,834 training and 9,706 test examples, based on date of publication (test is all articles published on January 1, 2007 or later). We took 4,000 examples from the training set as the validation set. We also followed their filtering procedure, documents with summaries that are shorter than 50 words were removed from the raw dataset. The filtered test set (NYT50) includes 3,452 test examples. We first split sentences by CoreNLP and pre-process the dataset following methods in Durrett et al. (2016).

Both datasets contain abstractive gold summaries, which are not readily suited to training extractive summarization models. A greedy algorithm was used to generate an oracle summary for each document. The algorithm greedily select sentences which can maximize the ROUGE scores as the oracle sentences. We assigned label 1 to sentences selected in the oracle summary and 0 otherwise.

Experimental Results

The experimental results on CNN/Dailymail datasets are shown in Table 1. For comparison, we implement a non-pretrained Transformer baseline which uses the same architecture as BERT, but with smaller parameters. It is randomly initialized and only trained on the summarization task. The Transformer baseline has 6 layers, the hidden size is $512$ and the feed-forward filter size is $2048$ . The model is trained with same settings following Vaswani et al. (2017). We also compare our model with several previously proposed systems.

Lead is an extractive baseline which uses the first-3 sentences of the document as a summary.

Refresh (Narayan et al., 2018) is an extractive summarization system trained by globally optimizing the ROUGE metric with reinforcement learning.

Neusum (Zhou et al., 2018) is the state-of-the-art extractive system that jontly score and select sentences.

Pgn (See et al., 2017), is the Pointer Generator Network, an abstractive summarization system based on an encoder-decoder architecture.

Dca (Celikyilmaz et al., 2018) is the Deep Communicating Agents, a state-of-the-art abstractive summarization system with multiple agents to represent the document as well as hierarchical attention mechanism over the agents for decoding.

As illustrated in the table, all BERT-based models outperformed previous state-of-the-art models by a large margin. Bertsum with Transformer achieved the best performance on all three metrics. The Bertsum with LSTM model does not have an obvious influence on the summarization performance compared to the Classifier model.

Ablation studies are conducted to show the contribution of different components of Bertsum. The results are shown in in Table 2. Interval segments increase the performance of base model. Trigram blocking is able to greatly improve the summarization results. This is consistent to previous conclusions that a sequential extractive decoder is helpful to generate more informative summaries. However, here we use the trigram blocking as a simple but robust alternative.

The experimental results on NYT datasets are shown in Table 3. Different from CNN/Dailymail, we use the limited-length recall evaluation, following Durrett et al. (2016). We truncate the predicted summaries to the lengths of the gold summaries and evaluate summarization quality with ROUGE Recall. Compared baselines are (1) First- $k$ words, which is a simple baseline by extracting first $k$ words of the input article; (2) Full is the best-performed extractive model in Durrett et al. (2016); (3) Deep Reinforced Paulus et al. (2018) is an abstractive model, using reinforce learning and encoder-decoder structure. The Bertsum+Classifier can achieve the state-of-the-art results on this dataset.

Conclusion

In this paper, we explored how to use BERT for extractive summarization. We proposed the Bertsum model and tried several summarization layers can be applied with BERT. We did experiments on two large-scale datasets and found the Bertsum with inter-sentence Transformer layers can achieve the best performance.