Ranking Sentences for Extractive Summarization with Reinforcement Learning
Shashi Narayan, Shay B. Cohen, Mirella Lapata
Introduction
Automatic summarization has enjoyed wide popularity in natural language processing due to its potential for various information access applications. Examples include tools which aid users navigate and digest web content (e.g., news, social media, product reviews), question answering, and personalized recommendation engines. Single document summarization — the task of producing a shorter version of a document while preserving its information content — is perhaps the most basic of summarization tasks that have been identified over the years (see Nenkova and McKeown, 2011 for a comprehensive overview).
Modern approaches to single document summarization are data-driven, taking advantage of the success of neural network architectures and their ability to learn continuous features without recourse to preprocessing tools or linguistic annotations. Abstractive summarization involves various text rewriting operations (e.g., substitution, deletion, reordering) and has been recently framed as a sequence-to-sequence problem Sutskever et al. (2014). Central in most approaches Rush et al. (2015); Chen et al. (2016); Nallapati et al. (2016); See et al. (2017); Tan and Wan (2017); Paulus et al. (2017) is an encoder-decoder architecture modeled by recurrent neural networks. The encoder reads the source sequence into a list of continuous-space representations from which the decoder generates the target sequence. An attention mechanism Bahdanau et al. (2015) is often used to locate the region of focus during decoding.
Extractive systems create a summary by identifying (and subsequently concatenating) the most important sentences in a document. A few recent approaches Cheng and Lapata (2016); Nallapati et al. (2017); Narayan et al. (2017); Yasunaga et al. (2017) conceptualize extractive summarization as a sequence labeling task in which each label specifies whether each document sentence should be included in the summary. Existing models rely on recurrent neural networks to derive a meaning representation of the document which is then used to label each sentence, taking the previously labeled sentences into account. These models are typically trained using cross-entropy loss in order to maximize the likelihood of the ground-truth labels and do not necessarily learn to rank sentences based on their importance due to the absence of a ranking-based objective. Another discrepancy comes from the mismatch between the learning objective and the evaluation criterion, namely ROUGE Lin and Hovy (2003), which takes the entire summary into account.
In this paper we argue that cross-entropy training is not optimal for extractive summarization. Models trained this way are prone to generating verbose summaries with unnecessarily long sentences and redundant information. We propose to overcome these difficulties by globally optimizing the ROUGE evaluation metric and learning to rank sentences for summary generation through a reinforcement learning objective. Similar to previous work Cheng and Lapata (2016); Narayan et al. (2017); Nallapati et al. (2017), our neural summarization model consists of a hierarchical document encoder and a hierarchical sentence extractor. During training, it combines the maximum-likelihood cross-entropy loss with rewards from policy gradient reinforcement learning to directly optimize the evaluation metric relevant for the summarization task. We show that this global optimization framework renders extractive models better at discriminating among sentences for the final summary; a sentence is ranked high for selection if it often occurs in high scoring summaries.
We report results on the CNN and DailyMail news highlights datasets Hermann et al. (2015) which have been recently used as testbeds for the evaluation of neural summarization systems. Experimental results show that when evaluated automatically (in terms of ROUGE), our model outperforms state-of-the-art extractive and abstractive systems. We also conduct two human evaluations in order to assess (a) which type of summary participants prefer (we compare extractive and abstractive systems) and (b) how much key information from the document is preserved in the summary (we ask participants to answer questions pertaining to the content in the document by reading system summaries). Both evaluations overwhelmingly show that human subjects find our summaries more informative and complete.
Our contributions in this work are three-fold: a novel application of reinforcement learning to sentence ranking for extractive summarization; corroborated by analysis and empirical results showing that cross-entropy training is not well-suited to the summarization task; and large scale user studies following two evaluation paradigms which demonstrate that state-of-the-art abstractive systems lag behind extractive ones when the latter are globally trained.
Summarization as Sentence Ranking
Given a document D consisting of a sequence of sentences , an extractive summarizer aims to produce a summary by selecting sentences from D (where ). For each sentence , we predict a label (where means that should be included in the summary) and assign a score quantifying ’s relevance to the summary. The model learns to assign when sentence is more relevant than . Model parameters are denoted by . We estimate using a neural network model and assemble a summary by selecting sentences with top scores.
Our architecture resembles those previously proposed in the literature Cheng and Lapata (2016); Nallapati et al. (2017); Narayan et al. (2017). The main components include a sentence encoder, a document encoder, and a sentence extractor (see the left block of Figure 1) which we describe in more detail below.
A core component of our model is a convolutional sentence encoder which encodes sentences into continuous representations. In recent years, CNNs have proven useful for various NLP tasks Collobert et al. (2011); Kim (2014); Kalchbrenner et al. (2014); Zhang et al. (2015); Lei et al. (2015); Kim et al. (2016); Cheng and Lapata (2016) because of their effectiveness in identifying salient patterns in the input Xu et al. (2015). In the case of summarization, CNNs can identify named-entities and events that correlate with the gold summary.
We use temporal narrow convolution by applying a kernel filter of width to a window of words in sentence to produce a new feature. This filter is applied to each possible window of words in to produce a feature map where is the sentence length. We then apply max-pooling over time over the feature map and take the maximum value as the feature corresponding to this particular filter . We use multiple kernels of various sizes and each kernel multiple times to construct the representation of a sentence. In Figure 1, kernels of size (red) and (blue) are applied three times each. Max-pooling over time yields two feature lists and . The final sentence embeddings have six dimensions.
Document Encoder
The document encoder composes a sequence of sentences to obtain a document representation. We use a recurrent neural network with Long Short-Term Memory (LSTM) cells to avoid the vanishing gradient problem when training long sequences Hochreiter and Schmidhuber (1997). Given a document D consisting of a sequence of sentences , we follow common practice and feed sentences in reverse order Sutskever et al. (2014); Li et al. (2015); Filippova et al. (2015); Narayan et al. (2017). This way we make sure that the network also considers the top sentences of the document which are particularly important for summarization Rush et al. (2015); Nallapati et al. (2016).
Sentence Extractor
Our sentence extractor sequentially labels each sentence in a document with (relevant for the summary) or (otherwise). It is implemented with another RNN with LSTM cells and a softmax layer. At time , it reads sentence and makes a binary prediction, conditioned on the document representation (obtained from the document encoder) and the previously labeled sentences. This way, the sentence extractor is able to identify locally and globally important sentences within the document. We rank the sentences in a document D by , the confidence scores assigned by the softmax layer of the sentence extractor.
We learn to rank sentences by training our network in a reinforcement learning framework, directly optimizing the final evaluation metric, namely ROUGE Lin and Hovy (2003). Before we describe our training algorithm, we elaborate on why the maximum-likelihood cross-entropy objective could be deficient for ranking sentences for summarization (Section 3). Then, we define our reinforcement learning objective in Section 4 and show that our new way of training allows the model to better discriminate amongst sentences, i.e., a sentence is ranked higher for selection if it often occurs in high scoring summaries.
The Pitfalls of Cross-Entropy Loss
Previous work optimizes summarization models by maximizing , the likelihood of the ground-truth labels y = for sentences , given document D and model parameters . This objective can be achieved by minimizing the cross-entropy loss at each decoding step:
Cross-entropy training leads to two kinds of discrepancies in the model. The first discrepancy comes from the disconnect between the task definition and the training objective. While MLE in Equation (1) aims to maximize the likelihood of the ground-truth labels, the model is (a) expected to rank sentences to generate a summary and (b) evaluated using ROUGE at test time. The second discrepancy comes from the reliance on ground-truth labels. Document collections for training summarization systems do not naturally contain labels indicating which sentences should be extracted. Instead, they are typically accompanied by abstractive summaries from which sentence-level labels are extrapolated. Cheng and Lapata (2016) follow Woodsend and Lapata (2010) in adopting a rule-based method which assigns labels to each sentence in the document individually based on their semantic correspondence with the gold summary (see the fourth column in Table 1). An alternative method Svore et al. (2007); Cao et al. (2016); Nallapati et al. (2017) identifies the set of sentences which collectively gives the highest ROUGE with respect to the gold summary. Sentences in this set are labeled with 1 and 0 otherwise (see the column 5 in Table 1).
Labeling sentences individually often generates too many positive labels causing the model to overfit the data. For example, the document in Table 1 has 12 positively labeled sentences out of 31 in total (only first 10 are shown). Collective labels present a better alternative since they only pertain to the few sentences deemed most suitable to form the summary. However, a model trained with cross-entropy loss on collective labels will underfit the data as it will only maximize probabilities for sentences in this set (e.g., sentences in Table 1) and ignore all other sentences. We found that there are many candidate summaries with high ROUGE scores which could be considered during training.
Table 1 (last column) shows candidate summaries ranked according to the mean of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. Interestingly, multiple top ranked summaries have reasonably high ROUGE scores. For example, the average ROUGE for the summaries ranked second (0,13), third (11,13), and fourth (0,1,13) is 57.5%, 57.2%, and 57.1%, and all top 16 summaries have ROUGE scores more or equal to 50%. A few sentences are indicative of important content and appear frequently in the summaries: sentence 13 occurs in all summaries except one, while sentence 0 appears in several summaries too. Also note that summaries (11,13) and (1,13) yield better ROUGE scores compared to longer summaries, and may be as informative, yet more concise, alternatives.
These discrepancies render the model less efficient at ranking sentences for the summarization task. Instead of maximizing the likelihood of the ground-truth labels, we could train the model to predict the individual ROUGE score for each sentence in the document and then select the top sentences with highest scores. But sentences with individual ROUGE scores do not necessarily lead to a high scoring summary, e.g., they may convey overlapping content and form verbose and redundant summaries. For example, sentence 3, despite having a high individual ROUGE score (35.6%), does not occur in any of the top 5 summaries. We next explain how we address these issues using reinforcement learning.
Sentence Ranking with Reinforcement Learning
Reinforcement learning Sutton and Barto (1998) has been proposed as a way of training sequence-to-sequence generation models in order to directly optimize the metric used at test time, e.g., BLEU or ROUGE Ranzato et al. (2015). We adapt reinforcement learning to our formulation of extractive summarization to rank sentences for summary generation. We propose an objective function that combines the maximum-likelihood cross-entropy loss with rewards from policy gradient reinforcement learning to globally optimize ROUGE. Our training algorithm allows to explore the space of possible summaries, making our model more robust to unseen data. As a result, reinforcement learning helps extractive summarization in two ways: (a) it directly optimizes the evaluation metric instead of maximizing the likelihood of the ground-truth labels and (b) it makes our model better at discriminating among sentences; a sentence is ranked high for selection if it often occurs in high scoring summaries.
We cast the neural summarization model introduced in Figure 1 in the Reinforcement Learning paradigm Sutton and Barto (1998). Accordingly, the model can be viewed as an “agent” which interacts with an “environment” consisting of documents. At first, the agent is initialized randomly, it reads document D and predicts a relevance score for each sentence using “policy” , where are model parameters. Once the agent is done reading the document, a summary with labels is sampled out of the ranked sentences. The agent is then given a “reward” commensurate with how well the extract resembles the gold-standard summary. Specifically, as reward function we use mean F1 of ROUGE-1, ROUGE-2, and ROUGE-L. Unigram and bigram overlap (ROUGE-1 and ROUGE-2) are meant to assess informativeness, whereas the longest common subsequence (ROUGE-L) is meant to assess fluency. We update the agent using the REINFORCE algorithm Williams (1992) which aims to minimize the negative expected reward:
where, stands for . REINFORCE is based on the observation that the expected gradient of a non-differentiable reward function (ROUGE, in our case) can be computed as follows:
While MLE in Equation (1) aims to maximize the likelihood of the training data, the objective in Equation (2) learns to discriminate among sentences with respect to how often they occur in high scoring summaries.
2 Training with High Probability Samples
Computing the expectation term in Equation (3) is prohibitive, since there is a large number of possible extracts. In practice, we approximate the expected gradient using a single sample from for each training example in a batch:
Experimental Setup
In this section we present our experimental setup for assessing the performance of our model which we call Refresh as a shorthand for REinFoRcement Learning-based Extractive Summarization. We describe our datasets, discuss implementation details, our evaluation protocol, and the systems used for comparison.
We evaluated our models on the CNN and DailyMail news highlights datasets Hermann et al. (2015). We used the standard splits of Hermann et al. (2015) for training, validation, and testing (90,266/1,220/1,093 documents for CNN and 196,961/12,148/10,397 for DailyMail). We did not anonymize entities or lower case tokens. We followed previous studies Cheng and Lapata (2016); Nallapati et al. (2016, 2017); See et al. (2017); Tan and Wan (2017) in assuming that the “story highlights” associated with each article are gold-standard abstractive summaries. During training we use these to generate high scoring extracts and to estimate rewards for them, but during testing, they are used as reference summaries to evaluate our models.
Implementation Details
We used the One Billion Word Benchmark corpus Chelba et al. (2013) to train word embeddings with the skip-gram model Mikolov et al. (2013) using context window size 6, negative sampling size 10, and hierarchical softmax 1. Known words were initialized with pre-trained embeddings of size 200. Embeddings for unknown words were initialized to zero, but estimated during training. Sentences were padded with zeros to a length of 100. For the sentence encoder, we used a list of kernels of widths 1 to 7, each with output channel size of 50 Kim et al. (2016). The sentence embedding size in our model was 350.
For the recurrent neural network component in the document encoder and sentence extractor, we used a single-layered LSTM network with size 600. All input documents were padded with zeros to a maximum document length of 120. We performed minibatch cross-entropy training with a batch size of 20 documents for 20 training epochs. It took around 12 hrs on a single GPU to train. After each epoch, we evaluated our model on the validation set and chose the best performing model for the test set. During training we used the Adam optimizer Kingma and Ba (2015) with initial learning rate . Our system is implemented in TensorFlow Abadi et al. (2015).
Evaluation
We evaluated summarization quality using F1 ROUGE Lin and Hovy (2003). We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) as a means of assessing informativeness and the longest common subsequence (ROUGE-L) as a means of assessing fluency.We used pyrouge, a Python package, to compute all ROUGE scores with parameters “-a -c 95 -m -n 4 -w 1.2.” We compared Refresh against a baseline which simply selects the first leading sentences from each document (Lead) and two neural models similar to ours (see left block in Figure 1), both trained with cross-entropy loss. Cheng and Lapata (2016) train on individual labels, while Nallapati et al. (2017) use collective labels. We also compared our model against the abstractive systems of Chen et al. (2016), Nallapati et al. (2016), See et al. (2017), and Tan and Wan (2017).Cheng and Lapata (2016) report ROUGE recall scores on the DailyMail dataset only. We used their code (https://github.com/cheng6076/NeuralSum) to produce ROUGE F1 scores on both CNN and DailyMail datasets. For other systems, all results are taken from their papers.
In addition to ROUGE which can be misleading when used as the only means to assess the informativeness of summaries Schluter (2017), we also evaluated system output by eliciting human judgments in two ways. In our first experiment, participants were presented with a news article and summaries generated by three systems: the Lead baseline, abstracts from See et al. (2017), and extracts from Refresh. We also included the human-authored highlights.We are grateful to Abigail See for providing us with the output of her system. We did not include output from Nallapati et al. (2017), Chen et al. (2016), Nallapati et al. (2016), or Tan and Wan (2017) in our human evaluation study, as these models are trained on a named-entity anonymized version of the CNN and DailyMail datasets, and as result produce summaries which are not comparable to ours. We did not include extracts from Cheng and Lapata (2016) either as they were significantly inferior to Lead (see Table 2). Participants read the articles and were asked to rank the summaries from best (1) to worst (4) in order of informativeness (does the summary capture important information in the article?) and fluency (is the summary written in well-formed English?). We did not allow any ties. We randomly selected 10 articles from the CNN test set and 10 from the DailyMail test set. The study was completed by five participants, all native or proficient English speakers. Each participant was presented with the 20 articles. The order of summaries to rank was randomized per article and the order of articles per participant. Examples of summaries our subjects ranked are shown in Figure 2.
Our second experiment assessed the degree to which our model retains key information from the document following a question-answering (QA) paradigm which has been previously used to evaluate summary quality and text compression Morris et al. (1992); Mani et al. (2002); Clarke and Lapata (2010). We created a set of questions based on the gold summary under the assumption that it highlights the most important document content. We then examined whether participants were able to answer these questions by reading system summaries alone without access to the article. The more questions a system can answer, the better it is at summarizing the document as a whole.
We worked on the same 20 documents used in our first elicitation study. We wrote multiple fact-based question-answer pairs for each gold summary without looking at the document. Questions were formulated so as to not reveal answers to subsequent questions. We created 71 questions in total varying from two to six questions per gold summary. Example questions are given in Figure 2. Participants read the summary and answered all associated questions as best they could without access to the original document or the gold summary. Subjects were shown summaries from three systems: the Lead baseline, the abstractive system of See et al. (2017), and Refresh. Five participants answered questions for each summary. We used the same scoring mechanism from Clarke and Lapata (2010), i.e., a correct answer was marked with a score of one, partially correct answers with a score of 0.5, and zero otherwise. The final score for a system is the average of all its question scores. Answers were elicited using Amazon’s Mechanical Turk crowdsourcing platform. We uploaded data in batches (one system at a time) on Mechanical Turk to ensure that same participant does not evaluate summaries from different systems on the same set of questions.
Results
We report results using automatic metrics in Table 2. The top part of the table compares Refresh against related extractive systems. The bottom part reports the performance of abstractive systems. We present three variants of Lead, one is computed by ourselves and the other two are reported in Nallapati et al. (2017) and See et al. (2017). Note that they vary slightly due to differences in the preprocessing of the data. We report results on the CNN and DailyMail datasets and their combination (CNNDailyMail).
The results in Table 2 show that Refresh is superior to our Lead baseline and extractive systems across datasets and metrics. It outperforms the extractive system of Cheng and Lapata (2016) which is trained on individual labels. Refresh is not directly comparable with Nallapati et al. (2017) as they generate anonymized summaries. Their system lags behind their Lead baseline on ROUGE-L on the CNN+DailyMail dataset (35.5% vs 35.3%). Also note that their model is trained on collective labels and has a significant lead over Cheng and Lapata (2016). As discussed in Section 3 cross-entropy training on individual labels tends to overgenerate positive labels leading to less informative and verbose summaries.
Extractive vs Abstractive Systems
Our automatic evaluation results further demonstrate that Refresh is superior to abstractive systems Chen et al. (2016); Nallapati et al. (2016); See et al. (2017); Tan and Wan (2017) which are all variants of an encoder-decoder architecture Sutskever et al. (2014). Despite being more faithful to the actual summarization task (hand-written summaries combine several pieces of information from the original document), abstractive systems lag behind the Lead baseline. Tan and Wan (2017) present a graph-based neural model, which manages to outperform Lead on ROUGE-1 but falters when higher order ROUGE scores are used. Amongst abstractive systems See et al. (2017) perform best. Interestingly, their system is mostly extractive, exhibiting a small degree of rewriting; it copies more than 35% of the sentences in the source document, 85% of 4-grams, 90% of 3-grams, 95% of bigrams, and 99% of unigrams.
Human Evaluation: System Ranking
Table 3 shows, proportionally, how often participants ranked each system, 1st, 2nd, and so on. Perhaps unsurprisingly human-authored summaries are considered best (and ranked 1st 39% of the time). Refresh is ranked 2nd best followed by Lead and See et al. (2017) which are mostly ranked in 3rd and 4th places. We carried out pairwise comparisons between all models in Table 3 to assess whether system differences are statistically significant. There is no significant difference between Lead and See et al. (2017), and Refresh and Gold (using a one-way ANOVA with post-hoc Tukey HSD tests; ). All other differences are statistically significant.
Human Evaluation: Question Answering
The results of our QA evaluation are shown in the last column of Table 3. Based on summaries generated by Refresh, participants can answer 66.34% of questions correctly. Summaries produced by Lead and the abstractive system of See et al. (2017) provide answers for 36.33% and 28.73% of the questions, respectively. Differences between systems are all statistically significant () with the exception of Lead and See et al. (2017).
Although the QA results in Table 3 follow the same pattern as ROUGE in Table 2, differences among systems are now greatly amplified. QA-based evaluation is more focused and a closer reflection of users’ information need (i.e., to find out what the article is about), whereas ROUGE simply captures surface similarity (i.e., -gram overlap) between output summaries and their references. Interestingly, Lead is considered better than See et al. (2017) in the QA evaluation, whereas we find the opposite when participants are asked to rank systems. We hypothesize that Lead is indeed more informative than See et al. (2017) but humans prefer shorter summaries. The average length of Lead summaries is 105.7 words compared to 61.6 for See et al. (2017).
Related Work
Traditional summarization methods manually define features to rank sentences for their salience in order to identify the most important sentences in a document or set of documents Kupiec et al. (1995); Mani (2001); Radev et al. (2004); Filatova and Hatzivassiloglou (2004); Nenkova et al. (2006); Spärck Jones (2007). A vast majority of these methods learn to score each sentence independently Barzilay and Elhadad (1997); Teufel and Moens (1997); Erkan and Radev (2004); Mihalcea and Tarau (2004); Shen et al. (2007); Schilder and Kondadadi (2008); Wan (2010) and a summary is generated by selecting top-scored sentences in a way that is not incorporated into the learning process. Summary quality can be improved heuristically, Yih et al. (2007), via max-margin methods Carbonell and Goldstein (1998); Li et al. (2009), or integer-linear programming Woodsend and Lapata (2010); Berg-Kirkpatrick et al. (2011); Woodsend and Lapata (2012); Almeida and Martins (2013); Parveen et al. (2015).
Recent deep learning methods Kågebäck et al. (2014); Yin and Pei (2015); Cheng and Lapata (2016); Nallapati et al. (2017) learn continuous features without any linguistic preprocessing (e.g., named entities). Like traditional methods, these approaches also suffer from the mismatch between the learning objective and the evaluation criterion (e.g., ROUGE) used at the test time. In comparison, our neural model globally optimizes the ROUGE evaluation metric through a reinforcement learning objective: sentences are highly ranked if they occur in highly scoring summaries.
Reinforcement learning has been previously used in the context of traditional multi-document summarization as a means of selecting a sentence or a subset of sentences from a document cluster. Ryang and Abekawa (2012) cast the sentence selection task as a search problem. Their agent observes a state (e.g., a candidate summary), executes an action (a transition operation that produces a new state selecting a not-yet-selected sentence), and then receives a delayed reward based on . Follow-on work Rioux et al. (2014) extends this approach by employing ROUGE as part of the reward function, while Henß et al. (2015) further experiment with -learning. Mollá-Aliod (2017) has adapt this approach to query-focused summarization. Our model differs from these approaches both in application and formulation. We focus solely on extractive summarization, in our case states are documents (not summaries) and actions are relevance scores which lead to sentence ranking (not sentence-to-sentence transitions). Rather than employing reinforcement learning for sentence selection, our algorithm performs sentence ranking using ROUGE as the reward function.
The REINFORCE algorithm Williams (1992) has been shown to improve encoder-decoder text-rewriting systems by allowing to directly optimize a non-differentiable objective Ranzato et al. (2015); Li et al. (2016); Paulus et al. (2017) or to inject task-specific constraints Zhang and Lapata (2017); Nogueira and Cho (2017). However, we are not aware of any attempts to use reinforcement learning for training a sentence ranker in the context of extractive summarization.
Conclusions
In this work we developed an extractive summarization model which is globally trained by optimizing the ROUGE evaluation metric. Our training algorithm explores the space of candidate summaries while learning to optimize a reward function which is relevant for the task at hand. Experimental results show that reinforcement learning offers a great means to steer our model towards generating informative, fluent, and concise summaries outperforming state-of-the-art extractive and abstractive systems on the CNN and DailyMail datasets. In the future we would like to focus on smaller discourse units Mann and Thompson (1988) rather than individual sentences, modeling compression and extraction jointly.
We gratefully acknowledge the support of the European Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139), and Huawei Technologies (Cohen). The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract FA8650-17-C-9118. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.