Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B. Cohen, Mirella Lapata

Introduction

Automatic summarization is one of the central problems in Natural Language Processing (NLP) posing several challenges relating to understanding (i.e., identifying important content) and generation (i.e., aggregating and rewording the identified content into a summary). Of the many summarization paradigms that have been identified over the years (see Mani, 2001 and Nenkova and McKeown, 2011 for a comprehensive overview), single-document summarization has consistently attracted attention Cheng and Lapata (2016); Durrett et al. (2016); Nallapati et al. (2016, 2017); See et al. (2017); Tan and Wan (2017); Narayan et al. (2017); Fan et al. (2017); Paulus et al. (2018); Pasunuru and Bansal (2018); Celikyilmaz et al. (2018); Narayan et al. (2018a, b).

Neural approaches to NLP and their ability to learn continuous features without recourse to pre-processing tools or linguistic annotations have driven the development of large-scale document summarization datasets Sandhaus (2008); Hermann et al. (2015); Grusky et al. (2018). However, these datasets often favor extractive models which create a summary by identifying (and subsequently concatenating) the most important sentences in a document Cheng and Lapata (2016); Nallapati et al. (2017); Narayan et al. (2018b). Abstractive approaches, despite being more faithful to the actual summarization task, either lag behind extractive ones or are mostly extractive, exhibiting a small degree of abstraction See et al. (2017); Tan and Wan (2017); Paulus et al. (2018); Pasunuru and Bansal (2018); Celikyilmaz et al. (2018).

In this paper we introduce extreme summarization, a new single-document summarization task which is not amenable to extractive strategies and requires an abstractive modeling approach. The idea is to create a short, one-sentence news summary answering the question “What is the article about?”. An example of a document and its extreme summary are shown in Figure 1. As can be seen, the summary is very different from a headline whose aim is to encourage readers to read the story; it draws on information interspersed in various parts of the document (not only the beginning) and displays multiple levels of abstraction including paraphrasing, fusion, synthesis, and inference. We build a dataset for the proposed task by harvesting online articles from the British Broadcasting Corporation (BBC) that often include a first-sentence summary.

We further propose a novel deep learning model which we argue is well-suited to the extreme summarization task. Unlike most existing abstractive approaches Rush et al. (2015); Chen et al. (2016); Nallapati et al. (2016); See et al. (2017); Tan and Wan (2017); Paulus et al. (2018); Pasunuru and Bansal (2018); Celikyilmaz et al. (2018) which rely on an encoder-decoder architecture modeled by recurrent neural networks (RNNs), we present a topic-conditioned neural model which is based entirely on convolutional neural networks Gehring et al. (2017b). Convolution layers capture long-range dependencies between words in the document more effectively compared to RNNs, allowing to perform document-level inference, abstraction, and paraphrasing. Our convolutional encoder associates each word with a topic vector capturing whether it is representative of the document’s content, while our convolutional decoder conditions each word prediction on a document topic vector.

Experimental results show that when evaluated automatically (in terms of ROUGE) our topic-aware convolutional model outperforms an oracle extractive system and state-of-the-art RNN-based abstractive systems. We also conduct two human evaluations in order to assess (a) which type of summary participants prefer and (b) how much key information from the document is preserved in the summary. Both evaluations overwhelmingly show that human subjects find our summaries more informative and complete. Our contributions in this work are three-fold: a new single document summarization dataset that encourages the development of abstractive systems; corroborated by analysis and empirical results showing that extractive approaches are not well-suited to the extreme summarization task; and a novel topic-aware convolutional sequence-to-sequence model for abstractive summarization.

The XSum Dataset

Our extreme summarization dataset (which we call XSum) consists of BBC articles and accompanying single sentence summaries. Specifically, each article is prefaced with an introductory sentence (aka summary) which is professionally written, typically by the author of the article. The summary bears the HTML class “story-body__introduction,” and can be easily identified and extracted from the main text body (see Figure 1 for an example summary-article pair).

We followed the methodology proposed in Hermann et al. (2015) to create a large-scale dataset for extreme summarization. Specifically, we collected 226,711 Wayback archived BBC articles ranging over almost a decade (2010 to 2017) and covering a wide variety of domains (e.g., News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts). Each article comes with a unique identifier in its URL, which we used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) set. Table 1 compares XSum with the CNN, DailyMail, and NY Times benchmarks. As can be seen, XSum contains a substantial number of training instances, similar to DailyMail; documents and summaries in XSum are shorter in relation to other datasets but the vocabulary size is sufficiently large, comparable to CNN.

Table 2 provides empirical analysis supporting our claim that XSum is less biased toward extractive methods compared to other summarization datasets. We report the percentage of novel $n$ -grams in the target gold summaries that do not appear in their source documents. There are 36% novel unigrams in the XSum reference summaries compared to 17% in CNN, 17% in DailyMail, and 23% in NY Times. This indicates that XSum summaries are more abstractive. The proportion of novel constructions grows for larger $n$ -grams across datasets, however, it is much steeper in XSum whose summaries exhibit approximately 83% novel bigrams, 96% novel trigrams, and 98% novel 4-grams (comparison datasets display around 47–55% new bigrams, 58–72% new trigrams, and 63–80% novel 4-grams).

We further evaluated two extractive methods on these datasets. lead is often used as a strong lower bound for news summarization Nenkova (2005) and creates a summary by selecting the first few sentences or words in the document. We extracted the first 3 sentences for CNN documents and the first 4 sentences for DailyMail Narayan et al. (2018b). Following previous work Durrett et al. (2016); Paulus et al. (2018), we obtained lead summaries based on the first 100 words for NY Times documents. For XSum, we selected the first sentence in the document (excluding the one-line summary) to generate the lead. Our second method, ext-oracle, can be viewed as an upper bound for extractive models Nallapati et al. (2017); Narayan et al. (2018b). It creates an oracle summary by selecting the best possible set of sentences in the document that gives the highest ROUGE Lin and Hovy (2003) with respect to the gold summary. For XSum, we simply selected the single-best sentence in the document as summary.

Table 2 reports the performance of the two extractive methods using ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) with the gold summaries as reference. The lead baseline performs extremely well on CNN, DailyMail and NY Times confirming that they are biased towards extractive methods. ext-oracle further shows that improved sentence selection would bring further performance gains to extractive approaches. Abstractive systems trained on these datasets often have a hard time beating the lead, let alone ext-oracle, or display a low degree of novelty in their summaries See et al. (2017); Tan and Wan (2017); Paulus et al. (2018); Pasunuru and Bansal (2018); Celikyilmaz et al. (2018). Interestingly, lead and ext-oracle perform poorly on XSum underlying the fact that it is less biased towards extractive methods.

In line with our findings, Grusky et al. (2018) have recently reported similar extractive biases in existing datasets. They constructed a new dataset called “Newsroom” which demonstrates a high diversity of summarization styles. XSum is not diverse, it focuses on a single news outlet (i.e., BBC) and a unifrom summarization style (i.e., a single sentence). However, it is sufficiently large for neural network training and we hope it will spur further research towards the development of abstractive summarization models.

Convolutional Sequence-to-Sequence Learning for Summarization

Unlike tasks like machine translation and paraphrase generation where there is often a one-to-one semantic correspondence between source and target words, document summarization must distill the content of the document into a few important facts. This is even more challenging for our task, where the compression ratio is extremely high, and pertinent content can be easily missed.

Recently, a convolutional alternative to sequence modeling has been proposed showing promise for machine translation Gehring et al. (2017a, b) and story generation Fan et al. (2018). We believe that convolutional architectures are attractive for our summarization task for at least two reasons. Firstly, contrary to recurrent networks which view the input as a chain structure, convolutional networks can be stacked to represent large context sizes. Secondly, hierarchical features can be extracted over larger and larger contents, allowing to represent long-range dependencies efficiently through shorter paths.

Our model builds on the work of Gehring et al. (2017b) who develop an encoder-decoder architecture for machine translation with an attention mechanism Sukhbaatar et al. (2015) based exclusively on deep convolutional networks. We adapt this model to our summarization task by allowing it to recognize pertinent content (i.e., by foregrounding salient words in the document). In particular, we improve the convolutional encoder by associating each word with a vector representing topic salience, and the convolutional decoder by conditioning each word prediction on the document topic vector.

At the core of our model is a simple convolutional block structure that computes intermediate states based on a fixed number of input elements. Our convolutional encoder (shown at the top of Figure 2) applies this unit across the document. We repeat these operations in a stacked fashion to get a multi-layer hierarchical representation over the input document where words at closer distances interact at lower layers while distant words interact at higher layers. The interaction between words through hierarchical layers effectively captures long-range dependencies.

Analogously, our convolutional decoder (shown at the bottom of Figure 2) uses the multi-layer convolutional structure to build a hierarchical representation over what has been predicted so far. Each layer on the decoder side determines useful source context by attending to the encoder representation before it passes its output to the next layer. This way the model remembers which words it previously attended to and applies multi-hop attention (shown at the middle of Figure 2) per time step. The output of the top layer is passed to a softmax classifier to predict a distribution over the target vocabulary.

Our model assumes access to word and document topic distributions. These can be obtained by any topic model, however we use Latent Dirichlet Allocation (LDA; Blei et al. 2003) in our experiments; we pass the distributions obtained from LDA directly to the network as additional input. This allows us to take advantage of topic modeling without interfering with the computational advantages of the convolutional architecture. The idea of capturing document-level semantic information has been previously explored for recurrent neural networks Mikolov and Zweig (2012); Ghosh et al. (2016); Dieng et al. (2017), however, we are not aware of any existing convolutional models.

Topic Sensitive Embeddings

and $\otimes$ denotes point-wise multiplication. The topic distribution $t^{\prime}_{i}$ of word $w_{i}$ essentially captures how topical the word is in itself (local context), whereas the topic distribution $t_{D}$ represents the overall theme of the document (global context). The encoder essentially enriches the context of the word with its topical relevance to the document.

For every output prediction, the decoder estimates representation $\mathbf{g}=(g_{1},\ldots,g_{n})$ for previously predicted words $(w^{\prime}_{1},\ldots,w^{\prime}_{n})$ where $g_{i}$ is:

$x^{\prime}_{i}$ and $p^{\prime}_{i}$ are word and position embeddings of previously predicted word $w^{\prime}_{i}$ , and $t_{D}$ is the topic distribution of the input document. Note that the decoder does not use the topic distribution of $w^{\prime}_{i}$ as computing it on the fly would be expensive. However, every word prediction is conditioned on the topic of the document, enforcing the summary to have the same theme as the document.

Multi-layer Convolutional Structure

Multi-hop Attention

The attention mechanism described here performs multiple attention “hops” per time step and considers which words have been previously attended to. It is therefore different from single-step attention in recurrent neural networks Bahdanau et al. (2015), where the attention and weighted sum are computed over $\mathbf{z^{u}}$ only.

We use layer normalization and weight initialization to stabilize learning.

Our topic-enhanced model calibrates long-range dependencies with globally salient content. As a result, it provides a better alternative to vanilla convolutional sequence models Gehring et al. (2017b) and RNN-based summarization models See et al. (2017) for capturing cross-document inferences and paraphrasing. At the same time it retains the computational advantages of convolutional models. Each convolution block operates over a fixed-size window of the input sequence, allowing for simultaneous encoding of the input, ease in learning due to the fixed number of non-linearities and transformations for words in the input sequence.

Experimental Setup

In this section we present our experimental setup for assessing the performance of our Topic-aware Convolutional Sequence to Sequence model which we call T-ConvS2S for short. We discuss implementation details and present the systems used for comparison with our approach.

We report results with various systems which were all trained on the XSum dataset to generate a one-line summary given an input news article. We compared T-ConvS2S against three extractive systems: a baseline which randomly selects a sentence from the input document (random), a baseline which simply selects the leading sentence from the document (lead), and an oracle which selects a single-best sentence in each document (ext-oracle). The latter is often used as an upper bound for extractive methods. We also compared our model against the RNN-based abstractive systems introduced by See et al. (2017). In particular, we experimented with an attention-based sequence to sequence model (Seq2Seq), a pointer-generator model which allows to copy words from the source text (PtGen), and a pointer-generator model with a coverage mechanism to keep track of words that have been summarized (PtGen-Covg). Finally, we compared our model against the vanilla convolution sequence to sequence model (ConvS2S) of Gehring et al. (2017b).

Model Parameters and Optimization

We did not anonymize entities but worked on a lowercased version of the XSum dataset. During training and at test time the input document was truncated to 400 tokens and the length of the summary limited to 90 tokens.

The LDA model Blei et al. (2003) was trained on XSum documents (training portion). We therefore obtained for each word a probability distribution over topics which we used to estimate $\mathbf{t^{\prime}}$ ; the topic distribution $t_{D}$ can be inferred for any new document, at training and test time. We explored several LDA configurations on held-out data, and obtained best results with 512 topics. Table 3 shows some of the topics learned by the LDA model.

For Seq2Seq, PtGen and PtGen-Covg, we used the best settings reported on the CNN and DailyMail data See et al. (2017).We used the code available at https://github.com/abisee/pointer-generator. All three models had 256 dimensional hidden states and 128 dimensional word embeddings. They were trained using Adagrad Duchi et al. (2011) with learning rate 0.15 and an initial accumulator value of 0.1. We used gradient clipping with a maximum gradient norm of 2, but did not use any form of regularization. We used the loss on the validation set to implement early stopping.

For ConvS2SWe used the code available at https://github.com/facebookresearch/fairseq-py. and T-ConvS2S, we used 512 dimensional hidden states and 512 dimensional word and position embeddings. We trained our convolutional models with Nesterov’s accelerated gradient method Sutskever et al. (2013) using a momentum value of 0.99 and renormalized gradients if their norm exceeded 0.1 Pascanu et al. (2013). We used a learning rate of 0.10 and once the validation perplexity stopped improving, we reduced the learning rate by an order of magnitude after each epoch until it fell below $10^{-4}$ . We also applied a dropout of 0.2 to the embeddings, the decoder outputs and the input of the convolutional blocks. Gradients were normalized by the number of non-padding tokens per mini-batch. We also used weight normalization for all layers except for lookup tables.

All neural models, including ours and those based on RNNs See et al. (2017) had a vocabulary of 50,000 words and were trained on a single Nvidia M40 GPU with a batch size of 32 sentences. Summaries at test time were obtained using beam search (with beam size 10).

Results

We report results using automatic metrics in Table 4. We evaluated summarization quality using F1 ROUGE Lin and Hovy (2003). Unigram and bigram overlap (ROUGE-1 and ROUGE-2) are a proxy for assessing informativeness and the longest common subsequence (ROUGE-L) represents fluency.We used pyrouge to compute all ROUGE scores, with parameters “-a -c 95 -m -n 4 -w 1.2.”

On the XSum dataset, Seq2Seq outperforms the lead and random baselines by a large margin. PtGen, a Seq2Seq model with a “copying” mechanism outperforms ext-oracle, a “perfect” extractive system on ROUGE-2 and ROUGE-L. This is in sharp contrast to the performance of these models on CNN/DailyMail See et al. (2017) and Newsroom datasets Grusky et al. (2018), where they fail to outperform the lead. The result provides further evidence that XSum is a good testbed for abstractive summarization. PtGen-Covg, the best performing abstractive system on the CNN/DailyMail datasets, does not do well. We believe that the coverage mechanism is more useful when generating multi-line summaries and is basically redundant for extreme summarization.

ConvS2S, the convolutional variant of Seq2Seq, significantly outperforms all RNN-based abstractive systems. We hypothesize that its superior performance stems from the ability to better represent document content (i.e., by capturing long-range dependencies). Table 4 shows several variants of T-ConvS2S including an encoder network enriched with information about how topical a word is on its own (enc ${}_{t^{\prime}}$ ) or in the document (enc ${}_{(t^{\prime},t_{D})}$ ). We also experimented with various decoders by conditioning every prediction on the topic of the document, basically encouraging the summary to be in the same theme as the document (dec ${}_{t_{D}}$ ) or letting the decoder decide the theme of the summary. Interestingly, all four T-ConvS2S variants outperform ConvS2S. T-ConvS2S performs best when both encoder and decoder are constrained by the document topic (enc ${}_{(t^{\prime},t_{D})}$ ,dec ${}_{t_{D}}$ ). In the remainder of the paper, we refer to this variant as T-ConvS2S.

We further assessed the extent to which various models are able to perform rewriting by generating genuinely abstractive summaries. Table 5 shows the proportion of novel $n$ -grams for lead, ext-oracle, PtGen, ConvS2S, and T-ConvS2S. As can be seen, the convolutional models exhibit the highest proportion of novel $n$ -grams. We should also point out that the summaries being evaluated have on average comparable lengths; the summaries generated by PtGen contain 22.57 words, those generated by ConvS2S and T-ConvS2S have 20.07 and 20.22 words, respectively, while gold summaries are the longest with 23.26 words. Interestingly, PtGen trained on XSum only copies 4% of 4-grams in the source document, 10% of trigrams, 27% of bigrams, and 73% of unigrams. This is in sharp contrast to PtGen trained on CNN/DailyMail exhibiting mostly extractive patterns; it copies more than 85% of 4-grams in the source document, 90% of trigrams, 95% of bigrams, and 99% of unigrams See et al. (2017). This result further strengthens our hypothesis that XSum is a good testbed for abstractive methods.

Human Evaluation

In addition to automatic evaluation using ROUGE which can be misleading when used as the only means to assess the informativeness of summaries Schluter (2017), we also evaluated system output by eliciting human judgments in two ways.

In our first experiment, participants were asked to compare summaries produced from the ext-oracle baseline, PtGen, the best performing system of See et al. (2017), ConvS2S, our topic-aware model T-ConvS2S, and the human-authored gold summary (gold). We did not include extracts from the lead as they were significantly inferior to other models.

The study was conducted on the Amazon Mechanical Turk platform using Best-Worst Scaling (BWS; Louviere and Woodworth 1991; Louviere et al. 2015), a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales Kiritchenko and Mohammad (2017). Participants were presented with a document and summaries generated from two out of five systems and were asked to decide which summary was better and which one was worse in order of informativeness (does the summary capture important information in the document?) and fluency (is the summary written in well-formed English?). Examples of system summaries are shown in Table 6. We randomly selected 50 documents from the XSum test set and compared all possible combinations of two out of five systems for each document. We collected judgments from three different participants for each comparison. The order of summaries was randomized per document and the order of documents per participant.

The score of a system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst. The scores range from -1 (worst) to 1 (best) and are shown in Table 7. Perhaps unsurprisingly human-authored summaries were considered best, whereas, T-ConvS2S was ranked 2nd followed by ext-oracle and ConvS2S. PtGen was ranked worst with the lowest score of $-0.218$ . We carried out pairwise comparisons between all models to assess whether system differences are statistically significant. gold is significantly different from all other systems and T-ConvS2S is significantly different from ConvS2S and PtGen (using a one-way ANOVA with posthoc Tukey HSD tests; $p<0.01$ ). All other differences are not statistically significant.

For our second experiment we used a question-answering (QA) paradigm Clarke and Lapata (2010); Narayan et al. (2018b) to assess the degree to which the models retain key information from the document. We used the same 50 documents as in our first elicitation study. We wrote two fact-based questions per document, just by reading the summary, under the assumption that it highlights the most important content of the news article. Questions were formulated so as not to reveal answers to subsequent questions. We created 100 questions in total (see Table 6 for examples). Participants read the output summaries and answered the questions as best they could without access to the document or the gold summary. The more questions can be answered, the better the corresponding system is at summarizing the document as a whole. Five participants answered questions for each summary.

We followed the scoring mechanism introduced in Clarke and Lapata (2010). A correct answer was marked with a score of one, partially correct answers with a score of 0.5, and zero otherwise. The final score for a system is the average of all its question scores. Answers again were elicited using Amazon’s Mechanical Turk crowdsourcing platform. We uploaded the data in batches (one system at a time) to ensure that the same participant does not evaluate summaries from different systems on the same set of questions.

Table 7 shows the results of the QA evaluation. Based on summaries generated by T-ConvS2S, participants can answer $46.05\%$ of the questions correctly. Summaries generated by ConvS2S, PtGen and ext-oracle provide answers to $30.90\%$ , $21.40\%$ , and $15.70\%$ of the questions, respectively. Pairwise differences between systems are all statistically significant ( $p<0.01$ ) with the exception of PtGen and ext-oracle. ext-oracle performs poorly on both QA and rating evaluations. The examples in Table 6 indicate that ext-oracle is often misled by selecting a sentence with the highest ROUGE (against the gold summary), but ROUGE itself does not ensure that the summary retains the most important information from the document. The QA evaluation further emphasizes that in order for the summary to be felicitous, information needs to be embedded in the appropriate context. For example, ConvS2S and PtGen will fail to answer the question “Who has resigned?” (see Table 6 second block) despite containing the correct answer “Dick Advocaat” due to the wrong context. T-ConvS2S is able to extract important entities from the document with the right theme.

Conclusions

In this paper we introduced the task of “extreme summarization” together with a large-scale dataset which pushes the boundaries of abstractive methods. Experimental evaluation revealed that models which have abstractive capabilities do better on this task and that high-level document knowledge in terms of topics and long-range dependencies is critical for recognizing pertinent content and generating informative summaries. In the future, we would like to create more linguistically-aware encoders and decoders incorporating co-reference and entity linking.

We gratefully acknowledge the support of the European Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139), and Huawei Technologies (Cohen).