A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, Nazli Goharian

Introduction

Existing large-scale summarization datasets consist of relatively short documents. For example, articles in the CNN/Daily Mail dataset Hermann et al. (2015) are on average about 600 words long. Similarly, existing neural summarization models have focused on summarizing sentences and short documents. In this work, we propose a model for effective abstractive summarization of longer documents. Scientific papers are an example of documents that are significantly longer than news articles (see Table 1). They also follow a standard discourse structure describing the problem, methodology, experiments/results, and finally conclusions Suppe (1998).

Most summarization works in the literature focus on extractive summarization. Examples of prominent approaches include frequency-based methods Vanderwende et al. (2007), graph-based methods Erkan and Radev (2004), topic modeling Steinberger and Jezek (2004), and neural models Nallapati et al. (2017). Abstractive summarization is an alternative approach where the generated summary may contain novel words and phrases and is more similar to how humans summarize documents Jing (2002). Recently, neural methods have led to encouraging results in abstractive summarization Nallapati et al. (2016); See et al. (2017); Paulus et al. (2017); Li et al. (2017). These approaches employ a general framework of sequence-to-sequence (seq2seq) models Sutskever et al. (2014) where the document is fed to an encoder network and another (recurrent) network learns to decode the summary. While promising, these methods focus on summarizing news articles which are relatively short. Many other document types, however, are longer and structured. Seq2seq models tend to struggle with longer sequences because at each decoding step, the decoder needs to learn to construct a context vector capturing relevant information from all the tokens in the source sequence Shao et al. (2017).

Our main contribution is an abstractive model for summarizing scientific papers which are an example of long-form structured document types. Our model includes a hierarchical encoder, capturing the discourse structure of the document and a discourse-aware decoder that generates the summary. Our decoder attends to different discourse sections and allows the model to more accurately represent important information from the source resulting in a better context vector. We also introduce two large-scale datasets of long and structured scientific papers obtained from arXiv and PubMed to support both training and evaluating models on the task of long document summarization. Evaluation results show that our method outperforms state-of-the-art summarization models Data/code: https://github.com/acohan/long-summarization.

Background

In the seq2seq framework for abstractive summarization, an input document $\mathbf{x}$ is encoded using a Recurrent Neural Network (RNN) with $\mathbf{h}_{i}^{(e)}$ being the hidden state of the encoder at timestep $i$ . The last step of the encoder is fed as input to another RNN which decodes the output one token at a time. Given an input document along with the corresponding ground-truth summary $\mathbf{y}$ , the model is trained to output a summary $\hat{\mathbf{y}}$ that is close to $\mathbf{y}$ . The output at timestep $t$ is predicted using the decoder input $\mathbf{x}^{\prime}_{t}$ , decoder hidden state $\mathbf{h}_{t}^{(d)}$ , and some information about the input sequence. This framework is the general seq2seq framework employed in many generation tasks including machine translation Sutskever et al. (2014); Bahdanau et al. (2014) and summarization Nallapati et al. (2016); Chopra et al. (2016).

The attention mechanism maps the decoder state and the encoder states to an output vector, which is a weighted sum of the encoder states and is called context vector Bahdanau et al. (2014). Incorporating this context vector at each decoding timestep (attentive decoding) is proven effective in seq2seq models. Formally, the context vector $c_{t}$ is defined as: $\mathbf{c}_{t}\!=\!\sum_{i=1}^{N}\alpha^{(t)}_{i}\mathbf{h}_{i}^{(e)}$ where $\alpha^{(t)}_{i}$ are the attention weights calculated as follows:

where $\mathbf{v}_{a}$ is a weight vector and $\operatorname{linear}$ is a linear mapping function. I.e.,

where \contourblackw1 and \contourblackw2 are weight matrices and $\mathbf{b}$ is the bias vector.

Model

We now describe our discourse-aware summarization model (shown in Figure 1).

Our encoder extends the RNN encoder to a hierarchical RNN that captures the document discourse structure. We first encode each discourse section and then encode the document. Formally, we encode the document as a vector $\mathbf{d}$ according to the following:

$\operatorname{RNN}(.)$ denotes a function which is a recurrent neural network whose output is the final state of the network encoding the entire sequence. $N$ is the number of sections in the document and $\mathbf{h}_{j}^{(s)}$ is representation of section $j$ in the document consisting of a sequence of tokens.

where $\mathbf{x}_{(j,i)}$ are dense embeddings corresponding to the tokens $w_{(j,i)}$ and $M$ is the maximum section length. The parameters of $\operatorname{RNN}_{sec}$ are shared for all the discourse sections. We use a single layer bidirectional LSTM (following the LSTM formulation of Graves et al. (2013)) for both $\operatorname{RNN}_{doc}$ and $\operatorname{RNN}_{sec}$ ; further extension to multilayer LSTMs is straightforward. We combine the forward and backward LSTM states to a single state using a simple feed-forward network:

where $$ shows the concatenation operation. Throughout, when we mention the RNN (LSTM) state, we are referring to this combined state of both forward and backward RNNs (LSTMs).

Discourse-aware decoder

When humans summarize a long structured document, depending on the domain and the nature of the document, they write about important points from different discourse sections of the document. For example, scientific paper abstracts typically include the description of the problem, discussion of the methods, and finally results and conclusions Suppe (1998). Motivated by this observation, we propose a discourse-aware attention method. Intuitively, at each decoding timestep, in addition to the words in the document, we also attend to the relevant discourse section (the “section attention” block in Figure 1). Then we use the discourse-related information to modify the word-level attention function. Specifically, the context vector representing the source document is:

where $\mathbf{h}_{(j,i)}^{(e)}$ shows the encoder state of word $i$ in discourse section $j$ and $\alpha_{(j,i)}^{(t)}$ shows the corresponding attention weight to that encoder state. The scalar weights $\alpha_{(j,i)}^{(t)}$ are obtained according to:

The $\operatorname{score}$ function is the additive attention function (Equation 2) and the weights $\beta^{(t)}_{j}$ are updated according to:

At each timestep $t$ , the decoder state $\mathbf{h}_{t}^{(d)}$ and the context vector $\mathbf{c}_{t}$ are used to estimate the probability distribution of next word $y_{t}$ :

where $\mathbf{V}$ is a vocabulary weight matrix and $\operatorname{softmax}$ is over the entire vocabulary.

Copying from source

There has been a surge of recent works in sequence learning tasks to address the problem of unkown token prediction by allowing the model to occasionally copy words directly from source instead of generating a new token Gu et al. (2016); See et al. (2017); Paulus et al. (2017); Wiseman et al. (2017). Following these works, we add an additional binary variable $z_{t}$ to the decoder, indicating generating a word from vocabulary ( $z_{t}\texttt{=}0$ ) or copying a word from the source ( $z_{t}\texttt{=}1$ ). The probability is learnt during training according to the following equation:

Then the next word $y_{t}$ is generated according to:

Decoder coverage

The coverage implicitly includes information about the attended document discourse sections. We incorporate the decoder coverage as an additional input to the attention function:

Related work

Neural abstractive summarization models have been studied in the past Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016) and later extended by source copying Miao and Blunsom (2016); See et al. (2017), reinformcement learning Paulus et al. (2017), and sentence salience information Li et al. (2017). One model variant of Nallapati et al. (2016) is related to our model in using sentence-level information in attention. However, our model is different as it contains a hierarchical encoder, uses discourse sections in the decoding step, and has a coverage mechanism. Similarly, Ling and Rush (2017) proposed a coarse-to-fine attention model that uses hard attention to find the text chunks of importance and then only attend to words in that chunk. In contrast, we consider all the discourse sections using soft attention. The closest model to ours is that of See et al. (2017) and Paulus et al. (2017) who used a joint pointer-generator network for summarization. However, our model extends theirs by (i) a hierarchical encoder for modeling long documents and (ii) a discourse-aware decoder that captures the information flow from all discourse sections of the document. Finally, in a recent work, Liu et al. (2018) proposed a model based on the transformer network Vaswani et al. (2017) for abstractive generation of Wikipedia articles. However, their focus is on multi-document summarization.

Our datasets are obtained from scientific papers. Scientific document summarization has been recently received extended attention Qazvinian et al. (2013); Cohan and Goharian (2015, 2017b, 2017a). In contrast to ours, existing approaches are extractive and rely on external information such as citations, which may not be available for all papers.

Data

Seq2seq models typically have a large number of parameters and thus they require large training data with ground truth summaries. Researchers have constructed such training data from news articles (e.g., CNN, Daily Mail and New York Times articles), where the abstracts or highlights of news articles are considered as ground truth summaries Nallapati et al. (2016); Paulus et al. (2017). However, news articles are relatively short and not suitable for the task of long-from document summarization. Following these works, we take scientific papers as an example of long documents with discourse information, where their abstracts can be used as ground-truth summaries. We introduce two datasets collected from scientific repositories, arXiv.org and PubMed.com.

The choice of scientific papers for our dataset is motivated by the fact that scientific papers are examples of long documents that follow a standard discourse structure and they already come with ground truth summaries, making it possible to train supervised neural models. We follow existing work in constructing large-scale summarization datasets that take news article abstracts as ground truth.

We remove the documents that are excessively long (e.g., theses) or too short (e.g., tutorial announcements), or do not have an abstract or discourse structure. We use the level-1 section headings as the discourse information. For arXiv, we use the LaTeX files and convert them to plain text using Pandoc (https://pandoc.org) to preserve the discourse section information. We remove figures and tables using regular expressions to only preserve the textual information. We also normalize math formulas and citation markers with special tokens. We analyze the document section names and identify the most common concluding sections names (e.g. conclusion, concluding remarks, summary, etc). We only keep the sections up to the conclusion section of the document and we remove sections after the conclusion.

The statistics of our datasets are shown in Table 1. In our datasets, both document and summary lengths are significantly larger than the existing large-scale summarization datasets. We retain about 3% (5%) of PubMed (ArXiv) as validation data and about another 3% (5%) for test; the rest is used for training.

Experiments

Similar to the majority of published research in the summarization literature Chopra et al. (2016); Nallapati et al. (2016); See et al. (2017), evaluation was done using the Rouge automatic summarization evaluation metric Lin (2004) with full-length F-1 Rouge scores. We lowercase all tokens and perform sentence and word tokenization using spaCy Honnibal and Johnson (2015).

Implementation details

We use Tensorflow 1.4 for implementing our models. We use the hyperparameters suggested by See et al. (2017). In particular, we use two bidirectional LSTMs with cell size of 256 and embedding dimensions of 128. Embeddings are trained from scratch and we did not find any gain using pre-trained embeddings. The vocabulary size is constrained to 50,000; using larger vocabulary size did not result in any improvement. We use mini-batches of size 16 and we limit the document length to 2000 and section length to 500 tokens, and number of sections to 4. We use batch-padding and dynamic unrolling to handle variable sequence lengths in LSTMs. Training was done using Adagrad optimizer with learning rate 0.15 and an initial accumulator value of 0.1. The maximum decoder size was 210 tokens which is in line with average abstract length in our datasets. We first train the model without coverage and added it at the last two epochs to help the model converge faster. We train the models on NVIDIA Titan X Pascal GPUs. Training is performed for about 10 epochs and each training step takes about 3.2 seconds. We used beam search at decoding time with beam size of 4. We train the abstractive baselines for about 250K iterations as suggested by their authors.

Comparison

We compare our method with several well-known extractive baselines as well as state-of-the-art abstractive models using their open-sourced implementations, when available; we follow the same training setup described in the corresponding papers. The compared methods are: LexRank Erkan and Radev (2004), SumBasic Vanderwende et al. (2007), LSA Steinberger and Jezek (2004), Attn-Seq2Seq Nallapati et al. (2016); Chopra et al. (2016), Pntr-Gen-Seq2Seq See et al. (2017). The first three are extractive models and last two are abstractive. Pntr-Gen-Seq2Seq extends Attn-Seq2Seq by using a joint pointer network during decoding. For Pntr-Gen-Seq2Seq we use their reported hyperparameters to ensure that the result differences are not due to hyperparameter tuning.

Results

Our main results are shown in Tables 2 and 3. Our model significantly outperforms the state-of-the-art abstractive methods, showing its effectiveness on both datasets. We observe that in our Rouge-1 score is respectively about 4 and 3 points higher than the abstractive model Pntr-Gen-Seq2Seq for the arXiv and PubMed datasets, providing a significant improvement. Our method also outperforms most of the extractive methods except for LexRank in one of the Rouge scores. We note that since extractive methods copy salient sentences from the document, it is usually easier for them to achieve higher Rouge scores.

Figure 2 illustrates the effectiveness of our model extensions in capturing various discourse information from the papers. It can be observed that the state-of-the-art Pntr-Gen-Seq2Seq model generates a summary that mostly focuses on introducing the problem, whereas our model generates a summary that includes more information about the methodology and impacts of the target paper. This indicates that the context vector in our model compared with Pntr-Gen-Seq2Seq is better able to capture important information from the source by attending to various discourse sections.

Conclusions and future work

This work was the first attempt at addressing neural abstractive summarization of single, long documents. We presented a neural sequence-to-sequence model that is able to effectively summarize long and structured documents such as scientific papers. While our results are encouraging, there is still much room for improvement for this challenging task; our new datasets can help the community to further explore this problem.

We note that following the convention in the summarization research, our quantitative evaluation is performed by Rouge automatic metric. While Rouge is an effective evaluation framework, nuances in the coherence or coverage of the summaries are not captured with it. It is non-trivial to evaluate such qualities especially for long document summarization; future work can design expert human evaluations to explore these nuances.

Acknowledgements

We thank the three anonymous reviewers for their comments and suggestions.