Neural Latent Extractive Document Summarization

Xingxing Zhang, Mirella Lapata, Furu Wei, Ming Zhou

Introduction

Document summarization aims to automatically rewrite a document into a shorter version while retaining its most important content. Of the many summarization paradigms that have been identified over the years (see Mani 2001 and Nenkova and McKeown 2011 for comprehensive overviews), two have consistently attracted attention: extractive approaches generate summaries by copying parts of the source document (usually whole sentences), while abstractive methods may generate new words or phrases which are not in the document.

A great deal of previous work has focused on extractive summarization which is usually modeled as a sentence ranking or binary classification problem (i.e., sentences which are top ranked or predicted as True are selected as summaries). Early attempts mostly leverage human-engineered features Filatova and Hatzivassiloglou (2004) coupled with binary classifiers Kupiec et al. (1995), hidden Markov models Conroy and O’leary (2001), graph based methods Mihalcea (2005), and integer linear programming Woodsend and Lapata (2010).

The successful application of neural network models to a variety of NLP tasks and the availability of large scale summarization datasets Hermann et al. (2015); Nallapati et al. (2016) has provided strong impetus to develop data-driven approaches which take advantage of continuous-space representations. Cheng and Lapata (2016) propose a hierarchical long short-term memory network (LSTM; Hochreiter and Schmidhuber 1997) to learn context dependent sentence representations for a document and then use yet another LSTM decoder to predict a binary label for each sentence. Nallapati et al. (2017) adopt a similar approach, they differ in their neural architecture for sentence encoding and the features used during label prediction, while Narayan et al. (2018) equip the same architecture with a training algorithm based on reinforcement learning. Abstractive models Nallapati et al. (2016); See et al. (2017); Paulus et al. (2017) are based on sequence-to-sequence learning Sutskever et al. (2014); Bahdanau et al. (2015), however, most of them underperform or are on par with the baseline of simply selecting the leading sentences in the document as summaries (but see Paulus et al. 2017 and Celikyilmaz et al. 2018 for exceptions).

Although seemingly more successful than their abstractive counterparts, extractive models require sentence-level labels, which are not included in most summarization datasets (only document and gold summary pairs are available). Sentence labels are usually obtained by rule-based methods Cheng and Lapata (2016) or by maximizing the Rouge score Lin (2004) between a subset of sentences and the human written summaries Nallapati et al. (2017). These methods do not fully exploit the human summaries, they only create True/False labels which might be suboptimal. In this paper we propose a latent variable extractive model and view labels of sentences in a document as binary latent variables (i.e., zeros and ones). Instead of maximizing the likelihood of “gold” standard labels, the latent model directly maximizes the likelihood of human summaries given selected sentences. Experiments on the CNN/Dailymail dataset Hermann et al. (2015) show that our latent extractive model improves upon a strong extractive baseline trained on rule-based labels and also performs competitively to several recent models.

Model

We first introduce the neural extractive summarization model upon which our latent model is based on. We then describe a sentence compression model which is used in our latent model and finally move on to present the latent model itself.

In extractive summarization, a subset of sentences in a document is selected as its summary. We model this problem as an instance of sequence labeling. Specifically, a document is viewed as a sequence of sentences and the model is expected to predict a True or False label for each sentence, where True indicates that the sentence should be included in the summary. It is assumed that during training sentences and their labels in each document are given (methods for obtaining these labels are discussed in Section 3).

As shown in the lower part of Figure 1, our extractive model has three parts: a sentence encoder to convert each sentence into a vector, a document encoder to learn sentence representations given surrounding sentences as context, and a document decoder to predict sentence labels based on representations learned by the document encoder. Let $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ denote a document and $S_{i}=(w_{1}^{i},w_{2}^{i},\dots,w_{|S_{i}|}^{i})$ a sentence in $\mathcal{D}$ (where $w_{j}^{i}$ is a word in $S_{i}$ ). Let $Y=(y_{1},\dots,y_{|\mathcal{D}|})$ denote sentence labels. The sentence encoder first transforms $S_{i}$ into a list of hidden states $(\mathbf{h}_{1}^{i},\mathbf{h}_{2}^{i},\dots,\mathbf{h}_{|S_{i}|}^{i})$ using a Bidirectional Long Short-Term Memory Network (Bi-LSTM; Hochreiter and Schmidhuber 1997; Schuster and Paliwal 1997). Then, the sentence encoder yields $\mathbf{v}_{i}$ , the representation of $S_{i}$ , by averaging these hidden states (also see Figure 1):

In analogy to the sentence encoder, the document encoder is another Bi-LSTM but applies on the sentence level. After running the Bi-LSTM on a sequence of sentence representations $(\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{|\mathcal{D}|})$ , we obtain context dependent sentence representations $(\mathbf{h}^{E}_{1},\mathbf{h}^{E}_{2},\dots,\mathbf{h}^{E}_{|\mathcal{D}|})$ .

The document decoder is also an LSTM which predicts sentence labels. At each time step, it takes the context dependent sentence representation of $S_{i}$ produced by the document encoder as well as the prediction in the previous time step:

The model described above is usually trained by minimizing the negative log-likelihood of sentence labels in training documents; it is almost identical to Cheng and Lapata (2016) except that we use a word-level long short-term memory network coupled with mean pooling to learn sentence representations, while they use convolutional neural network coupled with max pooling Kim et al. (2016).

2 Sentence Compression

We train a sentence compression model to map a sentence selected by the extractive model to a sentence in the summary. The model can be used to evaluate the quality of a selected sentence with respect to the summary (i.e., the degree to which it is similar) or rewrite an extracted sentence according to the style of the summary.

For our compression model we adopt a standard attention-based sequence-to-sequence architecture Bahdanau et al. (2015); Rush et al. (2015). The training set for this model is generated from the same summarization dataset used to train the exractive model. Let $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ denote a document and $\mathcal{H}=(H_{1},H_{2},\dots,H_{|\mathcal{H}|})$ its summary. We view each sentence $H_{i}$ in the summary as a target sentence and assume that its corresponding source is a sentence in $\mathcal{D}$ most similar to it. We measure the similarity between source sentences and candidate targets using Rouge, i.e., $S_{j}=\operatorname*{argmax}_{S_{j}}\text{ROUGE}(S_{j},H_{i})$ and $\langle S_{j},H_{i}\rangle$ is a training instance for the compression model. The probability of a sentence $\hat{H_{i}}$ being the compression of $\hat{S_{j}}$ (i.e., $p_{s2s}(\hat{H_{i}}|\hat{S_{j}})$ ) can be estimated with a trained compression model.

3 Latent Extractive Summarization

Training the extractive model described in Section 2.1 requires sentence-level labels which are obtained heuristically Cheng and Lapata (2016); Nallapati et al. (2017). Our latent variable model views sentences in a document as binary variables (i.e., zeros and ones) and uses sentences with activated latent variables (i.e., ones) to infer gold summaries. The latent variables are predicted with an extractive model and the loss during training comes from gold summaries directly.

Let $\mathcal{D}=(S_{1},S_{2},\dots,S_{|\mathcal{D}|})$ denote a document and $\mathcal{H}=(H_{1},H_{2},\dots,H_{|\mathcal{H}|})$ its human summary ( $H_{k}$ is a sentence in $\mathcal{H}$ ). We assume that there is a latent variable $z_{i}\in\{0,1\}$ for each sentence $S_{i}$ indicating whether $S_{i}$ should be selected, and $z_{i}=1$ entails it should. We use the extractive model from Section 2.1 to produce probability distributions for latent variables (see Equation (3)) and obtain them by sampling $z_{i}\sim p(z_{i}|z_{1:i-1},\mathbf{h}^{D}_{i-1})$ (see Figure 1). $\mathcal{C}=\{S_{i}|z_{i}=1\}$ , the set of sentences whose latent variables equal to one, are our current extractive summaries. Without loss of generality, we denote $\mathcal{C}=(C_{1},\dots,C_{|\mathcal{C}|})$ . Then, we estimate how likely it is to infer the human summary $\mathcal{H}$ from $\mathcal{C}$ . We estimate the likelihood of summary sentence $H_{l}$ given document sentence $C_{k}$ with the compression model introduced in Section 2.2 and calculate the normalizedWe also experimented with unnormalized probabilities (i.e., excluding the $\exp$ in Equation (4)), however we obtained inferior results. probability $s_{kl}$ :

The score $R_{p}$ measures the extent to which $\mathcal{H}$ can be inferred from $\mathcal{C}$ :

For simplicity, we assume one document sentence can only find one summary sentence to explain it. Therefore, for all $H_{l}$ , we only retain the most evident $s_{kl}$ . $R_{p}(\mathcal{C},\mathcal{H})$ can be viewed as the “precision” of document sentences with regard to summary sentences. Analogously, we also define $R_{r}$ , which indicates the extent to which $\mathcal{H}$ can be covered by $\mathcal{C}$ :

$R_{r}(\mathcal{C},\mathcal{H})$ can be viewed as the “recall” of document sentences with regard to summary sentences. The final score $R(\mathcal{C},\mathcal{H})$ is the weighted sum of the two:

Our use of the terms “precision” and “recall” is reminiscent of relevance and coverage in other summarization work Carbonell and Goldstein (1998); Lin and Bilmes (2010); See et al. (2017).

We train the model by minimizing the negative expected $R(\mathcal{C},\mathcal{H})$ :

where $p(\cdot|\mathcal{D})$ is the distribution produced by the neural extractive model (see Equation (3)). Unfortunately, computing the expectation term is prohibitive, since the possible latent variable combinations are exponential. In practice, we approximate this expectation with a single sample from the distribution of $p(\cdot|\mathcal{D})$ . We use the REINFORCE algorithm Williams (1992) to approximate the gradient of $\mathcal{L}(\theta)$ :

Note that the model described above can be viewed as a reinforcement learning model, where $R(\mathcal{C},\mathcal{H})$ is the reward. To reduce the variance of gradients, we also introduce a baseline linear regressionThe linear regression model $b_{t}$ is trained by minimizing the mean squared error between the prediction of $b_{t}$ and $R(\mathcal{C},\mathcal{H})$ . model $b_{i}$ Ranzato et al. (2016) to estimate the expected value of $R(\mathcal{C},\mathcal{H})$ . To avoid random label sequences during sampling, we use a pre-trained extractive model to initialize our latent model.

Experiments

We conducted experiments on the CNN/Dailymail dataset Hermann et al. (2015); See et al. (2017). We followed the same pre-processing steps as in See et al. (2017). The resulting dataset contains 287,226 document-summary pairs for training, 13,368 for validation and 11,490 for test. To create sentence level labels, we used a strategy similar to Nallapati et al. (2017). We label the subset of sentences in a document that maximizes Rouge (against the human summary) as True and all other sentences as False. Using the method described in Section 2.2, we created a compression dataset with 1,045,492 sentence pairs for training, 53,434 for validation and 43,382 for testing. We evaluated our models using full length F1 Rouge Lin (2004) and the official ROUGE-1.5.5.pl script. We report Rouge-1, Rouge-2, and Rouge-L.

Implementation

We trained our extractive model on an Nvidia K80 GPU card with a batch size of 32. Model parameters were uniformly initialized to $[-\frac{1}{\sqrt{c}},\frac{1}{\sqrt{c}}]$ ( $c$ is the number of columns in a weight matrix). We used Adam Kingma and Ba (2014) to optimize our models with a learning rate of 0.001, $\beta_{1}=0.9$ , and $\beta_{2}=0.999$ . We trained our extractive model for 10 epochs and selected the model with the highest Rouge on the validation set. We rescaled the gradient when its norm exceeded 5 Pascanu et al. (2013) and regularized all LSTMs with a dropout rate of 0.3 Srivastava et al. (2014); Zaremba et al. (2014). We also applied word dropout Iyyer et al. (2015) at rate 0.2. We set the hidden unit size $d=300$ for both word-level and sentence-level LSTMs and all LSTMs had one layer. We used 300 dimensional pre-trained FastText vectors Joulin et al. (2017) to initialize our word embeddings. The latent model was initialized from the extractive model (thus both models have the same size) and we set the weight in Equation (7) to $\alpha=0.5$ . The latent model was trained with SGD, with learning rate 0.01 for 5 epochs. During inference, for both extractive and latent models, we rank sentences with $p(y_{i}=\text{\tt True}|y_{1:i-1},\mathcal{D})$ and select the top three as summary (see also Equation (3)).

Comparison Systems

We compared our model against Lead3, which selects the first three leading sentences in a document as the summary and a variety of abstractive and extractive models. Abstractive models include a sequence-to-sequence architecture Nallapati et al. (2016); abstract), its pointer generator variant (See et al. 2017; pointer+coverage), and two reinforcement learning-based models (Paulus et al. 2017; abstract-RL and abstract-ML+RL). We also compared our approach against an extractive model based on hierarchical recurrent neural networks (Nallapati et al. 2017; SummaRuNNer), the model described in Section 2.1 (extract) which encodes sentences using LSTMs, a variant which employs CNNs instead (Cheng and Lapata 2016; extract-cnn), as well as a similar system based on reinforcement learning (Narayan et al. 2018; Refresh).

Results

As shown in Table 1, extract, our extractive model outperforms Lead3 by a wide margin. extract also outperforms previously published extractive models (i.e., SummaRuNNer, extract-cnn, and refresh). However, note that SummaRuNNer generates anonymized summaries Nallapati et al. (2017) while our models generate non-anonymized ones, and therefore the results of extract and SummaRuNNer are not strictly comparable (also note that Lead3 results are different in Table 1). Nevertheless, extract exceeds lead3 by $+0.75$ Rouge-2 points and $+0.57$ in terms of Rouge-L, while SummaRuNNer exceeds lead3 by $+0.50$ Rouge-2 points and is worse by $-0.20$ points in terms of Rouge-L. We thus conclude that extract is better when evaluated with Rouge-2 and Rouge-L. extract outperforms all abstractive models except for abstract-RL. Rouge-2 is lower for abstract-RL which is more competitive when evaluated against Rouge-1 and Rouge-l.

Our latent variable model (latent; Section 2.3) outperforms extract, despite being a strong baseline, which indicates that training with a loss directly based on gold summaries is useful. Differences among Lead3, extract, and latent are all significant with a 0.95 confidence interval (estimated with the Rouge script). Interestingly, when applying the compression model from Section 2.2 to the output of our latent model ( latent+compress ), performance drops considerably. This may be because the compression model is a sentence level model and it removes phrases that are important for creating the document-level summaries.

Conclusions

We proposed a latent variable extractive summarization model which leverages human summaries directly with the help of a sentence compression model. Experimental results show that the proposed model can indeed improve over a strong extractive model while application of the compression model to the output of our extractive system leads to inferior output. In the future, we plan to explore ways to train compression models tailored to our summarization task.

Acknowledgments

We thank the EMNLP reviewers for their valuable feedback and Qingyu Zhou for preprocessing the CNN/Dailymail dataset. We gratefully acknowledge the financial support of the European Research Council (award number 681760; Lapata).