Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting

Yen-Chun Chen, Mohit Bansal

Introduction

The task of document summarization has two main paradigms: extractive and abstractive. The former method directly chooses and outputs the salient sentences (or phrases) in the original document (Jing and McKeown, 2000; Knight and Marcu, 2000; Martins and Smith, 2009; Berg-Kirkpatrick et al., 2011). The latter abstractive approach involves rewriting the summary (Banko et al., 2000; Zajic et al., 2004), and has seen substantial recent gains due to neural sequence-to-sequence models (Chopra et al., 2016; Nallapati et al., 2016; See et al., 2017; Paulus et al., 2018). Abstractive models can be more concise by performing generation from scratch, but they suffer from slow and inaccurate encoding of very long documents, with the attention model being required to look at all encoded words (in long paragraphs) for decoding each generated summary word (slow, one by one sequentially). Abstractive models also suffer from redundancy (repetitions), especially when generating multi-sentence summary.

To address both these issues and combine the advantages of both paradigms, we propose a hybrid extractive-abstractive architecture, with policy-based reinforcement learning (RL) to bridge together the two networks. Similar to how humans summarize long documents, our model first uses an extractor agent to select salient sentences or highlights, and then employs an abstractor network to rewrite (i.e., compress and paraphrase) each of these extracted sentences. To overcome the non-differentiable behavior of our extractor and train on available document-summary pairs without saliency label, we next use actor-critic policy gradient with sentence-level metric rewards to connect these two neural networks and to learn sentence saliency. We also avoid common language fluency issues (Paulus et al., 2018) by preventing the policy gradients from affecting the abstractive summarizer’s word-level training, which is supported by our human evaluation study. Our sentence-level reinforcement learning takes into account the word-sentence hierarchy, which better models the language structure and makes parallelization possible. Our extractor combines reinforcement learning and pointer networks, which is inspired by Bello et al. (2017)’s attempt to solve the Traveling Salesman Problem. Our abstractor is a simple encoder-aligner-decoder model (with copying) and is trained on pseudo document-summary sentence pairs obtained via simple automatic matching criteria.

Thus, our method incorporates the abstractive paradigm’s advantages of concisely rewriting sentences and generating novel words from the full vocabulary, yet it adopts intermediate extractive behavior to improve the overall model’s quality, speed, and stability. Instead of encoding and attending to every word in the long input document sequentially, our model adopts a human-inspired coarse-to-fine approach that first extracts all the salient sentences and then decodes (rewrites) them (in parallel). This also avoids almost all redundancy issues because the model has already chosen non-redundant salient sentences to abstractively summarize (but adding an optional final reranker component does give additional gains by removing the fewer across-sentence repetitions).

Empirically, our approach is the new state-of-the-art on all ROUGE metrics (Lin, 2004) as well as on METEOR Denkowski and Lavie (2014) of the CNN/Daily Mail dataset, achieving statistically significant improvements over previous models that use complex long-encoder, copy, and coverage mechanisms (See et al., 2017). The test-only DUC-2002 improvement also shows our model’s better generalization than this strong abstractive system. In addition, we surpass the popular lead-3 baseline on all ROUGE scores with an abstractive model. Moreover, our sentence-level abstractive rewriting module also produces substantially more (3x) novel $N$ -grams that are not seen in the input document, as compared to the strong flat-structured model of See et al. (2017). This empirically justifies that our RL-guided extractor has learned sentence saliency, rather than benefiting from simply copying longer sentences. We also show that our model maintains the same level of fluency as a conventional RNN-based model because the reward does not leak to our abstractor’s word-level training. Finally, our model’s training is 4x and inference is more than 20x faster than the previous state-of-the-art. The optional final reranker gives further improvements while maintaining a 7x speedup.

Overall, our contribution is three fold: First we propose a novel sentence-level RL technique for the well-known task of abstractive summarization, effectively utilizing the word-then-sentence hierarchical structure without annotated matching sentence-pairs between the document and ground truth summary. Next, our model achieves the new state-of-the-art on all metrics of multiple versions of a popular summarization dataset (as well as a test-only dataset) both extractively and abstractively, without loss in language fluency (also demonstrated via human evaluation and abstractiveness scores). Finally, our parallel decoding results in a significant 10-20x speed-up over the previous best neural abstractive summarization system with even better accuracy.We are releasing our code, best pretrained models, as well as output summaries, to promote future research: https://github.com/ChenRocks/fast_abs_rl

Model

The extractor agent is designed to model $f$ , which can be thought of as extracting salient sentences from the document. We exploit a hierarchical neural model to learn the sentence representations of the document and a ‘selection network’ to extract sentences based on their representations.

We use a temporal convolutional model Kim (2014) to compute $r_{j}$ , the representation of each individual sentence in the documents (details in supplementary). To further incorporate global context of the document and capture the long-range semantic dependency between sentences, a bidirectional LSTM-RNN (Hochreiter and Schmidhuber, 1997; Schuster et al., 1997) is applied on the convolutional output. This enables learning a strong representation, denoted as $h_{j}$ for the $j$ -th sentence in the document, that takes into account the context of all previous and future sentences in the same document.

1.2 Sentence Selection

Next, to select the extracted sentences based on the above sentence representations, we add another LSTM-RNN to train a Pointer Network (Vinyals et al., 2015), to extract sentences recurrently. We calculate the extraction probability by:

where $e_{t}$ ’s are the output of the glimpse operation Vinyals et al. (2016):

In Eqn. 3, $z_{t}$ is the output of the added LSTM-RNN (shown in green in Fig. 1) which is referred to as the decoder. All the $W$ ’s and $v$ ’s are trainable parameters. At each time step $t$ , the decoder performs a 2-hop attention mechanism: It first attends to $h_{j}$ ’s to get a context vector $e_{t}$ and then attends to $h_{j}$ ’s again for the extraction probabilities.Note that we force-zero the extraction prob. of already extracted sentences so as to prevent the model from using repeating document sentences and suffering from redundancy. This is non-differentiable and hence only done in RL training. This model is essentially classifying all sentences of the document at each extraction step. An illustration of the whole extractor is shown in Fig. 1.

2 Abstractor Network

The abstractor network approximates $g$ , which compresses and paraphrases an extracted document sentence to a concise summary sentence. We use the standard encoder-aligner-decoder Bahdanau et al. (2015); Luong et al. (2015). We add the copy mechanismWe use the terminology of copy mechanism (originally named pointer-generator) in order to avoid confusion with the pointer network Vinyals et al. (2015). to help directly copy some out-of-vocabulary (OOV) words See et al. (2017). For more details, please refer to the supplementary.

Learning

Given that our extractor performs a non-differentiable hard extraction, we apply standard policy gradient methods to bridge the back-propagation and form an end-to-end trainable (stochastic) computation graph. However, simply starting from a randomly initialized network to train the whole model in an end-to-end fashion is infeasible. When randomly initialized, the extractor would often select sentences that are not relevant, so it would be difficult for the abstractor to learn to abstractively rewrite. On the other hand, without a well-trained abstractor the extractor would get noisy reward, which leads to a bad estimate of the policy gradient and a sub-optimal policy. We hence propose optimizing each sub-module separately using maximum-likelihood (ML) objectives: train the extractor to select salient sentences (fit $f$ ) and the abstractor to generate shortened summary (fit $g$ ). Finally, RL is applied to train the full model end-to-end (fit $h$ ).

Extractor Training: In Sec. 2.1.2, we have formulated our sentence selection as classification. However, most of the summarization datasets are end-to-end document-summary pairs without extraction (saliency) labels for each sentence. Hence, we propose a simple similarity method to provide a ‘proxy’ target label for the extractor. Similar to the extractive model of Nallapati et al. (2017), for each ground-truth summary sentence, we find the most similar document sentence $d_{j_{t}}$ by:Nallapati et al. (2017) selected sentences greedily to maximize the global summary-level ROUGE, whereas we match exactly 1 document sentence for each GT summary sentence based on the individual sentence-level score.

Given these proxy training labels, the extractor is then trained to minimize the cross-entropy loss.

Abstractor Training: For the abstractor training, we create training pairs by taking each summary sentence and pairing it with its extracted document sentence (based on Eqn. 6). The network is trained as an usual sequence-to-sequence model to minimize the cross-entropy loss $L(\theta_{abs})=-\frac{1}{M}\sum_{m=1}^{M}\text{log}P_{\theta_{abs}}(w_{m}|w_{1:m-1})$ of the decoder language model at each generation step, where $\theta_{abs}$ is the set of trainable parameters of the abstractor and $w_{m}$ the $m^{th}$ generated word.

2 Reinforce-Guided Extraction

Here we explain how policy gradient techniques are applied to optimize the whole model. To make the extractor an RL agent, we can formulate a Markov Decision Process (MDP)Strictly speaking, this is a Partially Observable Markov Decision Process (POMDP). We approximate it as an MDP by assuming that the RNN hidden state contains all past info.: at each extraction step $t$ , the agent observes the current state $c_{t}=(D,d_{j_{t-1}})$ , samples an action $j_{t}\sim\pi_{\theta_{a},\omega}(c_{t},j)=P(j)$ from Eqn. 2 to extract a document sentence and receive a rewardIn Eqn. 6, we use ROUGE-recall because we want the extracted sentence to contain as much information as possible for rewriting. Nevertheless, for Eqn. 7, ROUGE- $F_{1}$ is more suitable because the abstractor $g$ is supposed to rewrite the extracted sentence $d$ to be as concise as the ground truth $s$ .

after the abstractor summarizes the extracted sentence $d_{j_{t}}$ . We denote the trainable parameters of the extractor agent by $\theta=\{\theta_{a},\omega\}$ for the decoder and hierarchical encoder respectively. We can then train the extractor with policy-based RL. We illustrate this process in Fig. 2.

The vanilla policy gradient algorithm, REINFORCE (Williams, 1992), is known for high variance. To mitigate this problem, we add a critic network with trainable parameters $\theta_{c}$ to predict the state-value function $V^{\pi_{\theta_{a},\omega}}(c)$ . The predicted value of critic $b_{\theta_{c},\omega}(c)$ is called the ‘baseline’, which is then used to estimate the advantage function: $A^{\pi_{\theta}}(c,j)=Q^{\pi_{\theta_{a},\omega}}(c,j)-V^{\pi_{\theta_{a},\omega}}(c)$ because the total return $R_{t}$ is an estimate of action-value function $Q(c_{t},j_{t})$ . Instead of maximizing $Q(c_{t},j_{t})$ as done in REINFORCE, we maximize $A^{\pi_{\theta}}(c,j)$ with the following policy gradient:

And the critic is trained to minimize the square loss: $L_{c}(\theta_{c},\omega)=(b_{\theta_{c},\omega}(c_{t})-R_{t})^{2}$ . This is known as the Advantage Actor-Critic (A2C), a synchronous variant of A3C (Mnih et al., 2016). For more A2C details, please refer to the supp.

Intuitively, our RL training works as follow: If the extractor chooses a good sentence, after the abstractor rewrites it the ROUGE match would be high and thus the action is encouraged. If a bad sentence is chosen, though the abstractor still produces a compressed version of it, the summary would not match the ground truth and the low ROUGE score discourages this action. Our RL with a sentence-level agent is a novel attempt in neural summarization. We use RL as a saliency guide without altering the abstractor’s language model, while previous work applied RL on the word-level, which could be prone to gaming the metric at the cost of language fluency.During this RL training of the extractor, we keep the abstractor parameters fixed. Because the input sentences for the abstractor are extracted by an intermediate stochastic policy of the extractor, it is impossible to find the correct target summary for the abstractor to fit $g$ with ML objective. Though it is possible to optimize the abstractor with RL, in out preliminary experiments we found that this does not improve the overall ROUGE, most likely because this RL optimizes at a sentence-level and can add across-sentence redundancy. We achieve SotA results without this abstractor-level RL.

In a typical RL setting like game playing, an episode is usually terminated by the environment. On the other hand, in text summarization, the agent does not know in advance how many summary sentence to produce for a given article (since the desired length varies for different downstream applications). We make an important yet simple, intuitive adaptation to solve this: by adding a ‘stop’ action to the policy action space. In the RL training phase, we add another set of trainable parameters $v_{EOE}$ (EOE stands for ‘End-Of-Extraction’) with the same dimension as the sentence representation. The pointer-network decoder treats $v_{EOE}$ as one of the extraction candidates and hence naturally results in a stop action in the stochastic policy. We set the reward for the agent performing EOE to $\text{ROUGE-1}_{F_{1}}([\{g(d_{j_{t}})\}_{t}],[\{s_{t}\}_{t}])$ ; whereas for any extraneous, unwanted extraction step, the agent receives zero reward. The model is therefore encouraged to extract when there are still remaining ground-truth summary sentences (to accumulate intermediate reward), and learn to stop by optimizing a global ROUGE and avoiding extra extraction.We use ROUGE-1 for terminal reward because it is a better measure of bag-of-words information (i.e., has all the important information been generated); while ROUGE-L is used as intermediate rewards since it is known for better measurement of language fluency within a local sentence. Overall, this modification allows dynamic decisions of number-of-sentences based on the input document, eliminates the need for tuning a fixed number of steps, and enables a data-driven adaptation for any specific dataset/application.

3 Repetition-Avoiding Reranking

Existing abstractive summarization systems on long documents suffer from generating repeating and redundant words and phrases. To mitigate this issue, See et al. (2017) propose the coverage mechanism and Paulus et al. (2018) incorporate tri-gram avoidance during beam-search at test-time. Our model without these already performs well because the summary sentences are generated from mutually exclusive document sentences, which naturally avoids redundancy. However, we do get a small further boost to the summary quality by removing a few ‘across-sentence’ repetitions, via a simple reranking strategy: At sentence-level, we apply the same beam-search tri-gram avoidance (Paulus et al., 2018). We keep all $k$ sentence candidates generated by beam search, where $k$ is the size of the beam. Next, we then rerank all $k^{n}$ combinations of the $n$ generated summary sentence beams. The summaries are reranked by the number of repeated $N$ -grams, the smaller the better. We also apply the diverse decoding algorithm described in Li et al. (2016) (which has almost no computation overhead) so as to get the above approach to produce useful diverse reranking lists. We show how much the redundancy affects the summarization task in Sec. 6.2.

Related Work

Early summarization works mostly focused on extractive and compression based methods (Jing and McKeown, 2000; Knight and Marcu, 2000; Clarke and Lapata, 2010; Berg-Kirkpatrick et al., 2011; Filippova et al., 2015). Recent large-sized corpora attracted neural methods for abstractive summarization (Rush et al., 2015; Chopra et al., 2016). Some of the recent success in neural abstractive models include hierarchical attention Nallapati et al. (2016), coverage Suzuki and Nagata (2016); Chen et al. (2016); See et al. (2017), RL based metric optimization Paulus et al. (2018), graph-based attention Tan et al. (2017), and the copy mechanism Miao and Blunsom (2016); Gu et al. (2016); See et al. (2017).

Our model shares some high-level intuition with extract-then-compress methods. Earlier attempts in this paradigm used Hidden Markov Models and rule-based systems (Jing and McKeown, 2000), statistical models based on parse trees (Knight and Marcu, 2000), and integer linear programming based methods (Martins and Smith, 2009; Gillick and Favre, 2009; Clarke and Lapata, 2010; Berg-Kirkpatrick et al., 2011). Recent approaches investigated discourse structures (Louis et al., 2010; Hirao et al., 2013; Kikuchi et al., 2014; Wang et al., 2015), graph cuts (Qian and Liu, 2013), and parse trees (Li et al., 2014; Bing et al., 2015). For neural models, Cheng and Lapata (2016) used a second neural net to select words from an extractor’s output. Our abstractor does not merely ‘compress’ the sentences but generatively produce novel words. Moreover, our RL bridges the extractor and the abstractor for end-to-end training.

Reinforcement learning has been used to optimize the non-differential metrics of language generation and to mitigate exposure bias (Ranzato et al., 2016; Bahdanau et al., 2017). Henß et al. (2015) use Q-learning based RL for extractive summarization. Paulus et al. (2018) use RL policy gradient methods for abstractive summarization, utilizing sequence-level metric rewards with curriculum learning (Ranzato et al., 2016) or weighted ML+RL mixed loss (Paulus et al., 2018) for stability and language fluency. We use sentence-level rewards to optimize the extractor while keeping our ML trained abstractor decoder fixed, so as to achieve the best of both worlds.

Training a neural network to use another fixed network has been investigated in machine translation for better decoding (Gu et al., 2017a) and real-time translation (Gu et al., 2017b). They used a fixed pretrained translator and applied policy gradient techniques to train another task-specific network. In question answering (QA), Choi et al. (2017) extract one sentence and then generate the answer from the sentence’s vector representation with RL bridging. Another recent work attempted a new coarse-to-fine attention approach on summarization (Ling and Rush, 2017) and found desired sharp focus properties for scaling to larger inputs (though without metric improvements). Very recently (concurrently), Narayan et al. (2018) use RL for ranking sentences in pure extraction-based summarization and Çelikyilmaz et al. (2018) investigate multiple communicating encoder agents to enhance the copying abstractive summarizer.

Finally, there are some loosely-related recent works: Zhou et al. (2017) proposed selective gate to improve the attention in abstractive summarization. Tan et al. (2018) used an extract-then-synthesis approach on QA, where an extraction model predicts the important spans in the passage and then another synthesis model generates the final answer. Swayamdipta et al. (2017) attempted cascaded non-recurrent small networks on extractive QA, resulting a scalable, parallelizable model. Fan et al. (2017) added controlling parameters to adapt the summary to length, style, and entity preferences. However, none of these used RL to bridge the non-differentiability of neural models.

Experimental Setup

Please refer to the supplementary for full training details (all hyperparameter tuning was performed on the validation set). We use the CNN/Daily Mail dataset Hermann et al. (2015) modified for summarization Nallapati et al. (2016). Because there are two versions of the dataset, original text and entity anonymized, we show results on both versions of the dataset for a fair comparison to prior works. The experiment runs training and evaluation for each version separately. Despite the fact that the 2 versions have been considered separately by the summarization community as 2 different datasets, we use same hyper-parameter values for both dataset versions to show the generalization of our model. We also show improvements on the DUC-2002 dataset in a test-only setup.

For all the datasets, we evaluate standard ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) on full-length $F_{1}$ (with stemming) following previous works (Nallapati et al., 2017; See et al., 2017; Paulus et al., 2018). Following See et al. (2017), we also evaluate on METEOR (Denkowski and Lavie, 2014) for a more thorough analysis.

2 Modular Extractive vs. Abstractive

Our hybrid approach is capable of both extractive and abstractive (i.e., rewriting every sentence) summarization. The extractor alone performs extractive summarization. To investigate the effect of the recurrent extractor (rnn-ext), we implement a feed-forward extractive baseline ff-ext (details in supplementary). It is also possible to apply RL to extractor without using the abstractor (rnn-ext + RL).In this case the abstractor function $g(d)=d$ . Benefiting from the high modularity of our model, we can make our summarization system abstractive by simply applying the abstractor on the extracted sentences. Our abstractor rewrites each sentence and generates novel words from a large vocabulary, and hence every word in our overall summary is generated from scratch; making our full model categorized into the abstractive paradigm.Note that the abstractive CNN/DM dataset does not include any human-annotated extraction label, and hence our models do not receive any direct extractive supervision. We run experiments on separately trained extractor/abstractor (ff-ext + abs, rnn-ext + abs) and the reinforced full model (rnn-ext + abs + RL) as well as the final reranking version (rnn-ext + abs + RL + rerank).

Results

For easier comparison, we show separate tables for the original-text vs. anonymized versions – Table 1 and Table 2, respectively. Overall, our model achieves strong improvements and the new state-of-the-art on both extractive and abstractive settings for both versions of the CNN/DM dataset (with some comparable results on the anonymized version). Moreover, Table 3 shows the generalization of our abstractive system to an out-of-domain test-only setup (DUC-2002), where our model achieves better scores than See et al. (2017).

In the extractive paradigm, we compare our model with the extractive model from Nallapati et al. (2017) and a strong lead-3 baseline. For producing our summary, we simply concatenate the extracted sentences from the extractors. From Table 1 and Table 2, we can see that our feed-forward extractor out-performs the lead-3 baseline, empirically showing that our hierarchical sentence encoding model is capable of extracting salient sentences.The ff-ext model outperforms rnn-ext possibly because it does not predict sentence ordering; thus is easier to optimize and the n-gram based metrics do not consider sentence ordering. Also note that in our MDP formulation, we cannot apply RL on ff-ext due to its historyless nature. Even if applied naively, there is no mean for the feed-forward model to learn the EOE described in Sec. 3.2. The reinforced extractor performs the best, because of the ability to get the summary-level reward and the reduced train-test mismatch of feeding the previous extraction decision. The improvement over lead-3 is consistent across both tables. In Table 2, it outperforms the previous best neural extractive model Nallapati et al. (2017). In Table 1, our model also outperforms a recent, concurrent sentence-ranking RL model by Narayan et al. (2018), showing that our pointer-network extractor and reward formulations are very effective when combined with A2C RL.

2 Abstractive Summarization

After applying the abstractor, the ff-ext based model still out-performs the rnn-ext model. Both combined models exceed the pointer-generator model (See et al., 2017) without coverage by a large margin for all metrics, showing the effectiveness of our 2-step hierarchical approach: our method naturally avoids repetition by extracting multiple sentences with different keypoints.A trivial lead-3 + abs baseline obtains ROUGE of (37.37, 15.59, 34.82), which again confirms the importance of our reinforce-based sentence selection.

Moreover, after applying reinforcement learning, our model performs better than the best model of See et al. (2017) and the best ML trained model of Paulus et al. (2018). Our reinforced model outperforms the ML trained rnn-ext + abs baseline with statistical significance of $p<0.01$ on all metrics for both version of the dataset, indicating the effectiveness of the RL training. Also, rnn-ext + abs + RL is statistically significant better than See et al. (2017) for all metrics with $p<0.01$ .We calculate statistical significance based on the bootstrap test Noreen (1989); Efron and Tibshirani (1994) with 100K samples. Output of Paulus et al. (2018) is not available so we couldn’t test for statistical significance there. In the supplementary, we show the learning curve of our RL training, where the average reward goes up quickly after the extractor learns the End-of-Extract action and then stabilizes. For all the above models, we use standard greedy decoding and find that it performs well.

Although the extract-then-abstract approach inherently will not generate repeating sentences like other neural-decoders do, there might still be across-sentence redundancy because the abstractor is not aware of other extracted sentences when decoding one. Hence, we incorporate an optional reranking strategy described in Sec. 3.3. The improved ROUGE scores indicate that this successfully removes some remaining redundancies and hence produces more concise summaries. Our best abstractive model (rnn-ext + abs + RL + rerank) is clearly superior than the one of See et al. (2017). We are comparable on R-1 and R-2 but a 0.4 point improvement on R-L w.r.t. Paulus et al. (2018).We do not list the scores of their pure RL model because they discussed its bad readability. We also outperform the results of Fan et al. (2017) on both original and anonymized dataset versions. Several previous works have pointed out that extractive baselines are very difficult to beat (in terms of ROUGE) by an abstractive system (See et al., 2017; Nallapati et al., 2017). Note that our best model is one of the first abstractive models to outperform the lead-3 baseline on the original-text CNN/DM dataset. Our extractive experiment serves as a complementary analysis of the effect of RL with extractive systems.

3 Human Evaluation

We also conduct human evaluation to ensure robustness of our training procedure. We measure relevance and readability of the summaries. Relevance is based on the summary containing important, salient information from the input article, being correct by avoiding contradictory/unrelated information, and avoiding repeated/redundant information. Readability is based on the summary’s fluency, grammaticality, and coherence. To evaluate both these criteria, we design the following Amazon MTurk experiment: we randomly select 100 samples from the CNN/DM test set and ask the human testers (3 for each sample) to rank between summaries (for relevance and readability) produced by our model and that of See et al. (2017) (the models were anonymized and randomly shuffled), i.e. A is better, B is better, both are equally good/bad. Following previous work, the input article and ground truth summaries are also shown to the human participants in addition to the two model summaries.We selected human annotators that were located in the US, had an approval rate greater than 95%, and had at least 10,000 approved HITs on record. From the results shown in Table 4, we can see that our model is better in both relevance and readability w.r.t. See et al. (2017).

4 Speed Comparison

Our two-stage extractive-abstractive hybrid model is not only the SotA on summary quality metrics, but more importantly also gives a significant speed-up in both train and test time over a strong neural abstractive system (See et al., 2017).The only publicly available code with a pretrained model for neural summarization which we can test the speed.

Our full model is composed of a extremely fast extractor and a parallelizable abstractor, where the computation bottleneck is on the abstractor, which has to generate summaries with a large vocabulary from scratch.The time needed for extractor is negligible w.r.t. the abstractor because it does not require large matrix multiplication for generating every word. Moreover, with convolutional encoder at word-level made parallelizable by the hierarchical rnn-ext, our model is scalable for very long documents. The main advantage of our abstractor at decoding time is that we can first compute all the extracted sentences for the document, and then abstract every sentence concurrently (in parallel) to generate the overall summary. In Table 5, we show the substantial test-time speed-up of our model compared to See et al. (2017).For details of training speed-up, please see the supp. We calculate the total decoding time for producing all summaries for the test set.We time the model of See et al. (2017) using beam size of 4 (used for their best-reported scores). Without beam-search, it gets significantly worse ROUGE of (36.62, 15.12, 34.08), so we do not compare speed-ups w.r.t. that version. Due to the fact that the main test-time speed bottleneck of RNN language generation model is that the model is constrained to generate one word at a time, the total decoding time is dependent on the number of total words generated; we hence also report the decoded words per second for a fair comparison. Our model without reranking is extremely fast. From Table 5 we can see that we achieve a speed up of 18x in time and 24x in word generation rate. Even after adding the (optional) reranker, we still maintain a 6-7x speed-up (and hence a user can choose to use the reranking component depending on their downstream application’s speed requirements).Most of the recent neural abstractive summarization systems are of similar algorithmic complexity to that of See et al. (2017). The main differences such as the training objective (ML vs. RL) and copying (soft/hard) has negligible test runtime compared to the slowest component: the long-summary attentional-decoder’s sequential generation; and this is the component that we substantially speed up via our parallel sentence decoding with sentence-selection RL.

Analysis

We compute an abstractiveness score See et al. (2017) as the ratio of novel $n$ -grams in the generated summary that are not present in the input document. The results are shown in Table 6: our model rewrites substantially more abstractive summaries than previous work. A potential reason for this is that when trained with individual sentence-pairs, the abstractor learns to drop more document words so as to write individual summary sentences as concise as human-written ones; thus the improvement in multi-gram novelty.

2 Qualitative Analysis on Output Examples

We show examples of how our best model selects sentences and then rewrites them. In the supplementary Fig. 4 and Fig. 5, we can see how the abstractor rewrites the extracted sentences concisely while keeping the mentioned facts. Adding the reranker makes the output more compact globally. We observe that when rewriting longer text, the abstractor would have many facts to choose from (Fig. 5 sentence 2) and this is where the reranker helps avoid redundancy across sentences.

Conclusion

We propose a novel sentence-level RL model for abstractive summarization, which makes the model aware of the word-sentence hierarchy. Our model achieves the new state-of-the-art on both CNN/DM versions as well a better generalization on test-only DUC-2002, along with a significant speed-up in training and decoding.

Acknowledgments

We thank the anonymous reviewers for their helpful comments. This work was supported by a Google Faculty Research Award, a Bloomberg Data Science Research Grant, an IBM Faculty Award, and NVidia GPU awards.

References

Supplementary Materials

Appendix A Model Details

Here we describe the convolutional sentence representation used in Sec. 2.1.1. We use the temporal convolutional model proposed by Kim (2014) to compute the representation of every individual sentence in the document. First, the words are converted to the distributed vector representation by a learned word embedding matrix $W_{emb}$ . The sequence of the word vectors from each sentence is then fed through 1-D single-layer convolution filters with various window sizes (3, 4, 5) to capture the temporal dependencies of nearby words and then followed by $relu$ non-linear activation and max-over-time pooling. The convolutional representation $r_{j}$ for the $j$ th sentence is then obtained by concatenating the outputs from the activations of all filter window sizes.

A.2 Abstractor

In this section we discuss the architecture choices for our abstractor network in Sec. 2.2. At a high-level, it is a sequence-to-sequence model with attention and copy mechanism (but no coverage). Note that the abstractor network is a separate neural network from the extractor agent without any form of parameter sharing.

We use a standard encoder-aligner-decoder model Bahdanau et al. (2015); Luong et al. (2015) with the bilinear multiplicative attention function Luong et al. (2015), $f_{att}(h_{i},z_{j})=h_{i}^{\top}W_{attn}z_{j}$ , for the context vector $e_{j}$ . We share the source and target embedding matrix $W_{emb}$ as well as output projection matrix as in Inan et al. (2017); Press and Wolf (2017); Paulus et al. (2018).

We add the copying mechanism as in See et al. (2017) to extend the decoder to predict over the extended vocabulary of words in the input document. A copy probability $p_{copy}=\sigma(v_{\hat{z}}^{\top}\hat{z}_{j}+v_{s}^{\top}z_{j}+v_{w}^{\top}w_{j}+b)$ is calculated by learnable parameters $v$ ’s and $b$ , and then is used to further compute a weighted sum of the probability of source vocabulary and the predefined vocabulary. At test time, an OOV prediction is replaced by the document word with the highest attention score.

A.3 Actor-Critic Policy Gradient

Here we discuss the details of the actor-critic policy gradient training. Given the MDP formulation described in Sec. 3.2 , the return (total discounted future reward) is

for each recurrent step $t$ . To learn a optimal policy $\pi^{*}$ that maximize the state-value function:

we will make use of the action-value function

We then take the policy gradient theorem and then substitute the action-value function with the Monte-Carlo sample:

which runs a single episode and gets the return (estimate of action-value function) by sampling from the policy $\pi_{\theta}$ , where $N_{s}$ is the total number of sentences the agent extracts. This gradient update is also known as the REINFORCE algorithm (Williams, 1992).

The vanilla REINFORCE algorithm is known for high variance. To mitigate this problem we add a critic network with trainable parameters $\theta_{c}$ having the same structure as the pointer-network’s decoder (described in Sec. 2.1.2) but change the final output layer to regress the state-value function $V^{\pi_{\theta_{a},\omega}}(c)$ . The predicted value $b_{\theta_{c},\omega}(c)$ is called the baseline and is subtracted from the action-value function to estimate the advantage

where $\theta=\{\theta_{a},\theta_{c},\omega\}$ denotes the set of all trainable parameters. The new policy gradient for our extractor can be estimated by substituting the action-value function in Eqn. 10 by the advantage and then use Monte-Carlo samples (use $R_{t}$ to estimate $Q$ ):We found that updating with mini-batch of episodes and standardizing $R_{t}$ over all time steps and all episodes within the batch helps converging.

Here we also show an interesting finding of the effect adding the EOE action. In Fig. 3, we can see that the average reward is low in the beginning but quickly goes up after the agent picks up the EOE action. The low beginning reward is because the agent does not choose the EOE action hence keep getting zero rewards when extracting extra sentences, which lowers the average.

A.4 Sentence Selection Baseline ff-ext

In this subsection, we describe the detailed network structure of the feed-forward extractor baseline (ff-ext). Following the hierarchical sentence representation described in Sec. 2.1.1, if we add another assumption that there exists a sequence $j_{i1},j_{i2},\dots,j_{iN_{s}}$ where $j_{i1}<j_{i2}<\cdots<j_{iN_{s}}$ such that

i.e., the extracted document are summarized in the order as is, we could apply the following feed-forward structure for sentence selection. We first learn a document representation by

where $N_{d},N_{s}$ each denotes the number of sentences in the document $x$ and the summary $y$ respectively. And then we compute the extraction probability:

for each sentence in the document. Assuming we have the groundtruth extraction labels $j_{1},\dots,j_{N_{s}}$ , the above formulation treats sentence selection as a sequence of binary classification problems, where $W$ s and $b$ s are trainable parameters. We can therefore train the sentence selection network end-to-end by cross-entropy loss, where $W$ s and $b$ s are trainable parameters.

At test time, the feed-forward extractor chooses the top-k sentences and then concatenates them as the original order in the document. Note that we still refer to this network as feed-forward extractor (ff-ext) to distinguish from the pointer network extractor (rnn-ext) though it contains recurrent structure.

Appendix B Training Details

We use the CNN/Daily Mail dataset first proposed by Hermann et al. (2015) for reading comprehension task. This dataset has been modified for summarization by Nallapati et al. (2017). This dataset differs from previous Gigaword dataset Rush et al. (2015) in the length of the text: both documents and summaries for CNN/Daily Mail is much longer. The standard split of the dataset contains 287,227 documents for training, 13,368 documents for validation, and 11,490 for testing. Note that the original release of this dataset by Hermann et al. (2015) is an anonymized version, where the named entities are anonymized and treated as a single word in the evaluation n-gram matching. On the other hand, See et al. (2017) proposed to use the non-anonymized, original-text version of the dataset. For a fair comparison to prior works, we show results on both versions of the dataset. The experiment runs training and evaluation for each version separately (but we transfer the same tuned hyperparameters from original to anonymized version).

The DUC-2002 dataset contains 567 document-summary pairs for single-document summarization. Due to its small size, we utilize it in a test-only setup: we directly use the CNN/Daily Mail (original text) trained model to summarize the DUC documents for testing generalization/transfer our models. The results of See et al. (2017) on DUC is obtained by running their publicly available pretrained model. We evaluate the results using the official ROUGE F1 script.

B.2 Hyperparameter Details

All hyper-parameters are tuned on the validation set of the original text version of CNN/DM. We use mini-batches of 32 samples for all the training. Adam optimizer (Kingma and Ba, 2014) is used with learning rate $0.001$ for ML and $0.0001$ for RL training (other hyper-parameters at their default). We apply gradient clipping (Pascanu et al., 2013) using 2-norm of $2.0$ . We do not use any regularization technique except early-stopping. We also found that halving the learning rate whenever validation loss stops decreasing speeds up convergence. For RL training, we use $\gamma=0.95$ for the discount factor in Eqn. 9. We first train the abstractor and extractors separately until convergence with maximum-likelihood objectives, then apply RL training on the trained sub-modules. For all LSTM-RNNs we use $256$ hidden units. We use single layer LSTM-RNN with $256$ hidden units for all models. The initial states of RNN are learned for our extractor agent. For the abstractor network, we learn a linear mapping to transform the encoder final states to the decoder initial states. We also train a word2vec (Mikolov et al., 2013) of 128 dimension on the same corpus to initialize the embedding matrix for all maximum-likelihood trained models and the embedding matrix is updated during training. We set a vocabulary size of $30000$ most common words in the training set. For saving the memory space in training, we truncate the input article sentences to a maximum length of $100$ tokens and summary sentences to $30$ tokens (note that this is counted at the sentence-level for our abstractor training). We use all possible sentence pairs within every summary without limit. At test time, the length of input is not limited and the generation limit remains $30$ maximum tokens for the abstractor. For all non-RL models, the number of sentences to extract is tuned on the validation set. For the reranking (see Sec. 3.3), we set $N=2$ (bi-gram) and $k=5$ (beam size).Due to the fact that the size of the reranking list is exponential to the number of sentences of the generated summary $n$ , we pruned the beam so as to allow completion (of dev-set summarization) in a reasonable amount of time, as following: for $n\leq 5$ , we use our standard beam size of $k=5$ , but for larger $n$ values, we use gradually-reduced $k$ values: $(6,4),(7-8,3),(9+,2)$ for $(n,k)$ . The diversity ratio of the diverse beam-search Li et al. (2016) is set to $1.0$ .

B.3 Training Speed

It took a total of $19.71$ hours4.15 hours for the abstractor, 15.56 hours for the RL training. Extractor ML training can be run at the same time with abstractor training and is approximately 1.5 hours. to train our model. On the other hand, See et al. (2017) reported more than 78 hours of training. The training speed gain is mainly from the shortened input/target pairs of our abstractor model. Since our encoder-decoder-aligner structure operates on sentence pair, it trains much faster the the document-summary pair used in the pointer-generator model (See et al., 2017). We also report here the speed of training our abstractor as time per training update.We use their publicly available code and run training (without coverage mechanism) on our machine for a fair comparison. The number of vocabulary, embedding dimension, RNN hidden units are also set to the same as our model. We set their maximum encoder and decoder steps to 400 and 100 respectively, as reported in their paper. Our abstractor only requires $0.54$ seconds per updates while See et al. (2017) needs $3.42$ . For all our speed experiments we use K40 GPUs (similar to See et al. (2017). The reduced sequence length gives us an advantage of 6x. Also, the model proposed by See et al. (2017) needs careful scheduling of the sentence lengths.

Appendix C Generation Samples

Please see Fig. 4 and Fig. 5 for the output examples (see the discussion of this example in Sec. 7.2).