GSum: A General Framework for Guided Neural Abstractive Summarization

Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Jiang, Graham Neubig

Introduction

Modern techniques for text summarization generally can be categorized as either extractive methods Nallapati et al. (2017); Narayan et al. (2018b); Zhou et al. (2018), which identify the most suitable words or sentences from the input document and concatenate them to form a summary, or abstractive methods Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016); Paulus et al. (2018), which generate summaries freely and are able to produce novel words and sentences. Compared with extractive algorithms, abstractive algorithms are more flexible, making them more likely to produce fluent and coherent summaries. However, the unconstrained nature of abstractive summarization can also result in problems. First, it can result in unfaithful summaries Kryściński et al. (2019), containing factual errors as well as hallucinated content. Second, it can be difficult to control the content of summaries; it is hard to pick in advance which aspects of the original content an abstractive system may touch upon. To address the issues, we propose methods for guided neural abstractive summarization: methods that provide various types of guidance signals that 1) constrain the summary so that the output content will deviate less from the source document; 2) allow for controllability through provision of user-specified inputs.

There have been some previous methods for guiding neural abstractive summarization models. For example, Kikuchi et al. (2016) specify the length of abstractive summaries, Li et al. (2018) provide models with keywords to prevent the model from missing key information, and Cao et al. (2018) propose models that retrieve and reference relevant summaries from the training set. While these methods have demonstrated improvements in summarization quality and controllability, each focuses on one particular type of guidance – it remains unclear which is better and whether they are complementary to each other.

In this paper, we propose a general and extensible guided summarization framework that can take different kinds of external guidance as input. Like most recent summarization models, our model is based on neural encoder-decoders, instantiated with contextualized pretrained language models, including BERT Devlin et al. (2019) and BART Lewis et al. (2020). With this as a strong starting point, we make modifications allowing the model to attend to both the source documents and the guidance signals when generating outputs. As shown in Figure 1, we can provide automatically extracted or user-specified guidance to the model during test time to constrain the model output. At training time, to encourage the model to pay close attention to the guidance, we propose to use an oracle to select informative guidance signals – a simple modification that nonetheless proved essential in effective learning of our guided summarization models. Using this framework, we investigate four types of guidance signals: (1) highlighted sentences in the source document, (2) keywords, (3) salient relational triples in the form of (subject, relation, object), and (4) retrieved summaries.

We evaluate our methods on 6 popular summarization benchmarks. Our best model, using highlighted sentences as guidance, can achieve state-of-the-art performance on 4 out of the 6 datasets, including 1.28/0.79/1.13 ROUGE-1/2/L improvements over previous state-of-the-art model on the widely-used CNN/DM dataset. In addition, we perform in-depth analyses of different guidance signals and demonstrate that they are complementary to each other in that there is potential to aggregate their outputs together and obtain further improvements. An analysis of the results also reveals that our guided models can generate more faithful summaries and more novel words. Finally, we demonstrate that we can control the output by providing user-specified guidance signals, with different provided signals resulting in qualitatively different summaries.

Background and Related Work

typically takes a source document $\mathbf{x}$ consisting of multiple sentences $x_{1},\cdots,x_{|\mathbf{x}|}$ , runs them through an encoder to generate representations, and passes them to a decoder that outputs the summary $\mathbf{y}$ one target word at a time. Model parameters $\theta$ are trained to maximize the conditional likelihood of the outputs in a parallel training corpus $\langle\mathcal{X},\mathcal{Y}\rangle$ :

Several techniques have been proposed to improve the model architecture. For example, models of copying (Gu et al., 2016; See et al., 2017; Gehrmann et al., 2018) allow words to be copied directly from the input to the output, and models of coverage discourage the model from generating repetitive words See et al. (2017).

can be defined as some variety of signal $\mathbf{g}$ that is fed into the model in addition to the source document $\mathbf{x}$ :

Within this overall framework, the types of information that go into $\mathbf{g}$ and the method for incorporating this information into the model may vary. While there are early attempts at non-neural guided models Owczarzak and Dang (2010); Genest and Lapalme (2012), here we focus on neural approaches and summarize recent work in Table 1. For example, Li et al. (2018) first generate a set of keywords, which are then incorporated into the generation process by an attention mechanism. Cao et al. (2018) propose to search the training corpus and retrieve datapoint $\langle\mathbf{x}^{j},\mathbf{y}^{j}\rangle$ whose input document $\mathbf{x}^{j}$ is most relevant to the current input $\mathbf{x}$ , and treat $\mathbf{y}^{j}$ as a candidate template to guide the summarization process. Besides, Jin et al. (2020) and Zhu et al. (2020) extract relational triples in the form of (subject, relation, object) from source documents and represent them by graph neural networks. The decoders then attend to the extracted relations to generate faithful summaries. A concurrent work by Saito et al. (2020) propose to extract keywords or highlighted sentences using saliency models and feed them to summarization models.

There are also works on controlling the summary length (Kikuchi et al., 2016; Liu et al., 2018b) and styles (Fan et al., 2018) by explicitly feeding the desired features to the model. In addition, Liu et al. (2018a) and Chen and Bansal (2018) follow a two-stage paradigm, in which a subset of the source document $\{x_{i_{1}},\cdots,x_{i_{n}}\}$ will first be selected by a pretrained extractor as highlighted sentences and then be fed into the model encoder in the second stage with the rest of the text discarded.

Methods

Figure 2 illustrates the general framework of our proposed method. We feed both the source documents and various types of guidance signals to the model. Specifically, we experiment with guidance signals including highlighted sentences, keywords, relations, and retrieved summaries, although the framework is general and could be expanded to other varieties of guidance as well.

We adopt the Transformer model (Vaswani et al., 2017) as our backbone architecture, instantiated with BERT or BART, which can be separated into the encoder and decoder components.

Our model has two encoders, encoding the input source document and guidance signals respectively.

Similar to the Transformer model, each of our encoders is composed of $N_{enc}+1$ layers, with each encoding layer containing both a self-attention block and a feed-forward block:

where LN denotes layer normalization. Note the source document and guidance signal do not interact with each other during encoding.

We share the parameters of the bottom $N_{enc}$ layers and the word embedding layers between the two encoders, because 1) this can reduce the computation and memory requirements; 2) we conjecture that the differences between source documents and guidance signals should be high-level, which are captured at top layers of the encoders.

1.2 Decoder

Different from the standard Transformer, our decoder has to attend to both the source document and guidance signal instead of just one input.

Concretely, our decoder is composed of $N_{dec}$ identical layers, with each layer containing four blocks. After the self-attention block, the decoder will first attend to the guidance signals and generate the corresponding representations, and hence the guidance signal will inform the decoder which part of the source documents should be focused on. Then, the decoder will attend to the whole source document based on the guidance-aware representations. Finally, the output representation will be fed into the feed-forward block:

Ideally, the second cross-attention block allows the model to fill in the details of the input guidance signal, such as finding the name of an entity by searching through co-reference chains.

2 Choices of Guidance Signals

Before delving into the specifics of the types of guidance signal we used, we first note an important detail in training our model. At test time, there are two ways we can define the guidance signal: 1) manual definition where an interested user defines the guidance signal $\mathbf{g}$ by hand, and 2) automatic prediction where an automated system is used to infer the guidance signal $\mathbf{g}$ from input $\mathbf{x}$ . We demonstrate results for both in experiments.

At training time, it is often prohibitively expensive to obtain manual guidance. Hence, we focus on two varieties of generating them: 1) automatic prediction using $\mathbf{x}$ as detailed above, and 2) oracle extraction where we use both $\mathbf{x}$ and $\mathbf{y}$ to deduce a value $\mathbf{g}$ that is most likely useful in generating $\mathbf{y}$ .

Theoretically, automatic prediction has the advantage of matching the training and testing conditions of a system that will also receive automatic predictions at test time. However, as we will show in experiments, the use of oracle guidance has a large advantage of generating guidance signals that are highly informative, thus encouraging the model to pay more attention to them at test time.

With this in mind, we describe the four varieties of guidance signal we experiment with, along with their automatic and oracle extraction methods.

The success of extractive approaches have demonstrated that we can extract a subset of sentences $\{x_{i_{1}},\cdots,x_{i_{n}}\}$ from the source document and concatenate them to form a summary. Inspired by this, we explicitly inform our model which subset of source sentences should be highlighted using extractive models.

We perform oracle extraction using a greedy search algorithm (Nallapati et al., 2017; Liu and Lapata, 2019) to find a set of sentences in the source document that have the highest ROUGE scores with the reference (detailed in Appendix) and treat these as our guidance $\mathbf{g}$ . At test time, we use pretrained extractive summarization models (BertExt Liu and Lapata (2019) or MatchSum Zhong et al. (2020) in our experiments) to perform automatic prediction.

If we select full sentences, they may contain unnecessary information that does not occur in an actual summary, which could distract the model from focusing on the desired aspects of the input. Therefore, we also try to feed our model with a set of individual keywords $\{w_{1},\ldots,w_{n}\}$ from the source document.

For oracle extraction, we first use the greedy search algorithm mentioned above to select a subset of input sentences, then use TextRank (Mihalcea and Tarau, 2004) to extract keywords from these sentences. We also filter the keywords that are not in the target summary. The remaining keywords are then fed to our models. For automatic prediction, we use another neural model (BertAbs Liu and Lapata (2019) in the experiments) to predict the keywords in the target summary.

Relations are typically represented in the form of relational triples, with each triple containing a subject, a relation, and an object. For example, Barack Obama was born in Hawaii will create a triple (Barack Obama, was born in, Hawaii).

For oracle extraction, we first use Stanford OpenIE (Angeli et al., 2015) to extract relational triples from the source document. Similar to how we select highlighted sentences, we then greedily select a set of relations that have the highest ROUGE score with the reference, which are then flattened and treated as guidance. For automatic prediction, we use another neural model (similarly, BertAbs) to predict the relation triples on the target side.

Intuitively, gold summaries of similar documents with the input can provide a reference point to guide the summarization. Therefore, we also try to retrieve relevant summaries from the training data $\langle\mathcal{X},\mathcal{Y}\rangle$ .

For oracle extraction, we directly retrieve five datapoints $\{\langle\mathbf{x}_{1},\mathbf{y}_{1}\rangle,\ldots,\langle\mathbf{x}_{5},\mathbf{y}_{5}\rangle\}$ from training data whose summaries $\mathbf{y}_{i}$ are most similar to the target summary $\mathbf{y}$ using Elastic Search.https://github.com/elastic/elasticsearch For automatic prediction at test time, we retrieve five datapoints whose source documents $\mathbf{x}_{i}$ are most similar to each input source document $\mathbf{x}$ instead.

Experiments

We experiment on 6 datasets (statistics in Table 2):

Kim et al. (2019) is a highly abstractive dataset and we use its TIFU-long version.

Narayan et al. (2018a) is an abstractive dataset that contains one-sentence summaries of online articles from BBC.

Hermann et al. (2015); Nallapati et al. (2016) is a widely-used summarization dataset consisting of news articles and associated highlights as summaries. We use its non-anonymized version.

Koupaee and Wang (2018) is extracted from an online knowledge base and requires high level of abstraction.

Sandhaus (2008) is a dataset that consists of news articles and their associated summaries.https://catalog.ldc.upenn.edu/LDC2008T19 We follow Kedzie et al. (2018) to preprocess and split the dataset.

Cohan et al. (2018) is relatively extractive and is collected from scientific papers.

2 Baselines

Our baselines include the following models:

Liu and Lapata (2019) is an extractive model whose parameters are initialized with BERT Devlin et al. (2019).

Liu and Lapata (2019) is an abstractive model with encoder initialized with BERT and trained with a different optimizer than its decoder.

Zhong et al. (2020) is an extractive model that reranks the candidate summaries produced by BertExt and achieves state-of-the-art extractive results on various summarization datasets.

Lewis et al. (2020) is an state-of-the-art abstractive summarization model pretrained with a denoising autoencoding objective.

3 Implementation Details

We build our models based on both BertAbs and BART, and follow their hyperparameter settings to train our summarizers. For our model built on BertAbs, there are 13 encoding layers, with the top layer randomly initialized and separately trained between the two encoders. For our model built on BART, there are 24 encoding layers, with the top layer initialized with pretrained parameters yet separately trained between the two encoders. The first cross-attention block of the decoder is randomly initialized whereas the second cross-attention block is initialized with pretrained parameters. BertAbs is used to predict guidance signals of relations and keywords during test time. Unless otherwise stated, we use oracle extractions at training time.

4 Main Results

We first compare different kinds of guidance signals on the CNN/DM dataset using BertAbs, then evaluate the best guidance on the other five datasets using both BertAbs and BART.

As shown in Table 3, if we feed the model with automatically constructed signals, feeding either highlighted sentences or keywords can outperform the abstractive summarization baseline by a large margin. Especially, feeding highlighted sentences can outperform the best baseline by more than 1 ROUGE-L point. Using relations or retrieved summaries as guidance will not improve the baseline performance, likely because it is hard to predict these signals during test time.

If we use an oracle to select the guidance signals, all varieties of guidance can improve the baseline performance significantly, with the best-performing model achieving a ROUGE-1 score of 55.18. The results indicate that 1) the model performance has the potential to be further improved given a better guidance prediction model; 2) the model does learn to depend on the guidance signals.

We then try to build our model on the state-of-the-art model, using highlighted sentences as guidance as it achieves the best performance on CNN/DM. First, we build our model on BART and train it with oracle-extracted highlighted sentences as guidance. Then, we use MatchSum to predict the guidance at test time. From Table 4, we can see that our model can achieve over 1 ROUGE-1/L point improvements compared with the state-of-the-art models, indicating the effectiveness of the proposed methods.

We report the performance of the highlighted sentence model on all the other five datasets in Table 5. Generally, the model works better when the dataset is more extractive. For abstractive datasets such as Reddit and XSum, our model cannot achieve performance increases when the abstractive summarization baseline is already rather strong. For extractive datasets such as PubMed and NYT, on the other hand, our model can achieve some improvements over the baselines even though the abstractive baseline outperforms the extractive oracle model in some cases.

5 Analysis

We perform extensive analyses on CNN/DM to gain insights into our (BERT-based) models. Unless otherwise stated, we use oracle extractions at training time and automatic prediction at test time.

While we sometimes provide information extracted from the source document as guidance signals, it is unclear whether the model will over-fit to and regurgitate this guidance, or still generate novel expressions. To measure this, we count the number of novel $n$ -grams in the output summaries, namely $n$ -grams that do not appear in the source document. As shown in Figure 3, all of our guided models in fact generate more novel $n$ -grams than the baseline, likely because at training time the model is trained to compress and paraphrase the extracted information from the source document into the gold summary. In addition, our models cover more novel $n$ -grams that are in the gold reference than baseline. The results indicate that our guided models can indeed generate novel expressions, and are not referencing the input guidance too strongly.

While some guidance signals achieve worse performance than others, it is still possible to aggregate their outputs and obtain better performance if their outputs are diverse and they complement each-other. To verify this hypothesis, we try to select the best output of the four guidance signals for each test datapoint and investigate if we can aggregate their best outputs and achieve better performance.

Concretely, for each test input, we perform an oracle experiment where we compute the ROUGE score of each output of the four guidance signals and pick the best one. As shown in Table 6, despite the fact that the highlighted sentence signal achieves the best overall performance, it still underperforms one of the other three varieties of guidance more than 60% of the time. In addition, by aggregating their best outputs together, we can achieve a ROUGE-1/L point of 48.30/45.15, which significantly outperforms any single guided model. Further, we try to aggregate these guidance signals in a pairwise manner, and Table 7 demonstrates that each guidance signal is complementary to each other to some extent. Thus, we can safely conclude that each type of guidance signal has its own merits and one promising direction is to utilize a system combination method such as Hong et al. (2015) to aggregate the results together.

It is also of interest what effect this guidance has on the model outputs qualitatively. We sample several generated outputs (Table 8) and find that different provided signals can result in different outputs. Especially, for our sentence-guided model, providing the model with by running tissue paper over his son seth makes him sleep enables the model to generate the exact same sentence, and when the model is fed with one grateful viewer of the video commented…, it will generate one viewer commented…. The examples demonstrate that our model can generate summaries mostly faithful to the guidance signals while also performing abstraction.

We also evaluate whether our generated summaries are faithful to the source document. We randomly sample 100 datapoints from the test set and ask 3 people from Amazon Mechanical Turk to evaluate their factual correctness. Each person gives a score between 1 and 3, with 3 being perfectly faithful to the source document. Table 9 shows that our guided model can generate more faithful summaries compared with the baseline.

As mentioned previously, we use an oracle to select guidance signals during training. In this part, we investigate if we can provide automatically constructed guidance to the model during training as well. Table 10 shows that this methodology will lead to significantly worse performance. We conjecture that this is because when the relevancy between guidance and reference is weakened, the model will not learn to depend on the guidance signals and thus the model will be reduced to the original abstractive summarization baseline.

Conclusion

We propose a general framework for guided neural summarization, using which we investigate four types of guidance signals and achieve state-of-the-art performance on various popular datasets. We demonstrate the complementarity of the four guidance signals, and find that our models can generate more novel words and more faithful summaries. We also show that we can control the output by providing user-specified guidance signals.

Given the generality of our framework, this opens the possibility for several future research directions including 1) developing strategies to ensemble models under different guidance signals; 2) incorporating sophisticated techniques such as copy or coverage over the source document, the guidance signal, or both; and 3) experimenting with other kinds of guidance signals such as salient elementary discourse units.

Acknowledgements

We thank Shruti Rijhwani, Yiran Chen, Jiacheng Xu and anonymous reviewers for valuable feedback and helpful suggestions. This work was supported in part by a grant under the Northrop Grumman SOTERIA project and the Air Force Research Laboratory under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory or the U.S. Government.

References

Appendix A Greedy Selection Algorithm

Algorithm 1 demonstrates how we use an oracle to select a subset of source sentences that have the highest ROUGE scores with the reference summary. We use a similar algorithm to select the relation triples as well. Concretely, we flatten each relational triple $(s,r,o)$ by concatenating its elements together and treat each concatenated text as a source sentence, then use Algorithm 1 to select the relation triples greedily.

Appendix B Analysis

We perform more analysis on CNN/DM in this section. Unless otherwise stated, we use oracle extractions at training time and BertAbs as our base model.

In addition to the qualitative results in the main paper, we also perform a quantitative analysis to demonstrate the controllability of our models.

The quantitative results in Table 3 of the main text already demonstrate to some extent that we can control the model with guidance signals, as guidance signals of better quality can lead to better summaries. To further demonstrate this, we randomly sample guidance signals multiple times and plot the correlation between guidance quality and output quality in Figure 4. We can clearly see that there is a strong correlation between these two variables, indicating the controllability of our model.

In addition, we try to divide each test reference summary into two halves, then use oracle extraction to obtain guidance signals for both of these two halves and feed them to the model. Table 11 shows that feeding incompatible guidance signals can lead to degraded performance, which further demonstrates that we can control the summary through provision of user-specified inputs.

B.2 Semantic Similarity

To evaluate the semantic similarities between our model outputs and the reference, we also compute the METEOR scores Banerjee and Lavie (2005). As shown in Table 12, all of our guided models can outperform BertAbs in temrs of both of METEOR. However, it is surprising that BertExt achieves the best performance, possibly because METEOR has a tendency to favor long summaries.

B.3 Automatic Factual Correctness Evaluation

Besides human evaluation, we have also tried to use factCC (Kryściński et al., 2019)https://github.com/salesforce/factCC to evaluate the factual correctness of our model outputs automatically. However, as shown in Figure 5, the factCC tool will give the gold reference an accuracy of about 10%. Considering our model is optimized towards the gold reference, the factCC score might not be a good indicator of whether there are factual errors in a generated summary.

B.4 Necessity of Using Oracles During Training

We have demonstrated in the main paper that it is necessary to use an oracle to select guidance signals during training for highlighted sentence models. In this part, we investigate if this is true for all the three guidance signals as well. Table 13 shows that this methodology will lead to significantly worse performance for other guidance signals as well, which further verifies our hypothesis that when the relevancy between guidance and reference is weakened, the model will not learn to depend on the guidance signals and thus the model will be reduced to the original abstractive summarization baseline.

B.5 Domain Adaptation.

We also evaluate the performance of our highlighted sentence-guided models under domain adaptation settings, namely train a summarization model on one dataset and test it on some other datasets. As shown in Table 14, generally, extractive models can outperform abstractive ones under domain adaptations settings and our model can achieve better performance than abstractive baselines. However, while our model is given the extracted sentences by the extractive model, we still cannot outperform extractive baselines. These negative results indicate that our model may still fail to fully depend on guidance signals when doing adaptation. Possible future directions include dropping out the input documents occasionally during training so that the model can learn to better condition on the guidance.