Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward
Luyang Huang, Lingfei Wu, Lu Wang
Introduction
Abstractive summarization aims to produce concise and informative summaries with the goal of promoting efficient information consumption and knowledge acquisition Luhn (1958). Significant progress has been made in this area by designing sequence-to-sequence-based neural models for single-document abstractive summarization Gehrmann et al. (2018); Liu et al. (2018); Liu and Lapata (2019). However, due to the limitations of model structure and word prediction-based learning objectives, these models frequently produce unfaithful content Cao et al. (2018) and near-extractive summaries See et al. (2017); Kryściński et al. (2018). These observations suggest that existing models lack semantic interpretation over the input, which is critical for summarization.
We argue that the generation of informative and succinct abstracts requires structured representation to facilitate the connection of relevant subjects, and the preservation of global context, e.g. entity interactions and topic flows. Take Fig. 1 as an example. Complex events related with the same entity may span multiple sentences, making it challenging for existing sequential models to capture. A graph representation, on the contrary, produces a structured summary and highlights the proximity of relevant concepts.
To this end, we present ASGARD, a framework for Abstractive Summarization with Graph-Augmentation and semantic-driven RewarD.Our code is available at https://github.com/luyang-huang96/GraphAugmentedSum. Under the encoder-decoder framework, we enhance the regular document encoder with a separate graph-structured encoder to maintain the global context and local characteristics of entities by using the outputs from an open information extraction (OpenIE) system.
Specifically, we experiment with two graph variants, one mainly capturing entities’ document-level interactions and the other reflecting such interactions within each paragraph plus topic shifts across paragraphs. Both graphs can capture interactions among entities that are positioned far from one another in the document and significantly reduce redundancy, as shown in Fig. 1. The document encoder and the graph encoder then cooperate during abstract generation, wherein the model is trained to identify salient content by aligning graphs with human summaries. Though structured representation has been studied before for summarization Fernandes et al. (2019), to the best of our knowledge, we are the first to utilize graph neural networks to explicitly encode entity-centered information for abstractive summary generation.
Moreover, we propose a novel multi-choice cloze reward to drive the model to acquire semantic understanding over the input. Concretely, we design cloze questions by removing pairwise entities that are connected with a predicate or co-occur in a human summary sentence, whereas prior work only considers single entities to construct questions Eyal et al. (2019). In tandem with our graph encoding of knowledge, the cloze reward further facilitates the acquisition of global entity interactions with reinforcement learning.
We carry out automatic and human evaluations on popular summarization datasets. Models based on ASGARD yield significantly better ROUGE scores Lin and Hovy (2003) than a variant without access to the knowledge graph on two popular news summarization datasets, New York Times corpus and CNN/Daily Mail dataset. Moreover, ASGARD models attain performance better than or comparable to others that are fine-tuned from large pretrained language models, including BERTSum Liu and Lapata (2019), UniLM Dong et al. (2019), and BART Lewis et al. (2019). Human judges further confirm that our models generate more informative summaries with less unfaithful errors than their counterparts without the graph encoder. Importantly, we find that automatic evaluation metrics only weakly correlate with these errors, implying that new evaluation methods are needed to better gauge summary quality.
The rest of the paper is organized as follows. We describe related work in the next section (§ 2). We then discuss the knowledge graph construction in § 3 and formulate our graph-augmented summarization framework in § 4. In § 5, we introduce reinforcement learning with cloze reward. Experiments and results are presented in § 6 and § 7. Finally, we conclude in § 8.
Related Work
Graph-Augmented Summarization and Generation. Graph structures have long been used for extractive summarization, such as in Textrank Mihalcea and Tarau (2004) and Lexrank Erkan and Radev (2004). For neural models, Tan et al. (2017) design graph-based attention to identify important sentences. For generating abstractive summaries, Fernandes et al. (2019) enhance a sequence-based encoder with graph neural networks (GNNs) to consider token-level entity types, however, entity interactions are largely ignored. On multi-document summarization, Fan et al. (2019) demonstrate the usefulness of encoding a linearized knowledge graph from OpenIE outputs. In this work, we design a graph encoder, which improves upon Graph Attention Networks (GATs) Veličković et al. (2018), to capture the global context in a more effective manner.
Also related is the graph-to-sequence framework that has been adopted for text generation Song et al. (2018). Both Gated Graph Neural Networks (GGNNs) Beck et al. (2018) and Graph Convolutional Networks (GCNs) Damonte and Cohen (2019) are shown to be effective in generating sentences from AMR graphs. Since Graph Attention Networks can better handle sparse graphs, they are used by Koncel-Kedziorski et al. (2019) with a transformer model to create scientific paper abstracts from knowledge graphs. Here we use graphs in addition to document encoder, both carrying complementary information for summarization.
Reinforcement Learning and QA Reward for Abstractive Summarization. As pointed out by Ranzato et al. (2016), word-level maximum likelihood training brings the problem of exposure bias. Recent work utilizes reinforcement learning to directly optimize the model to maximize the informativeness of summaries by using different forms of ROUGE scores Paulus et al. (2018); Chen and Bansal (2018); Sharma et al. (2019). However, ROUGE does not always distinguish good summaries from bad ones Novikova et al. (2017), and ignores entity interactions.
Since question answering (QA) has been used for summary evaluation Narayan et al. (2018), and is shown to correlate with human judgment of summaries qualities Eyal et al. (2019), QA-based rewards have been studied for summarization model training. Arumae and Liu (2019) demonstrate that using fill-in-the-blank questions by removing entities or root words leads to improved content selection. Scialom et al. (2019) consider a similar setup, but use both F1 score and QA system confidence as rewards in abstractive summarization. Previous work, however, mainly focuses on single entities or words in human-written summaries, thereby losing contexts and relations. Moreover, fill-in-the-blank questions by prior work give credits only when the answers exactly match the ground-truths, thus causing inaccuracies for rephrased answers and discouraging abstract content generation. In contrast, we design a semantic-driven cloze reward by measuring how well a QA system can address multiple choice cloze questions which better encode entity interactions and handle paraphrased answers.
Knowledge Graph Construction
To construct a knowledge graph from an input document, we utilize Stanford CoreNLP Manning et al. (2014) to first obtain outputs from coreference resolution and open information extraction (OpenIE) models Angeli et al. (2015). Note that we do not conduct global entity linking across documents. Next, we take the subject, predicate, object triples extracted by OpenIE and remove any triple whose argument (subject or object) has more than words. If two triples differ only by one argument, and the arguments overlap, we keep the longer triple.
We begin constructing the graph by treating subjects and objects as nodes connected by directed edges, with predicates as attributes. We further collapse coreferential mentions of the same entity into one node. With this, we can localize salient content related to each entity as well as make connections of spread-out entities through graph paths.
Summarization Model
In this section, we describe our graph-augmented abstractive summarization framework, as displayed in Fig. 2. Our model takes as input a document, represented as a sequence of tokens = , and a knowledge graph consisting of nodes . and are separately consumed by a document encoder and a graph encoder, as presented in § 4.1. Importantly, we present two types of graphs: DocGraph, focusing on the global context, and SegGraph, which additionally captures topic shift. The summary decoder then generates an abstractive summary by attending to both the document and the graph (§ 4.2). In § 4.3, we formulate a maximum likelihood training objective which leverages the detection of salient nodes in the graph.
Document Encoder. We first feed input to RoBERTa Liu et al. (2019) and take the last layer output as token embeddings. We then employ a single-layer bidirectional LSTM (BiLSTM) over token embeddings, producing encoder hidden states at time step .
Graph Encoder. Built on the graph constructed in § 3, we create nodes for predicates as done in previous graph-to-sequence work Beck et al. (2018) to reduce model parameters. Directed, unlabeled edges are added from subject to predicate, and from predicate to object. We further add reverse edges and self-loops to enhance the information flow, and this forms the graph .
Node Initialization. Each node often contains multiple mentions of an entity; we thus initialize node representation by using the average embedding of its tokens. We leverage document encoder hidden states as the contextual representation of tokens. Number of mentions in the node is added as an extra encoding to , to signify entity salience.
Contextualized Node Encoding. Our graph encoder improves upon Graph Attention Networks (GATs) Veličković et al. (2018) by adding residual connections between layers as discussed in Koncel-Kedziorski et al. (2019). Each node is represented by a weighted average of its neighbors:
where denotes the concatenation of heads, each producing a vector of the same dimension as . We use in our experiments with two layers of GATs. denotes the neighbors of in graph . are trainable parameters.
The graph encoder described above encodes document-level global context by merging entity mentions throughout the document and capturing their interactions with graph paths. It is henceforth denoted as DocGragh.
Encoder Extension to Capture Topic Shift (SegGragh). Modeling topic transitions and recurrences enables the identification of notable content, thus benefiting summarization Barzilay and Lee (2004). Since paragraphs naturally divide a document into different topic segments, we extend DocGragh by first encoding each paragraph as a subgraph (for the -th paragraph) using the same graph encoder, and then connecting all subgraphs with a BiLSTM. If two nodes in separate subgraphs refer to the same entity, they are initialized with the same embedding (as in the first occurrence). Concretely, we first apply max-pooling over all nodes in subgraph from the outputs of the final GAT layer; the max-pooling results are then used as inputs for a BiLSTM to produce the final subgraph representation for .
2 Summary Decoder
Our summary decoder uses a single-layer unidirectional LSTM with a hidden state at step ; it generates summary tokens recurrently by jointly attending to the input document and the graph.
Attending the Graph. At each decoding step , we compute a graph context vector with the attention mechanism Bahdanau et al. (2014):
where are also trainable parameters. We omit bias terms for simplicity.
Attending the Document. Similarly, the document context is computed over input tokens by additionally considering the graph context :
Token Prediction. Graph and document context vectors, treated as salient content summarized from both sources, are concatenated with the decoder hidden state to produce the vocabulary distribution :
We use weight-sharing between the input embedding matrix and the matrix to allow reusing linguistic knowledge as proposed by Paulus et al. (2018). We further add a copy mechanism similar to See et al. (2017), with copy probability as:
where denotes the embedding for the token predicted at step .
Modified Hierarchical Attention for SegGraph. As mentioned in § 4.1, SegGraph captures content salience by modeling topic shift across paragraphs. We thus seek to leverage paragraph-level importance to redistribute the node attentions, e.g., giving more attentions to nodes in important paragraphs. In particular, we utilize hierarchical attention Hsu et al. (2018), where we first calculate attention over subgraphs as done in Eq. 3 by replacing with subgraph representation .
We then combine subgraph attentions with the previously calculated attentions for nodes in the subgraph using scalar multiplication and renormalization over all nodes in input. This results in the new attention weights , which are used to obtain graph context vector as done in Eq. 3 for SegGraph.
3 Training Objectives
We first consider a maximum likelihood (ML) training objective that minimizes the following loss:
where are documents and are references from the training set , and are model parameters.
Node Salience Labeling. In addition to modeling local characteristics of nodes, we further enhance the model by adding an objective to label node salience, e.g., whether the entities in a node are mentioned in the reference summaries. We introduce a soft mask layer over each node before it is passed into the graph encoder, to signify its salience. This layer, serving as an information gate, predicts a real number in $\mathbf{v}_{i}m_{i}\mathbf{v}_{i}\mathbf{v}_{i}\hat{m}_{i}=\textrm{sigmoid}(\mathbf{u}_{2}\mathbf{v}_{i})m_{i}1D$:
where represents the number of nodes in the dataset. Finally, the ML training objective takes the following form: .
Reinforcement Learning with Cloze
After maximum likelihood training with , we further design a multiple choice cloze reward in a second-stage reinforcement learning (RL), leading the model to generate more faithful and informative summaries.
For RL, we use a self-critical policy gradient algorithm Rennie et al. (2017). During training, two summaries are generated: first, a summary , sampling tokens based on the probability distribution at each decoding step; and second, a baseline summary which greedily selects the tokens of the highest probability at each step. The objective of RL is defined based on the rewards of the two summaries, and , as follows:
Our reward function uses the combination of ROUGE and the multiple choice cloze score introduced below, i.e., . For ROUGE, it considers F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L calculated against the reference summary, and takes the form of .
Multiple Choice Cloze Reward. Here, we present a novel multiple choice cloze reward to work with our knowledge graph and guide the summarization model towards improved awareness of entity interactions. We treat the system-generated summary as context. We provide a set of questions automatically constructed from the corresponding reference summary written by a human. We separately train a question answering (QA) model to address the questions by reading the context. Intuitively, if the system summary shares salient information with the reference, the QA model will assign the correct answers with high probability. We decide to use the average probability of the correct answers as our cloze reward. Below, we give details on how to construct the questions and candidate answers with examples shown in Fig. 3.
Question Construction. We run the OpenIE tool on human-written summaries, retaining triples with arguments not longer than words. For each triple of subject, predicate, object, we create two types of questions: (1) argument pair questions, by removing the subject and object, and (2) predicate questions, by removing the predicate.
Candidate Answer Construction. Because fill-in-the-blank style cloze may incorrectly penalize QA systems with answers paraphrased from the ground-truth, we opt for a multiple choice cloze. We construct three candidate answers in addition to the gold-standard from the salient context, which are summary-worthy sentences selected from the input. Specifically, we use greedy search to select the best combination of sentences that maximizes ROUGE-2 F1 with reference to human summary. We further include a sentence in the salient context if it has a ROUGE-L recall greater than when compared with any sentence in the reference.
We first select OpenIE triples from the salient context and filter out those that have any overlapping content word with the correct answer. For argument pair questions, we create one candidate answer by swapping the subject and the object (e.g. candidate B as in Fig. 3) and two candidates by replacing the subject or the object with another argument of the same role extracted from the salient context (e.g. candidates C and D). If not enough answers are created, we further consider randomly selecting sentences from the input. For predicate questions, we use predicates in other triples from the context as candidate answers. Among all candidates, we select the three that are able to construct the most fluent questions using perplexity predicted by BERT Devlin et al. (2019).
In case reference summaries do not yield OpenIE triples, we create additional entity pair questions. We remove two co-occurring entities from the summary and create three candidate answers in the same way as described above.
QA Model. We fine-tune RoBERTa Liu et al. (2019) to build our QA model. We use the salient context described above as the context for training. We then concatenate the context, the question, and each of the four candidate answers, and pass the final [CLS] representation through a fully-connected layer, from which the answer is predicted.
Experimental Setups
Datasets. We experiment with two popular summarization datasets with summaries containing multiple sentences: the New York Times annotated corpus (NYT) Sandhaus (2008) and the CNN/Daily Mail dataset (CNN/DM) Hermann et al. (2015). We follow the preprocessing steps and experimental setups from prior work Paulus et al. (2018); See et al. (2017) for both datasets. For NYT, the training, validation, and test sets contain , , and samples. For CNN/DM, the numbers are , , and .
To train our cloze QA model for NYT, we construct question-answer pairs from human-written summaries in the training set based on the method described in § 5. On CNN/DM, we collect question-answer samples from the training set. For both datasets, we set aside samples as a validation set and samples as a test set. Our QA model achieves an accuracy of on NYT and on CNN.
Training Details and Parameters. We use the base version of RoBERTa model to extract token features for all experiments. We truncate input articles to (NYT) and (CNN/DM) BPEs. We employ LSTM models with -dimensional hidden states for the document encoder ( each direction) and the decoder. For the residual connection of the graph encoder, we use heads, each with a dimension of . For DocGraph training and inference, we prune isolated graphs with fewer than three nodes to increase robustness and reduce redundancy. We set , on NYT and , on CNN/DM after tuning on the validation set. For both datasets, we set . More details about parameters and graph statistics are in the Appendices.
Baselines and Comparisons. For both datasets, we include an extractive baseline Lead-3. We further add the following abstractive models for comparison: (1) a pointer-generator model with coverage See et al. (2017) (PointGen+cov); (2) a deep reinforcement learning-based model Paulus et al. (2018) (DeepReinforce); (3) a bottom-up model Gehrmann et al. (2018) (BottomUp); (4) a deep communicating agents-based summarization model Celikyilmaz et al. (2018) (DCA). We also report results by fine-tuning BART model Lewis et al. (2019). In Lewis et al. (2019), fine-tuning is only performed on CNN/Daily Mail. We apply the same method for NYT.
For NYT, we add results by SENECA model Sharma et al. (2019) from our prior work, which previously achieved the best ROUGE-2.
On CNN/Daily Mail, we include comparisons of a two-stage fine-tuned model (first on an extractor, then on an abstractor) with BERT Liu and Lapata (2019) (BertSumExtAbs), and a unified pretrained language model for generation Dong et al. (2019) (UniLM).
In addition to ASGARD-doc and ASGARD-seg, which are trained with an ML objective, we report results trained with ROUGE as the reward (), and with an additional cloze reward (). Lastly, we consider a variant NoGraph by ablating the graph encoder.
Results
Results on NYT. As displayed in Table 1, our ASGARD-seg model trained with ROUGE and cloze rewards achieves better ROUGE scores Lin and Hovy (2003) than all other comparisons except the fine-tuned BART. However, our ASGARD-seg’s ROUGE-L score is comparable to BART. This indicates the effectiveness of our graph-augmented summarization framework.
Moreover, both our ASGARD-doc and ASGARD-seg models yield significantly higher ROUGE scores than the variant without the graph encoder (NoGraph). This demonstrates the benefit of using structured representation to encode entity interactions. Furthermore, both ASGARD-doc and ASGARD-seg with cloze reward () obtain significantly higher scores compared to the models trained with ROUGE reward only. This signifies that our multi-choice cloze reward can guide better semantic interpretation of content, leading to the generation of more informative summaries. We also find that ASGARD-seg outperforms ASGARD-doc, indicating that ASGARD-seg better captures topic drift through multiple paragraphs.
Results on CNN/DM. We observe similar trends on the CNN/DM articles as shown in Table 2. Noticeably, ASGARD-doc trained with the combined ROUGE and cloze reward produces better ROUGE scores than BERTSumExtAbs and UniLM, which are carefully fine-tuned from large pretrained language models, and the numbers are also comparable to the fine-tuned BART.
Evaluation with Cloze Test. We further evaluate model-generated summaries with our proposed cloze test. Here, we report two scores in Fig. 4: the average probability of the correct answers output by our QA model, and its prediction accuracy. We first calculate one score per summary, then take the average over all summaries. We can see that our models with graph encoders perform better than the variant without it.
2 Human Evaluation
We further conduct human evaluation to analyze the informativeness and fluency of the generated summaries, as well as to investigate the unfaithful errors made by different models. We sample articles from the NYT test set and hire three native or fluent speakers of English to rate summaries generated by our two systems, NoGraph and ASGARD-seg, along with outputs by BART and human-written summaries (presented in random order). After reading the articles, each judge scores summaries on a Likert scale from 1 (worst) to 5 (best) on informativeness—whether the summary covers important information from the input, and fluency—whether the summary is grammatically correct.
We consider three types of unfaithful errors: (i) hallucination error—creating content not present in the input, (ii) out-of-context error—generating facts without including required context or within incorrect context, and (iii) deletion or substitution error—mistakenly deleting or substituting subjects, objects, or clauses. We ask the annotators to label each type as for existence of errors, and otherwise. Detailed guidelines are in the Appendices.
From Table 3, we can see that our ASGARD-seg model obtains better scores in informativeness and fluency, compared to the variant without the graph encoder. This indicates the effectiveness of leveraging knowledge graph representation. Sample output summaries by our models can be found in Fig. 5. Meanwhile, fine-tuned BART model produces outputs with similar informativeness and fluency of human-constructed summaries, suggesting a future direction of building our model on top of a large-pretrained encoder-decoder model.
For unfaithful errors, we report the percentage of errors calculated by majority voting (i.e., more than one annotator vote as incorrect). First, we find that our ASGARD-seg model has a comparable error pattern as human summaries. Specifically, for out-of-context and deletion or substitution errors, our graph-enhanced model produces significantly fewer mistakes in these categories, compared to the model without graph information. This implies that knowledge graph-enhanced models can improve summary faithfulness.
Interestingly, human-written summaries are also discerned to contain a nontrivial amount of hallucination errors. After inspection, we find that human tends to leverage world knowledge to include content that is not covered by the articles. For instance, for an article discussing events in “Boston”, the human writer may describe them as happening in “Massachusetts” in the summary.
3 Analyzing Automatic Metrics and Summary Errors
We further plot the distributions of automatic evaluation scores regarding the three types of unfaithful errors based on majority voting in Fig. 6. First, summaries with out-of-context and deletion or substitution errors receive lower cloze and ROUGE scores overall.
Nevertheless, with regard to hallucination errors, we do not see such pattern; there is even a slightly reversed relation with both cloze scores and ROUGE scores, wherein summaries with more hallucination errors tend to score higher. This echos our previous observation that human summaries can be hallucinatory too, where world knowledge is used for writing the summaries.During human evaluation, we do not ask human judges to distinguish the source of hallucination errors, i.e. from world knowledge or out of fabrication, since this requires significant domain knowledge.
Furthermore, we find a weak correlation between the three variants of ROUGE scores and three types of errors, e.g., the minimum and the maximum values of Pearson’s are and . This suggests that new metrics should be designed to better gauge summary quality. We plan to study this direction in future work.
Conclusion
We presented a novel knowledge graph-augmented abstractive summarization framework, along with a novel multiple choice cloze reward for reinforcement learning. Our models capture both local characteristics and global interactions of entities from the input, thus generating summaries of higher quality. In tandem with the graph representation, our cloze reward further improves summary content. Human evaluation further confirms that our graph-augmented models trained with the cloze reward produce more informative summaries and significantly reduces unfaithful errors.
Acknowledgements
This research is supported in part by National Science Foundation through Grant IIS-1813341, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. We thank the anonymous reviewers for their suggestions.
References
Appendix A Appendices
Statistics of Knowledge Graphs. We show the statistics of knowledge graphs on two datasets in Table 4. On each dataset, we construct a large graph with abundant relations for each article. Note that on CNN/DM we have more arguments but fewer predicates in a document than those on NYT. This indicates CNN/DM has fewer coreferred entities.
Training Details. We utilize Adam Kingma and Ba (2015) with a gradient clipping of and a batch size of for all models. During ML training, a learning rate of is used; during RL stage, it is reduced to Paulus et al. (2018).
We use the base version of BERT model Devlin et al. (2019) to select candidate answers and we fine-tune the base version of RoBERTa model Liu et al. (2019) to build our QA model. We take pretrained models from Wolf et al. (2019).
A.2 Human Evaluation Guideline
In our human evaluation, each human annotator is presented with 100 news articles. The annotators are asked to evaluate four summaries (in random order) for each article on two aspects (informativeness and fluency) on a scale of 1 to 5 (1 being very poor and 5 being very good). Furthermore, for unfaithfulness, we define three types of unfaithful errors and ask annotators to label whether summaries contain any type of error. Instructions in Table 5 are given to human judges.
Informativeness: Whether the summary provides enough and necessary content coverage from the input article.
Fluency: Whether the summary is free of obvious grammatically incorrect sentences (e.g., fragments, missing components) that make the text difficult to read.
Faithfulness: Whether the summary accords with the facts expressed in the source.