Enhancing Factual Consistency of Abstractive Summarization

Chenguang Zhu, William Hinthorn, Ruochen Xu, Qingkai Zeng, Michael Zeng, Xuedong Huang, Meng Jiang

Introduction

Text summarization models aim to produce an abridged version of long text while preserving salient information. Abstractive summarization is a type of such models that can freely generate summaries, with no constraint on the words or phrases used. This format is closer to human-edited summaries and is both flexible and informative. Thus, there are numerous approaches to produce abstractive summaries (See et al., 2017; Paulus et al., 2017; Dong et al., 2019; Gehrmann et al., 2018).

However, one prominent issue with abstractive summarization is factual inconsistency. It refers to the hallucination phenomenon that the summary sometimes distorts or fabricates the facts in the article. Recent studies show that up to 30% of the summaries generated by abstractive models contain such factual inconsistencies (Kryściński et al., 2019b; Falke et al., 2019), raising concerns about the credibility and usability of these systems. Table 1 demonstrates an example article and excerpts of generated summaries. As shown, the article mentions that Real Madrid ace Gareth Bale scored twice and Cristiano Ronaldo scored five goals. However, both BottomUp (Gehrmann et al., 2018) and Seq2Seq wrongly states that Bale scored five goals. Comparatively, our model FASum generates a summary that correctly exhibits the fact in the article. And as shown in Section 4.6.1, our model achieves higher factual consistency not just by making more copies from the article.

On the other hand, most existing abstractive summarization models apply a conditional language model to focus on the token-level accuracy of summaries, while neglecting semantic-level consistency between the summary and article. Therefore, the generated summaries are often high in token-level metrics like ROUGE (Lin, 2004) but lack factual consistency. In view of this, we argue that a robust abstractive summarization system must be equipped with factual knowledge to accurately summarize the article.

In this paper, we represent facts in the form of knowledge graphs. Although there are numerous efforts in building commonly applicable knowledge graphs such as ConceptNet (Speer et al., 2017), we find that these tools are more useful in conferring commonsense knowledge. In abstractive summarization for contents like news articles, many entities and relations are previously unseen. Plus, our goal is to produce summaries that do not conflict with the facts in the article. Thus, we propose to extract factual knowledge from the article itself.

We employ the information extraction (IE) tool OpenIE (Angeli et al., 2015) to extract facts from the article in the form of relational tuples: (subject, relation, object). This graph contains the facts in the article and is integrated in the summary generation process.

Then, we use a graph attention network (Veličković et al., 2017) to obtain the representation of each node, and fuse that into a transformer-based encoder-decoder architecture via attention. We denote this model as the Fact-Aware Summarization model, FASum.

In addition, to be generally applicable for all existing summarization systems, we propose a Factual Corrector model, FC, to help improve the factual consistency of any given summary. We frame the correction process as a seq2seq problem: the input is the original summary and the article, and the output is the corrected summary. FC has the same architecture as UniLM (Dong et al., 2019) and initialized with weights from RoBERTa-Large (Liu et al., 2019). We finetune it as a denoising autoencoder. The training data is synthetically generated via randomly replacing entities in the ground-truth summary with wrong ones in the article. As shown in Table 2, FC makes three corrections, replacing the original wrong entities which appear elsewhere in the article with the right ones.

In the experiments, we leverage an independently trained BERT-based (Devlin et al., 2018) factual consistency evaluator (Kryściński et al., 2019b). Results show that on CNN/DailyMail, FASum obtains 0.6% higher fact consistency scores than UniLM (Dong et al., 2019) and 3.9% higher than BottomUp (Gehrmann et al., 2018). Moreover, after correction by FC, the factual score of summaries from BottomUp increases 1.4% on CNN/DailyMail and 0.9% on XSum, and the score of summaries from TConvS2S increases 3.1% on XSum. We also conduct human evaluation to verify the effectiveness of our models.

We further propose an easy-to-compute model-free metric, relation matching rate (RMR), to evaluate factual consistency given a summary and the article. This metric employs the extracted relations and does not require human-labelled summaries. Under this metric, we show that our models can help enhance the factual consistency of summaries.

Related Work

Abstractive text summarization has been intensively studied in recent literature. Rush et al. (2015) introduces an attention-based seq2seq model for abstractive sentence summarization. See et al. (2017) uses copy-generate mechanism that can both produce words from the vocabulary via a generator and copy words from the article via a pointer. Paulus et al. (2017) leverages reinforcement learning to improve summarization quality. Gehrmann et al. (2018) uses a content selector to over-determine phrases in source documents that helps constrain the model to likely phrases. Zhu et al. (2019) defines a pretraining scheme for summarization and produces a zero-shot abstractive summarization model. Dong et al. (2019) employs different masking techniques for both NLU and NLG tasks, resulting in the UniLM model. Lewis et al. (2019) employs denoising techniques to help generation tasks including summarization.

2 Fact-Aware Summarization

Entailment models have been used to evaluate and enhance factual consistency of summarization. Li et al. (2018) co-trains summarization and entailment and employs an entailment-aware decoder. Falke et al. (2019) proposes using off-the-shelf entailment models to rerank candidate summary sentences to boost factual consistency.

Zhang et al. (2019b) employs descriptor vectors to improve factual consistency in medical summarization. Cao et al. (2018) extracts relational information from the article and maps it to a sequence as an additional input to the encoder. Gunel et al. (2019) employs an entity-aware transformer structure for knowledge integration, and Matsumaru et al. (2020) improves factual consistency of generated headlines by filtering out training data with more factual errors. In comparison, our model utilizes the knowledge graph extracted from the article and fuses it into the generated text via neural graph computation.

To correct factual errors, Dong et al. (2020) uses pre-trained NLU models to rectify one or more wrong entities in the summary. Concurrent to our work, Cao et al. (2020) employs the generation model BART (Lewis et al., 2019) to produce corrected summaries.

Several approaches have been proposed to evaluate a summary’s factual consistency (Kryściński et al., 2019a; Goodrich et al., 2019; Maynez et al., 2020). Zhang et al. (2019a) employs BERT to compute similarity between pairs of words in the summary and article. Wang et al. (2020); Durmus et al. (2020) use question answering accuracy to measure factual consistency. Kryściński et al. (2019b) applies various transformations on the summary to produce training data for a BERT-based classification model, FactCC, which shows a high correlation with human metrics. Therefore, we use FactCC as the factual evaluator in this paper.

Model

2 Fact-Aware Summarizer

We propose the Fact-Aware abstractive Summarizer, FASum. It utilizes the seq2seq architecture built upon transformers (Vaswani et al., 2017). In detail, the encoder produces contextualized embeddings of the article and the decoder attends to the encoder’s output to generate the summary.

To make the summarization model fact-aware, we extract, represent and integrate knowledge from the source article into the summary generation process, which is described in the following. The overall architecture of FASum is shown in Figure 1.

To extract important entity-relation information from the article, we employ the Stanford OpenIE tool (Angeli et al., 2015). The extracted knowledge is a list of tuples. Each tuple contains a subject (S), a relation (R) and an object (O), each as a segment of text from the article. In the experiments, there are on average 165.4 tuples extracted per article in CNN/DailyMail Hermann et al. (2015) and 84.5 tuples in XSum Narayan et al. (2018).

2.2 Knowledge Representation

We construct a knowledge graph to represent the information extracted from OpenIE. We apply the Levi transformation (Levi, 1942) to treat each entity and relation equally. In detail, suppose a tuple is $(s,r,o)$ , we create nodes $s$ , $r$ and $o$ , and add edges $s$ — $r$ and $r$ — $o$ . In this way, we obtain an undirected knowledge graph $G=(V,E)$ , where each node $v\in V$ is associated with text $t(v)$ . During training, this graph G is constructed for each batch individually, i.e. there’s no shared huge graph. One benefit is that the model can take unseen entities and relations during inference.

We then employ a graph attention network (Veličković et al., 2017) to obtain embedding ${\boldsymbol{e}}_{j}$ for each node $v_{j}$ . The initial embedding of $v_{j}$ is given by the last hidden state of a bidirectional LSTM applied to $t(v_{j})$ . In the experiment, we employ 2 graph attention layers.

2.3 Knowledge Integration

The knowledge graph embedding is obtained in parallel with the encoder. Then, apart from the canonical cross attention over the encoder’s outputs, each decoder block also computes cross-attention over the knowlege graph nodes’ embeddings:

where $\{{\boldsymbol{e}}_{j}\}_{j=1}^{|V|}$ are the final embeddings of the graph nodes, and $\{{\boldsymbol{s}}_{i}\}_{i=1}^{t}$ are the decoder block’s representation of the first $t$ generated tokens.

2.4 Summary Generation

We denote the final output of the decoder as ${\boldsymbol{z}}_{1},...,{\boldsymbol{z}}_{t}$ . To produce the next token $y_{t+1}$ , we employ a linear layer ${\boldsymbol{W}}$ to project ${\boldsymbol{z}}_{t}$ to a vector of the same size of the dictionary. And the predicted distribution of $y_{t+1}$ is obtained by:

During training, we use cross entropy as the loss function $\mathcal{L}(\theta)=-\sum_{t=1}^{n}{\boldsymbol{y}}_{t}^{T}\log({\boldsymbol{p}}^{t})$ , where ${\boldsymbol{y}}_{t}$ is the one-hot vector for the $t$ -th token, and $\theta$ represent the parameters in the network.

3 Fact Corrector

To better utilize existing summarization systems, we propose a Factual Corrector model, FC, to improve the factual consistency of any summary generated by abstractive systems. FC frames the correction process as a seq2seq problem: given an article and a candidate summary, the model generates a corrected summary with minimal changes to be more factually consistent with the article.

While FASum has a graph attention module in the transformer, preventing direct adaptation from pre-trained models, the FC model architecture adopts the design of the pre-trained model UniLM (Dong et al., 2019). We initiailized the model weights from RoBERTa-Large (Liu et al., 2019). The finetuning process is similar to training a denoising autoencoder. We use back-translation and entity swap for synthetic data generation. For example, an entity in the ground-truth summary is randomly replaced with another entity of the same type from the article. This modified summary and the article is sent to the corrector to recover the original summary. In the experiments, we generated 3.0M seq2seq data samples in CNN/DailyMail and 551.0K samples in XSum for finetuning. We take 10K samples in each dataset for validation and use the rest for training.

During inference, the candidate summary from any abstractive summarization system is concatenated with the article and sent to FC, which produces the corrected summary.

Experiments

We evaluate our model on benchmark summarization datasets CNN/DailyMail Hermann et al. (2015) and XSum Narayan et al. (2018). They contain 312K and 227K news articles and human-edited summaries respectively, covering different topics and various summarization styles.

2 Implementation Details

We use the Huggingface’s (Wolf et al., 2019) implementation of transformer in BART (Lewis et al., 2019). We also inherit their provided hyperparameters of CNN/DailyMail and XSum for the beam search. The minimum summary length is 56 and 11 for CNN/Daily Mail and XSum, respectively. The number of beams is 4 for CNN/DailyMail and 6 for XSum.

In FASum, both the encoder and decoder has 10 layers of 10 heads for attention. Teacher forcing is used in training. We use Adam (Kingma and Ba, 2014) as the optimizer with a learning rate of 2e-4.

The bi-LSTM to produce the initial embedding of graph nodes has a hidden state of size 64 and the graph attention network (GAT) has 8 heads and a hidden state of size 50. The dropout rate is 0.6 in GAT and 0.1 elsewhere.

We use the subword tokenizer SentencePiece Kudo and Richardson (2018). The dictionary is shared across all the datasets. The vocabulary has a size of 32K and a dimension of 720.

The correction model FC follows the UniLM (Dong et al., 2019) architecture initialized with weights from RoBERTa-Large (Liu et al., 2019). We fine-tune the model for 5 epochs with a learning rate of 1e-5 and linear warmup over the one-fifths of total steps and linear decay. During decoding, it uses beam search with a width of 2, and blocks tri-gram duplicates. The batch size during finetuning is 24. More details are presented in the Appendix.

3 Metrics

To evaluate factual consistency, we re-implemented and trained the FactCC model (Kryściński et al., 2019b). The model outputs a score between 0 and 1, where a higher score indicates better consistency between the input article and summary. The training of FactCC is independent of our summarizer so no parameters are shared. More details are in the Appendix.

We also employ the standard ROUGE-1, ROUGE-2 and ROUGE-L metrics Lin (2004) to measure summary qualities. These three metrics evaluate the accuracy on unigrams, bigrams and the longest common subsequence. We report the F1 ROUGE scores in all experiments. And the ROUGE-L score on validation set is used to pick the best model for both FASum and FC.

4 Baselines

The following abstractive summarization models are selected as baseline systems. TConvS2S Narayan et al. (2018) is based on topic modeling and convolutional neural networks. BottomUp Gehrmann et al. (2018) uses a bottom-up approach to generate summarization. UniLM Dong et al. (2019) utilizes large-scale pretraining to produce state-of-the-art abstractive summaries. We train the baseline models when the predictions are not available in their open-source repositories.

5 Results

As shown in Table 3, our model FASumWe have put code and all the generated summaries of all models in the supplementary materials. outperforms all baseline systems in factual consistency scores in CNN/DailyMail and is only behind UniLM in XSum. In CNN/DailyMail, FASum is 0.6% higher than UniLM and 3.9% higher than BottomUp in factual score. Statistical test shows that the lead is statistically significant with p-value smaller than 0.05. The higher factual score of UniLM among baselines corroborates the findings in Maynez et al. (2020) that pre-trained models exhibit better factuality. But our proposed knowledge graph component can help the train-from-scratch FASum model to excel in factual consistency.

We conduct ablation study to remove the knowledge graph component from FASum, resulting in the Seq2Seq model. As shown, there is a clear drop in factual score: 2.8% in CNN/DailyMail and 0.9% in XSum. This proves that the constructed knowledge graph can help increase the factual correctess of the generated summaries.

It’s worth noticing that the ROUGE metric does not always reflect the factual consistency, sometimes even showing an inverse relationship, a phenomenon observed in multiple studies (Kryściński et al., 2019a; Maynez et al., 2020). For instance, although BottomUp has 0.69 higher ROUGE-1 points than FASum in CNN/DailyMail, there are many factual errors in its summaries, as shown in the human evaluation. On the other hand, to make sure the improved factual correctness of our models is not achieved by simply copying insignificant information from the article, we conduct analysis on abstractiveness in Section 4.6.1 and human evaluation in Section 4.6.3.

Furthermore, the correction model FC can effectively enhance the factual consistency of summaries generated by various baseline models, especially when the original summary has a relatively low factual consistency. For instance, on CNN/DM, the factual score of BottomUp increases by 1.4% after correction. On XSum, after correction, the factual scores increase by 0.2% to 3.1% for all baseline models. Interestingly, FC can also boost the factual consistency of our FASum model. Furthermore, the correction has a rather small impact on the ROUGE score, and it can improve the ROUGE scores of most models in XSum dataset.

We check and find that FC only makes modest modifications necessary to the original summaries. For instance, FC modifies 48.3% of summaries generated by BottomUp in CNN/DailyMail. These modified summaries contain very few changed tokens: 94.4% of these corrected summaries contain 3 or fewer new tokens, while the summaries have on average 48.3 tokens.

In the appendix of supplementary materials, we show several examples of summaries given by FASum and corrected by FC to demonstrate the improved factual consistency of summarization.

6 Insights

It has been shown in Durmus et al. (2020) that less abstractive summaries are more factual consistent with the article. Therefore, we inspect whether our models boost factual consistency simply by copying more portions of the article.

On XSum’s testset, we compute the ratio of novel n-grams in summaries that do not appear in the article. Figure 2 shows that FASum achieves the closest ratio of novel n-gram compared with reference summaries, and higher than BottomUp and UniLM. This demonstrates that FASum can produce highly abstractive summaries while ensuring factual consistency.

6.2 Relation Matching Rate

While the factual consistency evaluator FactCC (Kryściński et al., 2019b) is based on pre-trained models, it requires finetuning on articles and labelled summaries. Furthermore, we empirically find that the performance of FactCC degrades when it is finetuned on one summary dataset and used to evaluate models on another dataset.

Therefore, in this subsection, we design an easy-to-compute model-free factual consistency metric, which can be used when ground-truth summaries are not available.

As the relational tuples in the knowledge graph capture the factual information in the text, we compute the precision of extracted tuples in the summary. In detail, suppose the set of the relational tuples in the summary is $R_{s}=\{(s_{i},r_{i},o_{i})\}$ , and the set of the relational tuples in the article is $R_{a}$ . Then, each tuple in $R_{s}$ falls into one of the following three categories:

Correct hit (C): $(s_{i},r_{i},o_{i})\in R_{a}$ ;

Wrong hit (W): $(s_{i},r_{i},o_{i})\not\in R_{a}$ , but $\exists o^{\prime}\neq o_{i},(s_{i},r_{i},o^{\prime})\in R_{a}$ , or $\exists s^{\prime}\neq s_{i},(s^{\prime},r_{i},o_{i})\in R_{a}$ ;

We define two kinds of relation matching rate (RMR) to measure the ratio of correct hits:

Note that this metric is different from the ratio of overlapping tuples proposed in Goodrich et al. (2019), where the ratio is computed between the ground-truth and the candidate summary. Since even the ground-truth summary may not cover all the salient information in the article, we choose to compare the knowledge tuples in the candidate summary directly against those in the article. An additional advantage of our metric is that it does not require ground-truth summaries to be available.

Table 4 displays the result of this metric in CNN/DailyMail’s testset. As shown, FASum achieves the highest precision of correct hits under both measures. And there is a considerable boost from the knowledge graph (FASum vs Seq2Seq): 11.2% in $\mbox{RMR}_{1}$ and 13.8% in $\mbox{RMR}_{2}$ . And the correction from the FC model can further improve the metric for both FASum and UniLM.

We also compute factual consistency via natural language inference models following Maynez et al. (2020). We use the BERT-Large model finetuned on MNLI dataset (Williams et al., 2018) provided by fairseq (Ott et al., 2019). The model predicts the relationship between the article and summary to be one of the following: entailment, neutral and contradiction. We report the ratio of contradiction as predicted by the model in Table 4. As shown, FASum achieves the lowest ratio and FC helps further reducing conflicting facts in generated summaries.

6.3 Human Evaluation

We conduct human evaluation on the factual consistency and informativeness of summaries. We randomly sample 100 articles from the test set of CNN/DailyMail. Then, each article and summary pair is labelled by 3 people from Amazon Mechanical Turk (AMT) to evaluate the factual consistency and informativeness. Each labeller gives a score in each category between 1 and 3 (3 being perfect). The kappa-ratio between reviewer scores is 0.32 for factual consistency and 0.28 for informativeness.

Here, factual consistency indicates whether the summary’s content is faithful with respect to the article; informativeness indicates how well the summary covers the salient information in the article.

As shown in Table 5, our model FASum achieves the highest factual consistency score, higher than UniLM and considerably outperforming BottomUp. We conduct a statistical test and find that compared with UniLM, our model’s score is statistically significant with p-value smaller than 0.05 under paired t-test. In terms of informativeness, our model is comparable with UniLM and outperforms BottomUp. Finally, without the knowledge graph component, the Seq2Seq model generates summaries with both less factual consistency and informativeness.

To assess the effectiveness of the correction model FC, we conduct a human evaluation of side-by-side summaries. In CNN/DailyMail, we randomly sample 100 articles where the summaries generated by BottomUp are modified by FC. 3 labelers are asked whether the original or the corrected version is factually more correct. We collect all the feedbacks and compute the ratio of judgements for each case. To reduce bias, we randomly shuffle the two versions of summaries. We conduct similar evaluation on UniLM.

As shown in Table 6, the corrected summaries are significantly more likely to be judged as more factually correct for both baseline models. For example, 42.3% of the judgements think the corrected summaries are factually more correct, 42.7% think the corrected version neither improves nor worsens the factual consistency, while only 15.0% think that the corrected version becomes worse than the original BottomUp summary. Therefore, FC can help boost the factual consistency of summaries from given systems.

Finally, to evaluate the quality of the relation matching rate (RMR), we compute the correlation coefficient $\gamma$ between the factual score given by human labelers and the RMR value. The result shows that $\gamma=0.43$ , indicating observable relationship between RMR and human evaluation results.

Conclusion

In this paper, we extract factual information from the article to be represented by a knowledge graph. We then integrate this factual knowledge into the process of producing summaries. The resulting model FASum enhances the ability to preserve facts during summarization, demonstrated by both automatic and human evaluation. We also present a correction model, FC, to rectify factual errors in candidate summaries. Furthermore, we propose an easy-to-compute model-free metric, relation matching rate, to measure factual consistency based on the overlapping ratio of relational tuples.

For future work, we plan to integrate knowledge graphs into pre-training for more accurate and factually consistent summarization. Moreover, we will combine the internally extracted knowledge graph with an external knowledge graph (e.g. ConceptNet) to enhance the commonsense capability of summarization models.

References

Appendix A Implementation details

For hyperparameter search, we tried 4 layers with 4 heads, 6 layers with 6 heads and 10 layers with 10 heads.

There’re 108.3M parameters in the FASum model and it takes 2 hours (CNN/DailyMail) / 0.5 hours (XSum) for 4 v100 GPUs to train 1 epoch. The batch size is set to 48 for both datasets.

On validation datasets, FASum achieves ROUGE-1 41.08%, ROUGE-2 18.35% and ROUGE-L 37.95% on CNN/DailyMail, and it achieves ROUGE-1 30.28%, ROUGE-2 10.09% and ROUGE-L 23.85% on XSum.

Appendix B Factual Consistency Evaluator

To automatically evaluate the factual consistency of a summary, we leverage the FactCC model (Kryściński et al., 2019b), which maps the consistency evaluation as a binary classification problem, namely finding a function $f:(A,C)\longrightarrow$ , where $A$ is an article and $C$ is a summary sentence defined as a claim. $f(A,C)$ represents the probability that $C$ is factually correct with respect to the article $A$ . If a summary $S$ is composed of multiple sentences $C_{1},...,C_{k}$ , we define the factual score of $S$ as: $f(A,S)=\frac{1}{k}\sum_{i=1}^{k}f(A,C_{i})$ .

To generate training data, we adopt backtranslation as a paraphrasing tool. The ground-truth summary is translated into an intermediate language, including French, German, Chinese, Spanish and Russian, and then translated back to English. Together with the original summaries, these claims are used as positive training examples. We then apply entity swap, negation and pronoun swap to generate negative examples (Kryściński et al., 2019b).

Following Kryściński et al. (2019b), we finetune the BERTBASE model using the same hyperparameters to finetune FactCC. We concatenate the article and the generated claim together with special tokens [CLS] and [SEP]. The final embedding of [CLS] is used to compute the probability that the claim is entailed by the article content.

As shown in Table 7, on CNN/Daily Mail, our reproduced model achieves better accuracy than that in Kryściński et al. (2019b) on the human-labelled sentence-pair-ordering data (Falke et al., 2019). Thus, we use this evaluator for all the factual consistency assessment tasks in the following.We use the same setting and train another evaluator for XSum dataset.

Appendix C Examples

Table 8, 9 and 10 show examples of CNN/DailyMail articles and summaries generated by our model and several baseline systems. The factual errors in the summary are marked in red, the correct facts in the summaries of FASum are marked in green and the corresponding facts are marked in bold in the article.

As shown, while baseline systems like BottomUp and UniLM achieve high ROUGE scores, they are susceptible to factual errors. For instance, in Article 5, both BottomUp and Seq2Seq wrongly state that Rickie Fowler accused Alexis. In fact, Alexis, Rickie’s girlfriend, was accused by an online hater. In Article 1, UniLM mistakenly summarizes that Arsenal lost 4-1 where in fact Arsenal won 4-1 against Liverpool.

In comparison, our proposed fact-aware summarizer FASum could faithfully summarize the salient information in the article. And it can re-organize the phrasing instead of merely copying content from the article.

Table 11 and Table 12 show examples of CNN/DailyMail articles, summaries generated by BottomUp/UniLM and the corrected version by FC. As shown, our correction model can select the wrong entities and replace them with correct ones. For instance, in Article 1, BottomUp’s summary states that Rual Castro, who appears elsewhere in the article, is the President of Venezuela, while FC correctly replaces it with Nocolas Maduro. In Article 4, UniLM wrongly attributes the statement to Scott’s lawyer (probabily because “Scott” appears closer to the statement in the article), while it was actually said by Slager’s lawyer. This error is corrected by FC.