Factual Error Correction for Abstractive Summarization Models

Meng Cao, Yue Dong, Jiapeng Wu, Jackie Chi Kit Cheung

Introduction

Self-supervised methods have achieved success in a wide range of NLP tasks, and automatic summarization is no exception (Liu and Lapata, 2019; Lewis et al., 2019; Zhang et al., 2019a; Shi et al., 2019; Fabbri et al., 2019). These state-of-the-art abstractive summarization models typically finetune pre-trained transformer-based models on a summarization dataset Vaswani et al. (2017). Despite significant improvements over previous methods in terms of automatic evaluation scores such as ROUGE (Lin, 2004), ensuring factual consistency of the generated summary with respect to the source remains challenging. For example, Cao et al. (2018) claims that about 30% of summaries generated by abstractive models contain factual errors, which greatly limits their practicality.

Different approaches have been proposed to detect or ensure the factual consistency of generated summaries, including using fact extraction or applying attention on fact triples Cao et al. (2018); Zhang et al. (2019b); Goodrich et al. (2019), applying natural language inference or question answering models for consistency checking Falke et al. (2019); Li et al. (2018); Wang et al. (2020) and training the model on artificial datasets Kryściński et al. (2019). Most of these approaches either require a high-quality fact extraction model or they only focus on factual consistency evaluation. Improving factuality correction by editing inconsistent parts in generated summaries is a direction that has not been explored much.

In this work, we propose a model to improve the factual consistency of system summaries with post-editing correction (Table 1). Our model takes a draft summary that is generated by an abstractive summarization model and produces a corrected final summary, conditioned on the source document. In addition, our trained corrector can be used as an evaluation model for factual consistency of abstractive summaries, with the assumption that a generated summary is inconsistent if our corrector decides to make edits. To teach the model to correct errors, we train it with artificial data that has factual errors introduced using heuristics proposed by Kryściński et al. (2019).

The empirical results based on automatic and human evaluations indicate that our model not only corrects factual errors in summaries, it is also a reliable factuality evaluation model. In a downstream setting where we apply the corrector to the output of an abstractive summarizer, we find that our corrector is able to accurately correct errors in the generated summaries. However, the overall recall on correcting factual errors in real system summaries remains low, suggesting the errors introduced by heuristics have a different distribution than errors made by abstractive summarization systems.

Background and Related Work

Previous work on factual consistency in abstractive summarization can be divided into two categories: abstractive summarization models tailored towards factual consistency (Cao et al., 2018; Zhang et al., 2019b; Li et al., 2018), and evaluation models for factual consistency in abstractive summarization (Goodrich et al., 2019; Falke et al., 2019; Kryściński et al., 2019; Wang et al., 2020).

Cao et al. (2018) proposed a dual attention module in an abstractive summarizer that attends to both the source document and to relation triples extracted from the document. Zhang et al. (2019b) propose to improve their abstractive summarization model by optimizing fact scores defined in radiology reports with reinforcement learning methods. Li et al. (2018) jointly train their model’s encoder on summarization and NLI tasks. Guo et al. (2018) train an abstractive summarization system with the auxiliary tasks of question and entailment generation and show that their generated summaries are less likely to produce extraneous facts. Kumar and Cheung (2019) show that neural abstractive summarizers often assign higher posterior likelihood to perturbed contrastive summaries that are inconsistent with the source text than to human-written gold-standard ones. Concurrently to our work, Zhu et al. (2020) recently proposed a fact-aware summarization model that uses a knowledge graph. They use a pre-trained corrector module to modify generated summaries. Concurrent to our work, Dong et al. (2020) proposes factual correction models that leverages knowledge learned from question answering models via span selection. Their models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities.

In terms of evaluating abstractive summarization models for factual consistency, Goodrich et al. (2019) proposed a metric to check factual consistency by checking the overlapped fact triples between a source document and generated text on Wikidata. Falke et al. (2019) shows that factual error detection is a difficult task on its own and adapting entailment models for factual error detection do not offer the desired performance. Kryściński et al. (2019) finetune a BERT model on heuristically-created data with six types of rule-based text transformations for factual consistency checking. Wang et al. (2020) propose a framework for measuring inconsistencies in abstractive summarization by answering questions based on both generated summaries and documents.

Proposed Approach

In this section, we describe our procedure of introducing artificial errors in the datasets for training and propose our end-to-end error corrector model.

Inspired by a recent study of error types made by state-of-the-art summarization system, we artificially created a weakly-supervised training dataset based on the text transformations proposed by Kryściński et al. (2019).

Given a source text $d$ and the reference summary $s$ , we corrupt the reference summary into an inconsistent summary $s^{\prime}$ with a randomly sampled corruption rule (described below) with probability $\alpha$ ; otherwise, we keep $s^{\prime}=s$ with probability $1-\alpha$ . We set $\alpha=0.3$ to match the factuality error rate in real abstract summaries based on a recent study (Cao et al., 2018). The training data consists of triplets $(s^{\prime},s,d)$ .

Four types of errors are used to create the inconsistent summaries: Entity, Number, Date, and Pronoun errors. They are the most common types of errors in abstractive summaries based on our manual inspection of 100 abstractive system-generated summaries that are sampled from the dataset of Kryściński et al. (2019) (henceforth, the K2019 dataset). Unlike Kryściński et al. (2019), we corrupt the reference summary rather than sentences sampled from the source document.

In the first four types of error constructions, we utilize a swapping strategy to introduce errors. For Entity, Number, and Date swapping, one entity in the reference summary is selected and swapped with another random entity of the same typeAll the entities are extracted using a pre-trained NER model in spaCy https://spacy.io/. in the source document. For Pronoun swapping, one pronoun was extracted and swapped with another one of a matching syntactic case. Table 2 shows one example of a corruption.

2 Training Objective and Models

With the artificial training data consisting of triplets $(s^{\prime},s,d)$ , the goal of the corrector is to generate the correct summary $s$ based on the inconsistent summary $s^{\prime}$ and the source $d$ . This can be expressed as a problem of maximizing the likelihood of $P(s|s^{\prime},d)$ in an encoder-decoder model. We concatenate $s^{\prime}$ and $d$ as input to the encoder ( $s^{\prime}$ and $d$ are separated by a separation token) and train the decoder to generate $s$ .

We use BART Lewis et al. (2019) as the basis of our summary corrector because of its demonstrated level of performance on conditional text generation tasks. BART is a sequence-to-sequence auto-regressive transformer model that is pre-trained as a denoising auto-encoder. One appealing aspect about BART is that it is pre-trained on a denoising task. Specifically, given an input sentence that is corrupted by text infilling, token deletion as well as other text transformations, BART is trained to output the original sentence. This pre-training task is similar to our summary correction task in which we can regard the corrupted or generated summary as the noisy input and in this case the noise is the inconsistent content in the summary.

Experiments

We evaluate our model on two tasks: factual consistency checking and error correction.

For this task, the model needs to classify each original input summary as consistent or inconsistent with respect to the source text. It is thus a binary classification task for which we report accuracy, as well as precision, recall, and F1.

We interpret the output of our corrector model as a classification decision as follows. If the corrector makes any change to the original input summary, we consider this to be a prediction of the inconsistent class. Otherwise, the corrector makes no change and we consider this a prediction of the consistent class.

Error correction

For this task, the model must correct inconsistencies in the original summary (in any) with respect to the source text.

We define correction accuracy as the proportion of original summaries that are correctly changed by our corrector. On our artificial test set, an input summary is considered successfully corrected if the corrected summary matches the reference summary exactly. For the K2019 dataset, no reference corrections are available. We instead conducted a human evaluation to check the consistency of the corrected output. We read the original and corrected summaries as well as the source document to determine whether a summary is successfully corrected by our model.

2 Datasets

We use two datasets for our experiments. The first is the dataset of artificial corruptions described in Section 3.1, which we create by taking samples from the CNN/DailyMail dataset. There are in total 287,227 samples in the training set, and we corrupted 30% of them (85,583). This results in 16,858/35,113/13,408/20,204 date/entity/number/pronoun corrupted samples respectively. We refer the other 201,644 training samples as clean samples. We also create artificial validation and test set for model selection and evaluation. In the test set, there are 5,780 corrupted samples and 5,710 clean samples.

The second dataset we use is the K2019 test set of Kryściński et al. (2019). This dataset contains 503 summaries generated by different recent neural abstractive summarizers, which have been manually labeled for whether they contain an inconsistency.

We evaluate our model on both datasets. We did not use baselines for the artificial test set since it is simply used as a check to demonstrate our model’s performance in the artificial setting. The more meaningful evaluations are on K2019 consistency checking and error correction.

3 Corrector Training Details

We use the BART implementation from fairseq as the basis of our corrector.https://github.com/pytorch/fairseq/blob/master/examples/bart The pre-trained BART model is fine-tuned on our training dataset for 10 epochs as described in Section 3.2. The learning rate is set to 3e-5. All our experiments is done on 4 NVIDIA Tesla V100 GPUs. The training process takes about 12 hours.

Results

Table 3 shows the consistency checking performance of our corrector model on our artificial test set. The high classification accuracy and F1 scores indicate that our model is able to identify these artificially injected errors.

For error correction, among the 5780 corrupted summaries in the test set, 62.13% are corrected by the model to exactly match the reference summary. For the 5710 clean summaries, the model made changes to 26.27% of them, which results in 73.73% correction accuracy on clean summaries. These results show that the model is able to correct majority of the test samples even under our strict evaluation measure.

K2019

Table 4 shows the consistency checking results on the K2019 test set. Our model is better than the BERT model and slightly worse compared with the FactCC model.

As for correction performance, Table 6 shows the evaluation result of our human evaluation. Among 62 inconsistent summaries in the test set, the corrector model made changes to 19 summaries, of which 11 were successfully corrected and 7 remained inconsistent. For the remaining 441 consistent summaries in the test set, changes are made to 39 summaries and the model changed the meaning of 5 samples. In conclusion, with 17.74% probability that our model can successfully correct an inconsistent summary and 1.13% probability that it will corrupt a consistent one. Compared with the correction rate of 62.13% on the artificial test set, much lower correction rate on the real test set suggests that there is still a gap between the two settings. The error types in the training set are not able to represent the diverse errors made by summarization systems.

Output Analysis

Table 5 shows several input and output summaries of our corrector model together with the source document fragments. In the second example, the model correctly replaced 147 with 19, but was not able to correctly remove “including 142 students”, which is a larger modification to the original summary. More examples can be found in the Appendix.

Conclusions

In this paper, we proposed a novel approach to correct inconsistent content in summaries generated by abstractive summarization models. We train an end-to-end correction model with artificial examples created by corrupting reference summaries. Our model achieved promising performance on our artificial test set and outperformed previous models on the manually annotated test set by wide margins. Our human evaluation indicates that our model is able to correct some factually inconsistent summaries generated by abstractive summarization model. However, low recall on the inconsistent summaries and false positive samples remain as challenges.

Acknowledgments

This research was supported by the Canada CIFAR AI Chair program, the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Fonds de recherche du Québec – Nature et technologies (FRQNT). We would also like to thank Compute Canada for providing us computing resources.

References

Appendix A Appendix

Table 5 shows examples of generated summaries and outputs from our corrector. We put more examples here:

Article: (CNN)In case you needed a reminder that President Barack Obama isnt́ running for office again, he just alienated not only Republicans, who have largely resented him from day one, but the progressive base of Democratic voters. Obama has argued with the progressive potentate Elizabeth Warren, calling her “wrong” on trade policy. (…) Original: Obama has argued with the progressive potentate elizabeth warren. (consistent) Corrected: President Obama has argued with the progressive potentate elizabeth warren. (consistent)

Article: (The Hollywood Reporter)The author of a 2006 novel has accused the “Avengers” director and “Cabin” director Drew Goddard of stealing his idea. (…) Gallagher is basing his claim on the works’ similar premises: Both feature a group of young people terrorized by monsters while staying at a cabin in what is revealed to be (spoiler alert) a horror-film scenario designed by mysterious operators. (…) Original: Gallagher is basing his claim on the works’ names and personalities. (inconsistent) Corrected: Peter Gallagher is basing his claim on the works’ names and personalities. (inconsistent)

Article: (CNN)Too little, too late. (…) After the story of the statue caught fire online this weekend, Poulin publicly apologized Monday for his “most unsettling sculpture” in a letter to The Hollywood Reporter. (…) Original: Poulin publicly apologized for his “most unsettling sculpture” in a letter to the hollywood reporter. (consistent) Corrected: Dave Poulin publicly apologized for his “most unsettling sculpture” in a letter to the hollywood reporter. (consistent)

Article: (CNN)If I had to describe the U.S.-Iranian relationship in one word it would be “overmatched”. (…) America is alienating some of our closest allies because of the Iran deal, and Iran is picking up new ones and bolstering relations with old ones who are growing more dependent because they see Iran’s power rising. (…) Original: Iran is alienating some of our closest allies because of the iran deal, and iran is picking up new ones. (inconsistent) Corrected: America is alienating some of our closest allies because of the Iran deal, and Iran is picking up new ones. (consistent)

Article: (…) McHenry quickly issued an apology, blaming the incident on a moment of intense frustration but admitting her mistake and accepting responsibility. (…) Original: Mchenry apologizes to the incident on a moment of intense frustration. (consistent) Corrected: Britt Mchenry apologizes to the incident on a moment of intense frustration. (consistent)

Article: Boston (CNN)When the bomb went off, Steve Woolfenden thought he was still standing. That was because, as he lay on the ground, he was still holding the handles of his son’s stroller. He pulled back the stroller’s cover and saw that his son, Leo, 3, was conscious but bleeding from the left side of his head. (…) Original: Steve woolfenden, 3, was conscious but bleeding from the left side of his head. (inconsistent) Corrected: Leo Woolfenden, 3, was conscious but bleeding from the left side of his head. (consistent)

Article: (CNN)Mercedes driver and F1 championship leader Lewis Hamilton stole pole position for Sunday’s Chinese Grand Prix from teammate and fierce rival Nico Rosberg in dramatic fashion. (…) He did, however, find time to congratulate fellow German driver Sebastian Vettel, who will start in third after the Ferrari driver surprisingly won the Malaysian GP two weeks ago. (…) Original: Sebastian vettel won the malaysian gp two weeks ago. (consistent) Corrected: Nico Rosberg won the malaysian gp two weeks ago. (inconsistent)

Article: (CNN)At least 21 people were killed during a shipwreck off the northern coast of Haiti, the country’s civil protection directorate told CNN on Thursday. (…) So far, 11 victims – eight men and three women – have been identified, Celestin said. (…) Original: The 21 people are three women and three women have been identified. (inconsistent) The 21 people are eight women and three women have been identified. Corrected: (inconsistent)

Article: (CNN)Oklahoma Gov. Mary Fallin signed a bill on Friday that would allow the state to perform executions with nitrogen gas if lethal injection is ruled unconstitutional or becomes unavailable. Nitrogen causes a quick loss of consciousness and then death from lack of oxygen, Fallin’s office said in a press release. (…) Original: Nitrogen causes a quick loss of consciousness and then death from lack of oxygen. (consistent) Corrected: Netherlands causes a quick loss of consciousness and then death from lack of oxygen. (inconsistent)