Multi-Fact Correction in Abstractive Text Summarization

Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, Jingjing Liu

Introduction

Informative text summarization aims to shorten a long piece of text while preserving its main message. Existing systems can be divided into two main types: extractive and abstractive. Extractive strategies directly copy text snippets from the source to form summaries, while abstractive strategies generate summaries containing novel sentences not found in the source. Despite the fact that extractive strategies are simpler and less expensive, and can generate summaries that are more grammatically and semantically correct, abstractive strategies are becoming increasingly popular thanks to its flexibility, coherency and vocabulary diversity (Zhang et al., 2020a).

Recently, with the advent of Transformer-based models (Vaswani et al., 2017) pre-trained using self-supervised objectives on large text corpora (Devlin et al., 2019; Radford et al., 2018; Lewis et al., 2020; Raffel et al., 2020), abstractive summarization models are surpassing extractive ones on automatic evaluation metrics such as ROUGE (Lin, 2004). However, several studies Falke et al. (2019); Goodrich et al. (2019); Kryściński et al. (2019); Wang et al. (2020); Durmus et al. (2020); Maynez et al. (2020) observe that despite high ROUGE scores, system-generated abstractive summaries are often factually inconsistent with respect to the source text. Factual inconsistency is a well-known problem for conditional text generation, which requires models to generate readable text that is faithful to the input document. Consequently, sequence-to-sequence generation models need to learn to balance signals between the source for faithfulness and the learned language modeling prior for fluency (Kryściński et al., 2019). The dual objectives render abstractive summarization models highly prone to hallucinating content that is factually inconsistent with the source documents (Maynez et al., 2020).

Prior work has pushed the frontier of guaranteeing factual consistency in abstractive summarization systems. Most focus on proposing evaluation metrics that are specific to factual consistency, as multiple human evaluations have shown that ROUGE or BERTScore (Zhang et al., 2020b) correlates poorly with faithfulness (Kryściński et al., 2019; Maynez et al., 2020). These evaluation models range from using fact triples Goodrich et al. (2019), textual entailment predictions (Falke et al., 2019), adversarially pre-trained classifiers (Kryściński et al., 2019), to question answering (QA) systems (Wang et al., 2020; Durmus et al., 2020). It is worth noting that QA-based evaluation metrics show surprisingly high correlations with human judgment on factuality (Wang et al., 2020), indicating that QA models are robust in capturing facts that can benefit summarization tasks.

On the other hand, some work focuses on model design to incorporate factual triples (Cao et al., 2018; Zhu et al., 2020) or textual entailment (Li et al., 2018; Falke et al., 2019) to boost factual consistency in generated summaries. Such models are efficient in boosting factual scores, but often at the expense of significantly lowering ROUGE scores of the generated summaries. This happens because the models struggle between generating pivotal content while retaining true facts, often with an eventual propensity to sacrificing informativeness for the sake of correctness of the summary. In addition, these models inherit the backbone of generative models that suffer from hallucination despite the regularization from complex knowledge graphs or text entailment signals.

In this work, we propose SpanFact, a suite of two neural-based factual correctors that improve summary factual correctness without sacrificing informativeness. To ensure the retention of semantic meaning in the original documents while keeping the syntactic structures generated by advanced summarization models, we focus on factual edits on entities only, a major source of hallucinated errors in abstractive summarization systems in practice (Kryściński et al., 2019; Maynez et al., 2020). The proposed model is inspired by the observation that fact-checking QA model is a reliable medium in assessing whether an entity should be included in a summary as a fact (Wang et al., 2020; Durmus et al., 2020). To our knowledge, we are the first to adapt QA knowledge to enhance abstractive summarization. Compared to sequential generation models that incorporate complex knowledge graph and NLI mechanisms to boost factuality, our approach is lightweight and can be readily applied to any system-generated summaries without retraining the model. Empirical results on multiple summarization datasets show that the proposed approach significantly improves summarization quality over multiple factuality measures without sacrificing ROUGE scores.

Our contributions are summarized as follows. ( $i$ ) We propose SpanFact, a new factual correction framework that focuses on correcting erroneous facts in generated summaries, generalizable to any summarization system. ( $ii$ ) We propose two methods to solve multi-fact correction problem with single or multi-span selection in an iterative or auto-regressive manner, respectively. ( $iii$ ) Experimental results on multiple summarization benchmarks demonstrate that our approach can significantly improve multiple factuality measurements without a huge drop on ROUGE scores.

Related Work

The general neural-based encoder-decoder structure for abstractive summarization is first proposed by Rush et al. (2015). Later work improves this structure with better encoders, such as LSTMs (Chopra et al., 2016) and GRUs (Nallapati et al., 2016), that are able to capture long-range dependencies, as well as with reinforcement learning methods that directly optimize summarization evaluation scores (Paulus et al., 2018). One drawback of the earlier neural-based summarization models is the inability to produce out-of-vocabulary words, as the model can only generate whole words based on a fixed vocabulary. See et al. (2017) proposes a pointer-generator framework that can copy words directly from the source through a pointer network (Vinyals et al., 2015), in addition to the traditional sequence-to-sequence generation model.

Abstractive summarization starts to shine with the advent of self-supervised algorithms, which allow deeper and more complicated neural networks such as Transformers (Vaswani et al., 2017) to learn diverse language priors from large-scale corpora. Models such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018) and BART (Lewis et al., 2020) have achieved new state-of-the-art performances on abstractive summarization (Liu and Lapata, 2019; Lewis et al., 2020; Zhang et al., 2020a; Shi et al., 2019; Fabbri et al., 2019). These models often finetune pre-trained Transformers with supervised summarization datasets that contain pairs of source and summary.

However, encoder-decoder architectures widely used in abstractive summarization systems are inherently difficult to control and prone to hallucination (Vinyals and Le, 2015; Koehn and Knowles, 2017; Lee et al., 2018), and often leads to factual inconsistency: the system-generated summary is fluent but unfaithful to the source (Cao et al., 2018). Studies have shown that 8% to 30% system-generated abstractive summaries have factual errors (Falke et al., 2019; Kryściński et al., 2019) that cannot be discovered by ROUGE scores. Recent studies have proposed new methods to ensure factual consistency in summarization. Cao et al. (2018); Zhu et al. (2020) propose RNN-based and Transformer-based decoders that attend to both source and extracted knowledge triples, respectively. Li et al. (2018) propose an entailment-reward augmented maximum-likelihood training objective, and Falke et al. (2019) proposes to rerank beam results based on entailment scores to the source.

Our fact correction models are inherently different from these models, as we focus on post-correcting summaries generated by any model. Our models are trained with the objective of predicting masked entities identified for fact correction (Figure 1), and learn to fill in the entity masks of any system-generated summaries with single or multi-span selection mechanism (Figure 2). The most similar work to ours is proposed concurrently by Meng et al. (2020), where they fine-tune a BART (Lewis et al., 2020) model on distant supervision examples and use it as a post-editing model for factual error correction.

Multi-Fact Correction Models

In this section, we describe two models proposed for factual error correction: $(i)$ QA-span Fact Correction model, and $(ii)$ Auto-regressive Fact Correction model. As both methods rely on span selection with different masking and prediction strategies, we call them SpanFact collectively.

Let $(x,y)$ be a document-summary pair, where $x=(x_{1},\ldots,x_{M})$ is the source sequence with $M$ tokens, and $y=(y_{1},\ldots,y_{N})$ is the target sequence with $N$ tokens. An abstractive summarization model aims to model the conditional likelihood $p(y|x)$ , which can be factorized into a product $p(y|x)=\prod_{t=1}^{T}p(y_{t}|y_{1.\ldots,t-1},x)$ , where $y_{1.\ldots,t-1}$ denote the preceding tokens before position $t$ . The conditional maximum-likelihood objective ideally requires summarization models to not only optimize for informativeness but also correctness. However, in reality this often fails as the models have a high propensity for leaning towards informativeness than correctness (Li et al., 2018).

Suppose a summarization system generates a sequence of tokens $y^{\prime}=(y^{\prime}_{1},\ldots,y^{\prime}_{N})$ to form a summary. Our factual correction models aim to edit an informative-yet-incorrect summary into $y^{\prime\prime}=(y^{\prime\prime}_{1},\ldots,y^{\prime\prime}_{K})$ such that

where $f$ is a metric measuring factual consistency between the source and system summary.

2 Span Selection Dataset

Our fact correction models are inspired by the span selection task, which is often used in reading comprehension tasks such as question answering. Figure 1 shows examples of the span selection datasets we created for training our QA-span and auto-regressive fact correction models, respectively. The query is a reference summary masked with one or all entities,In this work, we use SpaCy NER tagger (Honnibal and Montani, 2017) to identify entities for data construction. and the passage is the corresponding source document to be summarized. If an entity appears multiple times in the source document, we rank them based on the fuzzy string-matching scores (a variation of Levenshtein distance) between the query sentence and the source sentence containing the entity. Our models explicitly learn to predict the span of the masked entity rather than pointing to a specific token as in Pointer Network (Vinyals et al., 2015), because the original tokens and replaced tokens often have different lengths.

Our QA-span fact correction model iteratively mask and replace one entity at a time, while the auto-regressive model masks all the entities simultaneously, and replace them in an auto-regressive fashion from left to right. Figure 2 shows an overview of our models. Comparing the two models, the QA-span fact correction model works better when only a few errors exist in the draft summary, as the prediction of each mask is relatively independent of each other. On the other hand, the auto-regressive fact correction model starts with a skeleton summary that has all the entities masked, which is often more robust when summaries contain many factual errors.

3 QA-Span Fact Correction Model

In the iterative setting, our model aims to conduct entity correction by answering a query that contains only one mask at a time. Suppose a system summary has $T$ entities. At time step $i$ , we mask the $i$ -th entity and use this masked sequence as the query to our QA-span model. The prediction is placed into the masked slot in the query to generate an updated system summary to be used in the next step.

Given the source text $x$ and a masked query $q=(y^{\prime}_{1},\ldots,\texttt{[MASK]},\ldots y^{\prime}_{m})$ , our iterative correction model aims to predict the answer span via modeling $p(i=\textnormal{start})$ and $p(i=\textnormal{end})$ . For span selection, we use the BertForQuestionAnsweringhttps://github.com/huggingface/transformers model, which adds two separate non-linear layers on top of Transformers as pointers to the start and end token position for the answer. We initialize the fact-correction model from a pre-trained BERT model (Devlin et al., 2019), and perform finetuning with the span selection datasets we created from the summarization datasets (Figure 1).

The input to the BERT model is a concatenation of two segments: the masked query $q$ and the source $x$ , separated by special delimiter markers as ( $\texttt{[CLS]},q,\texttt{[SEP]},x$ ). Each token in the sequence is assigned with three embeddings: token embedding, position embedding, and segmentation embedding.The segmentation embedding is used to distinguish the query (with two special tokens [CLS] and [SEP]) and the source in our models. These embeddings are summed into a single vector and fed to the multi-layer Transformer model:

where ${\boldsymbol{h}}^{0}$ are the input vectors, and $l$ represents the depth of stacked layers. LN and MHAtt are layer normalization and multi-head attention operations (Vaswani et al., 2017). The top layer provides the hidden states for the input tokens with rich contextual information. The start (s) and end (e) of the answer span are predicted as:

4 Auto-regressive Fact Correction Model

One disadvantage of the QA-style span-prediction strategy is that if the sequence contains too many factual errors, masking out one entity at a time may lead to highly erroneous skeleton summary to start with. The model might be making predictions on top of wrong entities from later in the sequence. Masking one entity at a time is essentially a greedy local method that is prone to error accumulation. To alleviate this issue, we propose a new sequential fact correction model to handle errors in a more global manner with beam search. Specifically, we mask out all the entities simultaneously, and use a novel auto-regressive span-selection decoder to predict fillers for the multiple masks sequentially. By doing this, we assume dependency between the masks: the earlier predicted entities will be used as corrected context for better predictions in the later steps.

Given a source text $x=(x_{1},\ldots,x_{n})$ and a draft summary $(y^{\prime}_{1},\ldots y^{\prime}_{m})$ . Our model first masks out all the entities (with $T$ masks), and leaves a skeleton summary as the query $q=(y^{\prime}_{1},\ldots,\texttt{[MASK]}_{1},\ldots,\texttt{[MASK]}_{T}\ldots y^{\prime}_{m})$ . Then, we concatenate the query $q$ with the source $x$ (similar to Section 3.3) as inputs to the encoder. The inputs are fed into BERT to obtain contextual hidden representations.

The input ${\boldsymbol{z}}_{t}$ is then fed to the Transformer decoder (as in Eqn. (2) and (3)) to generate the decoder’s hidden state ${\boldsymbol{h}}^{\prime}_{t}$ at time step $t$ . Based on ${\boldsymbol{h}}^{\prime}_{t}$ , we use a two-pointer network to predict the start and end positions of the answer entity in the source (encoder’s hidden states). This is achieved with cross-attention of ${\boldsymbol{h}}^{\prime}_{t}$ w.r.t. the encoder’s hidden states, similar to Eqn (4) and (5). This operation results in two distributions over the encoder’s hidden states for the start and end span positions. The final prediction of the start and end positions for mask $t$ is obtained by taking the argmaxThe argmax is used for selecting the start and end indexes for the answer span, and the softmax is used for computing the loss for back-propagation. over the pointer position distributions:

under the constraint that $p_{start}<p_{end}$ and $p_{end}-p_{start}<k$ .

Based on the start and end positions for the predicted entity, we can obtain the predicted entity representation at time step $t$ as the mean over the in-span encoder’s hidden states:

which is used as the input for the next step of decoding. It is worth noting that although the argmax operations in Eqn. (9) and (10) are non-differentiable, the model is trained based on the start and end positions of the ground-truth answer w.r.t. the start and end logits in Eqn. (4) and (5), which makes the gradient back-propagates to the encoder. Meanwhile, the encoder’s hidden states used to compose ${\boldsymbol{s}}^{ent}_{i}$ in Eqn. (11) also carry the gradients. During inference, beam search is used to find the best sequence of predicted spans in the source to replace the masks.

Compared to the conventional Pointer Network (Vinyals et al., 2015; See et al., 2017) that only points to one token at a time, our sequential span selection decoder has the flexibility to replace a mask by any number of entity tokens, which is often required in summary factual correction.

Experiment

In this section, we present our results on using SpanFact for multiple summarization datasets.

Training data for our fact correction models are generated as described in Section 3.2 on CNN/DailyMail Hermann et al. (2015), XSum (Narayan et al., 2018) and Gigaword (Graff et al., 2003; Rush et al., 2015). The statistics of these three dataset are provided in Table 2. During training, if an entity does not have a corresponding span in the source, we point the answer span to the [CLS] token. During inference, if the answer span predicted is the [CLS] token, we replace back the original masked entity.

Our fact correction models are implemented via the Huggingface Transformers library (Wolf et al., 2019) in PyTorch (Paszke et al., 2017). We initialize all encoder models with the checkpoint of an uncased, large BERT model pre-trained on English data and SQuAD for all experiments. Both source and target texts were tokenized with BERT’s sub-words tokenizer. The max sequence length is set to 512 for the encoder. We use a shallow Transformer decoder (L=2) for the auto-regressive span selection decoder, as the pre-trained BERT-large encoder is already robust for selecting right spans in the single-span selection task with only two pointers (Section 3.3). The Transformer decoder has 1024 hidden units and the feed-forward intermediate size for all layers is 4,096.

All models were finetuned on our span prediction data for 2 epochs with batch size 12. AdamW optimizer (Loshchilov and Hutter, 2017) with $\epsilon=$ 1e-8 and an initial learning rate 3e-5 is used for training. Our learning rate schedule follows a linear decay scheduler with warmup=10,000. During inference, we use beam search with $b=5$ and $k=10$ (constraint for the distance between the start and end pointer). The best model checkpoints are chosen based on performance on the validation set. Experiments are conducted using 4 Quadro RTX 8000 GPUs with 48GB of memory.

2 Evaluation Metrics

We use three automatic evaluation metrics to evaluate our models. The first is ROUGE (Lin, 2004), the standard summarization quality metric, which has high correlation with summary informativeness in the news domain (Kryściński et al., 2019).

Since ROUGE has been criticized for its poor correlation with factual consistency (Kryściński et al., 2019; Wang et al., 2020), we use two additional automatic metrics that specifically focus on factual consistency: FactCC (Kryściński et al., 2019) and QAGS (Wang et al., 2020). FactCC is a pre-trained binary classifier that evaluates the factuality of a system-generated summary by predicting whether it is consistent or inconsistent w.r.t. the source. This classifier was trained on adversarial examples obtained by heuristically injecting noise into reference summaries.

In addition, very recent work proposed QA-based models for factuality evaluation (Wang et al., 2020; Durmus et al., 2020; Maynez et al., 2020), and Wang et al. (2020) showed that their evaluation models have higher correlation with human judgements on factuality when compared with FactCC (Kryściński et al., 2019). We thus include our re-implementation of a question generation and question answering model (QGQA) following Wang et al. (2020) as an evaluation metric for factuality.We were not able to obtain any of the QA evaluation model or code from Wang et al. (2020); Durmus et al. (2020); Maynez et al. (2020) as the authors are still in the stage of making the code public. We used pre-trained UniLM model for question generation (QG) and BertForQuestionAnswering model for question answering (QA). The QG model is fine-tuned on NewsQA (Trischler et al., 2017) with entity-answer conditional task (Wang et al., 2020), and the QA model is pre-trained on SQuAD 2.0 (Rajpurkar et al., 2018). This model generates a set of questions based on the system-generated summary, and then answers these questions using either the source or the summary to obtain two sets of answers. The answers are compared against each other using an answer-similarity metric (token-level F1), and the averaged similarity metric over all questions is used as the QGQA score. Answers generated from a highly faithful system summary should be similar to those generated from the source.

3 Baselines

We compare against the following abstractive summarization baselines. On CNNDM and XSum, we use BertSumAbs, BertSumExtAbs and TransformerAbs (Liu and Lapata, 2019). In addition, we also compare with Bottom-up (Gehrmann et al., 2018). On Gigaword, we use the pointer-generator (See et al., 2017), base and full GenParse models (Song et al., 2020) for comparison. For the factual correction baseline, we compare with the Two-encoder Pointer Generatorhttps://github.com/darsh10/split_encoder_pointer_summarizer (Split Encoder) (Shah et al., 2020), which employs a similar setting to ours for masking entities w.r.t. the source, and uses dual encoders to copy and generate from both the source and the masked query for fact update. Compared to our span selection models that can fill in the mask with any number of tokens, their models aim to regenerate the mask query based on the source. In other words, their decoder regenerates the whole sequence token by token with a pointer-generator, which inherits the backbone of generative models that suffer from hallucination.

4 Experimental Results

Tables 3, 4, and 5 summarize the results on the CNN/DailyMail, XSum and Gigaword datasets, respectively. Each block in the tables compares the original summarization model’s output with the corrected outputs obtained by our baseline and proposed models.

On CNN/DailyMail (Table 3), our correction models significantly boost factual consistency measures (QGQA and FactCC) by large margins, with only small drops on ROUGE. This shows our models have the ability to improve the correctness of system-generated summaries without sacrificing informativeness. When comparing our two proposed models, we observe that the QA-span model performs better than the auto-regressive model. This is expected as CNN/DailyMail’s reference summaries tend to be more extractive (See et al., 2017), and summarization models tend to make few errors per summary (Narayan et al., 2018). Thus, the iterative procedure of the QA-span model is more robust with high precision as it has more correct context from the query, with only minimum negative influence from other concurrent errors. This is also reflected in the high scores of QGQA and FactCC across all the models we tested. Since QGQA and FactCC are based on comparing system-generated summary w.r.t. the source text, high score means high semantic similarity between system summary to the source.

On XSum (Table 4) and Gigaword (Table 5), both of our correction models boost factual consistency measures by large margins with a slight drop in ROUGE (-0.5 to -1.5) on average. This is still encouraging, as abstractive summarization models that use complex factual controlling components for generation often have drops of 5-10 ROUGE points (Zhu et al., 2020).

We also notice that the QGQA and FactCC scores of all summarization models are lower than that on CNN/DailyMail. The scores are especially low on XSum. This is likely due to the data construction protocol of XSum, where the first sentence of a source document is used as the summary and the remainder of the article is used as the source. As a result, many entities that appear in the reference summary never appear in the source, which may cause abstractive summarization models to hallucinate severely with many factual errors (Maynez et al., 2020). As the system summaries often contain many errors, our QA-span model that relies on answering a single-mask query often has the wrong context to condition on at each step, which negatively affects the performance of this model. In contrast, the strategy of masking all the entities would provide the auto-regressive model a better query for entity replacement. We can observe in Table 4 that the auto-regressive model performs better than the QA-span model on XSum.

5 Human Evaluation

To provide qualitative analysis of the proposed models, we conduct human evaluation on pairwise comparison of CNN/DailyMail summaries enhanced by different correction strategies. We select three state-of-the-art abstractive summarization models as the backbones, and collect three sets of pairwise summaries for each setting: $(i)$ Original vs. QA-Span corrected; $(ii)$ Original vs. Auto-regressive corrected; $(iii)$ QA-Span corrected vs. Auto-regressive corrected. Nine sets of 50 randomly selected samples (total 450 samples) are labeled by AMT tuckers. For each pair (in anonymized order), three annotators from Amazon Mechanical Turk (AMT) are asked to judge which is more factually correct based on the source document. As shown in Table 6, summaries from our two models are chosen more frequently as the factually correct one compared to the original. Between the two correction models, the preferences are comparable.

In addition, we also test our fact correction models on the FactCC test set provided by Kryściński et al. (2019) and manually checked the outputs. Table 7 shows the results of the original summaries and the summaries corrected by our models in terms of automatic fact evaluation and our manual evaluation. Among 508 system-generated summary sentences, 62 were incorrect. The QA-span model was able to correct 18 out of 62 right, and the auto-regressive model was able to correct 16 out of 62. Among the 446 sentences that are labeled as correct by the annotators in Kryściński et al. (2019), our two models made 3 and 4 wrong changes in the entities, respectively,This excludes the cases where the model would change a person’s full name by last name or break the fluency due to SpaCy NER errors. while keeping most of the entities unchanged or changed with equivalent entities.

Conclusion

We present SpanFact, a suite of two factual correction models that use span selection mechanisms to replace one or multiple entity masks at a time. SpanFact can be used for fact correction on any abstractive summaries. Empirical results show that our models improve the factuality of summaries generated by state-of-the-art abstractive summarization systems without a huge drop on ROUGE scores. For future work, we plan to apply our method for other type of spans, such as noun phrases, verbs, and clauses.

Acknowledgments

This research was supported in part by Microsoft Dynamics 365 AI Research and the Canada CIFAR AI Chair program. We would like to thank the reviewers for their valuable comments and special thanks to Yuwei Fang and other members of the Microsoft Dynamics 365 AI Research team for the feedback and suggestions.