FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Esin Durmus, He He, Mona Diab

Introduction

Abstractive summarization models must aggregate salient content from the source document(s) and remain faithful, i.e. being factually consistent with information in the source documents. Neural abstractive models are effective at identifying salient content and producing fluent summaries See et al. (2017); Chen and Bansal (2018); Gehrmann et al. (2018). However, the generated summary may not always contain faithful information, which is vital for real-world applications.

Table 1 shows an example of unfaithful generation. Recent studies have shown that around 30%30\% of generated summaries contain unfaithful information Cao et al. (2018); Falke et al. (2019a); Kryściński et al. (2019), especially when the sentence combines content from multiple source sentences Lebanoff et al. (2019).

In this paper, we address the problem of evaluating faithfulness of generated summaries given their source documents. Our key insight is that current models are limited by a trade-off between abstractiveness and faithfulness (Section 2). On a wide range of systems and two datasets with varying levels of abstractiveness (CNN/DM and XSum), we show that the number of unfaithful sentences (annotated by humans) increases as the summary becomes more abstractive (i.e. less overlap with the source document). Next, we investigate a diverse set of existing automatic evaluation metrics such as ROUGE, BERTScore Zhang et al. (2019a), and learned entailment models. We find that their correlations with human scores of faithfulness drop significantly on highly abstractive summaries, where deeper text understanding beyond surface similarity is needed.

Recently, question answering (QA) based automatic metrics have been proposed for evaluating content selection in summarization Eyal et al. (2019); Scialom et al. (2019); Chen et al. (2018). Specifically, cloze-style QA is used to evaluate whether important information in the source is recovered from the summary. Inspired by prior work, we use automatically generated QA pairs to represent information in the summary and validate it against the source. Concretely, we generate a set of “groundtruth” QA pairs from the summary, using a learned model that converts a declarative sentence and an answer span to a question (Section 3). Then, off-the-shelf reading comprehension models are evaluated on this set by extracting answer spans from the source documents. High accuracy means that the summary and the source document tend to produce the same answers, thus they are factually consistent with respect to the questions. Compared to prior approaches using cloze tests, our question generation approach enables evaluation with a broader range of QA models and answer types (e.g. extractive and generative), thus maximally taking advantage of progress in QA.

Among automatic metrics based on nn-gram overlap, word embeddings, and language understanding models (relation extraction and entailment), FEQA has significantly higher correlation with human scores of faithfulness and is the only metric that correlates with human scores on highly abstractive summaries from XSum.

The Abstractiveness-Faithfulness Tradeoff

While extractive summarizers are largely faithful (since they copy sentences from the source document), current abstractive models struggle to produce faithful summaries without copying. Similar to Lebanoff et al. (2019), we observe that factual errors occur more frequently as models generate more abstractive summary sentences, i.e. less overlap with the source document. In this section, we analyze generated summaries along two dimensions: abstractiveness and faithfulness. Specifically, we aim to answer the following questions: (1) How to quantify abstractiveness of a summary? (2) Is abstractiveness encouraged more by the data or the model? (3) How does being abstractive affect faithfulness?

Abstractive summarization involves rephrasing important content into brief statements, ranging from minor editing of a source sentence to condensing multiple sentences in new words. Given a source document and a summary, we want to measure the level of abstractiveness of the summary.

Prior work measures abstractiveness by overlapped text spans between the summary and the document Grusky et al. (2018); Zhang et al. (2018), or indirectly by the effectiveness of extractive baselines such as lead-3 Nallapati et al. (2016a). While metrics such as extractive fragment coverage and density Grusky et al. (2018) provide a continuous measure of the level of abstractiveness, we define a more fine-grained categorization of abstractiveness by analyzing how each sentence in the summary is formed.

A more abstractive summary sentence aggregates content over a larger chunk of source text; consequently it must copy fewer words to maintain brevity. Therefore, we define the following abstractiveness types based on the amount of copying, e.g. copying a source sentence, one or more partial fragments from the source sentence, and individual words.

Sentence extraction: the summary sentence is exactly the same as one of the source sentences.

Span extraction: the summary sentence is a substring of one of the source sentences, e.g. “the plane was coming back from the NCAA final” is a span extracted from “the plane was coming back from the NCAA final, according to spokesman John Twork”.

Word extraction: the summary sentence is formed by a subset of the tokens in a source sentence, e.g. “Capybara Joejoe has almost 60,000 followers” is a result of deleting words in “Capybara Joejoe who lives in Las Vegas has almost 60,000 followers on Instagram”.

Perfect fusionk: the summary sentence is constructed by piecing together the substrings from kk (k>1k>1) source sentences in their original order, e.g. “Capybara Joejoe has almost 60,000 followers” is a perfect fusion of the sentences “Capybara Joejoe lives in Las vegas.” and “He has almost 60,000 followers on Instagram.”

To quantify the amount of abstractiveness of a set of summaries, we label each sentence with the first qualified type in the order above if it fits to one of these categories.

We then define the score of each type as the percentage of sentences labeled by that category. The types are ordered by increasing levels of abstractiveness. For example, a summary with higher fusion scores and lower extraction scores is considered more abstractive. In addition, we compute the percentage of novel nn-grams that do not appear in the source document as another metric for abstractiveness.

2 Is abstractiveness from the model or the data?

Equipped with the metrics for abstractiveness above, we want to further understand how abstractive the generated summaries are, and whether the amount of abstractiveness is a result of the training data or the model. Therefore, we compute abstractiveness scores for both the reference summaries and summaries generated from a diverse set of models on two datasets.

We use the CNN/DailyMail Hermann et al. (2015); Nallapati et al. (2016b) (CNN/DM) and the XSum Narayan et al. (2018) datasets, which are both used for single-document news summarization tasks. CNN/DM consists of articles from the CNN and Daily Mail websites, where the summaries comprise highlights in bullet points. XSum consists of BBC articles, where the summaries comprise a single-sentence summary that is written as the opening introductory sentence for the article. XSum was released in particular to promote research on highly abstractive summarization systems. Appendix A provides statistics on CNN/DM and XSum datasets: they contain around 288k and 204k training examples, respectively; CNN/DM includes longer documents and summaries on average.

Most neural abstractive summarization models are based on sequence-to-sequence models. They differ in how summarization-specific operations such as copying/extraction are instantiated. We consider 5 prominent models and summarize their characteristics in Table 2.We use state-of-the-art models proposed for each dataset at the time of writing. Details of each model can be found in Appendix B. pgc See et al. (2017) uses the copy mechanism during decoding to allow extraction. FastRL Chen and Bansal (2018) and BottomUp Gehrmann et al. (2018) decouple extraction and abstractive generation by learning to select sentences and words respectively in the first step; this model has been shown to generate more abstractive summaries compared to pgc. Tconv Narayan et al. (2018) is initially designed for XSum, thus it does not include any explicit copying/extraction components and focuses on long text representation using convolutional neural networks. BertSum Liu and Lapata (2019) consists of a BERT-based encoder and a 6-layer Transformer decoder. It incorporates extraction implicitly by first fine-tuning the encoder on the extractive summarization task.We use the BertSumExtAbs variation.

Results. Our goal is to understand the level of abstractiveness of summaries generated by different models, and the influence on abstractiveness from the training data. Therefore, we analyzed summaries generated by the above models on CNN/DM and XSum. We computed the metrics described in Section 2.1 for both the generated summaries and the reference summaries on the test sets. The results are shown in Table 3.

First, CNN/DM is more extractive than XSum. Extraction scores of the reference summaries in CNN/DM shows that almost half of the sentences are formed by deleting words in one of the source sentences. This shows that sentence compression Knight and Marcu (2002) is the main technique used for this dataset. In contrast, none of the summary sentences in XSum are formed by copying from a single source sentence. They are generated mostly by paraphrasing the input content, indicated by the large fraction of novel nn-grams.

Second, training data has a larger influence on the abstractiveness of model outputs. Similar to Zhang et al. (2018), we find that models trained on CNN/DM are near-extractive. However, the same models trained on XSum are significantly more abstractive. In fact, none of the models produced any sentence that copies words/phrases from a single source sentence, which is consistent with characteristics of the reference summaries in XSum. The content is more often rephrased in novel words/phrases. However, on both datasets, current models struggle to achieve the same level of abstractiveness as the reference summaries, indicating that additional inductive bias is needed to condense multiple sentences by rephrasing.

Third, different models have different ways of doing extraction. When trained on CNN/DM, pgc generates the majority of sentences by copying complete source sentences, whereas FastRL, BottomUp and BertSum do simple compression by deletion more often. In addition, BottomUp does more fusion compared to pgc, FastRL and BertSum.

To understand faithfulness of current systems and its relation to abstractiveness, we crowd-sourced human annotations on the output of each model-dataset pair described in Section 2.2. Since a near-extractive sentence is very likely to be grammatical and faithful, we focus on more abstractive cases by excluding output sentences that are either an exact copy or a substring of one of the source sentences.

A key challenge to reliable human annotation is that the inter-annotator agreement on faithfulness is relatively low Lebanoff et al. (2019). Our pilot study shows that workers often do not agree on incoherent sentences, e.g. whether “Chelsea beat Chelsea 535-3 in the Premier League on Saturday.” is faithful or not. To standardize the annotation process, we design hierarchical questions to distinguish among failed generation that render a sentence meaningless, low-level grammatical errors that hardly affect semantic understanding, and faithfulness errors that convey incorrect (yet meaningful) information.

Figure 1 shows the decision tree of our human annotation steps. We first evaluate the grammaticality of generated sentences (independent from the source document). We show annotators a summary sentence and ask them to choose whether the given sentence is meaningful or nonsensical to determine if the given sentence is structurally and semantically sound. If the annotator can make sense of the sentence, we then ask whether it is grammatical or has minor grammaticality problems which a person can easily correct.

Next, for sentences labeled as meaningful in the first step, we ask workers whether they are faithful to the provided source document. In case the worker labels a sentence as unfaithful, we conduct a simple error analysis by asking them to indicate if the sentence contains information that is absent from or conflicting with the source document, which corresponds to hallucination and contradiction errors, respectively. More details about the annotation schema and guidelines are included in the Appendix C. Next, we describe our human evaluation results.

For each dataset-model pair described in Section 2.2, we randomly sampled 10001000 sentence-source pairs eliminating output sentences that are either an exact copy or substring of a source sentence. We collected grammaticality annotations for these sentences from 55 annotators. We consider a sentence meaningful if at least 44 out of 55 annotators label it as meaningful in the first stage. We sampled 200200 meaningful sentences randomly to collect annotations for faithfulness. Table 4 shows the results of the grammaticality and faithfulness human evaluations.

Overall, outputs from all models are scored high on grammaticality with high inter-annotator agreement. However, on more abstractive summaries (i.e. when trained on XSum), the grammaticality scores drop significantly. One exception is BertSum, which maintains good performance on XSum and achieves the highest grammaticality score on both datasets.Majority of the sentences (>70%>70\%) identified as “meaningful” are annotated as “perfectly grammatical” for each model-dataset pair.

Near-extractive summaries generated from models trained on CNN/DM have significantly higher faithfulness scores than highly abstractive summaries from models trained on XSum. We find that pgc and Tconv has faithfulness errors in more than half of the sentences they generate when trained on XSum. Although BertSum generates fewer unfaithful sentences, it still suffers from performance drop on XSum. Interestingly, human agreement on faithfulness is also lower for abstractive summaries from XSum. This suggests that faithfulness errors are harder to catch for humans as well in more abstractive settings. We further observe conflicting information is more common among models trained on CNN/DM while hallucination is more common among models trained on XSum. Table 5 shows examples of meaningful but unfaithful sentences.

FEQA: Faithfulness Evaluation with Question Answering

Our analysis above shows that the number of unfaithful sentences increases significantly as more abstractive summaries are generated. Thus the key challenge to faithfulness evaluation is to verify highly abstractive sentences against the source document, where surface similarity matching would fail. If we have a good semantic representation of the sentence abstracting away its surface form (e.g. a list of facts about who did what to whom), we can simply compare the sentence representation to the document representation (e.g. check whether the fact list from the summary is a subset of the list from the document). Ideally, the representation should be domain-general and interpretable for easy error analysis.

Motivated by the fast progress in reading comprehension Chen (2018); Gao et al. (2018) we propose to use QA pairs as a generic meaning representation of sentences for faithfulness evaluation. Given a summary sentence, we produce a list of questions asking about key information in the sentence and their corresponding answers. To verify this information against the source, we use a QA model to predict answers from the document. The questions and the QA model thus extract comparable information from two pieces of text. More matched answers from the document implies a more faithful summary since the information addressing these questions are consistent between the summary and the source document. Figure 2 shows the workflow of FEQA.

Prior work Eyal et al. (2019); Scialom et al. (2019) uses cloze tests as questions by masking entities. To go beyond cloze-style QA and leverage more recent extractive Rajpurkar et al. (2016) or even generative Alec et al. (2019) QA models, we generate natural language questions from the summary sentence automatically. Specifically, we mask important text spans in a sentence, including noun phrases extracted by a constituency parser Kitaev and Klein (2018) and named entities extracted by the Stanford CoreNLP NER model Finkel et al. (2005); Manning et al. (2014). We consider each span as the gold answer and generate its corresponding question by fine-tuning a pretrained BART language model Lewis et al. (2019). To train the question generator, we adapt the QA2D dataset Demszky et al. (2018). The input is a declarative sentence with masked answers and the output is a question. A training example might look like:

Since the transformation from declarative sentences to questions is almost rule-based without much paraphrasing, we expect the model to generalize to various domains.

Given the QA pairs generated from a summary sentence, we run off-the-shelf QA models to get answers to these questions from the source document. We then measure the average F1 score against the “gold” answers from the summary, which is our faithfulness score for the given sentence. This step does not have any constraint on the QA model. We experiment with the pretrained BERT-base model Devlin et al. (2019) fine-tuned on SQuAD-1.1 Rajpurkar et al. (2016) and SQuAD-2.0 Rajpurkar et al. (2018). Note that in the case of SQuAD-2.0, the model may be able to hypothesize that a question is unanswerable. This case is equivalent to getting an answer incorrect (i.e. unfaithful).

Experiments

We aim to understand to what extent the proposed QA-based metric and existing metrics capture faithfulness of a summary. Given pairs of documents and summary sentences without reference summaries, we measure correlations between human-annotated faithfulness scores (Section 2.3) and scores computed using each metric described below.

A straightforward metric for faithfulness is the word overlap between the summary sentence and the document. We compute ROUGE (R), BLEU (B),We report only BLUE-4 since it performed the best for CNN/DM and no variation of BLEU has significant correlation with faithfulness for XSum. between the output sentence and each of the source sentences (i.e. taking the source sentence as the reference). We then take the average scores and maximum score across all the source sentences. Since according to our analysis taking the average score consistently has higher correlation, we report only the correlation for the average.

Word embeddings extend word overlap-based metrics beyond exact match. Recently, BERTScore Zhang et al. (2019b) was proposed to compute the similarity between two sentences using contextual word embeddings from BERT. It has higher correlation with human judgements on image captioning and machine translation than word overlap based metrics. We compute BERTScore (BERTSc) between each source sentence and the summary sentence.https://github.com/Tiiiger/bert_score. To get the final score, we experiment with both the average and the maximum scores computed from each source sentence and the summary sentence. We report results using the maximum score since it has better performance.

In addition to QA, recent work has used relation extraction and textual entailment models for faithfulness evaluation Falke et al. (2019a); Goodrich et al. (2019). For the relation extraction metric (RE), we compute the precision for the relation triplets extracted from the summary sentence and the source document using an off-the-shelf model Angeli et al. (2015) from Stanford Open IE. For the textual entailment metric (ENT), we measure whether the summary sentence is entailed by the source using the pretrained ESIM model Chen et al. (2017) from AllenNLP Gardner et al. (2018).

2 Results

We first compute scores for each metric on document and output sentence pairs on both CNN/DM and XSum datasets (748748 and 286286 pairs respectively). We then compute Pearson and Spearman correlation coefficients between scores given by each metric and human-annotated scores. Table 7 includes correlation coefficients for the examples from CNN/DM and XSum, respectively. We observe that for both CNN/DM and XSum, the score of QA-based evaluation has a higher correlation with faithfulness than other metrics. Although word-overlap based metrics are correlated with the faithfulness in more extractive settings (i.e. for CNN/DM), these metrics have no correlation with faithfulness in more abstractive settings (i.e. for XSum). We further notice that all the metrics have significantly lower correlation with human scores for XSum, suggesting that evaluating faithfulness is more difficult in highly abstractive settings; deeper understanding of the source and the summary sentence is necessary here.

Consistent with the findings of Falke et al. (2019b), the entailment metric does not have a significant correlation with faithfulness in most cases. These models fail to distinguish entailed (faithful) and non-entailed (unfaithful) summary sentences when both overlap largely with the source document, because models trained on current entailment datasets may rely on simple heuristics such as lexical overlap McCoy et al. (2019). Similarly, BERTScore tends to give higher scores when there are overlapping concepts between the sentences even though the content is not the same. See Table 6 for examples.

Current evaluation metrics for summarization produce a single measure of the overall quality of the summary. Typically, the output summary is compared against the reference summary in terms of n-gram overlap. These metrics mainly evaluate content selection, i.e. whether the content of the output is similar to the content of the reference. In contrast, to evaluate faithfulness, we compare the output summary against the source document. One natural question that follows is whether high content matching sufficient for faithfulness. We compute the correlation coefficients between human-annotated faithfulness scores and ROUGE scores computed from the reference and the output sentence. As shown in Table 8, while there is a weak correlation between ROUGE scores of content selection and faithfulness on CNN/DM, the correlation is significantly lower than ROUGE scores of faithfulness (i.e. computed between the source and the output sentence). For XSum, there is no significant correlation between the content selection metrics and faithfulness. We provide unfaithful examples with high content selection scores in Appendix D.3. This suggests that content selection and faithfulness should be measured separately as opposed to using a unified score.

Table 9 shows examples for a faithful and an unfaithful output sentence and the corresponding QA pairs. Note that the QA system is able to capture common errors such as conflicting information in the output sentence. To measure the reliability of FEQA, we further perform a manual error analysis using 100100 randomly sampled QA pairs. We observe that around 94%94\% of generated questions are mostly grammatical and correct given the mask. For 78%78\% of the questions, the QA system has the correct behaviour: it answers the question correctly if the sentence is faithful to the article, otherwise it produces “unanswerable” or an incorrect answer. Majority of the errors of the QA system are because it either didn’t detect unanswerable questions or produces “unanswerable” when there exists an answer (14%14\%). Moreover, when the article is long, QA system tends to make more mistakes. Especially for more abstractive settings, F1-score penalizes the correct answers when the answer from the article does not exactly match with the gold answer (i.e. “Donald Trump” vs. “the President of the United States Donald Trump”) (16%16\%).

Related Work

Since the beginning of neural text generation, problems with repetition and generic responses have received lots of attention Sordoni et al. (2015); Li et al. (2016); Holtzman et al. (2019). Recently, more work has focused on semantic errors in model outputs, such as adequacy in machine translation Tu et al. (2017), faithfulness in summarization Cao et al. (2018), and consistency in dialogue Li et al. (2019). Our analysis on the abstractiveness-faithfulness tradeoff reveals additional limitation of current models, and suggests that we need new inductive bias on how to summarize beyond copying.

Question answering is a broad format that subsumes many tasks Gardner et al. (2019). To the best of our knowledge, Mani et al. (1999) first use QA as an extrinsic evaluation for summarization: A good summary should answer key questions a reader might have about an article. Later, QA is incorporated in human evaluation where one person writes questions and another person answers them based on the summary Clarke and Lapata (2010); Liu and Lapata (2019). The closest to our work are recent efforts in automating this protocol, including rule-based approaches Chen et al. (2018) and cloze-test QA Eyal et al. (2019); Scialom et al. (2019). Our work is the first to apply automated question generation. While we focus on faithfulness, our QA-based metric is applicable to semantic comparison between any two pieces of text.

Automated NLG evaluation is challenging as it often requires deep understanding of the text. Although metrics based on word overlap with the reference text are commonly used, it is widely known that they do not correlate well with human judgments Novikova et al. (2017); Liu et al. (2016). Recently, more work has focused on model-based evaluation using discriminators Lowe et al. (2017); Hashimoto et al. (2019), entailment models Falke et al. (2019a), information extraction Wiseman et al. (2017); Goodrich et al. (2019), and question answering Chen et al. (2018); Eyal et al. (2019).

Conclusion

We investigate the faithfulness problem in neural abstractive summarization and propose a QA-based metric for evaluating summary faithfulness. We show that current models suffer from an inherent trade-off between abstractiveness and faithfulness. They are good at copying important source content, but tend to concatenate unrelated spans and hallucinate details when generating more abstractive sentences. A new inductive bias or additional supervision is needed for learning reliable models. While our QA-based metric correlates better with human judgment and is useful for model development, it is limited by the quality of the QA model. The final evaluation should still rely on human annotation or human-in-the-loop methods Chaganty et al. (2018).

References

Acknowledgement

We would like to thank Faisal Ladhak, the Lex and Comprehend groups at Amazon Web Services AI, and the anonymous reviewers for their feedback on this work.

Appendix A Summarization Datasets

All of our experiments are run on the CNN/DM and XSum datasets. We show basic statistics of the two datasets in Table 10.

Appendix B Summarization Models

The characteristics of each model used in our experiments are detailed below.

uses the copy mechanism Vinyals et al. (2015) to allow copying words from the source. The adapted coverage mechanism Tu et al. (2016) is incorporated to alleviate repetition by keeping track of source words that have been summarized. This copy mechanism is widely adopted by subsequent models.

first uses an extractor agent to select salient sentences from the document, then condenses the extracted sentences using the Pointer-Generator summarizer.

first selects words from the source document that are likely to appear in the summary, then generates using the Pointer-Generator model, where the copying mechanism is constrained to the previously selected words. It improves upon pgc by explicitly learning the selector to avoid copying long text spans.

is a convolutional neural network-based model conditioned on the topics of the article. It is shown to be effective in capturing long-range dependencies in the documents.

is a two-stage fine-tuning approach where the BERT-based encoder is first fine-tuned on the extractive summarization task and then on the abstractive sumarization task with the decoder (denoted as BertSumExtAbs in the original paper).

Appendix C Details of Human Annotations

For grammaticality annotation, we present only the output sentence to the workers. We collect annotations from 55 workers for both of the tasks. For this task, given the output sentence, we provide workers the following guidelines:

First select whether the given sentence is “Nonsensical” or “Makes sense”.

If the given text is not a complete sentence, mark it as “Nonsensical”.

If you can understand the meaning of the sentence, despite grammaticality errors, and you are able to makes sense of it, select “Makes sense”.

If you did not select “Nonsensical”, evaluate whether the sentence is “Grammatical” or “Has Minor Grammaticality Issues”.

C.2 Faithfulness Annotation Guidelines

We present workers both the source and the output sentence and provide the following guidelines:

If the information conveyed by the sentence is not expressed in the source, select “unfaithful”.

Avoid using general knowledge, and check if the sentence is consistent with the source.

If you select “unfaithful”, for the second part, select whether the information expressed by the sentence is not contained in the source or conflicting with the source.

Appendix D Additional Analysis

Sandals, £34, office.co.uk, luluguinness.com. (generated by pgc for CNN/DM)

He says easter triduum is a progression , although the word itself – triduum. (generated by FastRL for CNN/DM)

Chelsea beat Chelsea 535-3 in the Premier League on Saturday. (generated by FastRL for CNN/DM)

12 years a slave actress Lupita Woodley and oily vegetables. (generated by BottomUp for CNN/DM)

A judge in Japan has ordered a judge to order a woman who has absconded from Japan to Japan. (generated by pgc for XSum)

Stoke City moved up to third in the Premier League with victory over Stoke City at Stoke. (generated by Tconv for XSum)

Johnny Depp’s management group is suing his management group over his “lavish lifestyle”. (generated by BertSum for XSum)

D.2 Examples for meaningful but unfaithful sentences

Table 11 includes examples that are annotated as meaningful but unfaithful. First three examples are picked from the models trained on CNN/DM, and last three are from the models trained on XSum. We observe that majority of sentences with faithfulness errors for CNN/DM dataset are generated by incorrect concatenation (IC). The models fuse two sentences from the source and generate a new sentence that is not consistent with the context of the source. Within this category, however, the models make a wide-range of mistakes such as copying the wrong entity, date, and quote.

For XSum, the faithfulness mistakes are mostly hallucinations. Models tend to hallucinate information (e.g. entities, events, date) that is not present in the source.

D.3 Examples for sentences with high content overlap with reference that are unfaithful

Although current summarization models are evaluated with respect to the content overlap between the reference and the output, these metrics do not necessarily provide any guarantees for the faithfulness of the output. Table 12 includes examples with similar content overlap scores as the faithful examples but are unfaithful. We can see that although the output sentences include similar words and refer to similar topics, they include hallucinations and inaccurate information.

D.4 Limitations of the datasets

Since CNN/DM and XSum datasets are automatically crawled, we find that there is noise in the data. For example, source documents can include phrases such as “click here for the latest news”. We further observe that reference can carry information that is not in the source document since some of these one sentence highlights are written using additional world knowledge. Table 13 shows an example where the reference is unfaithful since it includes information that is not in the source (i.e. the fact that Ms. Wood’s first name is Leanne and she is Plaid Cymru leader.).