How well do you know your summarization datasets?
Priyam Tejaswin, Dhruv Naik, Pengfei Liu
Introduction
The past few years have witnessed major breakthroughs and improvements in automatic summarization See et al. (2017); Celikyilmaz et al. (2018); Jadhav and Rajan (2018); Liu and Lapata (2019); Liu (2019); Dou et al. (2020); Yuan et al. (2021); Liu et al. (2021). Apart from the improvements in the summarization model architectures Zhang et al. (2019); Zhong et al. (2020), this growth has been aided by large-scale datasets Nallapati et al. (2016); Narayan et al. (2018a); Sharma et al. (2019) and automatic evaluation metrics Lin (2004); Zhao et al. (2019); Kryscinski et al. (2020) which are used for tuning hyperparameters and comparing models. While the reliability of these metrics has been explored extensively Peyrard (2019); Bhandari et al. (2020); Fabbri et al. (2020), few studies have focused on the underlying characteristics of different datasets, and how these impact model performance and metric reliability.
Datasets like CNN/DailyMail Nallapati et al. (2016), Gigaword Rush et al. (2015), XSum Narayan et al. (2018a), and many more Wang and Ling (2016); Koupaee and Wang (2018); Kim et al. (2019); Ganesan et al. (2010) were collected by scraping a large collection of web-pages. And for all the benefits this approach offers (seemingly infinite samples, diverse subjects, etc) there are some caveats:
We have no idea about the noise in the dataset. In the context of text summarization, noise could be an incomplete or irrelevant reference. At the moment, its quantity and impact on the performance is unknown.
What do we really know about the nature of samples in the dataset? Gigaword is a headline generation dataset with short sources and references. Does this imply a higher volume of simpler (i.e. more extractive) samples? The degree of summarization complexity, and its impact on model performance is unknown.
Exploring these open questions is critical for two reasons: (1) Information about the noise could lead to more informed data collection and pre-processing methods: in a recent study, Kryscinski et al. (2019) quantified HTML artefacts in popular summarization datasets, and proposed ways to detect and remove them. (2) Awareness about the complexity could better explain model performance, metrics, and even lead to new model architectures. In the tasks of machine comprehension and question answering, Chen et al. (2016) and Yatskar (2019) manually inspected random samples and drew insights which led to new state-of-the-art models. Such analysis could also help researchers choose datasets and metrics more carefully.
In this study, we perform intrinsic and model-centric evaluation of three popular summarization datasets (Gigaword, CNN/DM and XSum). We are interested in answering the following questions:
We are interested in (1) Identifying and quantifying the different types of “noise” that could occur and could penalize models. (2) Whether samples have different levels of difficulty. Armed with this, we ask the following questions.
Specifically, we’d like to know (1) If, and how, the performance varies across the different types of samples discovered from Q1. (2) If the performance is consistent across metrics.
This is motivated (in part) from prior metric-analysis studies, where researchers have explored inter-metric agreement and alignment with human-judgement under different conditions Peyrard (2019); Bhandari et al. (2020). Here we are more interested in knowing if the metrics are more correlated with human judgement for simpler samples, than complex ones.
Large-scale automatic intrinsic dataset evaluation has been explored with some promising results Bommasani and Cardie (2020). However, these methods rely on heuristics like content-value, density and compression Grusky et al. (2018). We are interested in a more fine-grained, interpretable analysis that can only come from manual inspection, much like the analysis by Chen et al. (2016) and by Yatskar (2019). To that end, we first define a six-class typology: the first three classes cover types of data-noise and the last three cover varying degrees of summarization difficulty. We then proceed to answer the aforementioned research questions, and discuss our key observations which are summarized below:
(1) Datasets have distinct modalities – a mix of simpler samples (which we call Extractive) and complex ones (which we call Paraphrase and Inference. (2) Gigaword is majorly Extractive but suffers from data noise (45% of the targets have some key entity, or fact that is absent from the source). (3) CNN/DM is relatively cleaner, and the authors’ attempts to create a more abstractive dataset seems to be successful compared with Gigaword (only 18% of samples are Extractive). (4) XSum has no Extractive samples, but also has the greatest fraction of noise: 54% of the test samples have key entities or facts missing from the source. (5) Within the datasets, the broad performance trends between the typology classes are consistent across all metrics: simpler samples score higher than complex ones. (6) Metric reliability is also complexity dependent: On CNN/DM the agreement with human judgement decreases as summarization complexity increases.
The remainder of the paper is organised as follows: in Section 2 we answer Q1, describe the three datasets, define the typology, and present results from the annotation. In Section 3 we explore Q2 a. and evaluate different models on a variety of metrics (automatic and human-judgement). In Section 4 we explore Q2 b. and investigate metric reliability. In Section 5 we share some learnings from our experience. We conclude with Section 7.
Evaluating the intrinsic properties of summarization datasets (Q1)
Among many summarization datasets, we choose the following:
Gigaword is a summarizaiton dataset extracted from news articles Rush et al. (2015)We use the version most commonly used by summarization systems: https://github.com/harvardnlp/sent-summary.
CNN/DailyMail or “CNN/DM” question answering dataset Hermann et al. (2015); Nallapati et al. (2016) is commonly used for summarization. The dataset consists of online news articles paired with human-generated summaries.We use the non-anonymized data as See et al. (2017).
XSum or “Extreme Summarization” Narayan et al. (2018a) was constructed from online news articles for highly abstractive summarization.
We consider these datasets because of their popularity, and the difference in the nature of samples. The latter enables a more comprehensive analysis; Table 1 captures the size of source and target documents along with the number of samples.
2 Typology Definition
The classes are defined below in order of priority. Some examples are in Table 2. Readers may refer to the Appendix A, A, A for more examples.
Incomplete/Irrelevant: The target summary ends abruptly. Or the source and target are unrelated.
Entity Missing: The target summary contains entities (names, dates, events, etc) that are absent from the source.
Evidence Missing: The target summary is based on concepts which are absent from the source. However, the target is not Incomplete and all Entities are present.
Extractive: The target is constructed by copying tokens from the source, mostly in-order of their appearance. Minor modifications, like stemming and abbreviating, are permitted. Word substitutions, and additions, are limited to a few. No reasoning, conclusion or co-ref resolution is performed as part of the summarization. The complete context of the target should be present in the source.
Paraphrase: The majority of tokens in the target are substituted, or appear out of order, or both. There is no reasoning, conclusion or co-ref resolution. The complete context of the target should be present in the source.
Inference: A non-trivial “inference” activity has to be completed to construct the target: some reasoning, conclusion, or complex co-reference resolution. The complete context of the target should be present in the source.
We annotate 200 samples from each dataset, on par with similar studies on intrinsic evaluation Chen et al. (2016); Cao et al. (2017). Two authors annotate samples independently. Annotations matched for 70%, 68% and 73% of Gigaword, CNN-DM and XSum samples, respectively. Disagreements were discussed between all authors before arriving at a consensus for the final label.
To the best of our knowledge, summarization datasets have not been manually analysed in this manner. A review of the most relevant summarization dataset analysis research shows that the most common form of intrinsic evaluation is to use surface-level heuristics. Most studies only cover a part of our typology, while almost all studies ignore the noise present in datasets.
Grusky et al. (2018); Bommasani and Cardie (2020); Zhong et al. (2019b) use similar forms of token-level coverage between the source and the reference to measure the extractiveness of the summary. In it’s simplest form, this is a ratio of the number of overlapping tokens and reference length. In our definition of Extractive, we first set a meaninful, well-defined criterion, and then manually check for extractive references, while allowing for some relaxations.
In most papers Grusky et al. (2018); Zhong et al. (2019b); Bommasani and Cardie (2020), the summarization complexity is defined by a compression ratio (usually the normalized word-count ratio of the source and reference). As a standalone metric, this does indeed capture the difficulty in replication. However, token rearrangement, substitution, reformulation is ignored in this measure of “complexity”. To combat this, we distinctly defined Paraphrase and Inference. By manually analysing samples, we are able to differentiate between the obviously simple Extractive samples, the relatively tougher Paraphrase samples and the most difficult Inference samples. Together these three offer a highly intuitive classification of samples. Part of the reason that the Machine Comprehension analysis by Chen et al. (2016) was so effective was the interpretability of their classes. We hope our analysis will also enable researchers to improve summarization models.
Prior works have not focused on quantify the noise in popular datasets. Moreover, none of these metrics are designed to account for noise or factual inconsistencies. A high value for content compression might imply a high-degree of summarization complexity. But this ignores the possibility that the source-reference pair is unrelated (like row 1 in Table 2). In addition, the manual analysis allows us to identify factual errors and co-ref errors.
This is not to say the typology is perfect and exhaustive. Limitations and possible extensions to our typology are discussed in Section 5.
3 Dataset Analysis
The distribution of classes in the datasets is in Figure 1. We have made the following key observations in our analysis of the labels.
24.5% of summaries are Extractive, but 44.5% of samples belong to Entity Missing, Evidence Missing, or Incomplete. Not unexpected considering the “headline” nature of the samples.
The authors (Narayan et al., 2018a) designed the dataset to be highly abstractive. This is reflected in the distribution: there were no Extractive samples in our analysis, suggesting a significantly higher level of difficulty. However, 55% of samples belong to Entity Missing, Evidence Missing, or Incomplete classes. The remaining 45% belongs to Paraphrase and Inference categories. Since we found only two incomplete samples, this class is ignored in all further XSum analysis.
The authors (Hermann et al., 2015) designed CNN/DM to be abstractive in nature, and this is reflected in the distribution: 64% of samples belong to Paraphrase and Inference categories. Of the three, CNN/DM has the lowest fraction of factual and data noise: there are no Incomplete/Irrelavant samples, and only 18% of samples belong to Entity Missing and Evidence Missing.
The degree with which missing facts affects automatic evaluation varies. In some samples, one or two entities are missing (like Row 2 in Table 2), but in others multiple facts are missing. Empirical analysis of model performance for each class of samples is discussed in Section 3.
Performance on different classes (Q2 a)
In this section, we list the different models and metrics considered for analysis, and then describe how model performance varies across class labels.
We collect outputs from 7 systems for Gigaword: (1) Pegasus (Zhang et al., 2019), (2) Prophet (Qi et al., 2020) (Lewis et al., 2020), (3) UniLM Dong et al. (2019) , (4) Biset Song et al. (2020), (5) ConCopy Wang et al. (2019) , (6) PointerGenerator See et al. (2017), (7) PointerGeneratorCopying See et al. (2017)
For CNN/DM, we use the outputs of 11 top-performing summarization systems collected by Bhandari et al. (2020)https://github.com/neulab/REALSumm: (1) HeterGraph (Wang et al., 2020), (2) MatchSumm (Lewis et al., 2020), (3) Refresh Narayan et al. (2018b) , (4) TwoStageRL Song et al. (2020), (5) Neusumm Wang et al. (2019) , (6) BottomUp Gehrmann et al. (2018) (7) SemSim Yoon et al. (2020) (8) UniLM Dong et al. (2019) (9) BartAbstractive Lewis et al. (2020) (10) BanditSumm Dong et al. (2018) (11) BartExtractive Lewis et al. (2020)
For XSum, we use the outputs of 9 different summarization systems: (1) ConvSeq2Seq (Gehring et al., 2017), (2) TConvS2S (Narayan et al., 2018a) (3) PointerGenerator (See et al., 2017), (4) Bart (Lewis et al., 2020), (5) PreSummExtractive (Liu and Lapata, 2019), (6) PreSummAbstracctive (Liu and Lapata, 2019), (7) PreSummTransformer (Liu and Lapata, 2019), (8) LEAD (Nenkova, 2005), (9) ExtOracle (Nallapati et al., 2017)
2 Metrics for evaluation
Existing summarization systems are usually evaluated using automated metrics or manually using human judgments. We list popular automatic metrics explored in this work. Except for the last two, all outputs from every model is scored on the following metrics.
ROUGE-1/2/L measure overlap of unigrams, bigrams and longest common subsequence. respectivelyFor ROUGE-1,2, and L, we used the Python implementation: https://github.com/sebastianGehrmann/rouge-baselines (Lin, 2004).
BERTScore (BS) measures soft overlap between contextual BERT embeddings of tokens between the two textsUsed code at github.com/Tiiiger/bert_score (Zhang et al., 2020).
MoverScore (MS) applies a distance measure to contextualized BERT and ELMo word embeddingsWe used a faster version of the code provided by the author at github.com/AIPHES/emnlp19-moverscore (Zhao et al., 2019).
FactCC is introduced to measure the fact consistency between the generated summaries and source documents Kryscinski et al. (2020). Due to issues with the setup and training procedure, this metric was only used in the CNN/DM analysis.
Human Pyramid (HP) provides a robust technique for evaluating content selection by exhaustively obtaining a set of Semantic Content Units (SCUs) from a set of references, and then scoring system summaries on the number of SCUs that can be inferred Nenkova and Passonneau (2004). We use the scores shared by Bhandari et al. (2020) for the first 100 samples of CNN/DM subset.
3 Model Performance
For each dataset, we group the samples by their labels. For all samples in a subset, the model response is scored using a metric. The mean of these sample scores returns a single subset-model-metric score, which is then averaged across all models in the subset, leaving us with a single subset-metric score. This is repeated for all (subset metric) pairs. The results are captured in Figures 2, 3 and 4 for Gigaword, CNN/DM and XSum respectively. The last column in each group is the average score across all samples.
Of the three datasets, only Gigaword contains Incomplete (or Irrelevant) samples. Across all metrics, the performance on this label is lowest, which is to be expected – high overlap will be rare if the source and target are unrelated or incomplete (like Row 1, Table 2). What’s alarming is the volume of such samples in Gigaword – if the distribution is the same for the training set, then the model is being trained on extremely noisy data (almost 14%). In addition, such samples needlessly penalise the model performance during evaluation.
The results for these subsets are a bit surprising. In Gigaword, the Entity Missing subset receives relatively higher scores than the Evidence Missing category. We attribute this to a combination of factors. Consider Row 2 in Table 2. Entities are missing, but token overlap is high (more than 50%), which explains the high R1 scores, but low R2 scores. In our observations, the impact of missing facts and entities varies by the length of the target, as well as the number of entities.
When compared with Gigaword, samples with data quality issues (i.e. Incomplete/Irrelevant, Entity Missing and Evidence Missing samples) in CNN/DM and XSum get relatively higher scores. The reasons are similar to the Gigaword phenomenon discussed before. The average summary length of CNN/DM (54 tokens) is about 7 times that of Gigaword (8 tokens). As a result, with respect to the complete reference, one or two missing facts amounts to a much smaller fraction of the reference in CNN/DM. The high overlap with the remainder leads to higher scores.
Automatic metrics only consider the token overlap (or “semantic distance”) between the target and the model output. While such metrics exhibit high correlation with human-judgement, a low score does not necessarily imply an incorrect generation, as demonstrated by Freitag et al. (2020) for machine translation. Hence we check for factual correctness of model outputs using FactCC. The competitive scores on the first three categories for FactCC in Fig .3 suggests the outputs generated by the model are factually faithful, which points to issues with the metric reliability. We discuss this in Section 4.
3.2 Impact of Summarization Complexity
For the last three categories (Extractive, Paraphrase and Inference) Gigaword and CNN/DM exhibit a common trend: the highest performance, across all metrics is on the Extractive subset, followed by Paraphrase samples which are more difficult to reproduce. The lowest performance is on the Inference samples. However, concluding models perform poorly would be incorrect. The last three samples in Table 2 suggest that model outputs are coherent, logical and factually faithful. FactCC scores in Figure 3 also suggest the outputs are factually consistent.
For the Extractive, Paraphrase and Inference samples, the samples we manually observed (some of which are captured in Table 2) and the FactCC scores indicates a gap in the token-based metrics. However, we cannot fault the metrics entirely. If we had diverse target references for the same sources, some outputs would have found better matches, and thus, higher scores! In fact, we see that BERTScore (a more “semantically” oriented metric) is extremely competitive across all categories in all three datasets (Figures 2, 3, 4), suggesting the generations are similar to the references. These results lead us to believe that token-based summarization metrics might also suffer from a “summarization-ese” effect: the metrics could be biased towards simpler, more “extractive” references. Recently, Freitag et al. (2020) also arrived at the same conclusion for machine translation and BLEU Papineni et al. (2002).
In the next section, we continue to explore the reliability of these metrics.
Does the reliability of metrics change with data properties? (Q2 b)
For each document in a dataset , we have system outputs, where the outputs can come from different systems. Let be the summary of the document, be a specific metric (including human judgment).
Correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported. Like other meta-evaluation studies, we consider the Pearson correlation and Spearman correlation as measures for . Due to space constraints we only show the Pearson plots for some critical results. More plots are available in Appendix A.1.
We present a pairwise correlation analysis of the automatic metrics to understand metric agreement in Figure 5. We conjecture that a strong correlation between two vastly different metrics (say ROUGE and MoverScore) might show that the metric is more reliable. Overall, we can see in Figure 5 that correlations between token-based metrics (ROUGE) and embedding-distance metrics (BERTScore, MoverScore) is lower in Gigaword, compared to CNN/DM and XSum. It is possible that the short length summaries of Gigaword is leading to this; perhaps there isn’t enough context for BERTScore. Although, we could not find any results in the original papers to support this claim.
We observe that the correlation is heavily sample dependent. In Figure 5, averaged across all samples, R1 and MoverScore have a Pearson correlation of about 0.68 in Gigaword. This increases to 0.82 for the Extractive samples in Figure 6-(a), which are the simplest to reproduce. As the complexity increases, the correlation scores decrease (in Paraphrase, and then in Inference). The trends for R2 and MoverScore are similar. This is also observed for CNN/DM: in Figure 6-(b), correlations for R1-MoverScore and R1-BERTScore drop from 0.9, 0.85 for Extractive samples to about 0.83, 0.72 for Paraphrase and Inference samples. This suggests that the inter-metric correlation is heavily sample dependent. We cannot comment on XSum, because we did not encounter any Extractive samples in that dataset.
For CNN/DM, we also compute the metric correlations with the human pyramid score (HP) in Figure 5 and Figure 6-(b). We observe the highest agreement with the human-judgement for the Extractive subset, and it is significantly lower in Paraphrase and Inference. This suggests that automatic metrics are more reliable when evaluating simpler examples, than complex ones.
Discussion
Limitations of the typology. Forcing samples to have a single label did limit our analysis. In retrospect, the typology could have allowed for two labels: one for quality, one for complexity. In XSum for instance most samples which were labelled Entity Missing could also be labelled Paraphrase and Inference. We also realise that the impact of positional-bias could be important. This has been explored by Zhong et al. (2019a, b), and we plan to include similar metrics in our future work.
Collecting better datasets. Our results suggest that current metrics are not equally reliable across all categories of samples. If the quality of the references cannot be controlled, then having a diverse set of references for the source is also advised. This will allow for multi-reference evaluation and could offset the “summarization-ese” issues.
Limits of the Pyramid Scores. At the moment, the Pyramid Scores (and judgement criteria in general) only compare the output to the gold-reference, assuming the latter is true. As we see from our analysis, ignoring the source is not the right approach, for references from the web could have quality issues. A modified judgement procedure, that also accounts for the faithfulness of the gold-reference (perhaps by using automatic factuality metrics FactCC) might be better.
Architecture specific performance. In this study, we were interested in measuring the broader, averaged trends that summarization models exhibit. However, it would be interesting to see how specific architectural decisions impact individual model performance across different classes. We plan to explore this in the future.
“But what’s the best metric for my data?” Specifically for metrics, our objective was to empirically demonstrate that (a) datasets have different modalities, and (b) metrics are not equally reliable across these modalities. In this process, we also observed some results suggesting possible biases in certain token-based metrics, and a need for diverse reference sets. We’ll continue to explore this question.
Related Work
For the task of text-summarization, the data analysis heuristics presented in Zhong et al. (2019a, b); Bommasani and Cardie (2020); Grusky et al. (2018) are most relevant to our work. Their analysis is focused on surface level heuristics which ignores all noise present in the data. This has been discussed in Sections 2.2.1, 5. Researchers have also explored other dataset biases Jung et al. (2019); Zhong et al. (2019b); Chen et al. (2020). As discussed in Section 5, we plan to include this in our future work.
For metric reliability and meta-analysis, we build on correlation analysis presented in earlier works Peyrard (2019); Bhandari et al. (2020); Fabbri et al. (2020). The key difference and novelty is the introduction of our typology and measuring the impact of sample complexity on model performance and metric reliability. To the best of our knowledge, metrics and models have not been evaluated on such a typology. As results in Section 3 and 4 show, sample complexity is indeed very critical for metric reliability.
Conclusion
In this study, we manually analysed 600 samples from three popular datasets, using a typology that captures data quality issues and varying degrees of sample-complexity. Our analysis of 27 summarization models reveals that the metric performance is heavily dependent on samples. On closer inspection, we found that the agreement of popular metrics also changes with the complexity, thus the scores might not reflect true model performance. This analysis also led to some suggestions for creating better summarization datasets and highlights some limitations of the current human-judgement procedures.
Acknowledgements
We thank Professor Graham Neubig, Yiran Chen and anonymous reviewers for valuable feedback and helpful suggestions. Thanks Kaiqiang Song for providing system outputs. This work was supported in part by a grant under the Northrop Grumman SOTERIA project and the Air Force Research Laboratory under agreement number FA8750-19-2-0200.
References
Appendix A Figures and Annotation Details
A.2 Annotation Details
Each sample is annotated by 2-3 annotators independently. Given the limited number of samples, and the laborious nature of the exercise, we chose not to select final labels based on majority vote. For all disagreements, annotators discussed their reasoning and came to an consensus for final label. For 70% of Gigaword samples, 68% of CNN-DM samples, and 73% of XSum samples, the initial annotations were in agreement.