PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu
Introduction
Text summarization aims at generating accurate and concise summaries from input document(s). In contrast to extractive summarization which merely copies informative fragments from the input, abstractive summarization may generate novel words. A good abstractive summary covers principal information in the input and is linguistically fluent.
In abstractive summarization, sequence-to-sequence (Sutskever et al., 2014) has become a dominant framework using encoder-decoder architectures based on RNNs (Chung et al., 2014; Hochreiter & Schmidhuber, 1997) and more recently Transformers (Vaswani et al., 2017). Most prior work on neural abstractive summarization relied on large-scale, high-quality datasets of supervised document-summary pairs (Hermann et al., 2015) and achieved promising results (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017). In recent years, there has been increased interest in collecting new summarization datasets that have more abstractive summaries (Narayan et al., 2018), have longer documents, (Cohan et al., 2018; Sharma et al., 2019), utilize multiple documents (Fabbri et al., 2019), and are sourced from diverse domains (Grusky et al., 2018; Koupaee & Wang, 2018; Kim et al., 2019; Kornilova & Eidelman, 2019; Zhang & Tetreault, 2019); however, there has been little work on systematic evaluation of models across these broad settings.
Contemporaneously, the adoption of Transformer models (Vaswani et al., 2017) pre-trained using self-supervised objectives on large text corpora (Radford et al., 2018a; Devlin et al., 2019) have improved performance on many NLP tasks (Wang et al., 2018; Rajpurkar et al., 2016).
Recent work leveraging such pre-training for Transformer-based sequence-to-sequence models (Dong et al., 2019; Song et al., 2019; Rothe et al., 2019; Lewis et al., 2019; Raffel et al., 2019) has extended the success to text generation, including abstractive summarization.
In this work, we study pre-training objectives specifically for abstractive text summarization and evaluate on 12 downstream datasets spanning news (Hermann et al., 2015; Narayan et al., 2018; Grusky et al., 2018; Rush et al., 2015; Fabbri et al., 2019), science (Cohan et al., 2018), short stories (Kim et al., 2019), instructions (Koupaee & Wang, 2018), emails (Zhang & Tetreault, 2019), patents (Sharma et al., 2019), and legislative bills (Kornilova & Eidelman, 2019). We find that masking whole sentences from a document and generating these gap-sentences from the rest of the document works well as a pre-training objective for downstream summarization tasks. In particular, choosing putatively important sentences outperforms lead or randomly selected ones. We hypothesize this objective is suitable for abstractive summarization as it closely resembles the downstream task, encouraging whole-document understanding and summary-like generation. We call this self-supervised objective Gap Sentences Generation (GSG). Using GSG to pre-train a Transformer encoder-decoder on large corpora of documents (Web and news articles) results in our method, Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models, or PEGASUS.
With our best 568M parameter model trained on the recently introduced C4 (Raffel et al., 2019) corpus we equal or exceed state-of-the-art on the 12 summarization tasks we consider. We further push forward the state-of-the-art using a newly collected text corpus comprised of news-like articles we call HugeNews, including the highly competitive XSum and CNN/DailyMail summarization datasets.
Large-scale document-summary datasets are rare and in practice there is a mismatch between research datasets and real-world use-cases where collecting summaries is expensive; the most common setting is that of low-resource summarization. We simulate this setting and show that our model is able to adapt very quickly when fine-tuning with small numbers of supervised pairs, obtaining state-of-the-art results in 6 datasets with only 1000 examples.
Qualitatively we observed high quality outputs from our best models and validated this in human evaluation studies. We found that PEGASUS summaries are at least as good as reference summaries for the datasets we assessed – XSum, CNN/DailyMail, and Reddit TIFU – even at low-levels of supervision.
We propose a new self-supervised pre-training objective for abstractive summarization, gap-sentences generation, and study strategies for selecting those sentences.
We evaluate the proposed pre-training objective on a broad range of downstream summarization tasks, with careful ablations to choose the best model settings, which we use to train a 568M parameter PEGASUS model that surpasses or is on-par with the state-of-the-art on all 12 downstream datasets considered.
We show how good abstractive summarization performance can be achieved across broad domains with very little supervision by fine-tuning the PEGASUS model and surpassing previous state-of-the-art results on many tasks with as little as 1000 examples.
We conducted human evaluation studies to validate our experimental design and demonstrate human-level summarization performance on XSum, CNN/DailyMail, and Reddit TIFU.
Related Work
Dai & Le (2015); Ramachandran et al. (2017) used LM and autoencoder pre-training on in-domain data to improve performance of RNN sequence models. However, the combination of pre-training with much larger external text corpora (such as Wikipedia, books, or Web-pages) and Transformer-based sequence models has led to a dramatic improvement in performance when fine-tuned for both natural language understanding and text generation tasks (Radford et al., 2018a; Devlin et al., 2019; Rothe et al., 2019; Yang et al., 2019; Joshi et al., 2019; Song et al., 2019; Dong et al., 2019; Lewis et al., 2019). Most similar to our approach are Transformer encoder-decoder models pre-trained on some masked input pre-training objective.
(Song et al., 2019) proposed masked sequence-to-sequence generation that reconstructs a sentence fragment given the remaining part of the sentence. A single sentence fragment was randomly selected.
(Dong et al., 2019) proposed jointly training on three types of language modeling tasks: unidirectional (left-to-right and right-to-left), bidirectional (word-level mask, with next sentence prediction), and sequence-to-sequence (word-level mask) prediction.
(Raffel et al., 2019) generalized the text-to-text framework to a variety of NLP tasks and showed the advantage of scaling up model size (to 11 billion parameters) and pre-training corpus, introducing C4, a massive text corpus derived from Common Crawl, which we also use in some of our models. T5 was pre-trained with randomly corrupted text spans of varying mask ratios and sizes of spans.
(Lewis et al., 2019) introduced a denoising autoencoder to pre-train sequence-to-sequence models. BART corrupted text with an arbitrary noising function and learned to reconstruct the original text. For generation tasks, the noising function was text infilling which used single mask tokens to mask random sampled spans of text.
In contrast to MASS, UniLM, BART and T5, the proposed PEGASUS masks multiple whole sentences rather than smaller continuous text spans. In our final objective we deterministically choose sentences based on importance, rather than randomly. As in T5, PEGASUS does not reconstruct full input sequences, and only generates the masked sentences as a single output sequence. In this work we focus entirely on downstream summarization (generative) tasks and do not evaluate on NLU classification tasks.
There has been some work on the low-resource, summarization setting using the CNN/DailyMail dataset. Radford et al. (2018b) showed that a large Transformer language model pre-trained on Web text could generate summaries if prompted with ”TL;DR”, achieving a ROUGE-2 of 8.27 on CNN/DailyMail. Khandelwal et al. (2019) pre-trained a Transformer language model on Wikipedia, and fine-tuned using 3000 examples, achieving 13.1 ROUGE-2.
Pre-training Objectives
We propose a new pre-training objective, GSG, in this work, but for comparison, we also evaluate BERT’s masked-language model objective, in isolation and in conjunction with GSG.
We hypothesize that using a pre-training objective that more closely resembles the downstream task leads to better and faster fine-tuning performance. Given our intended use for abstractive summarization, our proposed pre-training objective involves generating summary-like text from an input document. In order to leverage massive text corpora for pre-training, we design a sequence-to-sequence self-supervised objective in the absence of abstactive summaries. A naive option would be to pre-train as an extractive summarizer; however, such a procedure would only train a model to copy sentences, thus not suitable for abstractive summarization.
Inspired by recent success in masking words and contiguous spans (Joshi et al., 2019; Raffel et al., 2019), we select and mask whole sentences from documents, and concatenate the gap-sentences into a pseudo-summary. The corresponding position of each selected gap sentence is replaced by a mask token [MASK1] to inform the model. Gap sentences ratio, or GSR, refers to the number of selected gap sentences to the total number of sentences in the document, which is similar to mask rate in other works.
To even more closely approximate a summary, we select sentences that appear to be important/principal to the document. The resulting objective has both the empirically demonstrated benefits of masking, and anticipates the form of the downstream task.
We consider 3 primary strategies for selecting gap sentences without replacement from a document, , comprised of sentences:
Uniformly select sentences at random.
Select top- scored sentences according to importance. As a proxy for importance we compute ROUGE1-F1 (Lin, 2004) between the sentence and the rest of the document, .
In this formulation sentences are scored independently (Ind) and the top selected. We also consider selecting them sequentially (Seq) as in Nallapati et al. (2017) by greedily maximizing the ROUGE1-F1 between selected sentences, , and remaining sentences, as in Algorithm 1.
When calculating ROUGE1-F1, we also consider n-grams as a set (Uniq) instead of double-counting identical n-grams as in the original implementation (Orig). This results in four variants of the principal sentence selection strategy, choosing Ind/Seq and Orig/Uniq options.
An example containing lead, random and principal gap sentence selection strategies are shown in Figure 2.
2 Masked Language Model (MLM)
Following BERT, we select 15% tokens in the input text, and the selected tokens are (1) 80% of time replaced by a mask token [MASK2], or (2) 10% of time replaced by a random token, or (3) 10% of time unchanged. We apply MLM to train the Transformer encoder as the sole pre-training objective or along with GSG. When MLM is the sole pre-training objective, the Transformer decoder shares all parameters with encoder when fine-tuning on downstream tasks following Rothe et al. (2019).
Figure 1 simultaneously shows how both GSG and MLM are applied to the same example when used in conjunction. However, we found that MLM does not improve downstream tasks at large number of pre-training steps (section 6.1.2), and chose not to include MLM in the final model (section 6.2).
Pre-training Corpus
For pre-training we considered two large text corpora:
C4, or the Colossal and Cleaned version of Common Crawl, introduced in Raffel et al. (2019); consists of text from 350M Web-pages (750GB).
HugeNews, a dataset of 1.5B articles (3.8TB) collected from news and news-like websites from 2013-2019. A whitelist of domains ranging from high-quality news publishers to lower-quality sites such as high-school newspapers, and blogs was curated and used to seed a web-crawler. Heuristics were used to identify news-like articles, and only the main article text was extracted as plain text.
Downstream Tasks/Datasets
For downstream summarization, we only used public abstractive summarization datasets, and access them through TensorFlow Summarization Datasets https://www.tensorflow.org/datasets/catalog/overview, which provides publicly reproducible code for dataset processing and train/validation/test splits. We used train/validation/test ratio of 80/10/10 if no split was provided, and 10% train split as validation if there was no validation split.
XSum (Narayan et al., 2018) consists of 227k BBC articles from 2010 to 2017 covering a wide variety of subjects along with professionally written single-sentence summaries.
CNN/DailyMail (Hermann et al., 2015) dataset contains 93k articles from the CNN, and 220k articles the Daily Mail newspapers. Both publishers supplement their articles with bullet point summaries. We use the non-anonymized variant used in See et al. (2017).
NEWSROOM (Grusky et al., 2018) is a large dataset containing 1.3M article-summary pairs written by authors and editors in the newsrooms of 38 major publications between 1998 and 2017.
Multi-News (Fabbri et al., 2019) is a multi-document summarization dataset consisting of 56k pairs of news articles and their human-written summaries from the site newser.com.
Gigaword (Rush et al., 2015) contains 4M examples extracted from news articles (seven publishers) from the Gigaword corpus (Graff et al., 2003). The task is to generate the headline from the first sentence.
arXiv, PubMed (Cohan et al., 2018) are two long document datasets of scientific publications from arXiv.org (113k) and PubMed (215k). The task is to generate the abstract from the paper body.
BIGPATENT (Sharma et al., 2019) consists of 1.3 million U.S. patents along with human summaries under nine patent classification categories.
WikiHow (Koupaee & Wang, 2018) is a large-scale dataset of instructions from the online WikiHow.com website. Each of 200k examples consists of multiple instruction-step paragraphs along with a summarizing sentence. The task is to generate the concatenated summary-sentences from the paragraphs.
Reddit TIFU (Kim et al., 2019) contains 120K posts of informal stories from the online discussion forum Reddit, more specifically the TIFU sub-reddit from 2013-Jan to 2018-Mar. The sub-reddit posts strictly follow the rule of writing a descriptive ”TL;DR” summary and has higher quality than (Völske et al., 2017) (which used more subreddits) based on our manual inspection. We uses the TIFU-long subset (using TLDR as summaries) in the work.
AESLC (Zhang & Tetreault, 2019) consists of 18k email bodies and their subjects from the Enron corpus (Klimt & Yang, 2004), a collection of email messages of employees in the Enron Corporation.
BillSum (Kornilova & Eidelman, 2019) contains 23k US Congressional bills and human-written reference summaries from the 103rd-115th (1993-2018) sessions of Congress. We do not use the California test set which is out-of-distribution.
Following Grusky et al., the number of examples and extractive fragment coverage/density for all downstream datasets is illustrated in Appendix A.
Experiments
In a similar strategy to Raffel et al. (2019), to save time and computation we conducted pre-training ablation experiments using a reduced-size model with 223M parameters, , smaller batch size, and only 4 of 12 datasets before scaling up pre-training with the best settings to the final 568M parameters, . The datasets (XSum, CNN/DailyMail, WikiHow and Reddit TIFU) were chosen for diversity in abstractiveness, writing style, and size.
had and had , where denotes the number of layers for encoder and decoder (i.e. Transformer blocks), for the hidden size, for the feed-forward layer size and for the number of self-attention heads. We pre-trained with a batch size of and with a batch size of . We refer to without pre-training as .
We used sinusoidal positional encoding following Vaswani et al. (2017). For optimization, both pre-training and fine-tuning used Adafactor (Shazeer & Stern, 2018) with square root learning rate decay and dropout rate of 0.1.
We used greedy-decoding for studies in Section 6.1, and used beam-search with a length-penalty, , as in Wu et al. (2016) for the final large model.
All experiments’ hyper parameters can be found in Appendix C and reported numbers are in Appendix D and E.
We used to evaluate choices of pre-training corpus, pre-training objective, and vocabulary size. For reproducibility, we evaluated the latter two using the publicly available C4 corpus.
Note that the y-axis in Figures 3, 4, 5 are normalized by the left-most bar using where , , are ROUGE F1 scores and , , are the scores of the configuration corresponding to the first bar.
With more pre-training steps, the model observed more documents in the pre-training corpus. A model trained for 500k (highest we tried) steps did not observe all training examples on C4 nor HugeNews. Appendix B shows the number of pre-training steps had an unsurprisingly positive impact on downstream dataset performance. We used 500k steps for the ablation studies and the large model.
Figure 3 shows that pre-training on HugeNews was more effective than C4 on the two news downstream datasets, while the non-news informal datasets (WikiHow and Reddit TIFU) prefer the pre-training on C4. This suggests pre-training models transfer more effectively to downstream tasks when their domains are aligned better.
1.2 Effect of Pre-training Objectives
We compared six variants of GSG (Lead, Random, Ind-Orig, Ind-Uniq, Seq-Orig, Seq-Uniq) while choosing 30% sentences as gap sentences. As shown in Figure 4(a), Ind-Orig achieved the best performance followed by Seq-Uniq. Ind-Orig and Seq-Uniq were consistently better (or similar) than Random and Lead across the four downstream datasets. Lead had decent performance on the two news datasets but was significantly worse on the two non-news datasets, which agrees findings of lead bias in news datasets (See et al., 2017; Zhong et al., 2019). The results suggest choosing principal sentences works best for downstream summarization tasks, and we chose Ind-Orig for the .
A significant hyper-parameter in GSG is the gap-sentences ratio (GSR). A low GSR makes the pre-training less challenging and computationally efficient. On the other hand, choosing gap sentences at a high GSR loses contextual information necessary to guide the generation. We compared GSRs from 15% to 75%. For a fair comparison, the original documents were truncated to have up to 400 words. The maximum input length, in the encoder and the maximum target length, in the decoder were set as 512 tokens.
Figure 4(b) shows that different downstream datasets had slightly different optima. The best performance always had GSR lower than 50%. The model with 15% gap sentences achieved the highest ROUGE scores on CNN/DailyMail, while XSum/Reddit TIFU and WikiHow did better with 30% and 45% respectively. When scaling up to (Section 6.2), we chose an effective GSR of 30%.
As mentioned, the MLM objective can either be applied solely or together with GSG. We jointly trained MLM with GSG Ind-Orig (MLM & Ind-Orig), which masks 30% sentences and extra 15% tokens in unselected sentences, as shown in Figure 1. Figure 4(a) shows that the model pre-trained with MLM alone performed significantly worse and MLM & Ind-Orig had similar performance as Random. Interestingly, when comparing MLM & Ind-Orig to Ind-Orig, we empirically observed MLM improved fine-tuning performance at early pre-training checkpoints (100k - 200k steps), but inhibited further gains with more pre-training steps (500k). Therefore, we chose not to include MLM in .
1.3 Effect of Vocabulary
We compared two tokenization methodsImplemented in https://github.com/google/sentencepiece: Byte-pair-encoding algorithm (BPE) (Wu et al., 2016; Sennrich et al., 2016), and SentencePiece Unigram algorithm (Unigram) proposed in Kudo (2018). We evaluated Unigram with different vocabulary sizes ranging from 32k to 256k. In these experiments, models were pre-trained for 500k steps on the C4 corpus with the Ind-Orig objective and 15% GSR. As shown in Figure 5, BPE and Unigram were comparable on news datasets while Unigram outperformed BPE on non-news datasets, especially WikiHow. On XSum and CNN/DailyMail, Unigram 96k achieved the highest ROUGE scores. On WikiHow and Reddit TIFU, the best configurations were Unigram 128k and 64k respectively. Therefore, we used the overall best vocabulary option Unigram 96k in .
2 Larger Model Results
Compared with , the large model had increased capacity from larger hidden size (, , ), number of layers () and traversed much more data, due to larger batch size () (same number of pre-training steps, 500k). We adopted the best practices found in the ablation studies using the GSG (Ind-Orig) pre-training objective without MLM and Unigram vocabulary size of 96k. In total, had 568M parameters.
To encourage the model to copy, which is an important aspect of the more extractive datasets, we left 20% of selected sentences unchanged in the input instead of replacing with [MASK1]. We increased the GSR to 45% to achieve a similar number of “gaps” as the optimal 30% found above. We reported the performance of the models pre-trained on HugeNews and C4 separately. We conducted a simple hyper-parameter sweep of learning rate and length penalty, , when fine-tuning on each downstream dataset.
CNN/DailyMail, Multi-News, arXiv, PubMed, BIGPATENT datasets contain input documents longer than the maximum input length ( tokens) in pre-training. This would present a problem for position embeddings which would never be updated for longer input lengths, but we confirm the postulation that sinusoidal positional encodings (Vaswani et al., 2017) generalize well when fine-tuning beyond the input lengths observed in training up to tokens. Since average input length in BIGPATENT, arXiv, PubMed and Multi-News are well beyond 1024 tokens, further scaling up or applying a two-stage approach (Liu et al., 2018) may improve performance even more, although this is outside the scope of this work.
Tables 1 and 2 show the performance improvements of and on downstream datasets. While exceeded current state-of-the-art on many datasets, achieved better than state-of-the-art results on all downstream datasets using HugeNews, although C4 performed better on WikiHow.
The improvement from a Transformer model without pre-training () to was more significant on smaller datasets. For example, the ROUGE2-F1 scores nearly tripled on AESLC and quintupled on Reddit TIFU. The large jumps in performance suggest that small text summarization datasets benefit the most from pre-training. We further investigate low resource summarization in Section 6.3.
3 Zero and Low-Resource Summarization
In real-world practice, it is often difficult to collect a large number of supervised examples to train or fine-tune a summarization model. To simulate the low-resource summarization setting, we picked the first () training examples from each dataset to fine-tune (HugeNews) . We fine-tuned the models up to 2000 steps with batch size 256, learning rate 0.0005, and picked the checkpoint with best validation performance. In Figure. 6, in 8 out of 12 datasets, with just 100 examples could be fine-tuned to generate summaries at comparable quality to trained on the full supervised datasets ranging from 20k to 200k examples. also beat previous state-of-the-art results on 6 out of 12 datasets with only 1000 fine-tuning examples.
On CNN/DailyMail, with half the number of parameters demonstrated much better zero-shot (ROUGE2-F=13.28) performance than GPT-2 (ROUGE2-F=8.27). Using only 1000 examples, achieved ROUGE2-F of 19.35, much higher than the 13.1 obtained in Khandelwal et al. (2019) with 3000 examples.
4 Qualitative Observations and Human Evaluation
Overall, we observed high-linguistic quality (in terms of fluency and coherence), closely emulating the style of ground-truth summaries. While some previous work suggested that maximum likelihood training results in repetitive text in model outputs (Welleck et al., 2019) we found this to be rare in our outputs and did not require additional counter-measures to mitigate dis-fluencies.
Although ROUGE clearly has its draw-backs (Kryscinski et al., 2019), over-penalizing abstractive approaches compared to extractive ones and having no sense of linguistic quality, we found that choosing perplexity-optimized models using aggregated ROUGE (rather than directly optimizing ROUGE as in Paulus et al. (2017)) resulted in qualitatively good models. Randomly sampled (by a program) model decodes across all datasets and a broad range of ROUGE scores can be found in Appendix I.We found that even low-ROUGE model summaries often were high-quality, Figure G.1.
To assess how close is to human performance we conducted human evaluation experiments on Amazon Mechanical Turk comparing model summaries with (human) reference summaries given the input document. The examples were drawn from the XSum, CNN/DailyMail, and Reddit TIFU datasets; the first two were chosen due to their popularity in past work, and the third was chosen for its significant difference in style. Workers were asked to rate the summaries on a 1-5 scale, with higher being better (full experiment details provided in Appendix F) and a paired t-test was used to assess whether scores were significantly different from human.
In the first experiment, (HugeNews), (C4), and were compared with reference summaries; in the second experiment, (HugeNews) fine-tuned using 10, 100, 1000, and all supervised examples were compared with references; the results are shown in Table 3. According to the significance level of , both (HugeNews) and (C4) outputs were at least as good as the reference summaries in all cases. Even at low-levels of supervision (HugeNews) was not measurably worse than human summaries on XSum and CNN/DailyMail. In the Reddit TIFU case, however, perhaps due to its diverse writing styles, human performance required full supervision.
5 Test-set Overlap with Pre-training Corpus
The pre-training corpora are a large collection of documents from the Internet and potentially have overlap with the downstream test sets. In this section, we measured the extent of overlap between the pre-training corpus and downstream datasets. We also studied if the pre-trained model was able to exploit memorization to achieve higher performance on the downstream datasets.
To measure the overlap, we calculated similarities between all pairs of downstream test set targets and pre-training documents. We use the ROUGE-2 recall as a similarity measure (common 2-grams / test set targets 2-grams). It is not necessarily exact match even if the similarity score is 1.0. We filtered all test set examples that have similarity to any pre-training example above a threshold, and recalculated the ROUGE scores on the remaining test set. In Figure 7, we conducted this study on the pre-training corpus C4 and test set of XSum, CNN/Dailymail, Reddit TIFU and WikiHow, with a similarity threshold of 1.0 and 0.8. Results show that only XSum has significant amount of overlap 15% to 20%, and filtering those examples does not change ROUGE scores more than 1%. We also manually examined those overlapped examples with similarity of 1.0, and found that the models produce very different summaries compared to the human written ones, suggesting that there was no clear memorization.
Following our experiments on pre-trained on C4 and HugeNews, we pre-trained a model on both corpora and stochastically sampled important sentences. The (mixed,stochastic) model includes the changes: (1) The model was pre-trained on the mixture of C4 and HugeNews weighted by their number of examples. (2) The model dynamically chose gap sentences ratio uniformly between 15%-45%. (3) Importance sentences were stochastically sampled with 20% uniform noise on their scores. (4) The model was pre-trained for 1.5M steps instead of 500k steps, as we observed slower convergence of pre-training perplexity. (5) The SentencePiece tokenizer was updated to encode the newline character. The (mixed, stochastic) model achieved best results on almost all downstream tasks, as shown in Table 4.
Conclusion
In this work, we proposed PEGASUS, a sequence-to-sequence model with gap-sentences generation as a pre-training objective tailored for abstractive text summarization. We studied several gap-sentence selection methods and identified principle sentence selection as the optimal strategy. We demonstrated the effects of the pre-training corpora, gap-sentences ratios, vocabulary sizes and scaled up the best configuration to achieve state-of-the-art results on all 12 diverse downstream datasets considered. We also showed that our model was able to adapt to unseen summarization datasets very quickly, achieving strong results in as little as 1000 examples. We finally showed our model summaries achieved human performance on multiple datasets using human evaluation.
Code and Model Checkpoints Release
The training code and instructions for using model checkpoints can be found at
https://github.com/google-research/pegasus
Acknowledgments
We thank Anastassia Kornilova, Eva Sharma, Shashi Narayan, Adam Roberts, Etienne Pot, and the Google News team for assistance with datasets, and Carey Radebaugh, David Grangier, Doug Eck, and Samy Bengio for reviewing the manuscript.
References
Appendix A Datasets Statistics
Following Grusky et al., we calculate extractive fragment coverage/density for all downstream datasets. They were defined as
where is article, is summary, and are extractive fragments. High density indicates more extractive datasets and low coverage suggests more novel words in the summary.
Appendix B Pre-training Steps
Appendix C PEGASUS Hyper Parameters
Appendix D Experiment Figures’ Numbers
Appendix E Low Resource Numbers
Appendix F Human Evaluation Details
In all human evaluation experiments we used the same task template shown in Figure F.1, where workers were asked to rate 4 summaries for a document on a scale of 1 (poor summary) to 5 (great summary). The order in which the summaries are presented for each task was random per example. Each task was independently done by 3 different workers and we retained the median score across workers for each summary. We paid 1 USD per task and used the following critieria for workers to ensure high-quality:
With this criteria we observed high reproducibility in the conclusions of the huamn evaluation. Multiple runs of the same experiment with different workers meeting this criteria yielded very similar results. The HITT template is provided at https://github.com/google-research/pegasus.
In experiment 1, the four summaries corresponded to 3 models ( pre-trained on HugeNews, C4, and ) that were fine-tuned using all the supervised examples along with the reference (human) summary. We sampled 100 examples from each dataset (XSum, CNN/DailyMail, Reddit TIFU).
In experiment 2, we evaluated 4 models ( pre-trained on HugeNews fine-tuned using different amounts of supervision, 10, 100, 1000, and all examples) alongside the human summary. To do this with the same template, for each example we randomly selected 4 out of the 5 summaries. This resulted in fewer ratings per model, but did not increase the work (and cost) of the task.
We used a paired t-test to determine statistical significance when comparing the ratings of two sets of summaries.
Appendix G Example of summary with relatively low ROUGE2-F but qualitatively good.
This figure shows an example model summary from the CNN/DailyMail dataset exhibiting high fluency, coherence, although highly abstractive, and only ROUGE2-F of 16. The model understood that the football team ”Chelsea” could be paraphrased as ”Jose Mourinho’s side” and ”The Blues” and highlighted the same four matches to be played.
Appendix H Abstractiveness of Summaries
We compared the abstractiveness of model generated summaries with the human-written ones for all downstream datasets. We measured abstractiveness of summaries using average values of extractive coverage and extractive density (Grusky et al., 2018) on each dataset. More abstractive summaries have smaller extractive coverage (more novel words) and smaller extractive density (smaller spans copied from inputs). Figure H.1 shows that the summaries generated by models were all less abstractive than the human-written counterparts. However, the models that were finetuned on more abstractive datasets, such as XSum and Reddit TIFU, could generate more abstractive summaries than human-written ones on other datasets.
Appendix I Example Model Outputs
Model outputs were selected (and LaTeX tables generated) automatically by a program in the following way: (1) pick first 300 examples of triplets (document, gold summary, model output) from the dataset test split; (2) rank the examples by ROUGE1-F1/ROUGE2-F1/ROUGEL-F1 metrics in descending order; (3) divide the examples into 2-10 buckets depending on the documents lengths; (4) randomly pick one example from each bucket.
We filtered out examples that contain bad words from the link https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en. Input documents were truncated at 300 words for visualization. Each page shows examples from one dataset sampled by one ROUGE metric.