BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

Eva Sharma, Chen Li, Lu Wang

Introduction

There has been a growing interest in building neural abstractive summarization systems See et al. (2017); Paulus et al. (2017); Gehrmann et al. (2018a), which requires large-scale datasets with high quality summaries. A number of summarization datasets have been explored so far Sandhaus (2008); Napoles et al. (2012); Hermann et al. (2015); Grusky et al. (2018). However, as most of them are acquired from news articles, they share specific characteristics that limit current state-of-the-art models by making them more extractive rather than allowing them to understand input content and generate well-formed informative summaries. Specifically, in these datasets, the summaries are flattened narratives with a simpler discourse structure, e.g., entities are rarely repeated as illustrated by the news summary in Fig. 1. Moreover, these summaries usually contain long fragments of text directly extracted from the input. Finally, the summary-worthy salient content is mostly present in the beginning of the input articles.

We introduce BigPatent BigPatent dataset is available to download online at evasharma.github.io/bigpatent., a new large-scale summarization dataset consisting of $1.3$ million patent documents with human-written abstractive summaries. BigPatent addresses the aforementioned issues, thus guiding summarization research to better understand the input’s global structure and generate summaries with a more complex and coherent discourse structure. The key features of BigPatent are: i) summaries exhibit a richer discourse structure with entities recurring in multiple subsequent sentences as shown in Fig. 1, ii) salient content is evenly distributed in the document, and iii) summaries are considerably more abstractive while reusing fewer and shorter phrases from the input.

To further illustrate the challenges in text summarization, we benchmark BigPatent with baselines and popular summarization models, and compare with the results on existing large-scale news datasets. We find that many models yield noticeably lower ROUGE scores on BigPatent than on the news datasets, suggesting a need for developing more advanced models to address the new challenges presented by BigPatent. Moreover, while existing neural abstractive models produce more abstractive summaries on BigPatent, they tend to repeat irrelevant discourse entities excessively, and often fabricate information.

These observations demonstrate the importance of BigPatent in steering future research in text summarization towards global content modeling, semantic understanding of entities and relations, and discourse-aware text planning to build abstractive and coherent summarization systems.

Related Work

Recent advances in abstractive summarization show promising results in generating fluent and informative summaries Rush et al. (2015); Nallapati et al. (2016); Tan et al. (2017); Paulus et al. (2017). However, these summaries often contain fabricated and repeated content Cao et al. (2018). Fan et al. (2018) show that, for content selection, existing models rely on positional information and can be easily fooled by adversarial content present in the input. This underpins the need for global content modeling and semantic understanding of the input, along with discourse-aware text planning to yield a well-formed summary McKeown (1985); Barzilay and Lapata (2008).

Several datasets have been used to aid the development of text summarization models. These datasets are predominantly from the news domain and have several drawbacks such as limited training data (Document Understanding Conferencehttps://duc.nist.gov/), shorter summaries (Gigaword Napoles et al. (2012), XSum Narayan et al. (2018), and Newsroom Grusky et al. (2018)), and near-extractive summaries (CNN / Daily Mail dataset Hermann et al. (2015)). Moreover, due to the nature of news reporting, summary-worthy content is non-uniformly distributed within each article. ArXiv and PubMed datasets Cohan et al. (2018), which are collected from scientific repositories, are limited in size and have longer yet extractive summaries. Thus, existing datasets either lack crucial structural properties or are limited in size for learning robust deep learning methods. To address these issues, we present a new dataset, BigPatent, which guides research towards building more abstractive summarization systems with global content understanding.

BigPatent Dataset

We present BigPatent, a dataset consisting of 1.3 million U.S. patent documents collected from Google Patents Public Datasets using BigQuery Google (2018)Released and maintained by IFI CLAIMS Patent Services and Google, and licensed under Creative Commons Attribution 4.0 International License.. It contains patents filed after 1971 across nine different technological areas. We use each patent’s abstract as the gold-standard summary and its description as the input.The summarization task studied using BigPatent is notably different from traditional patent summarization task where patent claims are summarized into a more readable format Cinciruk (2015). Additional details for the dataset, including the preprocessing steps, are in Appendix A.1.

Table 1 lists statistics, including compression ratio and extractive fragment density, for BigPatent and some commonly-used summarization corpora. Compression ratio is the ratio of the number of words in a document and its summary, whereas density is the average length of the extractive fragmentExtractive fragments are the set of shared sequences of tokens in the document and summary. to which each word in the summary belongs Grusky et al. (2018). Among existing datasets, CNN/DM Hermann et al. (2015), NYT Napoles et al. (2012), Newsroom (released) Grusky et al. (2018) and XSum Narayan et al. (2018) are news datasets, while arXiv and PubMed Cohan et al. (2018) contain scientific articles. Notably, BigPatent is significantly larger with longer inputs and summaries.

Dataset Characterization

Inferring the distribution of salient content in the input is critical to content selection of summarization models. While prior work uses probabilistic topic models Barzilay and Lee (2004); Haghighi and Vanderwende (2009) or relies on classifiers trained with sophisticated features Yang et al. (2017), we focus on salient words and their occurrences in the input.

We consider all unigrams, except stopwords, in a summary as salient words for the respective document. We divide each document into four equal segments and measure the percentage of unique salient words in each segment. Formally, let $U$ be a function that returns all unique unigrams (except stopwords) for a given text. Then, $U(d^{i})$ denotes the unique unigrams in the $i^{th}$ segment of a document $d$ , and $U(y)$ denotes the unique unigrams in the corresponding summary $y$ . The percentage of salient unigrams in the $i^{th}$ segment of a document is calculated as:

Fig. 2 shows that BigPatent has a fairly even distribution of salient words in all segments of the input. Only $6\%$ more salient words are observed in the $1^{st}$ segment than in other segments. In contrast, for CNN/DM, NYT and Newsroom, approximately $50\%$ of the salient words are present in the $1^{st}$ segment, and the proportion drops monotonically to $10\%$ in the $4^{th}$ segment. This indicates that most salient content is present in the beginning of news articles in these datasets. For XSum, another news dataset, although the trend in the first three segments is similar to BigPatent, the percentage of novel unigrams in the last segment drops by $5\%$ compared to $0.2\%$ for BigPatent.

For scientific articles (arXiv and PubMed), where content is organized into sections, there is a clear drop in the $2^{nd}$ segment where related work is often discussed, with most salient information being present in the first (introduction) and last (conclusion) sections. Whereas in BigPatent, since each embodiment of a patent’s invention is sequentially described in its document, it has a more uniform distribution of salient content.

Next, we probe how far one needs to read from the input’s start to cover the salient words (only those present in input) from the summary. About $63\%$ of the sentences from the input are required to construct full summaries for CNN/DM, $57\%$ for XSum, $53\%$ for NYT, and $29\%$ for Newsroom. Whereas in the case of BigPatent, $\mathbf{80\%}$ of the input is required. The aforementioned observations signify the need of global content modeling to achieve good performance on BigPatent.

2 Summary Abstractiveness and Coherence

Following prior work See et al. (2017); Chen and Bansal (2018), we compute abstractiveness as the fraction of novel $n$ -grams in the summaries that are absent from the input. As shown in Fig. 3, XSum comprises of notably shorter but more abstractive summaries. Besides that, BigPatent reports the second highest percentage of novel $n$ -grams, for $n\in\{2,3,4\}$ . Significantly higher novelty scores for trigram and $4$ -gram indicate that BigPatent has fewer and shorter extractive fragments, compared to others (except for XSum, a smaller dataset). This further corroborates the fact that BigPatent has the lowest extractive fragment density (as shown in Table 1) and contains longer summaries.

Coherence Analysis via Entity Distribution. To study the discourse structure of summaries, we analyze the distribution of entities that are indicative of coherence Grosz et al. (1995); Strube and Hahn (1999). To identify these entities, we extract non-recursive noun phrases (regex NP $\rightarrow$ ADJ $\ast$ [NN]+) using NLTK Loper and Bird (2002). Finally, we use the entity-grid representation by Barzilay and Lapata (2008) and their coreference resolution rules to capture the entity distribution across summary sentences. In this work, we do not distinguish entities’ grammar roles, and leave that for future study.

On average, there are $6.7$ , $10.9$ , $12.4$ and $18.5$ unique entities in the summaries for Newsroom, NYT, CNN/DM and BigPatent, respectivelyWe exclude XSum as its summaries are all one-sentence.. PubMed and arXiv reported higher number of unique entities in summaries ( $39.0$ and $48.1$ respectively) since their summaries are considerably longer (Table 1). Table 2 shows that $24.1\%$ of entities recur in BigPatent summaries, which is higher than that on other datasets, indicating more complex discourse structures in its summaries. To understand local coherence in summaries, we measure the longest chain formed across sentences by each entity, denoted as $l$ . Table 3 shows that $11.1\%$ of the entities in BigPatent appear in two consecutive sentences, which is again higher than that of any other dataset. The presence of longer entity chains in the BigPatent summaries suggests its higher sentence-to-sentence relatedness than the news summaries.

Finally, we examine the entity recurrence pattern which captures how many entities, first occurring in the $t^{th}$ sentence, are repeated in subsequent ( $t+i^{th}$ ) sentences. Table 3 (right) shows that, on average, $2.3$ entities in BigPatent summaries recur in later sentences (summing up the numbers for $t{+}2$ and after). The corresponding recurring frequency for news dataset such as CNN/DM is only $0.4$ . Though PubMed and arXiv report higher number of recurrence, their patterns are different, i.e., entities often recur after three sentences. These observations imply a good combination of local and global coherence in BigPatent.

Experiments and Analyses

We evaluate BigPatent with popular summarization systems and compare with well-known datasets such as CNN/DM and NYT. For baseline, we use Lead-3, which selects the first three sentences from the input as the summary. We consider two oracles: i) OracleFragbuilds summary using all the longest fragments reused from input in the gold-summary Grusky et al. (2018), and ii) OracleExtselects globally optimal combination of three sentences from the input that gets the highest ROUGE-1 F1 score. Next, we consider three unsupervised extractive systems: TextRank Mihalcea and Tarau (2004), LexRank Erkan and Radev (2004), and SumBasic Nenkova and Vanderwende (2005). We also adopt RNN-ext RL Chen and Bansal (2018), a Seq2seq model that selects three salient sentences to construct the summary using reinforcement learning. Finally, we train four abstractive systems: Seq2Seq with attention, Pointer-Generator (PointGen) and a version with coverage mechanism (PointGen + cov) See et al. (2017), and SentRewriting Chen and Bansal (2018). Experimental setups and model parameters are described in Appendix A.2.

Table 4 reports F1 scores of ROUGE-1, 2, and L Lin and Hovy (2003) for all models. For BigPatent, almost all models outperform the Lead-3 baseline due to the more uniform distribution of salient content in BigPatent’s input articles. Among extractive models, TextRank and LexRank outperform RNN-ext RL which was trained on only the first 400 words of the input, again suggesting the need for neural models to efficiently handle longer input. Finally, SentRewriting, a reinforcement learning model with ROUGE as reward, achieves the best performance on BigPatent.

Table 5 presents the percentage of novel $n$ -grams in the generated summaries. Although the novel content in the generated summaries (for both unigrams and bigrams) is comparable to that of Gold, we observe repeated instances of fabricated or irrelevant information. For example, “the upper portion is configured to receive the upper portion of the sole portion”, part of Seq2Seq generated summary has irrelevant repetitions compared to the human summary as in Fig. 1. This suggests the lack of semantic understanding and control for generation in existing neural models.

LABEL:{tbl:novl_sys_sum} also shows the entity distribution (Section 4.2) in the generated summaries for BigPatent. We find that neural abstractive models (except PointGen+cov) tend to repeat entities more often than humans do. For Gold, only $5.2\%$ and $4.0\%$ of entities are mentioned thrice or more, compared to $6.7\%$ and $22.6\%$ for Seq2Seq. PointGen+cov, which employs coverage mechanism to explicitly penalize repetition, generates significantly fewer entity repetitions. These findings indicate that current models failt to learn the entity distribution pattern, suggesting a lack of understanding of entity roles (e.g., their importance) and discourse-level text planning.

Conclusion

We present the BigPatent dataset with human-written abstractive summaries containing fewer and shorter extractive phrases, and a richer discourse structure compared to existing datasets. Salient content from the BigPatent summaries is more evenly distributed in the input. BigPatent can enable future research to build robust systems that generate abstractive and coherent summaries.

Acknowledgements

This research is supported in part by National Science Foundation through Grants IIS-1566382 and IIS-1813341, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. We also thank the anonymous reviewers for their constructive suggestions.

References

Appendix A Appendices

BigPatent, a novel large-scale summarization dataset of $1.3$ million US Patent documents, is collected from Google Patents Public Datasets using BigQuery Google (2018). Google has indexed more than $87$ million patents with full text from $17$ different patent offices so far. We only consider patent documents from United States Patent and Trademark Office (USPTO) filed in English language after 1971 in order to get considerably more consistent writing and formatting style to facilitate easier parsing of the text.

Each US patent application is filed under a Cooperative Patent Classification (CPC) code USPTO (2013) that provides a hierarchical system of language independent symbols for the classification of patents according to the different areas of technology to which they pertain. There are nine such classification categories: A (Human Necessities), B (Performing Operations; Transporting), C (Chemistry; Metallurgy), D (Textiles; Paper), E (Fixed Constructions), F (Mechanical Engineering; Lightning; Heating; Weapons; Blasting), G (Physics), H (Electricity), and Y (General tagging of new or cross-sectional technology). Table 6 summarizes the statistics for BigPatent across all nine categories.

From the full public dataset, for each patent record, we retained its title, authors, abstract, claims of the invention and the description text. Abstract of the patent, which is generally written by the inventors after the patent application is approved, was considered as the gold-standard summary of the patent. Description text of the patent contains several other fields such as background of the invention covering previously published related inventions, description of figures, and detailed description of the current invention. For the summarization task, we considered the detailed description of each patent as the input.

We tokenized the articles and summaries using Natural Language Toolkit (NLTK) Bird et al. (2009). Since there was a large variation in size of summary and input texts, we removed patent records with compression ratio less than $5$ and higher than $500$ . Further, we only kept records with summary length between $10$ and $2,500$ words, and input length of at least $150$ and at most $80,000$ . Next, to focus on the abstractive summary-input pairs, we removed the records whose percentage of summary-worthy unigrams absent from the input (novel unigrams) was less than $15\%$ . Finally, we removed references of figure from summaries and input, along with full tables from the input.

As also shown in the main paper, i.e., Figure 4 and Figure 5, BigPatent demonstrates a relatively uniform distribution of the salient content from the summary in all parts of the input. Here, the salient content is considered as all bigrams and longest common sub-sequences from the summary.

A.2 Experiment details

For all experiments, we randomly split BigPatent into $1,207,222$ training pairs, $67,068$ validation pairs, and $67,072$ test pairs. For CNN/DM, we followed preprocessing steps from See et al. (2017), using $287,226$ training, $13,368$ validation, and $11,490$ test pairs. For NYT, following preprocessing steps from Paulus et al. (2017), we used $589,298$ training, $32,739$ validation, and $32,739$ test pairs.

For TextRank, we used the summanlphttps://pypi.org/project/summa/ Barrios et al. (2016) to generate summary with three sentences based on TextRank algorithm Mihalcea and Tarau (2004). For LexRank and SumBasic, we used sumyhttps://pypi.python.org/pypi/sumy. For RNN-ext RL from Chen and Bansal (2018), we used the implementation provided by the authorshttps://github.com/ChenRocks/fast_abs_rl.

Abstract-based Systems.

For all the neural abstractive summarization models (except for SentRewriting), we truncated the input to $400$ words and output to $100$ words. Except for SentRewriting, all other models were trained using OpenNMT-py python libraryhttps://opennmt.net/OpenNMT-py/Summarization.html based on the instructions provided by the authors Gehrmann et al. (2018b). We provide further details for each model below.

Seq2Seq with attention Sutskever et al. (2014) was trained using a $128$ -dimensional word-embedding and $512$ -dimensional $1$ -layer LSTM. We used a bidirectional LSTM for the encoder and attention mechanism from Bahdanau et al. (2014). The model was trained using Adagrad Duchi et al. (2011) with learning rate 0.15 and an initial accumulator value of $0.1$ . At inference time, we used the beam size $5$ . We used the same settings for training PointGen and PointGen + cov See et al. (2017), adding the copy attention mechanism that allows the model to copy words from the source. At inference time, for PointGen + cov, we used coverage penalty with beta set to $5$ and length penalty Wu et al. (2016) with alpha as $0.9$ .

For SentRewriting from Chen and Bansal (2018), we again used the implementation by the authorshttps://github.com/ChenRocks/fast_abs_rl to train their full RL-based model using their default parameters.

A.3 Summaries for sample Input Document from BigPatent

For the sample summary presented in introduction of the main paper, in Table 7 we list complete gold-standard summary along with the summaries generated by Seq2Seq, PointGen + cov and SentRewriting. For the respective input, we also list the first 400 words for brevity.