AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization

Sayali Kulkarni, Sheide Chammas, Wan Zhu, Fei Sha, Eugene Ie

Introduction

Summarization has been a challenging problem in natural language processing. Recently a number of neural encoder-decoder approaches have made significant progress in this research area Rush et al. (2015); See et al. (2017); Wang et al. (2018). State-of-the-art models such as PEGASUS Zhang et al. (2019) have leveraged related data sources and tasks, such as language modeling, to pre-train massive summarization models. But only a few large-scale high-quality human-curated summarization datasets available for training and evaluation.

Obtaining human annotation for summarization is a nontrivial task in itself and there are several contributing factors. Summarization is often subjective and depends on the annotators’ reading comprehension abilities (especially on unfamiliar topics), their interpretation of the text and their judgement on what piece of information should be considered important or relevant to the use case of the generated summaries. These are all influenced by the annotators’ own life experiences, and in the case of abstractive summarization, their ability to compose fluent and succinct text passages as well. In some scenarios, such as generating news headlines or identifying how-to instructions (as a form of summary outline), it is possible to obtain and repurpose pre-annotated data from established web publishers for single document summarization (SDS) Hermann et al. (2015a); Koupaee and Wang (2018). However it’s less clear how one would approach data collection for multi-document summarization (MDS) from the open web.

Two recent studies, WikiSum and Multi-News, have attempted to tackle this problem with automatic procedures to harvest documents for MDS by crawling hyperlinks from Wikipedia and newser.com web sites respectively Liu et al. (2018); Fabbri et al. (2019). Both studies target significantly longer summaries than short texts or headlines. However, the area of generating focused summaries conditioned on contexts has not been in the limelight. This is an important problem in natural language generation, for example in personalized news feed summaries, context-driven product review summaries, to name a few.

Our work considers this variant of MDS called query-based MDS (qMDS) which have crucial applications in augmenting information retrieval (IR) experiences Daumé III and Marcu (2006); Litvak and Vanetik (2017); Hasselqvist et al. (2017); Baumel et al. (2018a). Text documents are typically multi-faceted and users are often interested in identifying information that is most relevant to their stated preferences. For example, suppose a user is interested in car reliability and cars from certain manufacturers, then an effective IR system could consolidate car reviews from across the web and provide concise summaries relating to the reliability of those cars of interest. Similar to MDS, qMDS also suffers from the lack of large-scale annotated data, especially for generating long abstractive summaries Nema et al. (2017). While we mainly focus on qMDS, the proposed dataset generation methodology can be re-adapted for the more general problem of MDS (while the reverse is not necessarily true).

Our contributions in this paper are two-folds. We first introduce a general approach for machine generating query-based multi-document summarization (qMDS) datasets at scale, with knobs to control the automatically generated outputs along dimensions such as document diversity and degree of relevance of the target summary. Second, we provide an automatically generated large-scale dataset for qMDS that we validate with baseline summarization experiments and human rater evaluations.

As aforementioned, we focus on summarizing several documents as multi-faceted answers to complex queries. To this end, we leverage the publicly released Google Natural Questions (NQ) dataset Kwiatkowski et al. (2019), which contains real user queries from Google search logs, capturing a wide range of topics that interest people. Many questions have short answers (e.g., one or more entity names, or dates) derived from Wikipedia pages and have been used to form NQ question answering dataset for training a SQuAD-like span-based QA system. More importantly, a sizable portion of the questions are paired with long-form answers (e.g., paragraphs) that are vetted by human raters. These long-form answers address user questions with content that are focused and coherent. As the Wikipedia passages have also been read and edited by the Wikipedia readership, the writing of the passages should be of adequate quality.

As we are primarily interested in the qMDS setting for general web IR applications, we would like to simulate how a search engine might synthesize documents of high relevance to a user query. To identify high quality passages from the web that can be used to recreate target Wikipedia paragraphs, we use a pre-processed and cleaned version of the Common Crawl corpus Raffel et al. (2019) as a proxy web search index to select documents relevant to the NQ long-form answers. We take special considerations in including documents of varying semantic relevance such that our baseline task involves deriving summaries from documents with enough distracting information to challenge summarization models. Furthermore we ensure the sources are sufficiently diverse among themselves, as our primary interest is to summarize multi-faceted information. Figure 1 illustrates the overall data generation procedure. We publicly release an instance of such a dataset containing 5,519 qMDS examples, that we split into training, validation and tests sets of sizes 4,555, 440 and 524 respectivelyhttps://github.com/google-research-datasets/aquamuse. Each example contains an average of 6 source documents to be synthesized into a long-form answer.

It is worth noting that the approach used in Liu et al. (2018) to harvest Wikipedia article texts as summary targets is related to ours but there are a few key distinctions in goal and methodology. In the foremost, we aim to provide high-quality dataset for the task of qMDS (not just MDS). As such, we aim to generate paragraphs that are more consistent and coherent than generating full length articles, which have much stronger variability in both structure and content. The use of cited references in the WikiSum dataset do not necessarily provide adequate coverage for the summaries especially if some sentences in the Wikipedia text are missing references. Instead we use crawled documents from the web as potential source material. Since these are “naturally occurring” documents from the web (albeit a cached subset), we are simulating a realistic web IR application scenario across a large document corpus.

AQuaMuSe

The qMDS problem is formalized as follows. Given a query $q$ , a set of related documents $R=\{r_{i}\}$ , document passages $\{r_{i,j}\}$ relevant to the query are synthesized to an answer $a$ . Various MDS approaches synthesize $a$ such that it is succinct and fluent natural language text that covers the information content in $\{r_{i,j}\}$ rather than just a concatenation of relevant spans. Such synthesized answers can augment information retrieval (IR) applications by enhancing the user experience with high-level query specific summaries.

We propose an automated approach to generating large datasets for the qMDS task for training and evaluating both abstractive and extractive approaches. We illustrate our approach using Google’s Natural Questions (NQ) and Common Crawl (CC). But the methodology is general enough to be extended to any other question answering dataset (containing answers that span multiple sentences) and web corpora (to serve as the domain for retrieval).

Google’s NQ is an open-domain question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples in version 1.0Kwiatkowski et al. (2019). Each example is a Google search query (generated by real users) paired with a crowd sourced short answer (one or more entities) and/or long answer span (typically a paragraph) from a Wikipedia page. Queries annotated with only a long answer span serve as summarization targets since these cannot be addressed tersely by entity names (e.g. “Who lives in the Imperial Palace in Tokyo?”) or a boolean. These queries result in open-ended and complex topics answers (e.g., “What does the word China mean in Chinese?”).

Suppose a long answer is comprised of $n$ sentences $a=[l_{1},...,l_{n}]$ and a document corpus $D=\{d_{i}\}$ consists of sentences $[d_{i,j}]$ . We use the Universal Sentence Encoder Cer et al. (2018) ( $\phi$ ) to encode sentences $\phi(l_{k})$ and $\phi(d_{i,j})$ for semantic similarity comparisons (e.g., using a dot product $s_{k,i,j}=\langle\phi(l_{k}),\phi(d_{i,j})\rangle$ ). This yield in partial result sets $R_{k}=\{(d_{i},s_{k,i,j}):\theta_{U}>s_{k,i,j}>\theta_{L}\}$ . These sets $R_{1..n}$ are then combined by $\psi(d_{i})=\sum_{k,j}s_{k,i,j}$ to yield document-level scores to get result set $R=\{(d_{i},\psi(d_{i}))\}$ . We restrict the result set by selecting the top-K ranked documents. While we have made specific choices for $\phi$ , $s_{k,i,j}$ , $\psi$ , they can be customized and tuned to construct result sets $R$ with higher/lower diversity, tighter/looser topicality match, or number of documents retrieved (to name a few) for evaluating summarization approaches under different qMDS task conditions.

With appropriate tuning of $\theta_{U}$ and $\theta_{L}$ , the process above admits documents with sentences of varying semantic relevance into the result set $R$ . Lowering $\theta_{U}$ (while keeping $\theta_{L}$ high enough) ensures we don’t retain sentences $d_{i,j}$ that are exact matches of $l_{k}$ (but of at least some semantic equivalence), thereby generating qMDS abstractive summarization examples $(q,a,R)$ . The relationship between $q$ and $R$ is transitive through the annotated long answer span $a$ . For constructing extractive qMDS examples, we perform an in-place substitution of $d_{i,j}$ with $l_{k}$ (that can optionally be sampled according to match score $s_{k,i,j}$ in future work).

2 Implementation details

We use a pre-processed and cleaned version of the English CC corpus called the Colossal Clean Crawled Corpus Raffel et al. (2019). It contains 355M web pages in total. For the question answering data source, we use a 62.5% sample of the NQ dataset from the train and development splits, in which 8.2% are question answering examples that we matched with the CC corpus. These NQ questions are marked “good” by a majority of NQ raters and are paired with long-form answers. We limit to question answering pairs that cannot be addressed by terse responses (e.g. factoids), to simulate realistic qMDS use cases.

Using TensorFlow Hubhttps://tfhub.dev/ we compute Universal Sentence Embeddings (which are approximately normalized) for sentences tokenized from both NQ and CC data sources. The encoded CC sentences are around 11Tb on disk while the NQ portion that formulates the target summaries is comparatively negligible in size. An exhaustive all pairwise comparison is performed using Apache Beamhttps://beam.apache.org/. The sentences from the NQ long answers are matched with the CC corpus using efficient nearest neighbor searches over sentence embeddings indexed by space partitioning trees Liu et al. (2004).

$\theta_{U}$ and $\theta_{L}$ control the semantic relevance of sentences matched between CC and NQ. Sentence pairs with matching scores below 0.8 are filtered out ( $\theta_{L}$ ). To avoid exact sentence matches from pages with near-Wikipedia duplicates, we also filter out sentence pairs with scores above 0.99 ( $\theta_{U}$ ). The CC document match score is based on the sum of these sentence-to-sentence match scores. This can be used to trade-off the quality of the matched documents and the abstractive nature of the task. We use the coverage and density metrics defined in Grusky et al. (2018) to construct the normalized bivariate density plot illustrated in Figure 2.

Summary Recall

It is entirely possible that we cannot locate a match for every sentence in a long answer. The Summary Recall is the fraction of sentences in a long answer with matching CC document sentences. A summary recall of 1.0 guarantees a summary can be generated. In our specific dataset instance, we restrict the summary recall to 0.75. Though this may seem like a handicap, we observed in experiments that the input documents have enough information to reconstruct the target summary fully and can be used to test the language generation capabilities of qMDS approaches.

Top-K Parameter

This is analogous to the number of top results returned by web search and controls the degree of support as well as diversity across multiple documents for a qMDS task. We evaluated the quality of the matched documents as ranked by their match scores. We use $K=7$ as we found document quality degrade after that (given our specific settings of sentence matching threshold and summary recall).

3 Dataset Statistics

Our specific dataset instance is derived from a subset of the NQ dataset that we match with CC. Based on the thresholds detailed above, we construct 5,519 examples that are split into training, development and testing sets (4,555, 440 and 524 examples, respectively) using a hash of the NQ long answer. Given the thresholds $\theta$ and $K$ , the total number of CC documents that matched our restricted set of NQ long answers is 33,760. Not every example can find up to $K=7$ matching documents from CC within the $\theta$ bounds. The distribution of input document count are as shown in Table 2.

Although we explicitly pick the examples from NQ which only have a long answer, we find many descriptive queries for factoid-like questions. For example, “Where is silver found and in what form (compound)”, “Where was Moses when he saw a burning bush”, “When is a jury used in civil cases”. The most interesting examples are the why queries since they have a descriptive summary in the long-form answer. For example, “Why does Friedman think the world is flat”, “Why do plants drip water from their leaves”.

Document and Summary Lexical Overlaps

Our approach relies on sentence-level matching to retrieve documents, but that does not guarantee a high recall of the n-grams in the summary. To sanity check that the set of no more than 7 documents retrieved this way has a high lexical overlap with the summary, we used the BLEU precision score. Note that a perfect BLEU precision score in this case implies that every n-gram in the summary can be mapped to a distinct n-gram in the source. Figure 3 shows the histogram of this overlap measure.

Comparing to other datasets

Table 1 compares our dataset instance with other commonly used datasets for summarization. CNN/DM is an abstractive SDS dataset Nallapati et al. (2016). Multi-News is the first large-scale MDS dataset Fabbri et al. (2019). While Multi-News has more summarization examples, our dataset includes query contexts and covers more documents, sentences and tokens per summary. Also the number AQuaMuSe examples can increase with looser restrictions on $\theta$ and $K$ . Furthermore, our approach generalizes to more MDS examples if we removed the query context and operated on any Wikipedia paragraph spans. Recently, Nema et al. (2017) introduced a qMDS dataset built from Debatepediahttp://www.debatepedia.org. Their input documents are relatively short (75 words/doc). AQuaMuSe includes much longer input documents that can be more challenging for qMDS models.

Quality Assessment

In this section, we carefully assess the quality of the automatically generated qMDS examples along several axes: correctness of matched documents, fluency of machine edited extractive summaries, and overall example quality. All our human evaluation tasks are based on human rater pools consisting of fluent English speakers. The annotation tasks are discriminative in nature (e.g., judging semantic match), which are cheaper to source and easier to validate through replication than generative annotation tasks (e.g., open-ended text generation). We also provide a few qMDS examples for illustration.

We first evaluate the factual and semantic correctness of the CC documents that were matched with the NQ long answers. We focus on the abstractive setup as we will demonstrate later how the derived extractive version is qualitatively similar.

For this annotation task, we presented raters with a Wikipedia paragraph (corresponding to the long-form answer) and a matched sentence (one from each of the top-7 CC documents). They were asked to rate “+1” if the CC sentence matched some of the content of the Wikipedia paragraph. Raters were instructed not to rely on external knowledge in the rating process. Numerical facts were subjectively evaluated, e.g., 4B years is close to 4.5B years, but 3 electrons and 4 electrons is not.

We rated a sample of 5,215 examples corresponding to 856 queries. Each example rating is replicated 3 times across different raters to account for subjectivity. Raters were allowed to abstain if they cannot make a decision. We found that 85.18% of the examples are marked relevant by majority as illustrated in Figure 4. Sentences from top ranked documents (per document match scores) contains many more sentences annotated with +1 majority decision as shown in Figure 5.

2 Fluency

The extractive dataset is created by replacing sentences from the CC doc with the matched sentence in Wikipedia long answer $a$ . This, however, may distort the overall flow of the original CC passage. This evaluation task ensures that the fluency is not harmful.

First, we designed a human evaluation where the raters were presented with the original and the edited CC document passages including the replaced sentence. A +1 marks the replaced sentence does not appear out of place. We rated 500 examples with rater replication of 3. In 96.20% examples, these were rated positive.

Second, we measured the perplexity of the paragraphs that with replaced sentences using a language model https://tfhub.dev/google/wiki40b-lm-en/1. The mean perplexity increased slightly from 80 to 82 after replacement. This small increase is expected since a foreign sentence was inserted, but the difference is small enough proving that the fluency is preserved.

3 Overall quality

We now turn to evaluating the overall quality of a random sample of 55 qMDS example triplets $(q,R,a)$ along three dimensions — referential clarity, focus and the coherence of the summary — adapted from DUC2007 taskhttps://duc.nist.gov/duc2007/quality-questions.txt. Since the summary $a$ is a Wikipedia passage, grammatical correctness and redundancy dimensions need not be evaluated.

Each triplet was rated by 3 raters. The raters were also instructed to consider the query $q$ when evaluating the focus of the summary $a$ rather than just a generic summary that can be generated from the set of input documents $R$ . Ratings were on a 5-point scale — 5 being very good and 1 being very poor. The results are summarized in Table 3 showing that the majority of ratings fall under good (4) and very good (5).

4 Examples

Finally, we also illustrate two specific challenging aspects of the qMDS dataset. The example in Table 4 demonstrates how a summary can cover multiple facets of a single query that can be sourced from multiple input documents. The example in Table 5 shows how the query context may require summarization models to attend to specific portions of the source documents.

Experiments

Our experiments are based on running popular summarization models on both abstractive and extractive versions of our qMDS dataset. These baseline summarization experiments are categorized into two types: (i) a query-agnostic setup where the query $q$ is ignored and the models map source documents $R$ to long answer $a$ as in standard SDS/MDS; and (ii) a query-based setup where the source document set $R$ is conditionally filtered by the input query $q$ followed by SDS/MDS approaches.

Fabbri et al. (2019) define a hierarchical abstractive MDS model that combines a pointer-generator network See et al. (2017) with Maximal Marginal Relevance (MMR) Carbonell and Goldstein (1998) scores to rank sentences based on relevancy and redundancy. We used 128-d word vectors in a single layer 512-d RNN that was trained up to 10K steps with an initial learning rate of 0.15.

PEGASUS

Zhang et al. (2019) propose pre-training Transformer-based MDS models with massive text corpora. Pre-training involves generating masked sentences, similar to an extractive summary. We fine-tune PEGASUS with an initial learning rate of 0.01 for 100K steps and evaluated it on our test set, with the caveat that some test CC documents were part of the pre-trained model (albeit for a different objective).

2 Extractive Summarization

Zhou et al. (2018) rank sentences using scores derived from a hierarchical encoder, with top ranked sentences forming the extractive summary. While it was designed for SDS, the hierarchical document representation is well suited for adapting to the MDS setting in future work. The model used 50-d GloVe word vectors that was trained with a learning rate of 0.001 and a batch size of 32 for 50 epochs. The output was set to 4 sentences to match the long answer summary statistics. Finally, the input sequence length was 500 sentences to capture the larger size of the multi-doc input.

TextRank

This is an unsupervised sentence similarity based summarization model based on weighted-graphs defined over sentences in a document Mihalcea and Tarau (2004) that is often used as a baseline for extractive summarization.

3 Incorporating Query in SDS/MDS

As our dataset explicitly designed for qMDS, we modified the standard SDS/MDS setup by pre-filtering sentences from the source documents $R$ that are relevant to query $q$ (based on BLEU scores) as input the models. To retain source document fluency, fragments are defined at the paragraph level. Table 6 and Table 7 show the results with and without this variation. The filter acts as a crude attention mechanism that weeds out irrelevant content from the inputs showing improvements in all the approaches, except for PEGASUS. We believe this drop may be attributed to the sentence masking done in pre-training PEGASUS which relies on undisrupted sentence orders.

4 Human Evaluation

In addition to automatic evaluation, we also collected human judgements for summarization outputs of one specific abstractive MDS model (Hi-MAP) to understand the headroom available in qMDS on this dataset. We follow the question-answering approach in Clarke and Lapata (2010).

We created 32 questions from 17 randomly sampled summaries. Participants are asked to answer those questions after reading the generated summary by Hi-MAP. Their answers are scored: 1 (fully correct answer), 0.5 (partially correct answer), and 0 (incorrect answer). Note that the ground-truth answers are the answers to the ground-truth summaries. The more the participants can answer correctly from the generated summaries, the better the summarization system. We then compute the averaged scores. 5 were given questions and the ground-truth summaries and the rest 5 were given the questions and the generated summaries by Hi-MAP. For ground-truth summaries, the average positive responses is 30.6 out of 32. For Hi-MAP summary, this is 13.8. This is significantly lower than the score for ground-truth showing a fairly wide headroom for improvement.

Related Work

Query-based summarization can be both extractive Dang (2006); Daumé III and Marcu (2006); Schilder and Kondadadi (2008); Otterbacher et al. (2009); Wang et al. (2016); Litvak and Vanetik (2017); Wang et al. (2019) or abstractive Nema et al. (2017); Baumel et al. (2018b); Hasselqvist et al. (2017); Ishigaki et al. (2020). Earlier studies were often extractive and relied on manually selected and curated datasets such as DUC2005 and DUC2006 Dang (2006). However, neural abstractive models often demand large amounts of labeled data, which are hard to obtain for summarization and other tasks with similar demands on manual annotation efforts. Recent studies show a two-step process of using extractive summarization followed by generation for abstractive summariesFabbri et al. (2019) as well as for query-based abstractive summaries Egonmwan et al. (2019).

While our work is motivated by the use case of generating longer summaries to answer complex questions, there are related work on creating QA datasets for short answers: using news articles from CNN/DM Hermann et al. (2015b), HotpotQA Yang et al. (2018), TriviaQA Joshi et al. (2017), SearchQA Dunn et al. (2017), online debates with summarizing arguments and debating topics as queries Nema et al. (2017), and community question answering websites Deng et al. (2020). Some of them involve extracting text spans of words as answers from multiple text passages. However, our work focuses on longer answers.

Large-scale datasets for regular MDS over long documents with target long summaries have also started to appear Liu et al. (2018); Fabbri et al. (2019); Koupaee and Wang (2018). Besides extracting contents for IR applications, our efforts differs from them in terms of heterogeneity in documents and lengths of summaries. The MS Marco dataset is close to our work in spirit Bajaj et al. (2018). The dataset contains 1M question-answer-context triplets where the answers are human created using the top-10 passages returned from Bing’s search queries. We use Wikipedia passages as summaries thus avoid additional human efforts. Examining the statistics of the dataset, our dataset also has longer input sources and answers.

Conclusion

We have presented AQuaMuSe, a scalable methodology for constructing new qMDS datasets, along with in-depth analyses and baseline experiments to demonstrate properties of one such dataset instance. Many parts of the approach are configurable providing researchers a rich sandbox for evaluating summarization models under different task conditions. Our methodology greatly reduces the cost of data collection by converting a predominantly generative human annotation task (e.g., reading documents and writing succinct summaries) to a discriminative human annotation task (e.g., deciding on sentence-document relevance). While our present work do not propose new methods for query-based summarization, we ran baseline experiments on one specific instance of the AQuaMuSe dataset using a few popular neural approaches re-adapted with query conditioning. Our experiments demonstrates that there is still much headroom for existing state-of-the-art models and we hope AQuaMuSe will spur further advancements query focused multi-document summarization algorithms.