Extractive Summarization as Text Matching

Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang

Introduction

The task of automatic text summarization aims to compress a textual document to a shorter highlight while keeping salient information on the original text. In this paper, we focus on extractive summarization since it usually generates semantically and grammatically correct sentences Dong et al. (2018); Nallapati et al. (2017) and computes faster.

Currently, most of the neural extractive summarization systems score and extract sentences (or smaller semantic unit Xu et al. (2019)) one by one from the original text, model the relationship between the sentences, and then select several sentences to form a summary. Cheng and Lapata (2016); Nallapati et al. (2017) formulate the extractive summarization task as a sequence labeling problem and solve it with an encoder-decoder framework. These models make independent binary decisions for each sentence, resulting in high redundancy. A natural way to address the above problem is to introduce an auto-regressive decoder Chen and Bansal (2018); Jadhav and Rajan (2018); Zhou et al. (2018), allowing the scoring operations of different sentences to influence on each other. Trigram Blocking Paulus et al. (2017); Liu and Lapata (2019), as a more popular method recently, has the same motivation. At the stage of selecting sentences to form a summary, it will skip the sentence that has trigram overlapping with the previously selected sentences. Surprisingly, this simple method of removing duplication brings a remarkable performance improvement on CNN/DailyMail.

The above systems of modeling the relationship between sentences are essentially sentence-level extractors, rather than considering the semantics of the entire summary. This makes them more inclined to select highly generalized sentences while ignoring the coupling of multiple sentences. Narayan et al. (2018b); Bae et al. (2019) utilize reinforcement learning (RL) to achieve summary-level scoring, but still limited to the architecture of sentence-level summarizers.

To better understand the advantages and limitations of sentence-level and summary-level approaches, we conduct an analysis on six benchmark datasets (in Section 3) to explore the characteristics of these two methods. We find that there is indeed an inherent gap between the two approaches across these datasets, which motivates us to propose the following summary-level method.

In this paper, we propose a novel summary-level framework (MatchSum, Figure 1) and conceptualize extractive summarization as a semantic text matching problem. The principle idea is that a good summary should be more semantically similar as a whole to the source document than the unqualified summaries. Semantic text matching is an important research problem to estimate semantic similarity between a source and a target text fragment, which has been applied in many fields, such as information retrieval Mitra et al. (2017), question answering Yih et al. (2013); Severyn and Moschitti (2015), natural language inference Wang and Jiang (2016); Wang et al. (2017) and so on. One of the most conventional approaches to semantic text matching is to learn a vector representation for each text fragment, and then apply typical similarity metrics to compute the matching scores.

Specific to extractive summarization, we propose a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary. Siamese BERT leverages the pre-trained BERT Devlin et al. (2019) in a Siamese network structure Bromley et al. (1994); Hoffer and Ailon (2015); Reimers and Gurevych (2019) to derive semantically meaningful text embeddings that can be compared using cosine-similarity. A good summary has the highest similarity among a set of candidate summaries.

We evaluate the proposed matching framework and perform significance testing on a range of benchmark datasets. Our model outperforms strong baselines significantly in all cases and improve the state-of-the-art extractive result on CNN/DailyMail. Besides, we design experiments to observe the gains brought by our framework.

We summarize our contributions as follows:

1) Instead of scoring and extracting sentences one by one to form a summary, we formulate extractive summarization as a semantic text matching problem and propose a novel summary-level framework. Our approach bypasses the difficulty of summary-level optimization by contrastive learning, that is, a good summary should be more semantically similar to the source document than the unqualified summaries.

2) We conduct an analysis to investigate whether extractive models must do summary-level extraction based on the property of dataset, and attempt to quantify the inherent gap between sentence-level and summary-level methods.

3) Our proposed framework has achieved superior performance compared with strong baselines on six benchmark datasets. Notably, we obtain a state-of-the-art extractive result on CNN/DailyMail (44.41 in ROUGE-1) by only using the base version of BERT. Moreover, we seek to observe where the performance gain of our model comes from.

Related Work

Recent research work on extractive summarization spans a large range of approaches. These work usually instantiate their encoder-decoder framework by choosing RNN Zhou et al. (2018), Transformer Zhong et al. (2019b); Wang et al. (2019) or GNN Wang et al. (2020) as encoder, non-auto-regressive Narayan et al. (2018b); Arumae and Liu (2018) or auto-regressive decoders Jadhav and Rajan (2018); Liu and Lapata (2019). Despite the effectiveness, these models are essentially sentence-level extractors with individual scoring process favor the highest scoring sentence, which probably is not the optimal one to form summaryWe will quantify this phenomenon in Section 3..

The application of RL provides a means of summary-level scoring and brings improvement Narayan et al. (2018b); Bae et al. (2019). However, these efforts are still limited to auto-regressive or non-auto-regressive architectures. Besides, in the non-neural approaches, the Integer Linear Programming (ILP) method can also be used for summary-level scoring Wan et al. (2015).

In addition, there is some work to solve extractive summarization from a semantic perspective before this paper, such as concept coverage Gillick and Favre (2009), reconstruction Miao and Blunsom (2016) and maximize semantic volume Yogatama et al. (2015).

2 Two-stage Summarization

Recent studies Alyguliyev (2009); Galanis and Androutsopoulos (2010); Zhang et al. (2019a) have attempted to build two-stage document summarization systems. Specific to extractive summarization, the first stage is usually to extract some fragments of the original text, and the second stage is to select or modify on the basis of these fragments.

Chen and Bansal (2018) and Bae et al. (2019) follow a hybrid extract-then-rewrite architecture, with policy-based RL to bridge the two networks together. Lebanoff et al. (2019); Xu and Durrett (2019); Mendes et al. (2019) focus on the extract-then-compress learning paradigm, namely compressive summarization, which will first train an extractor for content selection. Our model can be viewed as an extract-then-match framework, which also employs a sentence extractor to prune unnecessary information.

Sentence-Level or Summary-Level? A Dataset-dependent Analysis

Although previous work has pointed out the weakness of sentence-level extractors, there is no systematic analysis towards the following questions: 1) For extractive summarization, is the summary-level extractor better than the sentence-level extractor? 2) Given a dataset, which extractor should we choose based on the characteristics of the data, and what is the inherent gap between these two extractors?

In this section, we investigate the gap between sentence-level and summary-level methods on six benchmark datasets, which can instruct us to search for an effective learning framework. It is worth noting that the sentence-level extractor we use here doesn’t include a redundancy removal process so that we can estimate the effect of the summary-level extractor on redundancy elimination. Notably, the analysis method to estimate the theoretical effectiveness presented in this section is generalized and can be applicable to any summary-level approach.

We refer to D={s1,,sn}D=\{s_{1},\cdots,s_{n}\} as a single document consisting of nn sentences, and C={s1,,sk,siD}C=\{s_{1},\cdots,s_{k},|s_{i}\in D\} as a candidate summary including kk (knk\leq n) sentences extracted from a document. Given a document DD with its gold summary CC^{*}, we measure a candidate summary CC by calculating the ROUGE Lin and Hovy (2003) value between CC and CC^{*} in two levels:

We define the pearl-summary to be the summary that has a lower sentence-level score but a higher summary-level score.

Clearly, if a candidate summary is a pearl-summary, it is challenging for sentence-level summarizers to extract it.

Best-Summary

The best-summary refers to a summary has highest summary-level score among all the candidate summaries.

2 Ranking of Best-Summary

For each document, we sort all candidate summariesWe use an approximate method here: take #Ext (see Table 1) of ten highest-scoring sentences to form candidate summaries. in descending order based on the sentence-level score, and then define zz as the rank index of the best-summary C^\hat{C}.

Intuitively, 1) if z=1z=1 (C^\hat{C} comes first), it means that the best-summary is composed of sentences with the highest score; 2) If z>1z>1, then the best-summary is a pearl-summary. And as zz increases (C^\hat{C} gets lower rankings), we could find more candidate summaries whose sentence-level score is higher than best-summary, which leads to the learning difficulty for sentence-level extractors.

Since the appearance of the pearl-summary will bring challenges to sentence-level extractors, we attempt to investigate the proportion of pearl-summary in different datasets on six benchmark datasets. A detailed description of these datasets is displayed in Table 1.

As demonstrated in Figure 2, we can observe that for all datasets, most of the best-summaries are not made up of the highest-scoring sentences. Specifically, for CNN/DM, only 18.9% of best-summaries are not pearl-summary, indicating sentence-level extractors will easily fall into a local optimization, missing better candidate summaries.

Different from CNN/DM, PubMed is most suitable for sentence-level summarizers, because most of best-summary sets are not pearl-summary. Additionally, it is challenging to achieve good performance on WikiHow and Multi-News without a summary-level learning process, as these two datasets are most evenly distributed, that is, the appearance of pearl-summary makes the selection of the best-summary more complicated.

In conclusion, the proportion of the pearl-summaries in all the best-summaries is a property to characterize a dataset, which will affect our choices of summarization extractors.

3 Inherent Gap between Sentence-Level and Summary-Level Extractors

Above analysis has explicated that the summary-level method is better than the sentence-level method because it can pick out pearl-summaries, but how much improvement can it bring given a specific dataset?

Based on the definition of Eq. (1) and (2), we can characterize the upper bound of the sentence-level and summary-level summarization systems for a document DD as:

where CD\mathcal{C}_{D} is the set of candidate summaries extracted from DD.

Then, we quantify the potential gain for a document DD by calculating the difference between αsen(D)\alpha^{sen}(D) and αsum(D)\alpha^{sum}(D):

Finally, a dataset-level potential gain can be obtained as:

where D\mathcal{D} represents a specific dataset and D|\mathcal{D}| is the number of documents in this dataset.

We can see from Figure 3, the performance gain of the summary-level method varies with the dataset and has an improvement at a maximum 4.7 on CNN/DM. From Figure 3 and Table 1, we can find the performance gain is related to the length of reference summary for different datasets. In the case of short summaries (Reddit and XSum), the perfect identification of pearl-summaries does not lead to much improvement. Similarly, multiple sentences in a long summary (PubMed and Multi-News) already have a large degree of semantic overlap, making the improvement of the summary-level method relatively small. But for a medium-length summary (CNN/DM and WikiHow, about 60 words), the summary-level learning process is rewarding. We will discuss this performance gain with specific models in Section 5.4.

Summarization as Matching

The above quantitative analysis suggests that for most of the datasets, sentence-level extractors are inherently unaware of pearl-summary, so obtaining the best-summary is difficult. To better utilize the above characteristics of the data, we propose a summary-level framework which could score and extract a summary directly.

Specifically, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. The following section will detail how we instantiate our proposed matching summarization framework by using a simple siamese-based architecture.

Inspired by siamese network structure Bromley et al. (1994), we construct a Siamese-BERT architecture to match the document DD and the candidate summary CC. Our Siamese-BERT consists of two BERTs with tied-weights and a cosine-similarity layer during the inference phase.

In order to fine-tune Siamese-BERT, we use a margin-based triplet loss to update the weights. Intuitively, the gold summary CC^{*} should be semantically closest to the source document, which is the first principle our loss should follow:

where CC is the candidate summary in DD and γ1\gamma_{1} is a margin value. Besides, we also design a pairwise margin loss for all the candidate summaries. We sort all candidate summaries in descending order of ROUGE scores with the gold summary. Naturally, the candidate pair with a larger ranking gap should have a larger margin, which is the second principle to design our loss function:

where CiC_{i} represents the candidate summary ranked ii and γ2\gamma_{2} is a hyperparameter used to distinguish between good and bad candidate summaries. Finally, our margin-based triplet loss can be written as:

The basic idea is to let the gold summary have the highest matching score, and at the same time, a better candidate summary should obtain a higher score compared with the unqualified candidate summary. Figure 1 illustrate this idea.

In the inference phase, we formulate extractive summarization as a task to search for the best summary among all the candidates C\mathcal{C} extracted from the document DD.

2 Candidates Pruning

The matching idea is more intuitive while it suffers from combinatorial explosion problems. For example, how could we determine the size of the candidate summary set or should we score all possible candidates? To alleviate these difficulties, we propose a simple candidate pruning strategy.

Concretely, we introduce a content selection module to pre-select salient sentences. The module learns to assign each sentence a salience score and prunes sentences irrelevant with the current document, resulting in a pruned document D={s1,,sextsiD}{D}^{{}^{\prime}}=\{s^{{}^{\prime}}_{1},\cdots,s^{{}^{\prime}}_{ext}|s^{{}^{\prime}}_{i}\in D\}.

Similar to much previous work on two-stage summarization, our content selection module is a parameterized neural network. In this paper, we use BertSum Liu and Lapata (2019) without trigram blocking (we call it BertExt) to score each sentence. Then, we use a simple rule to obtain the candidates: generating all combinations of selsel sentences subject to the pruned document, and reorganize the order of sentences according to the original position in the document to form candidate summaries. Therefore, we have a total of (extsel)\binom{ext}{sel} candidate sets.

Experiment

In order to verify the effectiveness of our framework and obtain more convicing explanations, we perform experiments on six divergent mainstream datasets as follows.

CNN/DailyMail Hermann et al. (2015) is a commonly used summarization dataset modified by Nallapati et al. (2016), which contains news articles and associated highlights as summaries. In this paper, we use the non-anonymized version.

PubMed Cohan et al. (2018) is collected from scientific papers and thus consists of long documents. We modify this dataset by using the introduction section as the document and the abstract section as the corresponding summary.

WikiHow Koupaee and Wang (2018) is a diverse dataset extracted from an online knowledge base. Articles in it span a wide range of topics.

XSum Narayan et al. (2018a) is a one-sentence summary dataset to answer the question “What is the article about?”. All summaries are professionally written, typically by the authors of documents in this dataset.

Multi-News Fabbri et al. (2019) is a multi-document news summarization dataset with a relatively long summary, we use the truncated version and concatenate the source documents as a single input in all experiments.

Reddit Kim et al. (2019) is a highly abstractive dataset collected from social media platform. We only use the TIFU-long version of Reddit, which regards the body text of a post as the document and the TL;DR as the summary.

2 Implementation Details

We use the base version of BERT to implement our models in all experiments. Adam optimizer Kingma and Ba (2014) with warming-up is used and our learning rate schedule follows Vaswani et al. (2017) as:

where each step is a batch size of 32 and wmwm denotes warmup steps of 10,000. We choose γ1=0\gamma_{1}=0 and γ2=0.01\gamma_{2}=0.01. When γ1\textless0.05\gamma_{1}\textless 0.05 and 0.005\textlessγ2\textless0.050.005\textless\gamma_{2}\textless 0.05 they have little effect on performance, otherwise they will cause performance degradation. We use the validation set to save three best checkpoints during training, and record the performance of the best checkpoints on the test set. Importantly, all the experimental results listed in this paper are the average of three runs. To obtain a Siamese-BERT model on CNN/DM, we use 8 Tesla-V100-16G GPUs for about 30 hours of training.

For datasets, we remove samples with empty document or summary and truncate the document to 512 tokens, therefore ORACLE in this paper is calculated on the truncated datasets. Details of candidate summary for the different datasets can be found in Table 2.

3 Experimental Results

As shown in Table 3, we list strong baselines with different learning approaches. The first section contains LEAD, ORACLE and MATCH-ORACLELEAD and ORACLE are common baselines in the summarization task. The former means extracting the first several sentences of a document as a summary, the latter is the groundtruth used in extractive models training. MATCH-ORACLE is the groundtruth used to train MatchSum.. Because we prune documents before matching, MATCH-ORACLE is relatively low.

We can see from the second section, although RL can score the entire summary, it does not lead to much performance improvement. This is probably because it still relies on the sentence-level summarizers such as Pointer network or sequence labeling models, which select sentences one by one, rather than distinguishing the semantics of different summaries as a whole. Trigram Blocking is a simple yet effective heuristic on CNN/DM, even better than all redundancy removal methods based on neural models.

Compared with these models, our proposed MatchSum has outperformed all competitors by a large margin. For example, it beats BertExt by 1.51 ROUGE-1 score when using BERT-base as the encoder. Additionally, even compared with the baseline with BERT-large pre-trained encoder, our model MatchSum (BERT-base) still perform better. Furthermore, when we change the encoder to RoBERTa-base Liu et al. (2019), the performance can be further improved. We think the improvement here is because RoBERTa introduced 63 million English news articles during pretraining. The superior performance on this dataset demonstrates the effectiveness of our proposed matching framework.

Results on Datasets with Short Summaries

Reddit and XSum have been heavily evaluated by abstractive summarizer due to their short summaries. Here, we evaluate our model on these two datasets to investigate whether MatchSum could achieve improvement when dealing with summaries containing fewer sentences compared with other typical extractive models.

When taking just one sentence to match the original document, MatchSum degenerates into a re-ranking of sentences. Table 4 illustrates that this degradation can still bring a small improvement (compared to BertExt (Num = 1), 0.88 Δ\DeltaR-1 on Reddit, 0.82 Δ\DeltaR-1 on XSum). However, when the number of sentences increases to two and summary-level semantics need to be taken into account, MatchSum can obtain a more remarkable improvement (compared to BertExt (Num = 2), 1.04 Δ\DeltaR-1 on Reddit, 1.62 Δ\DeltaR-1 on XSum).

In addition, our model maps candidate summary as a whole into semantic space, so it can flexibly choose any number of sentences, while most other methods can only extract a fixed number of sentences. From Table 4, we can see this advantage leads to further performance improvement.

Results on Datasets with Long Summaries

When the summary is relatively long, summary-level matching becomes more complicated and is harder to learn. We aim to compare the difference between Trigram Blocking and our model when dealing with long summaries.

Table 5 presents that although Trigram Blocking works well on CNN/DM, it does not always maintain a stable improvement. Ngram Blocking has little effect on WikiHow and Multi-News, and it causes a large performance drop on PubMed. We think the reason is that Ngram Blocking cannot really understand the semantics of sentences or summaries, just restricts the presence of entities with many words to only once, which is obviously not suitable for the scientific domain where entities may often appear multiple times.

On the contrary, our proposed method does not have these strong constraints, but aligns the original document with the summary from semantic space. Experiment results display that our model is robust on all domains, especially on WikiHow, MatchSum beats the state-of-the-art BERT model by 1.54 ROUGE-1 score.

4 Analysis

In the following, our analysis is driven by two questions:

1) Whether the benefits of MatchSum are consistent with the property of the dataset analyzed in Section 3?

2) Why have our model achieved different performance gains on diverse datasets?

Typically, we choose three datasets (XSum, CNN/DM and WikiHow) with the largest performance gain for this experiment. We split each test set into roughly equal numbers of five parts according to zz described in Section 3.2, and then experiment with each subset.

Figure 4 shows that the performance gap between MatchSum and BertExt is always the smallest when the best-summary is not a pearl-summary (z=1z=1). The phenomenon is in line with our understanding, in these samples, the ability of the summary-level extractor to discover pearl-summaries does not bring advantages.

As zz increases, the performance gap generally tends to increase. Specifically, the benefit of MatchSum on CNN/DM is highly consistent with the appearance of pearl-summary. It can only bring an improvement of 0.49 in the subset with the smallest zz, but it rises sharply to 1.57 when zz reaches its maximum value. WikiHow is similar to CNN/DM, when best-summary consists entirely of highest-scoring sentences, the performance gap is obviously smaller than in other samples. XSum is slightly different, although the trend remains the same, our model does not perform well in the samples with the largest zz, which needs further improvement and exploration.

From the above comparison, we can see that the performance improvement of MatchSum is concentrated in the samples with more pearl-summaries, which illustrates our semantic-based summary-level model can capture sentences that are not particularly good when viewed individually, thereby forming a better summary.

Comparison Across Datasets

Intuitively, improvements brought by MatchSum framework should be associated with inherent gaps presented in Section 3.3. To better understand their relation, we introduce Δ(D)\Delta(\mathcal{D})^{*} as follows:

where CMSC_{MS} and CBEC_{BE} represent the candidate summary selected by MatchSum and BertExt in the document DD, respectively. Therefore, Δ(D)\Delta(\mathcal{D})^{*} can indicate the improvement by MatchSum over BertExt on dataset D\mathcal{D}. Moreover, compared with the inherent gap between sentence-level and summary-level extractors, we define the ratio that MatchSum can learn on dataset D\mathcal{D} as:

where Δ(D)\Delta(\mathcal{D}) is the inherent gap between sentence-level and summary-level extractos.

It is clear from Figure 5, the value of ψ(D)\psi(\mathcal{D}) depends on zz (see Figure 2) and the length of the gold summary (see Table 1). As the gold summaries get longer, the upper bound of summary-level approaches becomes more difficult for our model to reach. MatchSum can achieve 0.64 ψ(D)\psi(\mathcal{D}) on XSum (23.3 words summary), however, ψ(D)\psi(\mathcal{D}) is less than 0.2 in PubMed and Multi-News whose summary length exceeds 200. From another perspective, when the summary length are similar, our model performs better on datasets with more pearl-summaries. For instance, zz is evenly distributed in Multi-News (see Figure 2), so higher ψ(D)\psi(\mathcal{D}) (0.18) can be obtained than PubMed (0.09), which has the least pearl-summaries.

A better understanding of the dataset allows us to get a clear awareness of the strengths and limitations of our framework, and we also hope that the above analysis could provide useful clues for future research on extractive summarization.

Conclusion

We formulate the extractive summarization task as a semantic text matching problem and propose a novel summary-level framework to match the source document and candidate summaries in the semantic space. We conduct an analysis to show how our model could better fit the characteristic of the data. Experimental results show MatchSum outperforms the current state-of-the-art extractive model on six benchmark datasets, which demonstrates the effectiveness of our method. We believe the power of this matching-based summarization framework has not been fully exploited. In the future, more forms of matching models can be explored to instantiated the proposed framework.

Acknowledgment

We would like to thank the anonymous reviewers for their valuable comments. This work is supported by the National Key Research and Development Program of China (No. 2018YFC0831103), National Natural Science Foundation of China (No. U1936214 and 61672162), Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01) and ZJLab.

References