How to Write Summaries with Patterns? Learning towards Abstractive Summarization through Prototype Editing
Shen Gao, Xiuying Chen, Piji Li, Zhangming Chan, Dongyan Zhao, Rui Yan
Introduction
Abstractive summarization can be regarded as a sequence mapping task that maps the source text to the target summary Rush et al. (2015); Li et al. (2017); Cao et al. (2018); Gao et al. (2019a). It has drawn significant attention since the introduction of deep neural networks to natural language processing. Under special circumstances, the generated summaries are required to conform to a specific pattern, such as court judgments, diagnosis certificates, abstracts in academic papers, etc. Take the court judgments for example, there is always a statement of the crime committed by the accused, followed by the motives and the results of the judgment. An example case is shown in Table 1, where the summary shares the same writing style and has words in common with the prototype summary (retrieved from the training dataset).
Existing prototype based generation models such as Wu et al. (2018) are all applied on short text, thus, cannot handle long documents summarization task. Another series of works focus on template-based methods such as Oya et al. (2014). However, template-based methods are too rigid for our patternized summary generation task. Hence, in this paper, we propose a summarization framework named Prototype Editing based Summary Generator (PESG) that incorporates prototype document-summary pairs to improve summarization performance when generating summaries with pattern. First, we calculate the cross dependency between the prototype document-summary pair to obtain a summary pattern and prototype facts (explained in § 4.2). Then, we extract facts from the input document with the help of the prototype facts (explained in § 4.3). Next, a recurrent neural network (RNN) based decoder is used to generate a new summary, incorporating both the summary pattern and extracted facts (explained in § 4.4). Finally, a fact checker is designed to provide mutual information between the generated summary and the input document to prevent the generator from copying irrelevant facts from the prototype (explained in § 4.5). To evaluate PESG, we collect a large-scale court judgment dataset, where each judgment is a summary of the case description with a patternized style. Extensive experiments conducted on this dataset show that PESG outperforms the state-of-the-art summarization baselines in terms of ROUGE metrics and human evaluations by a large margin.
Our contributions can be summarized as follows:
We propose to use prototype information to help generate better summaries with patterns.
Specifically, we propose to generate the summary incorporating the prototype summary pattern and extracted facts from input document.
We provide mutual information signal for the generator to prevent copying irrelevant facts from the prototype.
We release a large-scale prototype based summarization dataset that is beneficial for the community.
Related Work
We detail related work on text summarization and prototype editing.
Text summarization can be classified into extractive and abstractive methods. Extractive methods Narayan et al. (2018b); Chen et al. (2018) directly select salient sentences from an article to compose a summary. One shortcoming of these models is that they tend to suffer from redundancy. Recently, with the emergence of neural network models for text generation, a vast majority of the literature on summarization Ma et al. (2018); Zhou et al. (2018); Gao et al. (2019a); Chen et al. (2019) is dedicated to abstractive summarization, which aims to generate new content that concisely paraphrases a document from scratch.
Another line of research focuses on prototype editing. Guu et al. (2018) proposed the first prototype editing model, which samples a prototype sentence from training data and then edits it into a new sentence. Following this work, Wu et al. (2018) proposed a new paradigm for response generation, which first retrieves a prototype response from a pre-defined index and then edits the prototype response. Cao et al. (2018) applied this method on summarization, where they employed existing summaries as soft templates to generate new summary without modeling the dependency between the prototype document, summary and input document. Different from these soft attention methods, Cai et al. (2018) proposed a hard-editing skeleton-based model to promote the coherence of generated stories. Template-based summarization is also a hard-editing method Oya et al. (2014), where a multi-sentence fusion algorithm is extended in order to generate summary templates.
Different from all above works, our model focuses on patternized summary generation, which is more challenging than traditional news summarization and short sentence prototype editing.
Problem Formulation
For an input document , we assume there is a ground truth summary . In our prototype summarization task, a retrieved prototype document with a corresponding prototype summary is also attached according to their similarities with .
For a given document , our model extracts salient facts from guided by a prototype document , and then generates the summary by referring to the prototype summary . The goal is to generate a summary that not only follows a patternized style (as defined by prototype summary ) but also is consistent with the facts in document .
Model
In this section, we propose our prototype editing based summary generator, which can be split into two main parts, as shown in Figure 1:
Summary Generator. (1) Prototype Reader analyzes the dependency between and to determine the summary pattern and prototype facts. (2) Fact Extraction module extracts facts from the input document under the guidance of the prototype facts. (3) Editing Generator module generates the summary of document by incorporating summary pattern and facts.
Fact Checker estimates the mutual information between the generated summary and input document . This information provides an additional signal for the generation process, preventing irrelevant facts from being copied from the prototype document.
2 Prototype Reader
To begin with, we use an embedding matrix to map a one-hot representation of each word in , , into a high-dimensional vector space. We then employ a bi-directional recurrent neural network (Bi-RNN) to model the temporal interactions between words:
where , and denote the hidden state of -th step in Bi-RNN for , and , respectively. Following Tao et al. (2018); Gao et al. (2019b); Hu et al. (2019), we choose long short-term memory (LSTM) as the cell for Bi-RNN.
where is a trainable scalar function that calculates the similarity between two input vectors. denotes a concatenation operation and is an element-wise multiplication.
We sum the prototype facts to obtain the overall representation of these facts:
3 Fact Extraction
In this section, we discuss how to extract useful facts from an input document with the help of prototype facts.
We first extract the facts from an input document by calculating their relevance to prototype facts. The similarity matrix is then calculated between the weighted prototype document and input document representation :
where is the similarity function introduced in Equation 4. Then, we sum up along the length of the prototype document to obtain the weight for -th word in the document. Next, similar to Equation 6, a convolutional layer is applied on the weighted hidden states to obtain the fact representation from the input document:
Inspired by the polishing strategy in extractive summarization Chen et al. (2018), we propose to use the prototype facts to polish the extracted facts and obtain the final fact representation , as shown in Figure 2. Generally, the polishing process consists of two hierarchical recurrent layers. The first recurrent layer is made up of Selective Recurrent Units (SRUs), which take facts and polished fact as input, outputting the hidden state . The second recurrent layer consists of regular Gated Recurrent Units (GRUs), which are used to update the polished fact from to using .
SRU is a modified version of the original GRU introduced in Chen et al. (2018), details of which can be found in Appendix A. Its difference from GRU lies in that the update gate in SRU is decided by both the polished fact and original fact together. The -th hidden state of SRU is calculated as:
We take as the overall representation of all input facts . In this way, SRU can decide to which degree each unit should be updated based on its relationship with the polished fact .
Next, is used to update the polished fact using the second recurrent layer, consisting of GRUs:
where is the cell state, is the input and is the output hidden state. is initialized using in Equation 7. This iterative process is conducted times, and each output is stored as extracted facts . In this way, stores facts with different polished levels.
4 Editing Generator
The editing generator aims to generate a summary based on the input document, prototype summary and extracted facts. As with the settings of prototype reader, we use LSTM as the RNN cell. We first apply a linear transformation on the summation of the summary pattern and input document representations , and then employ this vector as the initial state of the RNN generator as shown in Equation 12. The procedure of -th generation is shown in Equation 13:
where are trainable parameters, is the hidden state of the -th generating step, and is the context vector produced by the standard attention mechanism Bahdanau et al. (2015).
To take advantage of the extracted facts and prototype summary , we incorporate them both into summary generation using a dynamic attention. More specifically, we utilize a matching function to model the relationship between the current decoding state and each ( can be a extracted fact or summary pattern ):
where can be or for attending to extracted facts or a summary pattern, respectively. We use a simple but efficient bi-linear layer as the matching function . As for combining and , we propose to use an “editing gate” , which is determined by the decoder state , to decide the importance of the summary pattern and extracted facts at each decoding step.
where denotes the sigmoid function. Using the editing gate, we obtain which dynamically combines information from the extracted facts and summary pattern with the editing gate , as:
Finally, the context vector is concatenated with the decoder state and fed into a linear layer to obtain the generated word distribution :
The loss is the negative log likelihood of the target word :
In order to handle the out-of-vocabulary (OOV) problem, we equip our decoder with a pointer network Gu et al. (2016); Vinyals et al. (2015); See et al. (2017). This process is the same as the model described in See et al. (2017), thus, is omit here due to limited space.
What’s more, previous work Holtzman et al. (2018) has found that using a cross entropy loss alone is not enough for generating coherent text. Similarly, in our task, using alone is not enough to distinguish a good summary with accurate facts from a bad summary with detailed facts from the prototype document (see § 6.2). Thus, we propose a fact checker to determine whether the generated summary is highly related to the input document.
5 Fact Checker
To generate accurate summaries that are consistent with the detailed facts from the input document rather than facts from the prototype document, we add a fact checker to provide additional training signals for the generator. Following Hjelm et al. (2019), we employ the neural mutual information estimator to estimate the mutual information between the generated summary and its corresponding document , as well as the prototype document . Generally, mutual information is estimated from a local and global level, and we expect the matching degree to be higher between the generated summary and input document than the prototype document. An overview of the fact checker is shown in Figure 3.
To begin, we use a local matching network to calculate the matching degree, for local features, between the generated summary and the input, as well as prototype document. Remember that, in § 4.3, we obtain the fact representation of an input document and prototype facts . Combining these with the final hidden state of the generator RNN (in Equation 13), yields the local features of input extracted facts and the prototype facts:
A convolutional layer and a fully-connected layer are applied to score these two features:
We also have a global matching network to measure the matching degree, for global features, between the generated summary and the input document, as well as prototype document. To do so, we concatenate the representation of the generated summary with the final hidden state of the input document and final state of the prototype document , respectively, and apply a linear layer to these:
Finally, we combine the local and global loss functions to obtain the final loss , which we use to calculate the gradients for all parameters:
where are both hyper parameters. To optimize the trainable parameters, we employ the gradient descent method Adagrad Duchi et al. (2010) to update all parameters.
Experimental Setup
We collect a large-scale prototype based summarization datasethttps://github.com/gsh199449/proto-summ, which contains 2,003,390 court judgment documents. In this dataset, we use a case description as an input document and the court judgment as the summary. The average lengths of the input documents and summaries are 595.15 words and 273.57 words respectively. The percentage of words common to a prototype summary and the reference summary is 80.66%, which confirms the feasibility and necessity of prototype summarization. Following other summarization datasets Grusky et al. (2018); Kim et al. (2019); Narayan et al. (2018a), we also count the novel n-grams in a summary compared with the n-grams in the original document, and the percentage of novel n-grams are 51.21%, 84.59%, 91.48%, 94.83% for novel 1-grams to 4-grams respectively. The coverage, compression and density Grusky et al. (2018) are commonly used as metrics to evaluate the abstractness of a summary. For the summaries in our dataset, the coverage percentage is 48.78%, compression is 2.28 and density is 1.31. We anonymize entity tokens into special tags, such as using “PERS” to replace a person’s name.
2 Comparisons
In order to prove the effectiveness of each module of PESG, we conduct several ablation studies, shown in Table 2. We also compare our model with the following baselines: (1) Lead-3 is a commonly used summarization baseline Nallapati et al. (2017); See et al. (2017), which selects the first three sentences of document as the summary. (2) S2S is a sequence-to-sequence framework with a pointer network, proposed by See et al. (2017). (3) Proto is a context-aware prototype editing dialog response generation model proposed by Wu et al. (2018). (4) Re3Sum, proposed by Cao et al. (2018), uses an IR platform to retrieve proper summaries and extends the seq2seq framework to jointly conduct template-aware summary generation. (5) Uni-model was proposed by Hsu et al. (2018), and is the current state-of-the-art abstractive summarization approach on the CNN/DailyMail dataset. (6) We also directly concatenate the prototype summary with the original document as input for S2S and Uni-model, named as Concat-S2S and Concat-Uni, respectively.
3 Evaluation Metrics
For the court judgment dataset, we evaluate standard ROUGE-1, ROUGE-2 and ROUGE-L Lin (2004) on full-length F1 following previous works Nallapati et al. (2017); See et al. (2017); Paulus et al. (2018), where ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) refer to the matches of unigram, bigrams, and the longest common subsequence respectively.
Schluter (2017) notes that only using the ROUGE metric to evaluate summarization quality can be misleading. Therefore, we also evaluate our model by human evaluation. Three highly educated participants are asked to score 100 randomly sampled summaries generated by three models: Uni-model, Re3Sum and PESG. The statistical significance of observed differences between the performance of two runs is tested using a two-tailed paired t-test and is denoted using ▲(or ▼) for strong (or weak) significance for .
4 Implementation Details
We implement our experiments in TensorFlow Abadi et al. (2016) on an NVIDIA GTX 1080 Ti GPU. The word embedding dimension is 256 and the number of hidden units is 256. The batch size is set to 64. We padded or cut input document to contain exactly 250 words, and the decoding length is set to 100. and from the Equation 28 are both set to 1.0. We initialize all of the parameters randomly using a Gaussian distribution. We use Adagrad optimizer Duchi et al. (2010) as our optimizing algorithm and employ beam search with size 5 to generate more fluency summary sentence. We also apply gradient clipping Pascanu et al. (2013) with range $p=0.7$.
Experimental Result
We compare our model with the baselines listed in Table 3. Our model performs consistently better than other summarization models including the state-of-the-art model with improvements of 6%, 12% and 6% in terms of ROUGE-1, ROUGE-2 and ROUGE-L. This demonstrates that prototype document-summary pair provides strong guidance for summary generation that cannot be replaced by other complicated baselines without prototype information. Meanwhile, directly concatenating the prototype summary with the original input does not increase performance, instead leading to drops of 9%, 17%, 8% and 1%, 3%, 2% in terms of ROUGE 1,2,L on the S2S and Unified models, respectively. As for the baseline model Proto, we found that it directly copies from the prototype summary as generated summary, which leads to a totally useless and incorrect summary.
For the human evaluation, we asked annotators to rate each summary according to its consistency and fluency. The rating score ranges from 1 to 3, with 3 being the best. Table 4 lists the average scores of each model, showing that PESG outperforms the other baseline models in both fluency and consistency. The kappa statistics are 0.33 and 0.29 for fluency and consistency respectively, and that indicates the moderate agreement between annotators. To prove the significance of these results, we also conduct the paired student t-test between our model and Re3Sum (row with shaded background). We obtain a p-value of and for fluency and consistency, respectively.
We also analyze the effectiveness of performance by the two hyper-parameters: and . It turns out that our model has a consistently good performance, with ROUGE-1, ROUGE-2, ROUGE-L scores above 39.5, 27.5, 39.4, which demonstrates that our model is very robust.
2 Ablation Study
The ROUGE scores of different ablation models are shown in Table 5. All ablation models perform worse than PESG in terms of all metrics, which demonstrates the preeminence of PESG. More importantly, by this controlled experiment, we can verify the contributions of each modules in PESG.
3 Analysis of Editing Generator
We visualize the editing gate (illustrated in Equation 16) of two randomly sampled examples, shown in Figure 4. A lower weight (lighter color) means that the word is more likely to be copied from the summary pattern; that is to say, this word is a universal patternized word. We can see that the phrase 本院认为 (the court held that) has a lower weight than the name of the defendant (PERS), which is consistent with the fact that (the court held that) is a patternized word and the name of the defendant is closely related to the input document.
We also show a case study in Table 6, which includes the input document and reference summary with the generated summaries. Underlined text denotes a grammar error and a strike-through line denotes a fact contrary to the input document. We only show part of the document and summary due to limited space; the full version is shown in Appendix. As can be seen, the summary generated by Uni-model faces an inconsistency problem and the summary generated by Re3Sum is contrary to the facts described in the input document. However, PESG overcomes both of these problems and generates an accurate summary with good grammar and logic.
4 Analysis of Fact Extraction Module
We investigate the influence of the iteration number when facts are extracted. Figure 5 illustrates the relationship between iteration number and the f-value of the ROUGE score. The results show that the ROUGE scores first increases with the number of hops. After reaching an upper limit it then begins to drop. This phenomenon demonstrates that the fact extraction module is effective by polishing the facts representation.
Conclusion
In this paper, we propose a framework named Prototype Editing based Summary Generator (PESG), which aims to generate summaries in formal writing scenarios, where summaries should conform to a patternized style. Given a prototype document-summary pair, our model first calculates the cross dependency between the prototype document-summary pair. Next, a fact extraction module is employed to extract facts from the document, which are then polished. Finally, we design an editing-based generator to produce a summary by incorporating the polished fact and summary pattern. To ensure that the generated summary is consistent with the input document, we propose a fact checker to estimate the mutual information between the input document and generated summary. Our model outperforms state-of-the-art methods in terms of ROUGE scores and human evaluations by a large margin, which demonstrates the effectiveness of PESG.
Acknowledgments
We would like to thank the anonymous reviewers for their constructive comments. We would also like to thank Anna Hennig in Inception Institute of Artificial Intelligence for her help on this paper. This work was supported by the National Key Research and Development Program of China (No. 2017YFC0804001), the National Science Foundation of China (NSFC No. 61876196 and NSFC No. 61672058)
Appendix A SRU Cell
Gated recurrent unit (GRU) Cho et al. (2014) is a gating mechanism in recurrent neural networks, which incorporate an update gate in an RNN. We first give the details of the original GRU here.
where are all trainable parameters and is the hop number in the multi-hop situation which is a hyper-parameter manually set. The effectiveness of this hyper-parameter is verified in the experimental results shown in § 6.4. Equation 32 now becomes:
We use the name “SRU” to denote this modified version of an GRU cell.