Heterogeneous Graph Neural Networks for Extractive Document Summarization

Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, Xuanjing Huang

Introduction

Extractive document summarization aims to extract relevant sentences from the original documents and reorganize them as the summary. Recent years have seen a resounding success in the use of deep neural networks on this task Cheng and Lapata (2016); Narayan et al. (2018); Arumae and Liu (2018); Zhong et al. (2019a); Liu and Lapata (2019b). These existing models mainly follow the encoder-decoder framework in which each sentence will be encoded by neural components with different forms.

To effectively extract the summary-worthy sentences from a document, a core step is to model the cross-sentence relations. Most current models capture cross-sentence relations with recurrent neural networks (RNNs) Cheng and Lapata (2016); Nallapati et al. (2017); Zhou et al. (2018). However, RNNs-based models are usually hard to capture sentence-level long-distance dependency, especially in the case of the long document or multi-documents. One more intuitive way is to model the relations of sentences using the graph structure. Nevertheless, it is challenging to find an effective graph structure for summarization. Efforts have been made in various ways. Early traditional work makes use of inter-sentence cosine similarity to build the connectivity graph like LexRank Erkan and Radev (2004) and TextRank Mihalcea and Tarau (2004). Recently, some works account for discourse inter-sentential relationships when building summarization graphs, such as the Approximate Discourse Graph (ADG) with sentence personalization features Yasunaga et al. (2017) and Rhetorical Structure Theory (RST) graph Xu et al. (2019). However, they usually rely on external tools and need to take account of the error propagation problem. A more straightforward way is to create a sentence-level fully-connected graph. To some extent, the Transformer encoder Vaswani et al. (2017) used in recent workZhong et al. (2019a); Liu and Lapata (2019b) can be classified into this type, which learns the pairwise interaction between sentences. Despite their success, how to construct an effective graph structure for summarization remains an open question.

In this paper, we propose a heterogeneous graph network for extractive summarization. Instead of solely building graphs on sentence-level nodes, we introduce more semantic units as additional nodes in the graph to enrich the relationships between sentences. These additional nodes act as the intermediary that connects sentences. Namely, each additional node can be viewed as a special relationship between sentences containing it. During the massage passing over the heterogeneous graph, these additional nodes will be iteratively updated as well as sentence nodes.

Although more advanced features can be used (e.g., entities or topics), for simplicity, we use words as the semantic units in this paper. Each sentence is connected to its contained words. There are no direct edges for all the sentence pairs and word pairs. The constructed heterogeneous word-sentence graph has the following advantages: (a) Different sentences can interact with each other in consideration of the explicit overlapping word information. (b) The word nodes can also aggregate information from sentences and get updated. Unlike ours, existing models usually keep the words unchanged as the embedding layer. (c) Different granularities of information can be fully used through multiple message passing processes. (d) Our heterogeneous graph network is expandable for more types of nodes. For example, we can introduce document nodes for multi-document summarization.

We highlight our contributions as follows:

(1) To our knowledge, we are the first one to construct a heterogeneous graph network for extractive document summarization to model the relations between sentences, which contains not only sentence nodes but also other semantic units. Although we just use word nodes in this paper, more superior semantic units (e.g. entities) can be incorporated.

(2) Our proposed framework is very flexible in extension that can be easily adapt from single-document to multi-document summarization tasks.

(3) Our model can outperform all existing competitors on three benchmark datasets without the pre-trained language modelsSince our proposed model is orthogonal to the methods that using pre-trained models, we believe our model can be further boosted by taking the pre-trained models to initialize the node representations, which we reserve for the future.. Ablation studies and qualitative analysis show the effectiveness of our models.

Related Work

With the development of neural networks, great progress has been made in extractive document summarization. Most of them focus on the encoder-decoder framework and use recurrent neural networks Cheng and Lapata (2016); Nallapati et al. (2017); Zhou et al. (2018) or Transformer encoders Zhong et al. (2019b); Wang et al. (2019a) for the sentential encoding. Recently, pre-trained language models are also applied in summarization for contextual word representations Zhong et al. (2019a); Liu and Lapata (2019b); Xu et al. (2019); Zhong et al. (2020).

Another intuitive structure for extractive summarization is the graph, which can better utilize the statistical or linguistic information between sentences. Early works focus on document graphs constructed with the content similarity among sentences, like LexRank Erkan and Radev (2004) and TextRank Mihalcea and Tarau (2004). Some recent works aim to incorporate a relational priori into the encoder by graph neural networks (GNNs) Yasunaga et al. (2017); Xu et al. (2019). Methodologically, these works only use one type of nodes, which formulate each document as a homogeneous graph.

Heterogeneous Graph for NLP

Graph neural networks and their associated learning methods (i.e. message passing Gilmer et al. (2017), self-attention Velickovic et al. (2017)) are originally designed for the homogeneous graph where the whole graph shares the same type of nodes. However, the graph in the real-world application usually comes with multiple types of nodes Shi et al. (2016), namely the heterogeneous graph. To model these structures, recent works have made preliminary exploration. Tu et al. (2019) introduced a heterogeneous graph neural network to encode documents, entities and candidates together for multi-hop reading comprehension. Linmei et al. (2019) focused on semi-supervised short text classification and constructed a topic-entity heterogeneous neural graph.

For summarization, Wei (2012) proposes a heterogeneous graph consisting of topic, word and sentence nodes and uses the markov chain model for the iterative update. Wang et al. (2019b) modify TextRank for their graph with keywords and sentences and thus put forward HeteroRank. Inspired by the success of the heterogeneous graph-based neural network on other NLP tasks, we introduce it to extractive text summarization to learn a better node representation.

Methodology

Given a document D={s1,,sn}D=\{s_{1},\cdots,s_{n}\} with nn sentences, we can formulate extractive summarization as a sequence labeling task as Narayan et al. (2018); Liu and Lapata (2019b). Our goal is to predict a sequence of labels y1,,yny_{1},\cdots,y_{n} (yi{0,1}y_{i}\in\{0,1\}) for sentences, where yi=1y_{i}=1 represents the ii-th sentence should be included in the summaries. The ground truth labels, which we call oracle, is extracted using the greedy approach introduced by Nallapati et al. (2016) with the automatic metrics ROUGE Lin and Hovy (2003).

Generally speaking, our heterogeneous summarization graph consists of two types of nodes: basic semantic nodes (e.g. words, concepts, etc.) as relay nodes and other units of discourse (e.g. phrases, sentences, documents, etc.) as supernodes. Each supernode connects with basic nodes contained in it and takes the importance of the relation as their edge feature. Thus, high-level discourse nodes can establish relationships between each other via basic nodes.

In this paper, we use words as the basic semantic nodes for simplicity. HeterSUMGraph in Section 3.1 is a special case which only contains one type of supernodes (sentences) for classification, while HeterDocSUMGraph in Section 3.5 use two (documents and sentences). Based on our framework, other types of supernodes (such as paragraphs) can also be introduced and the only difference lies in the graph structure.

Given a graph G={V,E}G=\{V,E\}, where VV stands for a node set and EE represents edges between nodes, our undirected heterogeneous graph can be formally defined as V=VwVsV=V_{w}\cup V_{s} and E={e11,,emn}E=\{e_{11},\cdots,e_{mn}\}. Here, Vw={w1,,wm}V_{w}=\{w_{1},\cdots,w_{m}\} denotes mm unique words of the document and Vs={s1,,sn}V_{s}=\{s_{1},\cdots,s_{n}\} corresponds to the nn sentences in the document. EE is a real-value edge weight matrix and eij0e_{ij}\neq 0 (i{1,,m},j{1,,n}i\in\{1,\cdots,m\},j\in\{1,\cdots,n\}) indicates the jj-th sentence contains the ii-th word.

Figure 1 presents the overview of our model, which mainly consists of three parts: graph initializers for nodes and edges, the heterogeneous graph layer and the sentence selector. The initializers first create nodes and edges and encode them for the document graph. Then the heterogeneous graph updates these node representations by iteratively passing messages between word and sentence nodes via Graph Attention Network (GAT) Velickovic et al. (2017). Finally, the representations of sentence nodes are extracted to predict labels for summaries.

2 Graph Initializers

To further include information about the importance of relationships between word and sentence nodes, we infuse TF-IDF values in the edge weights. The term frequency (TF) is the number of times wiw_{i} occurs in sjs_{j} and the inverse document frequency (IDF) is made as the inverse function of the out-degree of wiw_{i}.

3 Heterogeneous Graph Layer

Given a constructed graph GG with node features XwXs\textbf{X}_{w}\cup\textbf{X}_{s} and edge features E, we use graph attention networks Velickovic et al. (2017) to update the representations of our semantic nodes.

where Wa\mathbf{W}_{a}, Wq\mathbf{W}_{q}, Wk\mathbf{W}_{k}, Wv\mathbf{W}_{v} are trainable weights and αij\alpha_{ij} is the attention weight between hi\bm{h}_{i} and hj\bm{h}_{j}. The multi-head attention can be denoted as:

Besides, we also add a residual connection to avoid gradient vanishing after several iterations. Therefore, the final output can be represented as:

After each graph attention layer, we introduce a position-wise feed-forward (FFN) layer consisting of two linear transformations just as Transformer Vaswani et al. (2017).

To pass messages between word and sentence nodes, we define the information propagation as Figure 2. Specifically, after the initialization, we update sentence nodes with their neighbor word nodes via the above GAT and FFN layer:

After that, we obtain new representations for word nodes using the updated sentence nods and further update sentence nodes iteratively. Each iteration contains a sentence-to-word and a word-to-sentence update process. For the tt-th iteration, the process can be represented as:

As Figure 2 shows, word nodes can aggregate the document-level information from sentences. For example, the high degree of a word node indicates the word occurs in many sentences and is likely to be the keyword of the document. Regarding sentence nodes, the one with more important words tends to be selected as the summary.

4 Sentence Selector

Finally, we need to extract sentence nodes included in the summary from the heterogeneous graph. Therefore, we do node classification for sentences and cross-entropy loss is used as the training objective for the whole system.

Following Paulus et al. (2017) and Liu and Lapata (2019b), we use Trigram Blocking for decoding, which is simple but powerful version of Maximal Marginal Relevance Carbonell and Goldstein (1998). Specifically, we rank sentences by their scores and discard those which have trigram overlappings with their predecessors.

5 Multi-document Summarization

For multi-document summarization, the document-level relation is crucial for better understanding the core topic and most important content of this cluster. However, most existing neural models ignore this hierarchical structure and concatenate documents to a single flat sequenceLiu et al. (2018); Fabbri et al. (2019). Others try to model this relation by attention-based full-connected graph or take advantage of similarity or discourse relationsLiu and Lapata (2019a).

Our framework can establish the document-level relationship in the same way as the sentence-level by just adding supernodes for documents(as Figure 3), which means it can be easily adapted from single-document to multi-document summarization. The heterogeneous graph is then extended to three types of nodes: V=VwVsVdV=V_{w}\cup V_{s}\cup V_{d} and Vd={d1,,dl}V_{d}=\{d_{1},\cdots,d_{l}\} and ll is the number of source documents. We name it as HeterDocSUMGraph.

As we can see in Figure 3, word nodes become the bridges between sentences and documents. Sentences containing the same words connect with each other regardless of their distance across documents, while documents establish relationships based on their similar contents.

Document nodes can be viewed as a special type of sentence nodes: a document node connects with contained word nodes and the TF-IDF value is used as the edge weight. Besides, document nodes also share the same update process as sentence nodes. The differences lie in the initialization, where the document node takes the mean-pooling of its sentence node features as its initial state. During the sentence selection, the sentence nodes are concatenated with the corresponding document representations to obtain the final scores for multi-document summarization.

Experiment

We evaluate our models both on single- and multi-document summarization tasks. Below, we start our experiment with the description of the datasets.

The CNN/DailyMail question answering dataset Hermann et al. (2015); Nallapati et al. (2016) is the most widely used benchmark dataset for single-document summarization. The standard dataset split contains 287,227/13,368/11,490 examples for training, validation, and test. For the data prepossessing, we follow Liu and Lapata (2019b), which use the non-anonymized version as See et al. (2017), to get ground-truth labels.

NYT50

NYT50 is also a single-document summarization dataset, which was collected from New York Times Annotated Corpus Sandhaus (2008) and preprocessed by Durrett et al. (2016). It contains 110,540 articles with summaries and is split into 100,834 and 9706 for training and test. Following Durrett et al. (2016), we use the last 4,000 examples from the training set as validation and filter test examples to 3,452.

Multi-News

The Multi-News dataset is a large-scale multi-document summarization introduced by Fabbri et al. (2019). It contains 56,216 articles-summary pairs and each example consists of 2-10 source documents and a human-written summary. Following their experimental settings, we split the dataset into 44,972/5,622/5,622 for training, validation and test examples and truncate input articles to 500 tokens.

2 Settings and Hyper-parameters

For both single-document and multi-document summarization, we limit the vocabulary to 50,000 and initialize tokens with 300-dimensional GloVe embeddings Pennington et al. (2014). We filter stop words and punctuations when creating word nodes and truncate the input document to a maximum length of 50 sentences. To get rid of the noisy common words, we further remove 10% of the vocabulary with low TF-IDF values over the whole dataset. We initialize sentence nodes with ds=128d_{s}=128 and edge features eij\bm{e}_{ij} in GATe\text{GAT}_{e} with de=50d_{e}=50. Each GAT layer is 8 heads and the hidden size is dh=64d_{h}=64, while the inner hidden size of FFN layers is 512.

During training, we use a batch size of 32 and apply Adam optimizer Kingma and Ba (2014) with a learning rate 5e-4. An early stop is performed when valid loss does not descent for three continuous epochs. We select the number of iterations t=1t=1 based on the performance on the validation set.The detailed experimental results are attached in the Appendix Section. For decoding, we select top-3 sentences for CNN/DailyMail and NYT50 datasets and top-9 for Multi-New according to the average length of their human-written summaries.

3 Models for Comparison

Extractive summarizer with BiLSTM encoder learns the cross-sentence relation by regarding a document as a sequence of sentences. For simplification, we directly take out the initialization of sentence nodes for classification, which includes a CNN encoder for the word level and 2-layer BiLSTM for sentence level. This model can also be viewed as an ablation study of our HeterSUMGraph on the updating of sentence nodes.

Ext-Transformer

Extractive summarizers with Transformer encoder learn the pairwise interaction Vaswani et al. (2017) between sentences in a purely data-driven way with a fully connected priori. Following Liu and Lapata (2019b), we implement a Transformer-based extractor as a baseline, which contains the same encoder for words followed by 12 Transformer encoder layers for sentences. Ext-Transformer can be regarded as the sentence-level fully connected graph.

HeterSUMGraph

Our heterogeneous summarization graph model relations between sentences based on their common words, which can be denoted as sentence-word-sentence relationships. HeterSUMGraph directly selects sentences for the summary by node classification, while HeterSUMGraph with trigram blocking further utilizes the n-gram blocking to reduce redundancy.

Results and Analysis

We evaluate our single-document model on CNN/DailyMail and NYT50 and report the unigram, bigram and longest common subsequence overlap with reference summaries by R-1, R-2 and R-L. Due to the limited computational resource, we don’t apply pre-trained contextualized encoder (i.e. BERT Devlin et al. (2018)) to our models, which we will regard as our future work. Therefore, here, we only compare with models without BERT for the sake of fairness.

Table 1 shows the results on CNN/DailyMail. The first part is the Lead-3 baseline and oracle upper bound, while the second part includes other summarization models.

We present our models (described in Section 4.3) in the third part. Compared with Ext-BiLSTM, our heterogeneous graphs achieve more than 0.6/0.51/0.7 improvements on R-1, R-2 and R-L, which indicates the cross-sentence relationships learned by our sentence-word-sentence structure is more powerful than the sequential structure. Besides, Our models also outperform Ext-Transformer based on fully connected relationships. This demonstrates that our graph structures effectively prune unnecessary connections between sentences and thus improve the performance of sentence node classification.

Compared with the second block of Figure 1, we observe that HeterSUMGraph outperforms all previous non-BERT-based summarization systems and trigram blocking leads to a great improvement on all ROUGE metrics. Among them, HER Luo et al. (2019) is a comparable competitor to our HeterSUMGraph, which formulated the extractive summarization task as a contextual-bandit problem and solved it with reinforcement learning. Since the reinforcement learning and our trigram blocking plays a similar role in reorganizing sentences into a summary Zhong et al. (2019a), we additionally compare HER without policy gradient with HeterSUMGraph. Our HeterSUMGraph achieve 0.61 improvements on R-1 over HER without policy for sentence scoring, and HeterSUMGraph with trigram blocking outperforms by 0.65 over HER for the reorganized summaries.

Results on NYT50

Results on NYT50 are summarized in Table 2. Note that we use limited-length ROUGE recall as Durrett et al. (2016), where the selected sentences are truncated to the length of the human-written summaries and the recall scores are used instead of F1. The first two lines are baselines given by Durrett et al. (2016) and the next two lines are our baselines for extractive summarization. The second and third part report the performance of other non-BERT-based works and our models respectively.

Again, we observe that our cross-sentence relationship modeling performs better than BiLSTM and Transformer. Our models also have strong advantages over other non-BERT-based approaches on NYT50. Meanwhile, we find trigram block doesn’t work as well as shown on CNN/DailyMail, and we attribute the reason to the special formation of summaries of CNN/DailyMail dataset. Nallapati et al. (2016) concatenate summary bullets, which are written for different parts of the article and have few overlaps with each other, as a multi-sentence summary. However, when human write summaries for the whole article (such as NYT50 and Multi-News), they will use key phrases repeatedly. This means roughly removing sentences by n-gram overlaps will lead to loss of important information.

Ablation on CNN/DailyMail

In order to better understand the contribution of different modules to the performance, we conduct ablation study using our proposed HeterSUMGraph model on CNN/DailyMail dataset. First, we remove the filtering mechanism for low TF-IDF words and the edge weights respectively. We also remove residual connections between GAT layers. As a compensation, we concatenate the initial sentence feature after updating messages from nearby word nodes in Equal 8:

Furthermore, we make iteration number t=0t=0, which deletes the word updating and use the sentence representation Hs1\textbf{H}_{s}^{1} for classification. Finally, we remove the BiLSTM layer in the initialization of sentence nodes.

As Table 3 shows, the removal of low TF-IDF words leads to increases on R-1 and R-L but drops on R-2. We suspect that filtering noisy words enable the model to better focus on useful word nodes, at the cost of losing some bigram information. The residual connection plays an important role in the combination of the original representation and the updating message from another type of nodes, which cannot be replaced by the concatenation. Besides, the introduction of edge features, word update and BiLSTM initialization for sentences also show their effectiveness.

2 Multi-document Summarization

We first take the concatenation of the First-k sentences from each source document as the baseline and use the codes and model outputshttps://github.com/Alex-Fabbri/ Multi-News released by Fabbri et al. (2019) for other models.

To explore the adaptability of our model to multi-document summarization, we concatenate multi-source documents to a single mega-document and apply HeterSUMGraph as the baseline. For comparison, we extend HeterSUMGraph to multi-document settings HeterDocSUMGraph as described in Section 3.5. Our results are presented in Table 4.

Specifically, we observe that both of our HeterSUMGraph and HeterDocSUMGraph outperform previous methods while HeterDocSUMGraph achieves better performance improvements. This demonstrates the introduction of document nodes can better model the document-document relationships and is beneficial for multi-document summarization. As mentioned above, trigram blocking does not work for the Multi-News dataset, since summaries are written as a whole instead of the concatenations of summary bullets for each source document.

3 Qualitative Analysis

We further design several experiments to probe into how our HeterSUMGraph and HeterDocSUMGraph help the single- and multi-document summarization.

In HeterSUMGraph, the degree of a word node indicates its occurrence across sentences and thus can measure the redundancy of the document to some extent. Meanwhile, words with a high degree can aggregate information from multiple sentences, which means that they can benefit more from the iteration process. Therefore, it is important to explore the influence of the node degree of words on the summarization performance.

Number of source documents

We also investigate how the number of source documents influences the performance of our model. To this end, we divide the test set of Multi-News into different parts by the number of source documents and discard parts with less than 100 examples. Then, we take First-3 as the baseline, which concatenates the top-3 sentences of each source document as the summary.

In Figure 5, we can observe that the lead baseline raises while both of our model performance degrade and finally they converge to the baseline. This is because it is more challenging for models to extract limited-number sentences that can cover the main idea of all source documents with the increasing number of documents. However, the First-3 baseline is forced to take sentences from each document which can ensure the coverage. Besides, the increase of document number enlarges the performance gap between HeterSUMGraph and HeterDocSUMGraph. This indicates the benefit of document nodes will become more significant for more complex document-document relationships.

Conclusion

In this paper, we propose a heterogeneous graph-based neural network for extractive summarization. The introduction of more fine-grained semantic units in the summarization graph helps our model to build more complex relationships between sentences . It is also convenient to adapt our single-document graph to multi-document with document nodes. Furthermore, our models have achieved the best results on CNN/DailyMail compared with non-BERT-based models, and we will take the pre-trained language models into account for better encoding representations of nodes in the future.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (No. U1936214 and 61672162), Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01) and ZJLab.

References

Appendix A Appendices

In order to select the best iteration number for HeterSUMGraph, we compare performances of different tt on the validation set of CNN/DM. All models are trained on a single GeForce RTX 2080 Ti GPU for about 5 epochs. As Table 5 shows, our HeterSUMGraph has comparable results for t=1t=1 and t=3t=3. However, when the iteration number goes from 1 to 3, the time for one epoch nearly doubles. Therefore, we take t=1t=1 as a result of the balance of time cost and model performance.