Enhancing Scientific Papers Summarization with Citation Graph

Chenxin An, Ming Zhong, Yiran Chen, Danqing Wang, Xipeng Qiu, Xuanjing Huang

Introduction

Text summarization is to automatically compress a document into a shorter version preserving a concise description of the content. Most of the previous work focused on News domain (Nallapati et al. 2016; Rush, Chopra, and Weston 2015; Nallapati, Zhai, and Zhou 2016; Zhong et al. 2019), and achieved promising result using the neural encoder-decoder architecture. Although text summarization systems have not been explored too much in other domains, such as scientific papers, they still have broad application prospects.

Generating a good abstract for a scientific paper is a very challenging task, even for a beginner researcher, since the scientific papers are usually longer and full of complex concepts and domain-specific items in specific fields. Cohan et al. (2018) and Xiao and Carenini (2019) leveraged the paper structure information to generate the abstracts for scientific papers. However, their methods dedicate to solving the problem of long document modeling and do not utilize the information of references. As a matter of fact, researchers usually write an abstract of a paper by referring some examples. Especially a large number of papers on the same topic are often similar in content. Reasonable use of the information of reference papers may help us solve the scientific papers summarization task. To generate better summary for a scientific paper, Yasunaga et al. (2019) integrated the formation of the source paper and the papers which cite the source papers. However, the citing papers appeared after the source paper, so we tend to think that this task does not help a research to draft an abstract when the paper has not been cited yet.

In this paper, we highlight the importance of the citation graph and believe that it can assist in generating high-quality summaries. Figure 1 shows an example of a small research community consisting of the source paper and several reference papers. They are all about topic Weak Galerkin Finite Element Method and are thus very similar in content, logic, and writing style. For instance, many uncommon domain-specific terms (green text) are shared in these papers, it is almost impossible for the model to understand the true meaning of these concepts without sufficient descriptions, so naturally, we should encourage the models to learn from the reference papers. The same expression always has different writing styles (orange text) in different papers, even some academic definitions that do not appear in the original text can be found in other papers (blue text), this relevant information will undoubtedly help the model to better understand the entire research community.

Motivated by the above observations, we augment the task of scientific papers summarization with citation graph. While generating the abstract of a source paper, the summarization systems are able to refer to papers in the same research community. Considering that all current large-scale scientific summarization datasets do not provide citation relationships between papers, we construct a scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K papers and 661K citation relationships extracted from the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). Notably, our dataset is a huge connected citation graph, and each paper has class labels denoting its research field. We divide the enhanced summarization task into 2 settings: (1) transductive: during training, models can access to all the nodes and edges in the whole dataset including papers (excluding abstracts) in the test set. (2) inductive: papers in the test set are from a totally new graph which means all test nodes cannot be used during training.

Further, we propose a citation graph-based summarization model (CgSum). which incorporates the document and relevant citation graph when generating summaries. For each source paper, we obtain its corresponding research community by sampling a subgraph from the whole citation graph. We firstly encode the content of the source paper and utilize a graph encoder to capture the information of the subgraph. Finally, a decoder combines all the features outputted by the two encoders to produce the final summary. Additionally, we introduce a novel ROUGE credit method, which can instruct the model how to write summaries with the help of other papers’ abstract in the same research community. Although our model only uses BiLSTM and GNN structures, experimental results show that it achieves the competitive performance when compared with the pretrained model. We summarize our contributions as follows:

We augment the task of scientific papers summarization by introducing the citation graph.

We construct a large-scale summarization dataset SSN. To our best knowledge, this is the first large-scale scientific papers summarization dataset with citation graph.

We propose a citation graph-based summarization model to solve the enhanced task of scientific papers summarization, which can incorporate the source paper information and the features of the citation graph at the same time.

Related Work

Early approaches for extractive summarization, such as TextRank (Mihalcea and Tarau 2004), have taken advantage of graph structures by building the connectivity graph with inter-sentence cosine similarity. As for the neural systems ,Wang et al. (2020) construct a heterogeneous graph network to model the relations between different semantic units. On abstractive system, inspired by the great success of Graph Attention Networks (GATs) (Veličković et al. 2017) in NLP, Song et al. (2018) proposed the task of text generation from graph and Koncel-Kedziorski et al. (2019) design a GATs-based transformer encoder to generate summary with the help of knowledge graphs extracted from scientific texts. For the combination of text and graph, Fernandes, Allamanis, and Brockschmidt (2018) incorporates the regular document encoder with graph neural networks to make use of both the input sequence and graph structure, and Zhu et al. (2020); Huang, Wu, and Wang (2020) built a knowledge graph from the input document and integrated it into the decoding process. Instead of directly generating abstract from the graph, our model uses the graph-enhanced encoder, viewing the citation graph as complementary information.

Scientific Papers Summarization

Automatic summarization for scientific papers has been studied for decades. Previous work mainly focused on the content of document (Luhn 1958; Cohan and Goharian 2018) and most of them are extractive (Teufel and Moens 2002; Xiao and Carenini 2019). For instance, Cohan et al. (2018) propose a neural model under the sequence-to-sequence framework with the discourse structure of scientific papers. These methods focus on modeling long documents, but ignore the influence of the research community it belongs to. Another direction is citation summarization (Qazvinian and Radev 2008; Cohan and Goharian 2018; Yasunaga et al. 2019), which can make use of the reference relationship between papers. Citation summarization aims to generate the summary of a source Paper according to the papers citing it. Although we can improve the quality of summary for a paper with its citation information, it cannot help authors to draft the summary while writing paper. Different to citation summarization, we generate the summary of the source paper by utilizing its reference papers as background knowledge. In our setting, the papers citing the source paper are not visible during the process of writing a summary.

Semantic Scholar Network (SSN) Dataset

Many scientific summarization datasets have emerged in recent years. The most commonly used scientific datasets, arXiv and PubMed (Cohan et al. 2018), focus on long document summarization without providing citation relationships between papers, which undoubtedly ignores the characteristics of the academic domain. Yasunaga et al. (2019) proposes a relatively small dataset containing 1k papers based on The ACL Anthology Network (ANN) (Radev et al. 2013), but they generate summaries using only papers that cite the current paper (i.e., citing papers), which is unreasonable. In view of the above, we construct a large-scale summarization dataset, Semantic Scholar Network (SSN), consists of 141k research papers extracted from Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). All the papers in SSN form a large connected citation graph, allowing us to make full use of citation relationships between papers.

Semantic Scholar Open Research Corpus (Lo et al. 2020) contains 81.1M academic papers from multiple research fields. We only extract papers with full text LaTeX parses (1.5M) which provides us more details about the paper (e.g. section names, boundaries of paragraph/sections). We keep papers whose abstract length is between 50 and 1000, and the body length is between 1000 and 8000. Additionally, papers with less than 4 sections or do not have an Introduction section are also filtered out because they are likely to lose the discourse structure. A Breadth-first search algorithm is applied to get a large connected citation graph. To prevent the graph from being too sparse, we recursively remove papers with only one 1-hop neighbour. We also normalize inline formulas, equations, tables, figures and citation markers with special tokens.

Statistics

SSN has 140,799 nodes and 660,908 edges where most papers come from the fields of mathematical, physics and computer science. Statistics of our dataset and other general datasets are shown in Table 1. CNN/DailyMail (Hermann et al. 2015) is a widely used news dataset, others belong to the scientific field. SSN has the longest text, which brings difficulty to modeling. Meanwhile, it has the most sections, showing that our dataset retains the most complete paper structure possible. Besides, SSN is a connected huge citation graph, indicating that SSN can be used to train some auxiliary tasks such as node classification and link prediction to help models better understand the research community in the whole graph.

Method

In this section, we first define the task of scientific papers summarization with citation graph, then describe our citation graph-based summarization model (CgSum) in detail.

Existing document summarization methods usually conceptualize this task as a sequence-to-sequence problem. Given a dataset $D=(d_{1},d_{2},\ldots,d_{k})$ , each document $d_{i}$ can be represented as a sequence of $n$ words $d$ $=({x}_{1},{x}_{2},\ldots,{x}_{n})$ , the objective is to generate a target summary $Y$ $=({y}_{1},{y}_{2},\ldots,{y}_{m})$ by modeling the conditional distribution $p(y_{1},y_{2},\ldots,y_{m}|x_{1},\ldots,x_{n})$ .

However, scientific papers have their own characteristics: there are citation relationships between papers, and the content of these papers is logically closely related. Therefore, we introduce the concept of citation graph to strengthen summarization tasks in the scientific domain. We define a citation graph $G=(V,E)$ on the whole dataset, which contains scientific papers and citation relationships. Each node $v\in V$ represents a scientific paper in the dataset, and each edge $e\in E$ indicates the citation relationship between two papers. Notably, when generating the summary of a paper, we cannot rely on the information of the papers that cites this one (because they are later in chronological order), so we extract a subgraph $G_{v}$ for each node $v$ to avoid introducing information that should not be used, the specific method can be seen in Algorithm 1.

Given the source paper $v$ (w/o abstract) and the related citation graph $G_{v}$ (we only use the abstract of other nodes), we need to generate a summary $Y$ of $v$ by modeling the conditional distribution $p(y_{1},y_{2},\ldots,y_{m}|x_{1},\ldots,x_{n};G_{v})$ .

Citation Graph-Based Summarization Model

In this part, we illustrate our citation graph-based model (CgSum) as displayed in Figure 2. The key idea is not only to encode the source document $v$ , but also to capture the features of the corresponding citation graph $G_{v}$ to help us generate the summary. Our model consists of document encoder, graph encoder and decoder. In addition, we introduce a novel ROUGE credit approach.

We employ a single-layer bidirectional LSTM (BiLSTM) to convert the input document $\bf d$ $=({x}_{1},{x}_{2},\ldots,{x}_{n})$ to a sequence of hidden representations $\mathbf{H}={\rm BiLSTM}({x}_{1},\ldots,{x}_{n})$ . We initialize the source node $v_{i}$ by pooling its hidden representations $\bf H$ . For the neighbor nodes $v_{j}\in\mathcal{N}(v_{i})$ , where $\mathcal{N}(v_{i})$ denotes the input neighborhood of $v_{i}$ , we feed their abstract $t$ to another BiLSTM and obtain the initial representation $\mathbf{h}_{v_{j}}$ of node $v_{j}$ by aggregating the hidden representations of $t$ with a pooling layer.

Neighborhood Extraction

For each node $v$ , it is too computationally expensive to use the whole citation graph $G_{v}$ , so it is necessary to sample an informative subgraph. Specifically, we first extract a directed subgraph $G_{v_{i}}^{\prime}$ consisting of the source paper $v_{i}$ and its $K$ -hop neighbors, and add self-loops to $G_{v_{i}}^{\prime}$ for information enhancement. Before feeding $G_{v_{i}}^{\prime}$ to the graph encoder,

we employ a neighborhood extraction method to further extract $T$ neighbors by their salience scores with source node $v_{i}$ :

where $v_{j}\in G_{{v_{i}}}^{\prime}$ , $s_{i,j}$ denotes the salience score between $v_{i}$ and $v_{j}$ , and $f$ is a 3-layers feed forward neural network. We extract the most salient $T$ vertices with $\operatorname{argmax}$ function to construct the final citation graph $G^{*}_{v_{i}}$ . However, directly sampling important nodes corrupts the training of parameters in $f$ . To overcome this problem, we follow Huang, Wu, and Wang (2020) and view $f$ as an information gate and multiplies $s_{i,j}$ to $v_{j}$ itself, $\mathbf{h}_{v_{j}}=s_{i,j}\mathbf{h}_{v_{j}}$ .

Graph Encoder

Given a sampled citation graph $G^{*}_{v}$ and the initial nodes features $\mathbf{H}_{v}$ , we use 2-layers graph attention networks (GAT) (Veličković et al. 2017) to update the representation of each node. Besides, to avoid the gradient vanishing problem, we add residual connections between layers. $v_{i}$ is represented by the aggregation of its neighbors:

where $\mathbin{\|}_{n=1}^{N}$ denotes concatenation of $N$ attention heads, and $\alpha_{i,j}^{n}$ is the normalized attention weight between $h_{v_{i}}$ and ${h_{v_{j}}}$ computed by the $n$ -th attention head, $\mathbf{W}_{a}^{n},\mathbf{W}_{k}^{n},\mathbf{W}_{q}^{n},\mathbf{W}_{v}^{n}$ are trainable parameters. Dropout (Hinton et al. 2012) with probability 0.1 is applied in each layer.

Decoder

Our decoder is a single-layer unidirectional LSTM. At each step $t$ , the decoder has a hidden state $\mathbf{s}_{t}$ . Previous works (See, Liu, and Manning 2017) employ an attention mechanism to compute the attention distribution over the source words in the sequence-to-sequence structure, and we extend it to the graph structure as:

where $\mathbf{v}^{T}$ , $\mathbf{W}_{h}^{v}$ , $\mathbf{W}_{s}^{v}$ and $\mathbf{b}^{v}$ are trainable weights. We compute the attention distribution over the nodes in $G^{*}_{v}$ and obtain a graph context vector $\mathbf{h}_{t}^{v,*}$ . Furthermore, on the basis of introducing the features of the citation graph, we still need to pay attention to the source document as:

where $\mathbf{W}_{h}$ , $\mathbf{W}_{s}$ and $\mathbf{b}$ are trainable parameters. $\mathbf{h}_{t}^{v,*}$ and $\mathbf{h}_{t}^{*}$ can be viewed as the aggregated representation of the citation graph and the source document respectively, so we concatenate them with the decoder hidden state $\mathbf{s}_{t}$ to produce the vocabulary distribution $P_{vocab}$ :

In addition, to overcome the OOV problem, we allow the decoder to copy words from the source document as proposed by See, Liu, and Manning (2017). The generation probability $p_{gen}\in$ (i.e. the copying probability $p_{copy}=1-P_{gen}$ ) for step $t$ is calculated as:

where $\mathbf{x}_{t}$ denotes the decoder input at time step $t$ . Therefore, the probability distribution over the extended vocabulary is:

Obviously, if $w$ does not appear in the source document, $\sum_{i:w_{i}=w}a_{i,t}$ is equal to zero, and if $w$ is an OOV word, $P_{vocab}$ is zero. The loss at time step $t$ is the negative log likelihood of the target word $y_{t}$ :

where $\lambda$ is the hyperparameter to reweight the coverage loss.

ROUGE Credit

Intuitively, the information brought by the citation graph is not only useful during training, but it is also helpful for the model to generate summaries during inference. Motivated by this, we propose a novel ROUGE credit score in beam search algorithm to instruct our model to write summaries with the help of nearby nodes’ abstracts.

Specifically, at the decoding step $t$ , we first select the neighbor $nbr_{max}$ which has the most influence on the generated summary using $\operatorname{argmax}$ function over the attention distribution $a_{t}^{v}$ on the graph $G^{*}_{v}$ . In the beam search process, there are $k$ candidate sequences $C=(c_{1},\ldots,c_{k})$ per time step, then we calcaute the ROUGE credit score between the abstract of $nbr_{max}$ and $c_{i}$ as:

Experiments

We evaluate our model on our Semantic Scholar Network dataset. Details about our dataset is shown in Table 1. We lowercase all tokens and tokenize sentence and word using spaCy (Honnibal and Johnson 2015). As is shown in Figure 3, for SNN (transductive) we randomly choose 6,250/6250 papers from the whole dataset as test/validation sets and the remaining 128,299 papers are classified as training set which is the most commonly way to split the dataset. The transductive division indicates that most neighbors of papers in test set are from the training set, but considering that in real cases, the test papers may from a new graph which has nothing to do with papers we used for training, thus we introduce SNN (inductive), by splitting the whole citation graph into three independent subgraphs – training, validation and test graphs with the breadth first search algorithm. The training/validation/test graphs in inductive setting contain 128,400/6,123/6,276 nodes and 603,737/17,221/14,515 edges. In both inductive and transducitve setting, the summary of papers in the test set and validation set are kept invisible during the training phase. Our inductive setting also has the intention to test whether models trained in a large-scale citation graph has the ability to transfer to another citation graph. Therefore, the inductive setting of our task is thought more difficult.

Training Details and Parameters

We use hyperparameters suggested by See, Liu, and Manning (2017) in the BiLSTM model. The word embeddings layer is trained from scratch without using any pretrained language models and embeddings. We use mini-batches of size 16 and we limit the input document length to 500 tokens. The input citation graph includes the source paper and its $K$ -hop neighbors ( $K$ = 1, 2), and we initialize the node representation with body text of source papers and the abstract of neighbors. We constrain the maximum number of papers in an input graph to 64. We implement graph attention network with Deep Graph Library (Wang et al. 2019) and the number of attention heads is set to 4. We use Adagrad optimizer with learning rate 0.15 and an initial accumulator value of 0.1. We set the beam size $b$ to 5 and $l_{s}$ to 75 in the ROUGE credit, and the ROUGE (Lin 2004) score used is the value of ROUGE-1 $\rm F_{1}$ . We do not train the model with coverage loss in the first epoch to help the model converge faster, and we train our model for 10 epochs and do validation every 2000 steps. We select the best checkpoint based on the ROUGE-L score on the validation set.

Baseline Methods

We provide the LEAD baseline which extracts the first $L$ (depending on the number of sentences in the reference summary) sentences from the source document and Oracle as an upper bound of extractive summarization systems. We use a greedy algorithm following Nallapati, Zhai, and Zhou (2016) to generate an oracle summary. Since we truncate the document to 500 tokens,Oracle in this paper is calculated on the truncated datasets.

Besides, we implement the following extractive systems: (1) TextRank (Mihalcea and Tarau 2004): an unsupervised extractive system based on the graph structure (2) TransformerExt: an extractive system based on transformer encoders (3) BertSumExt (Liu and Lapata 2019): an extractive summarization model with BERT. We further add the following abstractive baseline models: (1) PTGen+Cov (See, Liu, and Manning 2017): an abstractive summarization system with copy mechanism. (2) TransformerAbs: an abstractive summarization model based on transformer (3) BertSumAbs(Liu and Lapata 2019): an abstractive summarization system built on BERT. We employ trigram blocking (Paulus, Xiong, and Socher 2017) to reduce redundancy for both the baseline systems and our models.

Experimental Results

We evaluate summarization quality with the standard ROUGE score (Lin 2004) where R-1 and R-2 represent informativeness and R-L represents fluency. Table 2,3 show the results on our dataset. Several well-known extractive and abstractive baselines as well as models that make use of pretrained language model BERT (Devlin et al. 2018) using their open-sourced implementations are shown in the second and third part. Besides, to better compare our model with the baseline models, for each abstractive baseline we give an additional Concat Nbr. Summ version whose input is the concatenation of source document and the neighbors’ abstracts separated by a special token [SEP] following the general setting in multi-documents summarization. In our experiments, we are surprised to find that TransformerAbs performs poorly on our dataset, but it will be significantly improved if we further add copy mechanism.

Although BERT has achieved the state-of-the-art performance in the News domain (Zhong et al. 2020), it has not shown great advantages in the field of scientific papers. We think the main reason here is that BERT has a length limit of 512, but scientific papers are usually much longer than this limit. To solve this issue, we break the constrain on maximum length in BERT by adding more position embeddings which are initialized randomly and finetune them in the training phrase, which brings remarkable improvement for two BERT-based models (BertSumExt and BertSumAbs). In addition, all models have not significantly improved after adding the content of the cited papers (i.e., Concat Nbr. Summ), which shows that the content of the reference papers is not enough.

As can be seen from Table 2,3, in both inductive and transductive settings, CGSum outperforms all the pretrained models in terms of R-1 and R-L metrics. When compared with BertSumAbs (mp = 640), which is also an abstractive model, although our model uses a shorter input sequence (500 vs 640) and a simpler encoder structure (1-layer BiLSTM and 2-layers GAT vs 12-layers pretrained transformers), it still outperforms BertSumAbs (mp = 640). This result fully illustrates a combination of the document information and the features of the citation graph structure can greatly help the model better understand the relevant research community, thereby naturally generating high-quality abstracts. In inductive setting, CGSum beats BERT by 0.63 R-1 score and beats PTGen + Cov by 1.52 R-1 score while CGSum brings more significant improvements in transductive setting (beats BERT by 1.53 R-1 score and beats PTGen + Cov 3.99 R-1 score).

Degree of Source Paper

We further explore the relationship between model performance and the degree of the source node. We divide our test set into six parts according to the degree of the node. As is shown in Figure 4, there is no obvious connection between the performance of PTGen + Cov and the degree of the source paper. Notably, PTGen + Cov can be viewed as our model removes the graph encoder, so for the nodes with degree 0, the two models have similar performance. However, as the degree of nodes increases, our model can gradually achieve better performance. This dataset splitting experiment shows that our model is good at handling papers with rich citation graph information, that is to say, an informative and relevant research community is very important for understanding a scientific paper. In inductive setting the average degree of nodes $d_{avg}=2.3$ in test set is much smaller than that in transductive setting $d_{avg}=4.7$ . This experiment also gives an explanation of why CGSum outperforms other baseline models without using citation graph by a larger margin in the transductive setting.

Ablation Study

To have a better understanding of the contribution of each component in our proposed model, we remove the neighborhood extraction, residual connection, trigram blocking, rouge credit and GNN from the origin model. As shown in Table 4, neighborhood extraction obtains a certain performance improvement because it extracts a more informative subgraph. Residual connection and trigram blocking have been proven to work well in previous work, and they are also effective in our task. Besides, our proposed ROUGE credit method significantly improve the performance on R-1 and R-L because of the shared domain-specific terms and the similar writing style among papers in the same research community. Finally, if we remove the GNN encoder, our model actually become PTGen + Cov.

Conclusion

In this paper, we augment the task of scientific papers summarization with the citation graph. Specifically, summarization systems can not only use the document information of the source paper, but also find the useful information from the corresponding research community from citation graph to generate the final abstract. Different to the previous work, we aim to help researchers draft a paper abstract by utilizing its references, rather than the papers citing it. We construct a large-scale scientific summarization dataset which is a huge connected citation graph with 141K nodes and 661K citation edges. We also design a novel citation graph-based model which incorporates the features of a paper and its references. Experiments show the effectiveness of our proposed model and the important role of citation graphs for scientific paper summarization.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable suggestions. This work was supported by the National Key Research and Development Program of China (No. 2017YFB1002104).