Neural Sentence Ordering Based on Constraint Graphs

Yutao Zhu, Kun Zhou, Jian-Yun Nie, Shengchao Liu, Zhicheng Dou

Introduction

Text coherence is an essential problem in natural language processing (NLP). Coherent texts with well-organized logical structures are much easier for people to read and understand. As one subtask of coherence modeling, sentence ordering (Barzilay and Lapata 2008) aims at learning to reconstruct a coherent paragraph from an unordered set of sentences. This task underlies many downstream applications such as determining the ordering of: concepts in concept-to-text generation (Konstas and Lapata 2012, 2013), answer spans in retrieval-based question answering (Yu et al. 2018), information from various document in extractive multi-document summarization (Barzilay, Elhadad, and McKeown 2002; Nallapati, Zhai, and Zhou 2017), and events in story generation (Fan, Lewis, and Dauphin 2019; Zhu et al. 2020). An example of this task is shown in Table 1.

Early studies on sentence ordering generally use handcrafted linguistic features to model the document structure (Lapata 2003; Barzilay and Lee 2004; Barzilay and Lapata 2008). However, these manual features are strongly domain-dependent and their definition requires heavy domain expertise. Therefore, it is difficult to apply these methods to different domains. To avoid the burden of hand-crafted features, numerous neural approaches have been proposed recently, which can be roughly categorized into two groups: The first group tends to solve this problem by predicting the pairwise sentence order with a classifier then inferring the global order (Chen, Qiu, and Huang 2016; Li and Jurafsky 2017; Prabhumoye, Salakhutdinov, and Black 2020). An obvious missing part in these approaches is the global context information, which is complementary to the local order information. Another group predicts the sentence order sequentially based on contextual sentence representations. For example, typical methods are based on pointer networks (Gong et al. 2016; Logeswaran, Lee, and Radev 2018; Cui et al. 2018; Yin et al. 2020; Kumar et al. 2020), in which sentences and paragraphs are represented by encoders and the order is predicted by a decoder sequentially. However, local coherence information is not taken into account. Indeed, both local order information and global context information are useful in this task. For the example in Table 1, it is preferable to put sentence $s_{4}$ after $s_{3}$ as the pronoun “they” in $s_{4}$ may refer to the noun “kids” in $s_{3}$ . However, the word “they” also appears in $s_{2}$ . Without the contextual information in $s_{1}$ , it would be hard to decide the order between $s_{2}$ and $s_{3}$ . On the other hand, given the information in $s_{1}$ and $s_{2}$ , both $s_{3}$ and $s_{4}$ can be contextually coherent candidates for the next sentence. It is still hard to decide which should come first after $s_{2}$ without knowing the relative order $s_{3}\prec s_{4}$ .

Therefore, we argue that both local information and global context are crucial to sentence ordering. This raises two problems: 1) how to capture both the local information and the global context; 2) how to utilize them effectively for sentence ordering.

For the first problem, we observe that the order preference between sentences can be hinted by different types of information. For example in Table 1, when we compare sentences $s_{3}$ and $s_{4}$ , the words “kids” and “they played” indicate that $s_{4}$ may immediately follow $s_{3}$ . While comparing sentences $s_{5}$ and $s_{3}$ , no obvious information can tell that one sentence immediately should follow another. However, the word “finally” in $s_{5}$ implies that it should appear at some position after $s_{3}$ . In these examples, the clues we use to place sentences at successive positions or at some distance are different. In general, per writing rules, immediately successive sentences may present some strong intrinsic order information (whether it is causal, about time series, or others), while the same type of information may not be observed between sentences separated at a larger distance (e.g., three sentences apart). In the latter case, hints on more global order information can help. For example, “finally” suggests a position towards the end, the sentence containing “go to fair” should be placed somewhere before “played some games” because of their semantic contents. These examples illustrate the need to consider useful order information at different distances or granularities. Based on these observations, we propose to model sentence order at different distances or granularities, each based on its own features.

For the second problem, we propose to model sentence ordering using graph representation. Graph representation is appealing in that different types of information can be naturally described with it. In our task, it is natural to treat each sentence in a paragraph as a node and represent their relation (the relative order) by an edge between them. With the help of graph neural networks (GNNs), the representation of each sentence can aggregate the information from its neighbors, leading to a representation that integrates both local order information and global context information. More importantly, as we want to take into account sentence relations at different granularities, GNN provides an appropriate way to fuse such information by computing sentence representations over multiple graphs. We expect that such an enhanced representation can help better predict sentence order.

More specifically, we design two phases in our framework. In the first phase, we learn multiple classifiers to judge the relative order between two sentences, each within a given distance (granularity level). In the second phase, multiple graphs (called constraint graphs) are built based on the order preferences. Then, we represent sentences as vectors by an encoder and employ GNNs to update these representations based on the graphs. Finally, the sentence representations are fused to predict their order.

(1) We propose to capture sentence order information at different granularities, allowing to cover a wide spectrum of useful information;

(2) We propose a graph representation for sentence orders and design a novel GNN-based method to fuse local and global information;

(3) We conduct extensive experiments on five benchmark datasets and our model achieves better performance than the existing state-of-the-art methods. This clearly shows the superior capability of our model for sentence ordering by leveraging different types of useful information. Our ablation analysis also shows the impacts of different modules in our framework.

Related Work

Traditional methods for sentence ordering often rely on handcrafted linguistic features and domain knowledge. For example, Lapata (2003) computed transition probabilities between sentences and ordered them by a greedy algorithm. Barzilay and Lee (2004) proposed a content model which represents topics as states in a Hidden Markov Model (HMM). Then Barzilay and Lapata (2008) took into account entities and computed entity transition probabilities among sentences. Recently, neural network based methods have shown great capability in sentence ordering. We review two groups of neural approaches most relevant to ours:

Pairwise model. Pairwise models first predict pairwise sentence order, based on which the entire order is inferred. Chen, Qiu, and Huang (2016) investigated various methods to judge the order of a sentence pair, and the final ranking score of a sentence is obtained by summing up all its scores in sentence pairs. Prabhumoye, Salakhutdinov, and Black (2020) proposed to use a topological sort method to infer the entire order based on the pairwise order. While our framework also considers the relative order between each pair of sentences, it also uses a learning method based on graphs rather than a sort algorithm to infer the entire order. Furthermore, we consider sentence order within multiple distances, leading to a multi-granular view of sentence order. As we will see in our experiments, these extensions significantly improve the results of sentence ordering.

Sequence generation model. This family of models aims to compute better representations for sentences, that encode some order information. Typical methods are based on pointer network (Vinyals, Fortunato, and Jaitly 2015), which is a sequence-to-sequence model using attention as a pointer to select successively a member of the input sequence as the output. Gong et al. (2016) first applied such a model to sentence ordering task, where the encoder represents all sentences into vectors and the decoder predicts results by iteratively selecting one sentence from the input sequence. Later on, many extensions have been proposed. For example, Cui et al. (2018) refined the encoder by the self-attention mechanism, while Yin et al. (2019) modeled the co-occurrences between entities in the sentences by an entity transition graph. More recently, researchers found that using feed-forward neural network as decoder and training the whole model by a listwise loss can further improve the performance (Kumar et al. 2020). Besides, adding supplementary loss functions during the training process is also helpful (Yin et al. 2020). Sentence representations in these approaches are improved, since they can incorporate more contextual information. The relations between sentences are implicitly modeled by self-attention mechanism in these methods. Compared to these methods, our framework explicitly learns and captures the relation at multiple granularities, thus sentence representation can be further improved.

To our best knowledge, This is the first investigation that utilizes graphs to represent the relations between sentences in sentence ordering problemRelated work about graph representations and GNNs is provided in Appendix.. Our study will show that graph is a powerful and adequate formalism to represent order-related information for the task.

Methodology

The sentence ordering task aims at ordering a set of sentences as a coherent text (paragraph). Formally, a set of $n$ sentences with the order $\mathbf{o}=[o_{1},\cdots,o_{n}]$ can be denoted as $\mathbf{p}=[s_{o_{1}},\cdots,s_{o_{n}}]$ . The goal is to find the correct order $\mathbf{o}^{*}=[o_{1}^{*},\cdots,o_{n}^{*}]$ , with which the whole paragraph have the highest coherence probability:

where $\mathbf{o}$ indicates any order of the sentences, and $\Psi$ denotes the set of all possible orders. For example, the order of sentences in Table 1 is $ $, while the correct order$ \mathbf{o}^{*} $is$ $. Following the existing work (Chen, Qiu, and Huang 2016; Kumar et al. 2020; Prabhumoye, Salakhutdinov, and Black 2020), this task is framed as a ranking problem, where the model predicts a score for each sentence and the global order is determined by sorting all scores.

Model Overview

As shown in Figure 1, our framework contains two phases to capture and to use order information. In the first phase, different from existing methods (Chen, Qiu, and Huang 2016; Prabhumoye, Salakhutdinov, and Black 2020), we propose to learn multiple classifiers to predict the relative order at different distances between two sentences. Such relative orderings can describe sentence relations in various granularities, thus can provide more comprehensive information for the overall order prediction. This process indeed simulates a common strategy of our human beings to order sentences, i.e., analyzing the relative order between sentence pairs before inferring the entire order. In the second phase, we focus on inferring the overall order based on the previously learned relations. Concretely, we build multiple graphs based on the obtained relations, and employ a GNN on each graph to incorporate the relative ordering information into the sentence representations. Finally, the representations from multiple GNNs are fused together to calculate a final score for each sentence.

Phase 1: Relative Order Prediction

Given $n$ sentences $[s_{o_{1}},\cdots,s_{o_{n}}]$ in any order $[o_{1},\cdots,o_{n}]$ , we can create from it a set of constraints ( $\mathcal{S}^{d}_{n}$ ) at distance $d$ . The set $\mathcal{S}^{d}_{n}$ describes the relative order between a pair of sentences within distance $d$ , which can be described as:

For example, assuming the actual order of four sentences is $s_{1}\prec s_{2}\prec s_{3}\prec s_{4}$ , if $d=2$ , we have five constraints $\{s_{1}\prec s_{2},s_{1}\prec s_{3},s_{2}\prec s_{3},s_{2}\prec s_{4},s_{3}\prec s_{4}\}$ .

To obtain $\mathcal{S}^{d}_{n}$ , we build a classifier to predict if the relative order between any two sentences $s_{i}$ and $s_{j}$ belongs to this set. To this end, we fine-tune the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model (Devlin et al. 2019) on each dataset with a multi-layer perceptron (MLP). The input is the sequence of tokens of sentence $s_{i}$ , followed by a separator token “[SEP]”, followed by the sequence of tokens of sentence $s_{j}$ . Then the pooled representation of all the time steps is fed into the MLP and output a probability of $s_{i}\prec s_{j}\in\mathcal{S}^{d}_{n}$ , which is denoted as $p(s_{i},s_{j})$ . Formally, given $s_{i}=[w^{i}_{1},w^{i}_{2},\cdots,w^{i}_{u}]$ and $s_{j}=[w^{j}_{1},w^{j}_{2},\cdots,w^{j}_{v}]$ containing $u$ and $v$ words respectively, $p(s_{i},s_{j})$ is computed as:

In our model, we use several distances. Therefore, the above process is repeated several times, each for a given distance, leading to multiple sets of constraints. Notice also that the predicted result for $s_{i}$ and $s_{j}$ is not symmetrical because we cannot infer $s_{j}\prec s_{i}\in\mathcal{S}^{d}_{n}$ from $s_{i}\prec s_{j}\notin\mathcal{S}^{d}_{n}$ . Therefore, $p(s_{i},s_{j})$ and $p(s_{j},s_{i})$ should be predicted separately.

Phase 2: Sentence Ordering

The relative orderings are exploited in the second phase to infer the order of all sentences. As suggested by the existing work (Chen, Qiu, and Huang 2016; Kumar et al. 2020; Prabhumoye, Salakhutdinov, and Black 2020), the ordering process can be treated as a ranking problem, where the order can be obtained by sorting the scores of sentences. Thus the question is transformed to building better representations for sentences to incorporate the order information.

As shown in Figure 1, it is straightforward to represent the determined pairwise order as a graph, in which a sentence corresponds to a node, while the relation between sentences forms an edge. The advantage of using graphs is that the pairwise relations become explicit, thus provide together a global view of the order information. With the help of GNNs, the connection information (i.e., sentence relation) contained in the edge can be incorporated into the sentence representations.

More specifically, we build a graph from $\mathcal{S}^{d}_{n}$ as follows. First, each sentence $s_{i}$ is represented by BERT and treated as a node in the graph:

We then apply Graph Isomorphism Networks (GINs) to update the representation of the node by recursively aggregating and transforming representation vectors of its neighboring nodes. We choose GIN in this work because it achieves state-of-the-art performance on many graph representation tasks (Xu et al. 2019). Alternatively, it could be replaced by any other GNN model. The node representation is initialized by the BERT representation of sentence (i.e., $\mathbf{h}_{i}^{0}=\mathbf{s}_{i}$ ) and updated in a multi-layer GIN through computation with the adjacent matrix $\mathbf{M}$ . The representation of the $i^{\text{th}}$ node in the $l^{\text{th}}$ layer is computed as follows:

where $[;]$ is concatenation operation. It is worth noting that the number of layers $L$ should be carefully tuned according to the distance $d$ used in the first phase. We will discuss about this in more detail later.

Finally, we compute a score for each node (i.e., sentence) through an MLP:

which determines the order of sentences in the paragraph.

We apply ListMLE (Xia et al. 2008) as the objective function, which is a surrogate loss to the perfect order 0-1 based loss function. Given a corpus with $N$ paragraphs, the $i^{\text{th}}$ paragraph with $n_{i}$ unordered sentences is denoted by $\mathbf{p}_{i}=[s_{1},\cdots,s_{n_{i}}]$ . Assume the correct order of $\mathbf{p}_{i}$ is $\mathbf{o}_{i}^{*}=[o_{1}^{*},\cdots,o_{n_{i}}^{*}]$ , then ListMLE is computed as:

With ListMLE, the model will assign the highest score to the first sentence and the lowest score to the last one.

Experiment

Following previous work (Cui et al. 2018; Kumar et al. 2020), we conduct experiments on five public datasets. The detailed statistics of these datasets are shown in Table 2.

NeurIPS/AAN/NSF abstracts (Logeswaran, Lee, and Radev 2018). These datasets consist of abstracts from NeurIPS, ACL and NSF research award papers. The data are split into training, validation, and test set according to the publication year.

SIND captions (Huang et al. 2016). This is a visual story dataset. Each story contains five sentences.

ROCStory (Mostafazadeh et al. 2016). It is a commonsense story dataset. Each story comprises five sentences. Following (Wang and Wan 2019), we make an 8:1:1 random split on the dataset to get the training, validation, and test set.

Baseline Models

We compare our model with the following methods:

Traditional methods: Entity Grid (Barzilay and Lapata 2008); Window Network (Li and Hovy 2014). These methods use hand-crafted features or neural networks to capture the coherence of text.

Pairwise models: Seq2seq (Li and Jurafsky 2017); B-TSort (Prabhumoye, Salakhutdinov, and Black 2020)Note that the results of B-TSort are slightly worse than those reported in the original paper, because the provided source code does not shuffle the sentence order on test set when applying topological sort algorithm, which artificially improves the results.. These methods predict the relative order between sentence pairs, then infer the entire order.

Sequence generation models: CNN/LSTM+PtrNet (Gong et al. 2016); Variant-LSTM+PtrNet (Logeswaran, Lee, and Radev 2018); ATTOrderNet (Cui et al. 2018); HierarchicalATTNet (Wang and Wan 2019); SE-Graph (Yin et al. 2019); ATTOrderNet+TwoLoss (Yin et al. 2020); RankTxNet+ListMLE (Kumar et al. 2020). These architectures adopt CNN/RNN based approaches to obtain the representation for the input sentences and employ the pointer network as the decoder to predict order. The last four methods are extended models based on ATTOrderNet.

Evaluation Metrics

We use Kendall’s $\tau$ and Perfect Match Ratio (PMR) as metrics, both being commonly used in previous work (Gong et al. 2016; Cui et al. 2018; Logeswaran, Lee, and Radev 2018; Yin et al. 2020; Kumar et al. 2020).

Kendall’s Tau ( $\tau$ ): it is one of the most frequently used metrics for the automatic evaluation of text coherence (Lapata 2006, 2003; Logeswaran, Lee, and Radev 2018). It quantifies the distance between the predicted order and the correct order in terms of the number of the inversions. $\tau=1-2I/{n\choose 2}$ , where $I$ is the number of pairs in the predicted order with incorrect relative order, and $n$ is number of sentences in the paragraph. This value ranges from -1 (the worst) to 1 (the best).

Implementation Details

All models are implemented with PyTorch (Paszke et al. 2019) and trained on a TITAN V GPU. In the first phase, we employ BERT uncased model (Wolf et al. 2019) with an MLP to predict constraints. The batch size and learning rate is set as 50 and 5e-5 respectively for all datasets. In the second phase, sentences are also represented by an uncased BERT model and a dropout layer with rate 0.1 is applied over the representations. Three GINsWe use the implementation at https://github.com/chao1224/BioChemGNN˙Dense. with the number of layers $L=\{2,3,5\}$ are used for the three corresponding graphs ( $k=3$ ) obtained in the first phase (denoted as $g_{1}$ , $g_{2}$ and $g_{3}$ respectively). The hidden size of all GIN layers is 512 ( $m=512$ ). A ReLU activation function is added between each layer. $\epsilon$ is tuned in $\{0,0.1,0.5\}$ and set as 0 according to experimental results on validation set. Batch normalization is applied to avoid overfitting. The batch size is set as 128 for all datasets while the learning rate is set as 1e-4, 5e-4, 6e-4, 4e-4 and 3e-4 for NeurIPS, AAN, NSF, SIND and ROCStory dataset. The maximum number of words in sentences is set as 50 on all datasets, which means sentences containing more than 50 words are truncated while those having less than 50 words are padded. All paragraphs in the datasets are randomly shuffled. To make a fair comparison, all models are tested on the same test set without any other preprocessing operations. The models in both two phases are optimized with AdamW (Loshchilov and Hutter 2019).

Relationship between distance d𝑑d and the number of layers L𝐿L

In our experiments, we observe that the hyperparameter $L$ in the second phase should be selected according to $d$ in the first phase. The larger the distance $d$ is, the fewer the GIN layers are required to obtain good results. This trend can be explained by the coverage of the entire set of sentences in a paragraph as follows.

As illustrated in Figure 1, assuming that there are five sentences to be ordered, we can see when $d=1$ (the top one), the classifier in the first phase only predict the relative order between two successive sentences. Therefore, in the corresponding constraint graph, the model need to stack four layers (hops) to connect the first sentence to the last sentence. Similarly, if $d=2$ , the model need two layers to build the connection; and if $d=4$ , any two sentences in the graph are connected, thus only one layer is enough.

Therefore, to connect any sentence pair within a set of $n_{i}$ sentences for paragraph $i$ , we have the empirical formula:

where $\lceil\cdot\rceil$ is the ceiling function. In practice, we have to fix $d$ in order to train classifiers, and we can only implement models with a fixed $L$ . Therefore, in our experiments, we use GNNs with $L=\{2,3,5\}$ for all datasets and compute the corresponding distance $d$ as:

As a result, the distance $d$ is set as {14, 7, 4}, {19, 10, 5}, {39, 20, 10}, {4, 2, 1}, and {4, 2, 1} for NeurIPS, AAN, NSF, SIND, and ROCStory dataset respectively.

To validate the Equation (15), we design an experiment that use the ground-truth constraint graphs as input to test how many layers are necessary to obtain perfect results. The results on SIND validation sets ( $n=5$ for all samples) confirm our empirical analysis. The details of this experiment is presented in Appendix.

Sentence Ordering Results

Table 3 shows the sentence ordering results on all datasetsThe average accuracy of each classifier in the first phase on all datasets is around 80%, and the detailed results are reported in Appendix.. Our method significantly outperforms all baseline models on both evaluation metrics.

The improvement is generalized across different topics and sizes of data. On research paper abstracts datasets, our model outperforms the previous best baseline method by about 1.5% $\tau$ score and 1.7% PMR score on NeurIPS and AAN datasets, while the improvement is more than 2.5% on larger datasets, i.e., NSF. As for two story datasets, our method outperforms the previous state-of-the-art model by around 1.8% $\tau$ score and 1.3% PMR score. Interestingly, our method achieves 49.52% PMR score on ROCStory, meaning that about half of the stories in the test set can be ordered completely right. The performance clearly demonstrates the effectiveness and wide applicability of our proposed method.

B-TSort is a recently proposed pairwise method, which predicts the entire order merely based on the relative order of two sentences at only one distance (as $d=n-1$ in our framework). Compared to it, our method can achieve better performance because multiple sentence relations are considered. Besides, in the second phase, we use neural networks rather than a sorting algorithm (e.g., a topological sort) to infer the entire order, which can effectively fuse the order information from multiple constraint graphs.

According to the results, being equipped with BERT may bring improvements (e.g., RankTxNet+ListMLE on SIND and ROCStory dataset). However, compared with the last two baselines that also use BERT as encoder, our framework still yields better results. This indicates that the better performance we obtained is not merely due to BERT embeddings but also to the way we create representations on top of it.

Ablation Study

We investigate the impact of different modules in our model by an ablation study. The experiments are performed on AAN abstracts and SIND captions dataset. From the results shown in Table 4, we can observe:

(1) The effect of each graph varies on different datasets. Compared with using the graph $g_{1}$ (i.e., the graph with largest distance), our model with $g_{3}$ (smaller distance) performs better on AAN but worse on SIND. However, using more graphs always lead to better results. This empirically validates our assumption that multiple graphs can indeed capture various types of information between sentences that are useful for sentence ordering.

(2) In our model, the BERT encoder used in the second phase is not fine-tuned during the training process. We also test the performance with fine-tuning BERT. The result shows that fine-tuning BERT can bring further improvements in terms of $\tau$ . Nevertheless, the number of parameters becomes significantly larger than without fine-tuning (117M vs. 7M), and the training time is much longer, e.g., about 2.41 times more (917s vs. 381s per epoch) on SIND dataset. This suggests that fine-tuning BERT is a good strategy only if we have the necessary computation power.

(3) As many variants or extensions of BERT have been proposed recently, we also test our method with RoBERTa (Liu et al. 2019) and ALBERT (Lan et al. 2020). However, according to our experiments, these two models cannot perform well in the first phase. The potential reason is that they both remove the Next Sentence Prediction objective in the pre-training stage, which is very important for the constraint prediction of our framework. When applying them in the second phase, the final results show some slight improvements on AAN dataset. Through this experiment, we see that our method can work with other advanced pre-trained language models.

(4) We mentioned that GIN could be replaced by any GNN model. In this experiment, we replace it by graph convolutional networks (GCNs) (Kipf and Welling 2017) which are also widely used in many tasks. The results with GCNs are slightly worse compared to the model with GINs, but are still competitive. This shows that other GNN models could also be used on our graphs.

Further Analysis

As discussed by (Gong et al. 2016; Chen, Qiu, and Huang 2016; Cui et al. 2018; Kumar et al. 2020), the first and last sentence should be paid more attention to due to their crucial positions in a paragraph. We provide the accuracy for these sentences on AAN and SIND datasets in Table 5. Our method gives clearly better results than all baselines. In particular, on SIND dataset, our method achieves the absolute improvement of 2.08% on the last sentence prediction compared with the previous best method.

Following previous work (Logeswaran, Lee, and Radev 2018; Kumar et al. 2020), we use t-SNE embeddings to visualize the effect of training on the sentence representations for SIND dataset in Figure 2. We can clearly see that the updated representations contain more order information than the original BERT embeddings. This is another demonstration that our approach can effectively capture order information into sentence representations.

Conclusion

In this paper, we proposed a novel model for sentence ordering which takes into account relative order information at multiple granularities. This is based on the intuition that the order at different distances may rely on different types of information, and all such order information is useful for sentence ordering. We first classified sentence order within different distances and formed multiple constraint graphs from it. Then we employed GINs to incorporate the order information into sentence representations, which are finally used to predict the sentence order. Our model achieved state-of-the-art performance on five benchmark datasets. This study paves the way for a new research direction on sentence ordering by leveraging different types of information in the form of constraint graphs.

Acknowledgments

We thank Pan Du for the insightful discussions and the anonymous reviewers for their feedback. This work was supported by National Natural Science Foundation of China No. 61872370 and No. 61832017, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, and Shandong Provincial Natural Science Foundation under Grant ZR2019ZD06.

References

Appendix

Graph representation is appealing in the sense that many things can be naturally described with it: as long as the feature has intra-relations, e.g., pixels in images have spatial connection, words in the sentences have temporal correlation, atoms in molecules are connected following the graph topology, and entities in the knowledge graph form a graph in a straightforward way.

The graph neural networks are proposed accordingly for better representation learning, and the core idea is that, for each node, we use message passing operation to exchange information within the $k$ -hop neighborhood around it. Successful applications include image classification (Dwivedi et al. 2020), social network analysis (Kipf and Welling 2017; Velickovic et al. 2017), and molecule property prediction (Wu et al. 2018; Liu et al. 2018; Liu, Demirel, and Liang 2019; Gilmer et al. 2017; Xu et al. 2019).

Accuracy of our method in the first phase

In the first phase, our method learns a classifier to predict the constraint set with given distance. We report the accuracy of each classifier on the test set in Table 6. As the labels are imbalanced (samples with label zero are more than those with label one), we report the accuracy of each label and the overall accuracy respectively. We can observe that the accuracy is highly related to the dataset and influences the performance in the second phase directly. For example, the accuracy on AAN and ROCStory is higher than that on other datasets. Correspondingly, the performance in the second phase (as shown in Table 3) on these two datasets is also better. Moreover, although the accuracy on predicting some of the constraints is not very high, such as $d=4$ on NeurIPS and $d=1$ on SIND, the corresponding graph can still contribute to the final performance on sentence ordering. We believe that any prediction better than random could be useful. Therefore, using multiple graphs is an effective way to improve the sentence ordering performance.

Validation for relationship between distance d𝑑d and the number of layers L𝐿L

To validate the Equation (15), we design an experiment that uses the ground-truth constraint graphs as input to test how many layers are necessary to obtain perfect results. The results on SIND validation sets ( $n=5$ for all samples) confirm our empirical analysis. We observe that the results are consistent with our empirical formula that $d=1$ requires at least four GIN layers ( $L\geq 4$ ) to achieve perfect results, while $d=4$ requires only one layer ( $L\geq 1$ ). Therefore, on this dataset, the necessary setting of $L$ is $\{1,2,4\}$ . As a reference, the results with predicted constraints are illustrated in the lower part of the figure. According to the results, we find that adding one more layer may slightly improve the performance, thus we emperically set $L$ as $\{2,3,5\}$ on all datasets.