Graph-Revised Convolutional Network

Donghan Yu, Ruohong Zhang, Zhengbao Jiang, Yuexin Wu, Yiming Yang

Introduction

Graph Convolutional Networks (GCNs) have received increasing attention in recent years as they are highly effective in graph-based node feature induction and belief propagation, and widely applicable to many real-world problems, including computer vision , natural language processing , recommender systems , epidemiological forecasting , and more.

However, the power of GCNs has not been fully exploited as most of the models assume that the given graph perfectly depicts the ground-truth of the relationship between nodes. Such assumptions are bound to yield sub-optimal results as real-world graphs are usually highly noisy, incomplete (with many missing edges), and not necessarily ideal for different downstream tasks. Ignoring these issues is a fundamental weakness of many existing GCN methods.

Recent methods that attempt to modify the original graph can be split into two major streams: 1) Edge reweighting: GAT and GLCN use attention mechanism or feature similarity to reweight the existing edges of the given graph. Since the topological structure of the graph is not changed, the model is prone to be affected by noisy data when edges are sparse. 2) Full graph parameterization: LDS , on the other hand, allows every possible node pairs in a graph to be parameterized. Although this design is more flexible, the memory cost is intractable for large datasets, since the number of parameters increases quadratically with the number of nodes. Therefore, finding a balance between model expressiveness and memory consumption remains an open challenge.

To enable flexible edge editing while maintaining scalability, we develop a GCN-based graph revision module that performs edge addition and edge reweighting. In each iteration, we calculate an adjacency matrix via GCN-based node embeddings, and select the edges with high confidence to be added. Our method permits a gradient-based training of an end-to-end neural model that can predict unseen edges. Our theoretical analysis demonstrates the effectiveness of our model from the perspective of multigraph , which allows more than one edges from different sources between a pair of vertices. To the best of our knowledge, we are the first to reveal the connection between graph convolutional networks and multigraph propagation. Our contributions can be summarized as follows:

We introduce a novel structure that simultaneously learns both graph revision and node classification through different GCN modules.

Through theoretical analysis, we show our model’s advantages in the view of multigraph propagation.

Comprehensive experiments on six benchmark datasets from different domains show that our proposed model achieves the best or highly competitive results, especially under the scenarios of highly incomplete graphs or sparse training labels.

Background

Graph convolutional networks generalize the convolution operation on images to graph structure data, performing layer-wise propagation of node features . Suppose we are given a graph with adjacency matrix $A$ and node features $H^{(0)}=X$ . An $L$ -layer Graph Convolution Network (GCN) conducts the following inductive layer-wise propagation:

To allow for an adaptive aggregation paradigm, GLCN learns to reweight the existing edges by node feature embeddings. The reweighted adjacancy matrix $\widetilde{A}$ is calculated by:

where $x_{i}$ denotes the feature vector of node $i$ and $a,P$ are model parameters. Another model GAT reweights edges by a layer-wise self-attention across node-neighbor pairs to compute hidden representations. For each layer $l$ , the reweighted edge is computed by:

where $a(\cdot,\cdot)$ is a shared attention function to compute the attention coefficients. Compared with GLCN, GAT uses different layer-wise maskings to allow for more flexible representation. However, neither of the methods has the ability to add edges since the revised edge $\widetilde{A}_{ij}$ or $\widetilde{A}^{(l)}_{ij}\neq 0$ only if the original edge $A_{ij}\neq 0$ .

In order to add new edges into the original graph, LDS makes the entire adjacency matrix parameterizable. Then it jointly learns the graph structure $\theta$ and the GCN parameters $W$ by approximately solving a bilevel program as follows:

Proposed Method

The revised adjacency matrix is formed by an elementwise summation of the original adjacency matrix and the calculated similarity matrix: $\widetilde{A}=A+S$ . Compared with the graph revision in GAT and GLCN which use entrywise product, we instead adopt the entrywise addition operator “ $+$ ” in order for new edges to be considered. In this process, the original graph $A$ is revised by the similarity graph $S$ , which can insert new edges to $A$ and potentially reweight or delete existing edges in $A$ . In practice, we apply a sparsification technique on dense matrix $S$ to reduce computational cost and memory usage, which will be introduced in the next section. Then the predicted labels are calculated by:

where $GCN_{c}$ denotes the graph convolutional network for the downstream node classification task. Note that to prevent numerical instabilities, renormalization trick is also applied on the revised adjacency matrix $\widetilde{A}$ . Figure 1 provides an illustration of our model. Finally, we use cross-entropy loss as our objective function:

where $\mathcal{Y}_{L}$ is the set of node indices that have labels $Y$ and $C$ is the number of classes. It’s worth emphasizing that our model does not need other loss functions to guide the graph revision process.

where $K(\cdot)$ is the kernel matrix computed from the node embeddings in Equation (6). In our implementation, we use dot product as kernel function for simplicity, and we use a two-layer GCN in both modules. Note that our framework is highly flexible for other kernel functions and graph convolutional networks, which we leave for future exploration.

2 Sparsification

Since the adjacency matrix $S$ of similarity graph is dense, directly applying it in the classification module is inefficient. Besides, we only want those edges with higher confidence to avoid introducing too much noise. Thus we conduct a $K$ -nearest-neighbour (KNN) sparsification on the dense graph: for each node, we keep the edges with top- $K$ prediction scores. The adjacancy matrix of the KNN-sparse graph, denoted as $S^{(K)}$ , is computed as:

where $topK(S_{i})$ is the set of top- $K$ values of vector $S_{i}$ . Finally, in order to keep the symmetric property, the output sparse graph $\widehat{S}$ is calculated by:

Now since both original graph $A$ and similarity graph $\widehat{S}$ are sparse, efficient matrix multiplication can be applied on both GCNs as in the training time, gradients will only backpropagate through the top- $K$ values. By sparsification, the memory cost of similarity matrix is reduced from $O(N^{2})$ to $O(NK)$ .

3 Fast-GRCN

To further reduce the training time, we introduce a faster version of the propose model: Fast-GRCN. Note that GRCN needs to compute the dense adjacency matrix $S$ in every epoch, where the computational time complexity is $O(N^{2})$ . While in Fast-GRCN, the whole matrix $S$ is only calculated in the first epoch, and then the indices of the non-zero values of the KNN-sparse matrix $S^{(K)}$ are saved. For the remaining epochs, to obtain $S^{(K)}$ , we only compute the values of the saved indices while directly setting zeros for other indices, which reduces the time complexity to $O(NK)$ .

The intuition behind is that the top- $K$ important neighbours of each node may remain unchanged, while the weight of edges between them should still be adjusted during training.

4 Theoretical Analysis

In this section, we show the effectiveness of our model in the view of Multigraph propagation. The major observation is that for existing methods, the learned function from GCNs can be regarded as a linear combination of limited pre-defined kernels where the flexibility of kernels have a large influence on the final prediction accuracy.

We consider the simplified graph convolution neural network $GCN_{s}$ for the ease of analysis. That is, we remove feature transformation parameter $W$ and non-linear activation function $\sigma(\cdot)$ as:

where $k$ is the number of GCN layers. For simplicity we denote $A$ as the adjacency matrix with self-loop after normalization. The final output can be acquired by applying a linear or logistic regression function $f(\cdot)$ on the node embeddings above:

where $\widehat{Y}$ denotes the predicted labels of nodes. Then the following theorem shows that under certain conditions, the optimal function $f^{*}$ can be expressed as a linear combination of kernel functions defined on training samples.

Then, any minimizer of the empirical risk admits a representation of the form:

where $B=XX^{T}$ is the adjacency matrix of graph induced by node features. Now we have two graphs based on the same node set: original graph $G_{A}$ (associated with adjacency matrix $A$ ) and feature graph $G_{B}$ (associated with adjacency matrix $B$ ). They form a multigraph where multiple edges is permitted between the same end nodes. Then the random-walk-like matrix $A^{k}BA^{k}$ can be regarded as one way to perform graph label/feature propagation on the multigraph. Its limitation is obvious: the propagation only happens once on the feature graph $G_{B}$ , which lacks flexibility. However, for our method, we have:

where labels/features can propagate multiple times on the feature graph $G_{B}$ . Thus our model is more flexible and more effective especially when the original graph $G_{A}$ is not reliable or cannot provide enough information for downstream tasks. In Equation (16), $A+A^{m}BA^{m}$ can be regarded as a combination of different edges in the multigraph. To reveal the connection between GRCN and GLCN , we first consider the special case of our model that $m=0$ : $GRCN(A,X)=(A+B)^{k}X$ . The operator “ $+$ ” is analogous to the operator $OR$ which incorporates information from both graph $A$ and $B$ . While GLCN takes another combination denoted as $A\circ B$ using Hadamard (entrywise) product “ $\circ$ ”, which can be analogous to $AND$ operation.

We can further extend our model to a layer-wise version for comparison to GAT . More specifically, for the $l$ -th layer, we denote the input as $X_{l}$ . The output $X_{l+1}$ is then calculated by:

where $B_{l}=X_{l}X_{l}^{T}$ . Similar to the analysis mentioned before, if we consider the special case of GRCN that $m=0$ and change the edge combination operator from entrywise sum “ $+$ ” to entrywise product “ $\circ$ ”, we have $X_{l+1}=(A\circ B_{l})X_{l}$ , which is the key idea behind GAT . Due to the property of entrywise product, the combined edges of both GAT and GLCN are only the reweighted edges of $A$ , which becomes ineffective when the original graph $G_{A}$ is highly sparse. Through the analysis above, we see that our model is more general in combining different edges by varying the value of $m$ , and also has more robust combination operator “ $+$ ” compared to previous methods.

Experiments

We use six benchmark datasets for semi-supervised node classification evaluation. Among them, Cora, CiteSeer and PubMed are three commonly used datasets. The data split is conducted by two ways. The first is the fixed split originating from . In the second way, we conduct 10 random splits while keeping the same number of labels for training, validation and testing as previous work. This provides a more robust comparison of the model performance. To further test the scalability of our model, we utilize three other datasets: Cora-Full , Amazon-Computers and Coauthor CS . The first is an extended version of Cora, while the second and the third are co-purchase and co-authorship graphs respectively. On these three datasets, we follow the previous work and take 20 labels of each classes for training, 30 for validation, and the rest for testing. We also delete the classes with less than 50 labels to make sure each class contains enough instances. The data statistics are shown in Table 1.

We compare the effectiveness of our GRCN model with several baselines, where the first two models are vanilla graph convolutional networks without any graph revision: GCN , SGC , GAT , LDS , and GLCN .

2 Implementation Details

Transductive setting is used for node classification on all the datasets. We train GRCN for $300$ epochs using Adam and select the model with highest validation accuracy for test. We set learning rate as $1\text{\times}{10}^{-3}$ for graph refinement module and $5\text{\times}{10}^{-3}$ for label prediction module. Weight decay and sparsification parameter $K$ are tuned by grid search on validation set, with the search space $[$ $1\text{\times}{10}^{-4}$ , $5\text{\times}{10}^{-4}$ , $1\text{\times}{10}^{-3}$ , $5\text{\times}{10}^{-3}$ , $1\text{\times}{10}^{-2}$ , $5\text{\times}{10}^{-2}$ $]$ and $$ respectively. Our code is based on Pytorch and one geometric deep learning extension library , which provides implementation for GCN, SGC and GAT. For LDS, the results were obtained using the publicly available code. Since an implementation for GLCN was not available, we report the results based on our own implementation of the original paper.

3 Main Results

Table 2 shows the mean accuracy and the corresponding standard deviation for all models across the 6 datasets averaged over 10 different runsNote that the performance difference of baseline models between fixed split and random split is also observed in previous work , where they show that different splits can lead to different ranking of models, and suggest multiple random splits as a better choice. Our later experiments are based on the random split setting.. We see that our proposed models achieve the best or highly competitive results for all the datasets. The effectiveness of our model over the other baselines demonstrates that taking the original graph as input for GCN is not optimal for graph propagation in semi-supervised classification. It’s also worth noting that Fast-GRCN, with much lower computational time complexity, performs nearly as well as GRCN .

To further test the superiority of our model, we consider the edge-sparse scenario when a certain fraction of edges in the given graph is randomly removed. Given an edge retaining ratio, we randomly sample the retained edges 10 times and report the mean classification accuracy and standard deviation. For the Cora, CiteSeer and PubMed dataset, we conduct experiments under random data split. Figure 2 shows the results under different ratios of retained edges. There are several observations from this figure. First, GRCN and Fast-GRCN achieve notable improvements on almost all the datasets, especially when edge retaining ratio is low. For instance, when edge retaining ratio is $10\%$ , GRCN outperforms the second best baseline model by $6.5\%,2.5\%,1.1\%,11.0\%,4.6\%,2.3\%$ on each dataset. Second, the GAT and GLCN models which reweight the existing edges do not perform well, indicating that such a reweighting mechanism is not enough when the original graph is highly incomplete. Third, our method also outperforms the over-parameterized model LDS in Cora and CiteSeer because of our restrained edge editing procedure. Though LDS achieves better performances than other baseline methods in these two datasets, its inability to scale prevents us from testing it on four of the larger datasets.

4 Robustness on Training Labels

We also show that the gains achieved by our model are very robust to the reduction in the number of training labels for each class, denoted by $T$ . We compare all the models on the Cora-Full, Amazon Computers and Coauthor CS datasets and fix the edge sampling ratio to $20\%$ . We reduce $T$ from $15$ to $5$ and report the results in Table 3. While containing more parameters than vanilla GCN, our model still outperforms others. Moreover, it wins by a larger margin when $T$ is smaller. This demonstrates our model’s capability to handle tasks with sparse training labels.

5 Hyperparameter Analysis

We investigate the influence of the hyperparameter $K$ in this section. Recall that after calculating the similarity graph in GRCN , we use $K$ -nearest-neighbour specification to generate a sparse graph out of the dense graph. This is not only benificial to efficiency, but also important for effectiveness. Figure 3 shows the results of node classification accuracy vs. sampling ratio on Cora dataset, where we vary the edge sampling ratio from $10\%$ to $100\%$ and change $K$ from $5$ to $200$ . From this figure, increasing the value of $K$ helps improve the classification accuracy at the initial stage. However, after reaching a peak, further increasing $K$ lowers the model performance. We conjecture that this is because a larger $K$ will introduce too much noise and thus lower the quality of the revised graph.

6 Ablation Study

To further examine the effectiveness of our GCN-based graph revision module, we conduct an ablation study by testing four different simplifications of the graph revision module:

Truncated SVD Reconstruction : $\widetilde{A}=\text{SVD}_{k}(A)$

Feature plus Graph (FG): $\widetilde{A}=A+K(X)$

Random Walk Feature plus Graph (RWFG): $\widetilde{A}=A+K(A^{2}X)$

where SVD and FO only use the original graph structure or node features to construct the graph. They are followed by the FG method, which adds the original graph to the feature similarity graph used in FO. Our model is most closely related to the third method, RWFG, which constructs the feature graph with similarity of node features via graph propagation, but without feature learning.

We conduct the ablation experiment on Cora dataset with different edge retaining ratios and report the results in Figure 4. The performance of SVD method indicates that simply smoothing the original graph is insufficient. The comparison between FO and FG shows that adding original graph as residual links is helpful for all edge retaing ratios, especially when there are more known edges in the graph. Examining the results of FG and RWFG, we can also observe a large improvement brought by graph propagation on features. Finally, the performance of our model and RWFG underscores the importance of feature learning, especially in the cases of low edge retraining ratio.

Related Work

Graph Convolution Networks (GCNs) were first introduced in the work by , with subsequent development and improvements from . Overall, GCNs can be categorized into two categories: spectral convolution and spatial convolution. The spectral convolution operates on the spectral representation of graphs defined in the Fourier domain by the eigen-decomposition of graph Laplacian . The spatial convolution operates directly on the graph to aggregate groups of spatially close neighbors . GraphSage computes node representations by sampling a fixed-size number of adjacent nodes. Besides those methods that are directly applied to an existing graph, GAT , GLCN use attention mechanism or feature similarity to reweight the original graph for better GCN performance, while LDS reconstructs the entire graph via a bilevel optimization. Although our work is related to these methods, we develop a different strategy for graph revision that maintains both efficiency and high flexibility.

2 Link prediction

Link prediction aims at identifying missing links, or links that are likely to be formed in a given network. Previous line of work uses heuristic methods based on local neighborhood structure of nodes, including first-order heuristics by common neighbors and preferential attachment , second-order heuristics by Adamic-Adar and resource allocation , or high-order heuristics by PageRank . To loose the strong assumptions of heuristic method, a number of neural network based methods are proposed, which are capable to learn general structure features. The problem we study in this paper is related to link prediction since we try to revise the graph by adding or reweighting edges. However, instead of treating link prediction as an objective, our work focus on improving node classification by feeding the revised graph into GCNs.

Conclusion

This paper presents Graph-Revised Convolutional Network, a novel framework for incorporating graph revision into graph convolution networks. We show both theoretically and experimentally that the proposed way of graph revision can significantly enhance the prediction accuracy for downstream tasks. GRCN overcomes two main drawbacks in previous approaches to graph revision, which either employ over-parameterized models and consequently face scaling issues, or fail to consider missing edges. In our experiments with node classification tasks, the performance of GRCN stands out in particular when the input graphs are highly incomplete or if the labeled training instances are very sparse.

In the future, we plan to explore GRCN in a broader range of prediction tasks, such as knowledge base completion, epidemiological forecasting and aircraft anomaly detection based on sensor network data.

Acknowledgments

We thank the reviewers for their helpful comments. This work is supported in part by the National Science Foundation (NSF) under grant IIS-1546329.