Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, Cho-Jui Hsieh

Introduction

Graph convolutional network (GCN) (Kipf and Welling, 2017) has become increasingly popular in addressing many graph-based applications, including semi-supervised node classification (Kipf and Welling, 2017), link prediction (Zhang and Chen, 2018) and recommender systems (Ying et al., 2018). Given a graph, GCN uses a graph convolution operation to obtain node embeddings layer by layer—at each layer, the embedding of a node is obtained by gathering the embeddings of its neighbors, followed by one or a few layers of linear transformations and nonlinear activations. The final layer embedding is then used for some end tasks. For instance, in node classification problems, the final layer embedding is passed to a classifier to predict node labels, and thus the parameters of GCN can be trained in an end-to-end manner.

Since the graph convolution operator in GCN needs to propagate embeddings using the interaction between nodes in the graph, this makes training quite challenging. Unlike other neural networks that the training loss can be perfectly decomposed into individual terms on each sample, the loss term in GCN (e.g., classification loss on a single node) depends on a huge number of other nodes, especially when GCN goes deep. Due to the node dependence, GCN’s training is very slow and requires lots of memory – back-propagation needs to store all the embeddings in the computation graph in GPU memory.

Previous GCN Training Algorithms: To demonstrate the need of developing a scalable GCN training algorithm, we first discuss the pros and cons of existing approaches, in terms of 1) memory requirementHere we consider the memory for storing node embeddings, which is dense and usually dominates the overall memory usage for deep GCN., 2) time per epochAn epoch means a complete data pass. and 3) convergence speed (loss reduction) per epoch. These three factors are crucial for evaluating a training algorithm. Note that memory requirement directly restricts the scalability of algorithm, and the later two factors combined together will determine the training speed. In the following discussion we denote $N$ to be the number of nodes in the graph, $F$ the embedding dimension, and $L$ the number of layers to analyze classic GCN training algorithms.

Full-batch gradient descent is proposed in the first GCN paper (Kipf and Welling, 2017). To compute the full gradient, it requires storing all the intermediate embeddings, leading to $O(NFL)$ memory requirement, which is not scalable. Furthermore, although the time per epoch is efficient, the convergence of gradient descent is slow since the parameters are updated only once per epoch. $[$ memory: bad; time per epoch: good; convergence: bad $]$

Mini-batch SGD is proposed in (Hamilton et al., 2017). Since each update is only based on a mini-batch gradient, it can reduce the memory requirement and conduct many updates per epoch, leading to a faster convergence. However, mini-batch SGD introduces a significant computational overhead due to the neighborhood expansion problem—to compute the loss on a single node at layer $L$ , it requires that node’s neighbor nodes’ embeddings at layer $L-1$ , which again requires their neighbors’ embeddings at layer $L-2$ and recursive ones in the downstream layers. This leads to time complexity exponential to the GCN depth. GraphSAGE (Hamilton et al., 2017) proposed to use a fixed size of neighborhood samples during back-propagation through layers and FastGCN (Chen et al., 2018a) proposed importance sampling, but the overhead of these methods is still large and will become worse when GCN goes deep. $[$ memory: good; time per epoch: bad; convergence: good $]$

VR-GCN (Chen et al., 2018b) proposes to use a variance reduction technique to reduce the size of neighborhood sampling nodes. Despite successfully reducing the size of samplings (in our experiments VR-GCN with only 2 samples per node works quite well), it requires storing all the intermediate embeddings of all the nodes in memory, leading to $O(NFL)$ memory requirement. If the number of nodes in the graph increases to millions, the memory requirement for VR-GCN may be too high to fit into GPU. $[$ memory: bad; time per epoch: good; convergence: good. $]$

In this paper, we propose a novel GCN training algorithm by exploiting the graph clustering structure. We find that the efficiency of a mini-batch algorithm can be characterized by the notion of “embedding utilization”, which is proportional to the number of links between nodes in one batch or within-batch links. This finding motivates us to design the batches using graph clustering algorithms that aims to construct partitions of nodes so that there are more graph links between nodes in the same partition than nodes in different partitions. Based on the graph clustering idea, we proposed Cluster-GCN, an algorithm to design the batches based on efficient graph clustering algorithms (e.g., METIS (Karypis and Kumar, 1998)). We take this idea further by proposing a stochastic multi-clustering framework to improve the convergence of Cluster-GCN. Our strategy leads to huge memory and computational benefits. In terms of memory, we only need to store the node embeddings within the current batch, which is $O(bFL)$ with the batch size $b$ . This is significantly better than VR-GCN and full gradient decent, and slightly better than other SGD-based approaches. In terms of computational complexity, our algorithm achieves the same time cost per epoch with gradient descent and is much faster than neighborhood searching approaches. In terms of the convergence speed, our algorithm is competitive with other SGD-based approaches. Finally, our algorithm is simple to implement since we only compute matrix multiplication and no neighborhood sampling is needed. Therefore for Cluster-GCN, we have $[$ memory: good; time per epoch: good; convergence: good $]$ .

We conducted comprehensive experiments on several large-scale graph datasets and made the following contributions:

Cluster-GCN achieves the best memory usage on large-scale graphs, especially on deep GCN. For example, Cluster-GCN uses 5x less memory than VRGCN in a 3-layer GCN model on Amazon2M. Amazon2M is a new graph dataset that we construct to demonstrate the scalablity of the GCN algorithms. This dataset contains a amazon product co-purchase graph with more than 2 millions nodes and 61 millions edges.

Cluster-GCN achieves a similar training speed with VR-GCN for shallow networks (e.g., 2 layers) but can be faster than VR-GCN when the network goes deeper (e.g., 4 layers), since our complexity is linear to the number of layers $L$ while VR-GCN’s complexity is exponential to $L$ .

Cluster-GCN is able to train a very deep network that has a large embedding size. Although several previous works show that deep GCN does not give better performance, we found that with proper optimization, deeper GCN could help the accuracy. For example, with a 5-layer GCN, we obtain a new benchmark accuracy 99.36 for PPI dataset, comparing with the highest reported one 98.71 by (Zhang et al., 2018).

Implementation of our proposed method is publicly available.https://github.com/google-research/google-research/tree/master/cluster_gcn

Background

Semi-supervised node classification is a popular application of GCN. When using GCN for this application, the goal is to learn weight matrices in (1) by minimizing the loss function:

where $\mathcal{Y}_{L}$ contains all the labels for the labeled nodes; $z^{(L)}_{i}$ is the $i$ -th row of $Z^{(L)}$ with the ground-truth label to be $y_{i}$ , indicating the final layer prediction of node $i$ . In practice, a cross-entropy loss is commonly used for node classification in multi-class or multi-label problems.

Proposed Algorithm

We first discuss the bottleneck of previous training methods to motivate the proposed algorithm.

In the original paper (Kipf and Welling, 2017), full gradient descent is used for training GCN, but it suffers from high computational and memory cost. In terms of memory, computing the full gradient of (2) by back-propagation requires storing all the embedding matrices $\{Z^{(l)}\}_{l=1}^{L}$ which needs $O(NFL)$ space. In terms of convergence speed, since the model is only updated once per epoch, the training requires more epochs to converge.

It has been shown that mini-batch SGD can improve the training speed and memory requirement of GCN in some recent works (Hamilton et al., 2017; Chen et al., 2018a, b). Instead of computing the full gradient, SGD only needs to calculate the gradient based on a mini-batch for each update. In this paper, we use $\mathcal{B}\subseteq[N]$ with size $b=|\mathcal{B}|$ to denote a batch of node indices, and each SGD step will compute the gradient estimation

to perform an update. Despite faster convergence in terms of epochs, SGD will introduce another computational overhead on GCN training (as explained in the following), which makes it having much slower per-epoch time compared with full gradient descent.

We consider the computation of the gradient associated with one node $i:\nabla\text{loss}(y_{i},z^{(L)}_{i})$ . Clearly, this requires the embedding of node $i$ , which depends on its neighbors’ embeddings in the previous layer. To fetch each node $i$ ’s neighbor nodes’ embeddings, we need to further aggregate each neighbor node’s neighbor nodes’ embeddings as well. Suppose a GCN has $L+1$ layers and each node has an average degree of $d$ , to get the gradient for node $i$ , we need to aggregate features from $O(d^{L})$ nodes in the graph for one node. That is, we need to fetch information for a node’s hop- $k$ ( $k=1,\cdots,L$ ) neighbors in the graph to perform one update. Computing each embedding requires $O(F^{2})$ time due to the multiplication with $W^{(l)}$ , so in average computing the gradient associated with one node requires $O(d^{L}F^{2})$ time.

Embedding utilization can reflect computational efficiency. If a batch has more than one node, the time complexity is less straightforward since different nodes can have overlapped hop- $k$ neighbors, and the number of embedding computation can be less than the worst case $O(bd^{L})$ . To reflect the computational efficiency of mini-batch SGD, we define the concept of “embedding utilization” to characterize the computational efficiency. During the algorithm, if the node $i$ ’s embedding at $l$ -th layer $z^{(l)}_{i}$ is computed and is reused $u$ times for the embedding computations at layer $l+1$ , then we say the embedding utilization of $z^{(l)}_{i}$ is $u$ . For mini-batch SGD with random sampling, $u$ is very small since the graph is usually large and sparse. Assume $u$ is a small constant (almost no overlaps between hop- $k$ neighbors), then mini-batch SGD needs to compute $O(bd^{L})$ embeddings per batch, which leads to $O(bd^{L}F^{2})$ time per update and $O(Nd^{L}F^{2})$ time per epoch.

We illustrate the neighborhood expansion problem in the left panel of Fig. 1. In contrary, full-batch gradient descent has the maximal embedding utilization—each embedding will be reused $d$ (average degree) times in the upper layer. As a consequence, the original full gradient descent (Kipf and Welling, 2017) only needs to compute $O(NL)$ embeddings per epoch, which means on average only $O(L)$ embedding computation is needed to acquire the gradient of one node.

To make mini-batch SGD work, previous approaches try to restrict the neighborhood expansion size, which however do not improve embedding utilization. GraphSAGE (Hamilton et al., 2017) uniformly samples a fixed-size set of neighbors, instead of using a full-neighborhood set. We denote the sample size as $r$ . This leads to $O(r^{L})$ embedding computations for each loss term but also makes gradient estimation less accurate. FastGCN (Chen et al., 2018a) proposed an important sampling strategy to improve the gradient estimation. VR-GCN (Chen et al., 2018b) proposed a strategy to store the previous computed embeddings for all the $N$ nodes and $L$ layers and reuse them for unsampled neighbors. Despite the high memory usage for storing all the $NL$ embeddings, we find their strategy very useful and in practice, even for a small $r$ (e.g., 2) can lead to good convergence.

We summarize the time and space complexity in Table 1. Clearly, all the SGD-based algorithms suffer from exponential complexity with respect to the number of layers, and for VR-GCN, even though $r$ can be small, they incur huge space complexity that could go beyond a GPU’s memory capacity. In the following, we introduce our Cluster-GCN algorithm, which achieves the best of two worlds—the same time complexity per epoch with full gradient descent and the same memory complexity with vanilla SGD.

1. Vanilla Cluster-GCN

Our Cluster-GCN technique is motivated by the following question: In mini-batch SGD updates, can we design a batch and the corresponding computation subgraph to maximize the embedding utilization? We answer this affirmative by connecting the concept of embedding utilization to a clustering objective.

Consider the case that in each batch we compute the embeddings for a set of nodes $\mathcal{B}$ from layer $1$ to $L$ . Since the same subgraph $A_{\mathcal{B},\mathcal{B}}$ (links within $\mathcal{B}$ ) is used for each layer of computation, we can then see that embedding utilization is the number of edges within this batch $\|A_{\mathcal{B},\mathcal{B}}\|_{0}$ . Therefore, to maximize embedding utilization, we should design a batch $\mathcal{B}$ to maximize the within-batch edges, by which we connect the efficiency of SGD updates with graph clustering algorithms.

Now we formally introduce Cluster-GCN. For a graph $G$ , we partition its nodes into $c$ groups: $\mathcal{V}=[\mathcal{V}_{1},\cdots\mathcal{V}_{c}]$ where $\mathcal{V}_{t}$ consists of the nodes in the $t$ -th partition. Thus we have $c$ subgraphs as

where each $\mathcal{E}_{t}$ only consists of the links between nodes in $\mathcal{V}_{t}$ . After reorganizing nodes, the adjacency matrix is partitioned into $c^{2}$ submatrices as

where each diagonal block $A_{tt}$ is a $|\mathcal{V}_{t}|\times|\mathcal{V}_{t}|$ adjacency matrix containing the links within $G_{t}$ . $\bar{A}$ is the adjacency matrix for graph $\bar{G}$ ; $A_{st}$ contains the links between two partitions $\mathcal{V}_{s}$ and $\mathcal{V}_{t}$ ; $\Delta$ is the matrix consisting of all off-diagonal blocks of $A$ . Similarly, we can partition the feature matrix $X$ and training labels $Y$ according to the partition $[\mathcal{V}_{1},\cdots,\mathcal{V}_{c}]$ as $[X_{1},\cdots,X_{c}]$ and $[Y_{1},\cdots,Y_{c}]$ where $X_{t}$ and $Y_{t}$ consist of the features and labels for the nodes in $V_{t}$ respectively.

The benefit of this block-diagonal approximation $\bar{G}$ is that the objective function of GCN becomes decomposible into different batches (clusters). Let $\bar{A}^{\prime}$ denotes the normalized version of $\bar{A}$ , the final embedding matrix becomes

due to the block-diagonal form of $\bar{A}$ (note that $\bar{A}_{tt}^{\prime}$ is the corresponding diagonal block of $\bar{A}^{\prime}$ ). The loss function can also be decomposed into

The Cluster-GCN is then based on the decomposition form in (6) and (7). At each step, we sample a cluster $\mathcal{V}_{t}$ and then conduct SGD to update based on the gradient of $\mathcal{L}_{\bar{A}_{tt}^{\prime}}$ , and this only requires the sub-graph $A_{tt}$ , the $X_{t}$ , $Y_{t}$ on the current batch and the models $\{W^{(l)}\}_{l=1}^{L}$ . The implementation only requires forward and backward propagation of matrix products (one block of (6)) that is much easier to implement than the neighborhood search procedure used in previous SGD-based training methods.

We use graph clustering algorithms to partition the graph. Graph clustering methods such as Metis (Karypis and Kumar, 1998) and Graclus (Dhillon et al., 2007) aim to construct the partitions over the vertices in the graph such that within-clusters links are much more than between-cluster links to better capture the clustering and community structure of the graph. These are exactly what we need because: 1) As mentioned before, the embedding utilization is equivalent to the within-cluster links for each batch. Intuitively, each node and its neighbors are usually located in the same cluster, therefore after a few hops, neighborhood nodes with a high chance are still in the same cluster. 2) Since we replace $A$ by its block diagonal approximation $\bar{A}$ and the error is proportional to between-cluster links $\Delta$ , we need to find a partition to minimize number of between-cluster links.

In Figure 1, we illustrate the neighborhood expansion with full graph $G$ and the graph with clustering partition $\bar{G}$ . We can see that cluster-GCN can avoid heavy neighborhood search and focus on the neighbors within each cluster. In Table 2, we show two different node partition strategies: random partition versus clustering partition. We partition the graph into 10 parts by using random partition and METIS. Then use one partition as a batch to perform a SGD update. We can see that with the same number of epochs, using clustering partition can achieve higher accuracy. This shows using graph clustering is important and partitions should not be formed randomly.

Since each node in $\mathcal{V}_{t}$ only links to nodes inside $\mathcal{V}_{t}$ , each node does not need to perform neighborhoods searching outside $A_{tt}$ . The computation for each batch will purely be matrix products $\bar{A}_{tt}^{\prime}X^{(l)}_{t}W^{(l)}$ and some element-wise operations, so the overall time complexity per batch is $O(\|{A}_{tt}\|_{0}F+bF^{2})$ . Thus the overall time complexity per epoch becomes $O(\|A\|_{0}F+NF^{2})$ . In average, each batch only requires computing $O(bL)$ embeddings, which is linear instead of exponential to $L$ . In terms of space complexity, in each batch, we only need to load $b$ samples and store their embeddings on each layer, resulting in $O(bLF)$ memory for storing embeddings. Therefore our algorithm is also more memory efficient than all the previous algorithms. Moreover, our algorithm only requires loading a subgraph into GPU memory instead of the full graph (though graph is usually not the memory bottleneck). The detailed time and memory complexity are summarized in Table 1.

2. Stochastic Multiple Partitions

Although vanilla Cluster-GCN achieves good computational and memory complexity, there are still two potential issues:

After the graph is partitioned, some links (the $\Delta$ part in Eq. (4)) are removed. Thus the performance could be affected.

Graph clustering algorithms tend to bring similar nodes together. Hence the distribution of a cluster could be different from the original data set, leading to a biased estimation of the full gradient while performing SGD updates.

In Figure 2, we demonstrate an example of unbalanced label distribution by using the Reddit data with clusters formed by Metis. We calculate the entropy value of each cluster based on its label distribution. Comparing with random partitioning, we clearly see that entropy of most clusters are smaller, indicating that the label distributions of clusters are biased towards some specific labels. This increases the variance across different batches and may affect the convergence of SGD.

To address the above issues, we propose a stochastic multiple clustering approach to incorporate between-cluster links and reduce variance across batches. We first partition the graph into $p$ clusters $\mathcal{V}_{1},\cdots,\mathcal{V}_{p}$ with a relatively large $p$ . When constructing a batch $B$ for an SGD update, instead of considering only one cluster, we randomly choose $q$ clusters, denoted as $t_{1},\dots,t_{q}$ and include their nodes $\{\mathcal{V}_{t_{1}}\cup\dots\cup\mathcal{V}_{t_{q}}\}$ into the batch. Furthermore, the links between the chosen clusters,

are added back. In this way, those between-cluster links are re-incorporated and the combinations of clusters make the variance across batches smaller. Figure 3 illustrates our algorithm—for each epochs, different combinations of clusters are chosen as a batch. We conduct an experiment on Reddit to demonstrate the effectiveness of the proposed approach. In Figure 4, we can observe that using multiple clusters as one batch could improve the convergence. Our final Cluster-GCN algorithm is presented in Algorithm 1.

3. Issues of training deeper GCNs

Previous attempts of training deeper GCNs (Kipf and Welling, 2017) seem to suggest that adding more layers is not helpful. However, the datasets used in the experiments may be too small to make a proper justification. For example, (Kipf and Welling, 2017) considered a graph with only a few hundreds of training nodes for which overfitting can be an issue. Moreover, we observe that the optimization of deep GCN models becomes difficult as it may impede the information from the first few layers being passed through. In (Kipf and Welling, 2017), they adopt a technique similar to residual connections (He et al., 2016) to enable the model to carry the information from a previous layer to a next layer. Specifically, they modify (1) to add the hidden representations of layer $l$ into the next layer.

Here we propose another simple technique to improve the training of deep GCNs. In the original GCN settings, each node aggregates the representation of its neighbors from the previous layer. However, under the setting of deep GCNs, the strategy may not be suitable as it does not take the number of layers into account. Intuitively, neighbors nearby should contribute more than distant nodes. We thus propose a technique to better address this issue. The idea is to amplify the diagonal parts of the adjacency matrix $A$ used in each GCN layer. In this way, we are putting more weights on the representation from the previous layer in the aggregation of each GCN layer. An example is to add an identity to $\bar{A}$ as follows.

While (9) seems to be reasonable, using the same weight for all the nodes regardless of their numbers of neighbors may not be suitable. Moreover, it may suffer from numerical instability as values can grow exponentially when more layers are used. Hence we propose a modified version of (9) to better maintain the neighborhoods information and numerical ranges. We first add an identity to the original $A$ and perform the normalization

Experimental results of adopting the “diagonal enhancement” techniques are presented in Section 4.3 where we show that this new normalization strategy can help to build deep GCN and achieve SOTA performance.

Experiments

We evaluate our proposed method for training GCN on two tasks: multi-label and multi-class classification on four public datasets. The statistic of the data sets are shown in Table 3. Note that the Reddit dataset is the largest public dataset we have seen so far for GCN, and the Amazon2M dataset is collected by ourselves and is much larger than Reddit (see more details in Section 4.2).

We include the following state-of-the-art GCN training algorithms in our comparisons:

Cluster-GCN (Our proposed algorithm): the proposed fast GCN training method.

VRGCNGitHub link: https://github.com/thu-ml/stochastic_gcn (Chen et al., 2018b): It maintains the historical embedding of all the nodes in the graph and expands to only a few neighbors to speedup training. The number of sampled neighbors is set to be 2 as suggested in (Chen et al., 2018b)Note that we also tried the default sample size 20 in VRGCN package but it performs much worse than sample size $=2$ . .

GraphSAGEGitHub link: https://github.com/williamleif/GraphSAGE (Hamilton et al., 2017): It samples a fixed number of neighbors per node. We use the default settings of sampled sizes for each layer ( $S_{1}=25,S_{2}=10$ ) in GraphSAGE.

We implement our method in PyTorch (Paszke et al., 2017). For the other methods, we use all the original papers’ code from their github pages. Since (Kipf and Welling, 2017) has difficulty to scale to large graphs, we do not compare with it here. Also as shown in (Chen et al., 2018b) that VRGCN is faster than FastGCN, so we do not compare with FastGCN here. For all the methods we use the Adam optimizer with learning rate as 0.01, dropout rate as 20%, weight decay as zero. The mean aggregator proposed by (Hamilton et al., 2017) is adopted and the number of hidden units is the same for all methods. Note that techniques such as (11) is not considered here. In each experiment, we consider the same GCN architecture for all methods. For VRGCN and GraphSAGE, we follow the settings provided by the original papers and set the batch sizes as 512. For Cluster-GCN, the number of partitions and clusters per batch for each dataset are listed in Table 4. Note that clustering is seen as a preprocessing step and its running time is not taken into account in training. In Section 6, we show that graph clustering only takes a small portion of preprocessing time. All the experiments are conducted on a machine with a NVIDIA Tesla V100 GPU (16 GB memory), 20-core Intel Xeon CPU (2.20 GHz), and 192 GB of RAM.

Training Time vs Accuracy: First we compare our proposed method with other methods in terms of training speed. In Figure 6, the $x$ -axis shows the training time in seconds, and $y$ -axis shows the accuracy (F1 score) on the validation sets. We plot the training time versus accuracy for three datasets with 2,3,4 layers of GCN. Since GraphSAGE is slower than VRGCN and our method, the curves for GraphSAGE only appear for PPI and Reddit datasets. We can see that our method is the fastest for both PPI and Reddit datasets for GCNs with different numbers of layers.

For Amazon data, since nodes’ features are not available, an identity matrix is used as the feature matrix $X$ . Under this setting, the shape of parameter matrix $W^{(0)}$ becomes 334863x128. Therefore, the computation is dominated by sparse matrix operations such as $AW^{(0)}$ . Our method is still faster than VRGCN for 3-layer case, but slower for 2-layer and 4-layer ones. The reason may come from the speed of sparse matrix operations from different frameworks. VRGCN is implemented in TensorFlow, while Cluster-GCN is implemented in PyTorch whose sparse tensor support are still in its very early stage. In Table 6, we show the time for TensorFlow and PyTorch to do forward/backward operations on Amazon data, and a simple two-layer network are used for benchmarking both frameworks. We can clearly see that TensorFlow is faster than PyTorch. The difference is more significant when the number of hidden units increases. This may explain why Cluster-GCN has longer training time in Amazon dataset.

Memory usage comparison: For training large-scale GCNs, besides training time, memory usage needed for training is often more important and will directly restrict the scalability. The memory usage includes the memory needed for training the GCN for many epochs. As discussed in Section 3, to speedup training, VRGCN needs to save historical embeddings during training, so it needs much more memory for training than Cluster-GCN. GraphSAGE also has higher memory requirement than Cluster-GCN due to the exponential neighborhood growing problem. In Table 5, we compare our memory usage with VRGCN’s memory usage for GCN with different layers. When increasing the number of layers, Cluster-GCN’s memory usage does not increase a lot. The reason is that when increasing one layer, the extra variable introduced is the weight matrix $W^{(L)}$ , which is relatively small comparing to the sub-graph and node features. While VRGCN needs to save each layer’s history embeddings, and the embeddings are usually dense and will soon dominate the memory usage. We can see from Table 5 that Cluster-GCN is much more memory efficient than VRGCN. For instance, on Reddit data to train a 4-layer GCN with hidden dimension to be 512, VRGCN needs 2064MB memory, while Cluster-GCN only uses 308MB memory.

2. Experimental results on Amazon2M

A new GCN dataset: Amazon2M. By far the largest public data for testing GCN is Reddit dataset with the statistics shown in Table 3, which contains about 200K nodes. As shown in Figure 6 GCN training on this data can be finished within a few hundreds seconds. To test the scalability of GCN training algorithms, we constructed a much larger graph with over 2 millions of nodes and 61 million edges based on Amazon co-purchasing networks (McAuley et al., 2015b; McAuley et al., 2015a). The raw co-purchase data is from Amazon-3Mhttp://manikvarma.org/downloads/XC/XMLRepository.html. In the graph, each node is a product, and the graph link represents whether two products are purchased together. Each node feature is generated by extracting bag-of-word features from the product descriptions followed by Principal Component Analysis (Hotelling, 1933) to reduce the dimension to be 100. In addition, we use the top-level categories as the labels for that product/node (see Table 7 for the most common categories). The detailed statistics of the data set are listed in Table 3.

In Table 8, we compare with VRGCN for GCNs with a different number of layers in terms of training time, memory usage, and test accuracy (F1 score). As can be seen from the table that 1) VRGCN is faster than Cluster-GCN with 2-layer GCN but slower than Cluster-GCN when increasing one layer while achieving similar accuracy. 2) In terms of memory usage, VRGCN is using much more memory than Cluster-GCN (5 times more for 3-layer case), and it is running out of memory when training 4-layer GCN, while Cluster-GCN does not need much additional memory when increasing the number of layers, and achieves the best accuracy for this data when training a 4-layer GCN.

3. Training Deeper GCN

In this section we consider GCNs with more layers. We first show the timing comparisons of Cluster-GCN and VRGCN in Table 9. PPI is used for benchmarking and we run 200 epochs for both methods. We observe that the running time of VRGCN grows exponentially because of its expensive neighborhood finding, while the running time of Cluster-GCN only grows linearly.

Next we investigate whether using deeper GCNs obtains better accuracy. In Section 4.3, we discuss different strategies of modifying the adjacency matrix $A$ to facilitate the training of deep GCNs. We apply the diagonal enhancement techniques to deep GCNs and run experiments on PPI. Results are shown in Table 11. For the case of 2 to 5 layers, the accuracy of all methods increases with more layers added, suggesting that deeper GCNs may be useful. However, when 7 or 8 GCN layers are used, the first three methods fail to converge within 200 epochs and get a dramatic loss of accuracy. A possible reason is that the optimization for deeper GCNs becomes more difficult. We show a detailed convergence of a 8-layer GCN in Figure 5. With the proposed diagonal enhancement technique (11), the convergence can be improved significantly and similar accuracy can be achieved.

State-of-the-art results by training deeper GCNs. With the design of Cluster-GCN and the proposed normalization approach, we now have the ability for training much deeper GCNs to achieve better accuracy (F1 score). We compare the testing accuracy with other existing methods in Table 10. For PPI, Cluster-GCN can achieve the state-of-art result by training a 5-layer GCN with 2048 hidden units. For Reddit, a 4-layer GCN with 128 hidden units is used.

Conclusion

We present ClusterGCN, a new GCN training algorithm that is fast and memory efficient. Experimental results show that this method can train very deep GCN on large-scale graph, for instance on a graph with over 2 million nodes, the training time is less than an hour using around 2G memory and achieves accuracy of 90.41 (F1 score). Using the proposed approach, we are able to successfully train much deeper GCNs, which achieve state-of-the-art test F1 score on PPI and Reddit datasets.

Acknowledgement CJH acknowledges the support of NSF via IIS-1719097, Intel faculty award, Google Cloud and Nvidia.

References

More Details about the experiments

In this section we describe more detailed settings about the experiments to help in reproducibility.

We describe more details about the datasets in Table 12. We download the datasets PPI, Reddit from the websitehttp://snap.stanford.edu/graphsage/ and Amazon from the websitehttps://github.com/Hanjun-Dai/steady_state_embedding. Note that for Amazon, we consider GCN in an inductive setting, meaning that the model only learns from training data. In (Dai et al., 2018) they consider a transductive setting. Regarding software versions, we install CUDA 10.0 and cuDNN 7.0. TensorFlow 1.12.0 and PyTorch 1.0.0 are used. We download METIS 5.1.0 via the offcial websitehttp://glaros.dtc.umn.edu/gkhome/metis/metis/download and use a Python wrapperhttps://metis.readthedocs.io/en/latest/ for METIS library.

2. Implementation details

Previous works (Chen et al., 2018a, b) propose to pre-compute the multiplication of $AX$ in the first GCN layer. We also adopt this strategy in our implementation. By precomputing $AX$ , we are essentially using the exact 1-hop neighborhood for each node and the expensive neighbors searching in the first layer can be saved.

Another implementation detail is about the technique mentioned in Section 3.2 When multiple clusters are selected, some between-cluster links are added back. Thus the new combined adjacency matrix should be re-normalized to maintain numerical ranges of the resulting embedding matrix. From experiments we find the renormalization is helpful.

As for the inductive setting, the testing nodes are not visible during the training process. Thus we construct an adjacency matrix containing only training nodes and another one containing all nodes. Graph partitioning are applied to the former one and the partitioned adjacency matrix is then re-normalized. Note that feature normalization is also conducted. To calculate the memory usage, we consider tf.contrib.memory_stats.BytesInUse() for TensorFlow and torch.cuda.memory_allocated() for PyTorch.

3. The running time of graph clustering algorithm and data preprocessing

The experiments of comparing different GCN training methods in Section 4 consider running time for training. The preprocessing time for each method is not presented in the tables and figures. While some of these preprocessing steps such as data loading or parsing are shared across different methods, some steps are algorithm specific. For instance, our method needs to run graph clustering algorithm during the preprocessing stage.

In Table 13, we present more details about preprocessing time of Cluster-GCN on the four GCN datasets. For graph clustering, we adopt Metis, which is a fast and scalable graph clustering library. We observe that the graph clustering algorithm only takes a small portion of preprocessing time, showing a small extra cost while applying such algorithms and its scalability on large data sets. In addition, graph clustering only needs to be conducted once to form the node partitions, which can be re-used for later training processes.