Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks
Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, Quanquan Gu
Introduction
Graph convolutional networks (GCNs) recently proposed by Kipf et al. adopt the concept of convolution filter into graph domain . For a given node, a GCN layer aggregates the embeddings of its neighbors from the previous layer, followed by a non-linear transformation, to obtain an updated contextualized node representation. Similar to the convolutional neural networks (CNNs) in the computer vision domain, by stacking multiple GCN layers, each node representation can utilize a wide receptive field from both its immediate and distant neighbors, which intuitively increases the model capacity.
Despite the success of GCNs in many graph-related applications , training a deep GCN for large-scale graphs remains a big challenge. Unlike tokens in a paragraph or pixels in an image, which normally have limited length or size, graph data in practice can be extremely large. For example, Facebook social network in 2019 contains billion usershttps://zephoria.com/top-15-valuable-facebook-statistics/. Such a large-scale graph is impossible to be handled using full-batch GCN training, which takes all the nodes into one batch to update parameters. However, conducting mini-batch GCN training is non-trivial, as the nodes in a graph are closely coupled and correlated. In particular, in GCNs, the embedding of a given node depends recursively on all its neighbors’ embeddings, and such computation dependency grows exponentially with respect to the number of layers. Therefore, training deep and large GCNs is still a challenge, which prevents their application to many large-scale graphs, such as social networks , recommender systems , and knowledge graphs .
To alleviate the previous issues, researchers have proposed sampling-based methods to train GCNs based on mini-batch of nodes, which only aggregate the embeddings of a sampled subset of neighbors of each node in the mini-batch. Among them, one direction is to use a node-wise neighbor-sampling method. For example, GraphSAGE calculates each node embedding by leveraging only a fixed number of uniformly sampled neighbors. Although this kind of approaches reduces the computation cost in each aggregation operation, the total cost can still be large. As is pointed out in , the recursive nature of node-wise sampling brings in redundancy for calculating embeddings. Even if two nodes share the same sampled neighbor, the embedding of this neighbor has to be calculated twice. Such redundant calculation will be exaggerated exponentially when the number of layers increases. Following this line of research as well as reducing the computation redundancy, a series of work was proposed to reduce the size of sampled neighbors. VR-GCN proposes to leverage variance reduction techniques to improve the sample complexity. Cluster-GCN considers restricting the sampled neighbors within some dense subgraphs, which are identified by a graph clustering algorithm before the training of GCN. However, these methods still cannot well address the issue of redundant computations, which may become worse when training very deep and large GCNs.
Another direction uses a layer-wise importance-sampling method. For example, FastGCN calculates a sampling probability based on the degree of each node, and samples a fixed number of nodes for each layer accordingly. Then, it only uses the sampled nodes to build a much smaller sampled adjacency matrix, and thus the computation cost is reduced. Ideally, the sampling probability is calculated to reduce the estimation variance in FastGCN , and guarantee fast and stable convergence. However, since the sampling probability is independent for each layer, the sampled nodes from two consecutive layers are not necessarily connected. Therefore, the sampled adjacency matrix can be extremely sparse, and may even have all-zero rows, meaning some nodes are disconnected. Such a sparse connection problem incurs an inaccurate computation graph and further deteriorates the training and generalization performance of FastGCN.
Based on the pros and cons of the aforementioned sampling approaches, we conclude that an ideal sampling method should have the following features: 1) layer-wise, thus the neighbor nodes can be taken into account together to calculate next layers’ embeddings without redundancy; 2) neighbor-dependent, thus the sampled adjacency matrix is dense without losing much information for training; 3) the importance sampling method should be adopted to reduce the sampling variance and accelerate convergence. To this end, we propose a new efficient sampling algorithm called LAyer-Dependent ImportancE-Sampling (LADIES) We would like to clarify that the proposed layer-dependent importance sampling is different from “layered importance sampling” proposed in . , which fulfills all the above features.
The procedure of LADIES is described as below: (1) For each current layer (), based on the nodes sampled in the upper layer (), it picks all their neighbors in the graph, and constructs a bipartite graph among the nodes between the two layers. (2) Then it calculates the sampling probability according to the degree of nodes in the current layer, with the purpose to reduce sampling variance. (3) Next, it samples a fixed number of nodes based on this probability. (4) Finally, it constructs the sampled adjacency matrix between layers and conducts training and inference, where row-wise normalization is applied to all sampled adjacency matrices to stabilize training. As illustrated in Figure 1, our proposed sampling strategy can avoid two pitfalls faced by existing two sampling strategies: layer-wise structure avoids exponential expansion of receptive field; layer-dependent importance sampling guarantees the sampled adjacency matrix to be dense such that the connectivity between nodes in two adjacent layers can be well maintained.
We highlight our contributions as follows:
We propose LAyer-Dependent ImportancE Sampling (LADIES) for training deep and large graph convolutional networks, which is built upon a novel layer-dependent sampling scheme to avoid exponential expansion of receptive field as well as guarantee the connectivity of the sampled adjacency matrix.
We prove theoretically that the proposed algorithm achieves significantly better memory and time complexities compared with node-wise sampling methods including GraphSage and VR-GCN , and has a dramatically smaller variance compared with FastGCN .
We evaluate the performance of the proposed LADIES algorithm on benchmark datasets and demonstrate its superior performance in terms of both running time and test accuracy. Moreover, we show that LADIES achieves high accuracy with an extremely small sample size (e.g., 256 for a graph with 233k nodes), which enables the use of LADIES for training very deep and large GCNs.
Background
In this section, we review GCNs and several state-of-the-art sampling-based training algorithms, including full-batch, node-wise neighbor sampling methods, and layer-wise importance sampling methods. We summarize the notation used in this paper in Table 1.
Full-batch GCN. When given an undirected graph , with defined in Table 1, the -th GCN layer is defined as:
When the graph is large and dense, full-batch GCN’s memory and time costs can be very expensive while computing such gradient, since during both backward and forward propagation process it requires to store and aggregate embeddings for all nodes across all layer. Also, since each epoch only updates parameters once, the convergence would be very slow.
Node-wise Neighbor Sampling Algorithms. GraphSAGE proposed to reduce receptive field size by neighbor sampling (NS). For each node in -th GCN layer, NS randomly samples of its neighbors at the -th GCN layer and formulate an unbiased estimator to approximate in graph convolution layer:
where and are the full and sampled neighbor sets of node for -th GCN layer respectively.
VR-GCN is another neighbor sampling work. It proposed to utilize historical activations to reduce the variance of the estimator under the same sample strategy as GraphSAGE. Though successfully achieved comparable convergence rate, the memory complexity is higher and the time complexity remains the same. Though NS scheme alleviates the memory bottleneck of GCN, there exists redundant computation under NS since the embedding calculation for each node in the same layer is independent, thus the complexity grows exponentially when the number of layers increases.
Layer-wise Importance Sampling Algorithms. FastGCN proposed a more advanced layer-wise importance sampling (IS) scheme aiming to solve the scalability issue as well as reducing the variance. IS conducts sampling for each layer with a degree-based sampling probability. The approximation of the -th row of with samples per layer can be estimated as:
where is the distribution over , is the probability assigned to node .
Though IS outperforms uniform sampling and the layer-wise sampling successfully reduced both time and memory complexities, however, this sampling strategy has a major limitation: since the sampling operation is conducted independently at each layer, FastGCN cannot guarantee connectivity between sampled nodes at different layers, which incurs large variance of the approximate embeddings.
2 Complexity and Variance Comparison
We compare each method’s memory, time complexity, and variance with that of our proposed LADIES algorithm in Table 5.
Complexity. We now compare the complexity of the proposed LADIES algorithm and existing sampling-based GCN training algorithms. The complexity for all methods are summarized in Table 5, detailed calculation could be found in Appendix A. Compare with full-batch GCN, the time and memory complexities of LADIES do not depend on the total number of nodes and edges, thus our algorithm does not have scalability issue on large and dense graphs. Unlike NS based methods including GraphSAGE and VR-GCN, LADIES is not sensitive to the number of layers and will not suffer exponential growth in complexity, therefore it can perform well when the neural network goes deeper. Compared to layer-wise importance sampling proposed in FastGCN, it maintains the same complexity while obtaining a better convergence guarantee as analyzed in the next paragraph. In fact, in order to guarantee good performance, our method requires a much smaller sample size than that of FastGCN, thus the time and memory burden is much lighter. Therefore, our proposed LADIES algorithm can achieve the best time and memory complexities and is applicable to training very deep and large graph convolution networks.
Variance. We compare the variance of our algorithm with existing sampling-based algorithms. To simplify the analysis, when evaluating the variance we only consider two-layer GCN. The results are summarized in Table 5. We defer the detailed calculation to Appendix B. Compared with FastGCN, our variance result is strictly better since , where denotes the number of nodes which are connected to the nodes sampled in the training batch. Moreover, scales with , which implies that our method can be even better when using a small batch size. Compared with node-wise sampling, consider the same sample size, i.e., . Ignoring the same factors, the variance of LADIES is in the order of and the variance of GraphSAGE is , where denotes the average degree. Based on the definition of , we strictly have since there is no redundant node been calculated in . Therefore our method is also strictly better than GraphSAGE especially when the graph is dense, i.e., many neighbors can be shared. The variance of VR-GCN resembles that of GraphSAGE but relies on the difference between the embedding and its history, which is not directly comparable to our results.
LADIES: LAyer-Dependent ImportancE Sampling
We present our method, LADIES, in this section. As illustrated in previous sections, for node-wise sampling methods , one has to sample a certain number of nodes in the neighbor set of all sampled nodes in the current layer, then the number of nodes that are selected to build the computation graph is exponentially large with the number of hidden layers, which further slows down the training process of GCNs. For the existing layer-wise sampling scheme , it is inefficient when the graph is sparse, since some nodes may have no neighbor been sampled during the sampling process, which results in meaningless zero activations .
To address the aforementioned drawbacks and weaknesses of existing sampling-based methods for GCN training, we propose our training algorithms that can achieve good convergence performance as well as reduce sampling complexity. In the following, we are going to illustrate our method in detail.
We first revisit the independent layer-wise sampling scheme for building the computation graph of GCN training. Recall in the forward process of GCN (1), the matrix production can be regarded as the embedding aggregation process. Then, the layer-wise sampling methods aim to approximate the intermediate embedding matrix by only sampling a subset of nodes at the -th layer and aggregating their embeddings for approximately estimating the embeddings at the -th layer. Mathematically, similar to (5), let with be the set of sampled nodes at the -th layer, we can approximate as
where we adopt non-uniformly sampling scheme by assigning probabilities to all nodes in . Then the corresponding discount weights are . Then let be the indices of sampled nodes at the -th layer, the estimator of can be formulated as
the forward process of GCN with layer-wise sampling can be approximated by
2 Layer-dependent Importance Sampling
However, independently conducting layer-wise sampling at different layers is not efficient since the sampled bipartite graph may still be sparse and even have all-zero rows. This further results in very poor performance and require us to sample lots of nodes in order to guarantee convergence throughout the GCN training. To alleviate this issue, we propose to apply neighbor-dependent sampling that can leverage the dependency between layers which further leads to dense computation graph. Specifically, our layer-dependent sampling mechanism is designed in a top-down manner, i.e., the sampled nodes at the -th layer are generated depending on the sampled nodes that have been generated in all upper layers. Note that for each node we only need to aggregate the embeddings from its neighbor nodes in the previous layer. Thus, at one particular layer, we only need to generate samples from the union of neighbors of the nodes we have sampled in the upper layer, which is defined by
where is the set of nodes we have sampled at the -th layer and denotes the neighbors set of node . Therefore during the sampling process, we only assign probability to nodes in , denoted by . Similar to FastGCN , we apply importance sampling to reduce the variance. However, we have no information regarding the activation matrix when characterizing the samples at the -th layer. Therefore, we resort to a important sampling scheme which only relies on the matrices and . Specifically, we define the importance probabilities as:
Since , applying independent sampling method results in the fact that many nodes are in the set , which has no contribution to the construction of computation graph. In contrast, we only sample nodes from which can guarantee more connections between the sampled nodes at -th and -th layers, and further leads to a dense computation graph between these two layers.
3 Normalization
Experiments
In this section, we conduct experiments to evaluate LADIES for training deep GCNs on different node classification datasets, including Cora, Citeseer, Pubmed and Reddit .
We compare LADIES with the original GCN (full-batch) , GraphSage (neighborhood sampling) and FastGCN (important sampling). We modified the PyTorch implementation of GCN https://github.com/tkipf/pygcn to add our LADIES sampling mechanism. To make the fair comparison only on the sampling part, we also choose the online PyTorch implementation of all these baselines released by their authors and use the same training code for all the methods. By default, we train 5-layer GCNs with hidden state dimension as 256, using the four methods. We choose 5 neighbors to be sampled for GraphSage, 64 and 512 nodes to be sampled for both FastGCN and LADIES per layer. We update the model with a batch size of 512 and ADAM optimizer with a learning rate of 0.001.
For all the methods and datasets, we conduct training for 10 times and take the mean and variance of the evaluation results. Each time we stop training when the validation accuracy doesn’t increase a threshold (0.01) for 200 batches, and choose the model with the highest validation accuracy as the convergence point. We use the following metrics to evaluate the effectiveness of sampling methods:
Accuracy: The micro F1-score of the test data at the convergence point. We calculate it using the full-batch version to get the most accurate inference (only care about training).
Total Running Time (s): The total training time (exclude validation) before convergence point.
Memory (MB): Total memory costs of model parameters and all hidden representations of a batch.
Batch Time and Num: Time cost to run a batch and the total batch number before convergence.
2 Experiment Results
As is shown in Table 3, our proposed LADIES can achieve the highest accuracy score among all the methods, using a small sampling number. One surprising thing is that the sampling-based method can achieve higher accuracy than the Full-Batch version, and in some cases using a smaller sampling number can lead to better accuracy (though it may take longer batches to converge). This is probably because the graph data is incomplete and noisy, and the stochastic nature of the sampling method can bring in regularization for training a more robust GCN with better generalization accuracy . Another observation is that no matter the total size of the graph, LADIES with a small sample number (64) can still converge well, and sometimes even better than a larger sample number (512). This indicates that LADIES is scalable to training very large and deep GCN while maintaining high accuracy.
As a comparison, FastGCN with 64 and 512 sampled nodes can lead to similar accuracy for small graphs (Cora). But for bigger graphs as Citeseer, Pubmed, and Reddit, it cannot converge to a good point, partly because of the computation graph sparsity issue. For a fair comparison, we choose a higher sampling number for FastGCN on these big graphs. For example, in Reddit, we choose 8192 nodes to be sampled, and FastGCN in such cases can converge to a similar accurate result compared with LADIES, but obviously taking more memory and time cost. GraphSage with 5 nodes to be sampled takes far more memory and time cost because of the redundancy problem we’ve discussed, and its uniform sampling makes it fail to converge well and fast compared to importance sampling.
In addition to the above comparison, we show that our proposed LADIES can converge pretty well with a much fewer sampling number. As is shown in Figure 2, when we choose the sampling number as small as 16, the algorithm can already converge to the best result, with low time and memory cost. This implies that although in Table 3, we choose sample number as 64 and 512 for a fair comparison, but actually, the performance of our method can be further enhanced with a smaller sampling number.
Furthermore, we show that the stochastic nature of LADIES can help to achieve better generalization accuracy than original full-batch GCN. We plot the F1-score of both full-batch GCN and LADIES on the PubMed dataset for 300 epochs without early stop in Figure 3. From Figure 3(a), we can see that, full-batch GCN can achieve higher F1-Score than LADIES on the training set. Nevertheless, on the validation and test datasets, we can see from Figures 3(b) and 3(c) that LADIES can achieve significantly higher F1-Score than full-batch GCN. This suggests that LADIES has better generalization performance than full-batch GCN. The reason is: real graphs are often noisy and incomplete. Full-batch GCN uses the entire graph in the training phase, which can cause overfitting to the noise. In sharp contrast, LADIES employs stochastic sampling to use partial information of the graph and therefore can mitigate the noise of the graph and avoid overfitting to the training data. At a high-level, the sampling scheme adopted in LADIES shares a similar spirit as bagging and bootstrap , which is known to improve the generalization performance of machine learning predictors.
Conclusions
We propose a new algorithm namely LADIES for training deep and large GCNs. The crucial ingredient of our algorithm is layer-dependent importance sampling, which can both ensure dense computation graph as well as avoid drastic expansion of receptive field. Theoretically, we show that LADIES enjoys significantly lower memory cost, time complexity and estimation variance, compared with existing GCN training methods including GraphSAGE and FastGCN. Experimental results demonstrate that LADIES can achieve the best test accuracy with much lower computational time and memory cost on benchmark datasets.
Acknowledgement
We would like to thank the anonymous reviewers for their helpful comments. D. Zou and Q. Gu were partially supported by the NSF BIGDATA IIS-1855099, NSF CAREER Award IIS-1906169 and Salesforce Deep Learning Research Award. Z. Hu, Y. Wang, S. Jiang and Y. Sun were partially supported by NSF III-1705169, NSF 1937599, NSF CAREER Award 1741634, and Amazon Research Award. We also thank AWS for providing cloud computing credits associated with the NSF BIGDATA award. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.
References
Appendix A Complexity comparison between different algorithms
We illustrate how to compute memory and time complexity in this part. The notations remains the same as in table 1 and 5.
The memory requirement for all the mentioned methods consists of two parts, including for storing the intermediate embedding matrices and weight matrices. The first part varies for different algorithms, but the second part is always since it has to store weight matrices with dimension .
For computation cost, it mainly consists of two steps, including the feature propagation and feature transformation. Feature propagation corresponds to the neighbors’ embeddings aggregation operation while updating the activations, and the feature transformation corresponds to the linear transformation associated with learnable weight matrix .
For full-batch GCN, it stores the intermediate embedding matrices for all layers, since it has layers, each layer has nodes with dimension of , thus its memory complexity for embedding matrices storage is , combining with weight matrices’ memory requirement , the memory complexity for full-batch GCN is in total. For its time complexity, the feature propagation operation corresponds to sparse-dense matrix multiplication , which leads to time complexity . The feature transformation operation corresponds to dense-dense matrix multiplication , which leads to time complexity . Thus, the per-batch time complexity for a -layer network is in total.
Now we analyze the complexity of node-wise sampling methods. For GraphSAGE, with batch-size , memory requirement for storing the intermediate embedding matrices is , since it has -dimension embedding aggregation operation. This leads to total memory complexity . For time complexity, for each batch of b nodes, we need to update activations for one node. Each new activation needs to aggregate embeddings in previous layers, so for one batch, the computation cost for feature propagation is . For feature transformation, since each new activation has a vector-matrix multiplication operation after the aggregation, thus the computation cost would be for this part. Therefore, we know the total per-batch time complexity for GraphSAGE is .
For another node-wise sampling strategy VR-GCN, it needs to store all historical activations, so its storage requirement for embedding matrices is , therefore the total memory is . The time complexity is the same as GraphSAGE, which is .
For layer-wise sampling method FastGCN, for embedding matrices storage, it only has to store embedding vectors in the last layer, and embedding vectors in the previous layers, so the total memory complexity for this part is . Also, it is easy to see that time complexity would be for feature propagation, and for feature transformation. Usually, we have , so after discarding relatively small terms, the memory complexity in total would be , and time complexity in total is .
Then, for our proposed LADIES algorithm, since its also a layer-wise sampling method, its memory complexity and time complexity are and respectively, which is same as FastGCN.
Appendix B Variance comparison between different algorithms
We first provide the following lemma which is useful in our computation of variance.
Note that , we have
In order to explicitly compare the variance of different algorithms, we made the following assumptions on the Laplacian matrix .
We assume there exist absolute constants and such that the following holds for all ,
Then we make the following assumption on the matrix product .
Variance of GraphSAGE.
We assume for each node, GraphSAGE randomly sample nodes from its neighbors to estimate the embedding. Also, the randomness of sampling at the hidden layer is independent of that at the output layer. Therefore, it is clearly that
where the second inequality is by Assumption 1 and the definition of , and the second inequality is by Assumption 2.
Variance of VR-GCN.
Assuming we have history activation matrix , VR-GCN turns the estimation of to that of . Thus, we can simply derive the variance of VR-GCN by replacing in (B) with . Therefore, the variance of VR-GCN is
The variance of VR-GCN no longer depends on the norm of but . We define and thus
Variance of LADIES.
Different from other algorithms, the random samples at the output layer and hidden layer are not independent. We first define by the union of neighbors of nodes in (i.e., nodes sampled in the upper layer). Then we have
where is a diagonal matrix with if and otherwise, and the second equation is by Lemma 1 and our choice of importance probabilities.
where denotes the average number of nodes that are connected to the nodes sampled in the training batch.
Appendix C Convergence Curves
Figure 4 shows the training curve of LADIES on Pubmed, with different sampling number. We can see that even with a very small sampling number (i.e., 8), it’s already converged pretty well. On the contrary, our previous experiments show that FastGCN with a huge sampling number (i.e., 512) cannot even converge. Figure 5 shows that our method can achieve better and robust convergence performance than FastGCN with smaller sampling number and fewer time.
Appendix D Differences with FastGCN
Since our proposed LADIES also uses Layer-wise Importance Sampling, here we emphasize the key differences in the following aspects: (1) our algorithm restricts the candidate nodes in the union of the neighborhoods of the sampled nodes in the upper layer, which can lead to a much denser sampled adjacency matrix compared with FastGCN; (2) our sampling method is layer-dependent, and the importance sampling probabilities are novelly derived from optimizing on the theoretical derivation of the modified Laplacian matrix (see formula (14)), which leads to smaller estimation variance than FastGCN, as shown in Table 2; and (3) our algorithm uses normalization to the modified Laplacian matrix in each layer, which further stabilizes the forward process and avoids exploding/vanishing gradient. Because of the above innovative and principled algorithmic designs, our algorithm enjoys lower time and memory complexities, as well as better predictive performance than FastGCN in both theory and practice.