GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models

Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, Jure Leskovec

Introduction and Related Work

Generative models for real-world graphs have important applications in many domains, including modeling physical and social interactions, discovering new chemical and molecular structures, and constructing knowledge graphs. Development of generative graph models has a rich history, and many methods have been proposed that can generate graphs based on a priori structural assumptions (Newman, 2010). However, a key open challenge in this area is developing methods that can directly learn generative models from an observed set of graphs. Developing generative models that can learn directly from data is an important step towards improving the fidelity of generated graphs, and paves a way for new kinds of applications, such as discovering new graph structures and completing evolving graphs.

In contrast, traditional generative models for graphs (e.g., Barabási-Albert model, Kronecker graphs, exponential random graphs, and stochastic block models) (Erdős & Rényi, 1959; Leskovec et al., 2010; Albert & Barabási, 2002; Airoldi et al., 2008; Leskovec et al., 2007; Robins et al., 2007) are hand-engineered to model a particular family of graphs, and thus do not have the capacity to directly learn the generative model from observed data. For example, the Barabási-Albert model is carefully designed to capture the scale-free nature of empirical degree distributions, but fails to capture many other aspects of real-world graphs, such as community structure.

Recent advances in deep generative models, such as variational autoencoders (VAE) (Kingma & Welling, 2014) and generative adversarial networks (GAN) (Goodfellow et al., 2014), have made important progress towards generative modeling for complex domains, such as image and text data. Building on these approaches a number of deep learning models for generating graphs have been proposed (Kipf & Welling, 2016; Grover et al., 2017; Simonovsky & Komodakis, 2018; Li et al., 2018). For example, Simonovsky & Komodakis 2018 propose a VAE-based approach, while Li et al. 2018 propose a framework based upon graph neural networks. However, these recently proposed deep models are either limited to learning from a single graph (Kipf & Welling, 2016; Grover et al., 2017) or generating small graphs with 40 or fewer nodes (Li et al., 2018; Simonovsky & Komodakis, 2018)—limitations that stem from three fundamental challenges in the graph generation problem:

Large and variable output spaces: To generate a graph with nn nodes the generative model has to output n2n^{2} values to fully specify its structure. Also, the number of nodes nn and edges mm varies between different graphs and a generative model needs to accommodate such complexity and variability in the output space.

Non-unique representations: In the general graph generation problem studied here, we want distributions over possible graph structures without assuming a fixed set of nodes (e.g., to generate candidate molecules of varying sizes). In this general setting, a graph with nn nodes can be represented by up to n!n! equivalent adjacency matrices, each corresponding to a different, arbitrary node ordering/numbering. Such high representation complexity is challenging to model and makes it expensive to compute and then optimize objective functions, like reconstruction error, during training. For example, GraphVAE (Simonovsky & Komodakis, 2018) uses approximate graph matching to address this issue, requiring O(n4)O(n^{4}) operations in the worst case (Cho et al., 2014).

Present work. Here we address the above challenges and present Graph Recurrent Neural Networks (GraphRNN), a scalable framework for learning generative models of graphs. GraphRNN models a graph in an autoregressive (or recurrent) manner—as a sequence of additions of new nodes and edges—to capture the complex joint probability of all nodes and edges in the graph. In particular, GraphRNN can be viewed as a hierarchical model, where a graph-level RNN maintains the state of the graph and generates new nodes, while an edge-level RNN generates the edges for each newly generated node. Due to its autoregressive structure, GraphRNN can naturally accommodate variable-sized graphs, and we introduce a breadth-first-search (BFS) node-ordering scheme to drastically improve scalability. This BFS approach alleviates the fact that graphs have non-unique representations—by collapsing distinct representations to unique BFS trees—and the tree-structure induced by BFS allows us to limit the number of edge predictions made for each node during training. Our approach requires O(n2)O(n^{2}) operations on worst-case (i.e., complete) graphs, but we prove that our BFS ordering scheme permits sub-quadratic complexity in many cases.

In addition to the novel GraphRNN framework, we also introduce a comprehensive suite of benchmark tasks and baselines for the graph generation problem, with all code made publicly availableThe code is available in https://github.com/snap-stanford/GraphRNN, the appendix is available in https://arxiv.org/abs/1802.08773.. A key challenge for the graph generation problem is quantitative evaluation of the quality of generated graphs. Whereas prior studies have mainly relied on visual inspection or first-order moment statistics for evaluation, we provide a comprehensive evaluation setup by comparing graph statistics such as the degree distribution, clustering coefficient distribution and motif counts for two sets of graphs based on variants of the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012). This quantitative evaluation approach can compare higher order moments of graph-statistic distributions and provides a more rigorous evaluation than simply comparing mean values.

Extensive experiments on synthetic and real-world graphs of varying size demonstrate the significant improvement GraphRNN provides over baseline approaches, including the most recent deep graph generative models as well as traditional models. Compared to traditional baselines (e.g., stochastic block models), GraphRNN is able to generate high-quality graphs on all benchmark datasets, while the traditional models are only able to achieve good performance on specific datasets that exhibit special structures. Compared to other state-of-the-art deep graph generative models, GraphRNN is able to achieve superior quantitative performance—in terms of the MMD distance between the generated and test set graphs—while also scaling to graphs that are 50×50\times larger than what these previous approaches can handle. Overall, GraphRNN reduces MMD by 80%-90%80\%\text{-}90\% over the baselines on average across all datasets and effectively generalizes, achieving comparatively high log-likelihood scores on held-out data.

Proposed Approach

We first describe the background and notation for building generative models of graphs, and then describe our autoregressive framework, GraphRNN.

Finally, note that traditional graph generative models (surveyed in the introduction) usually assume a single input training graph. Our approach is more general and can be applied to a single as well as multiple input training graphs.

2 A Brief Survey of Possible Approaches

We start by surveying some general alternative approaches for modeling p(G)p(G), in order to highlight the limitations of existing non-autoregressive approaches and motivate our proposed autoregressive architecture.

Node-embedding based models. There have been recent successes in encoding a graph’s structural properties into node embeddings (Hamilton et al., 2017), and one approach to graph generation could be to define a generative model that decodes edge probabilities based on pairwise relationships between learned node embeddings (as in Kipf & Welling 2016). However, this approach is only well-defined when given a fixed-set of nodes, limiting its utility for the general graph generation problem, and approaches based on this idea are limited to learning from a single input graph (Kipf & Welling, 2016; Grover et al., 2017).

3 GraphRNN: Deep Generative Models for Graphs

The key idea of our approach is to represent graphs under different node orderings as sequences, and then to build an autoregressive generative model on these sequences. As we will show, this approach does not suffer from the drawbacks common to other general approaches (c.f., Section 2.2), allowing us to model graphs of varying size with complex edge dependencies, and we introduce a BFS node ordering scheme to drastically reduce the complexity of learning over all possible node sequences (Section 2.3.4). In this autoregressive framework, the model complexity is greatly reduced by weight sharing with recurrent neural networks (RNNs). Figure 1 illustrates our GraphRNN approach, where the main idea is that we decompose graph generation into a process that generates a sequence of nodes (via a graph-level RNN), and another process that then generates a sequence of edges for each newly added node (via an edge-level RNN).

We first define a mapping fSf_{S} from graphs to sequences, where for a graph Gp(G)G\sim p(G) with nn nodes under node ordering π\pi, we have

where each element Siπ{0,1}i1,i{1,...,n}S^{\pi}_{i}\in\{0,1\}^{i-1},i\in\{1,...,n\} is an adjacency vector representing the edges between node π(vi)\pi(v_{i}) and the previous nodes π(vj),j{1,...,i1}\pi(v_{j}),j\in\{1,...,i-1\} already in the graph:We prohibit self-loops and S1πS^{\pi}_{1} is defined as an empty vector.

For undirected graphs, SπS^{\pi} determines a unique graph GG, and we write the mapping as fG()f_{G}(\cdot) where fG(Sπ)=Gf_{G}(S^{\pi})=G.

Thus, instead of learning p(G)p(G), whose sample space cannot be easily characterized, we sample the auxiliary π\pi to get the observations of SπS^{\pi} and learn p(Sπ)p(S^{\pi}), which can be modeled autoregressively due to the sequential nature of SπS^{\pi}. At inference time, we can sample GG without explicitly computing p(G)p(G) by sampling SπS^{\pi}, which maps to GG via fGf_{G}.

Given the above definitions, we can write p(G)p(G) as the marginal distribution of the joint distribution p(G,Sπ)p(G,S^{\pi}):

where p(Sπ)p(S^{\pi}) is the distribution that we want to learn using a generative model. Due to the sequential nature of SπS^{\pi}, we further decompose p(Sπ)p(S^{\pi}) as the product of conditional distributions over the elements:

where we set Sn+1πS^{\pi}_{n+1} as the end of sequence token EOS, to represent sequences with variable lengths. We simplify p(SiπS1π,...,Si1π)p(S^{\pi}_{i}|S^{\pi}_{1},...,S^{\pi}_{i-1}) as p(SiπS<iπ)p(S^{\pi}_{i}|S^{\pi}_{<i}) in further discussions.

3.2 The GraphRNN framework

So far we have transformed the modeling of p(G)p(G) to modeling p(Sπ)p(S^{\pi}), which we further decomposed into the product of conditional probabilities p(SiπS<iπ)p(S^{\pi}_{i}|S^{\pi}_{<i}). Note that p(SiπS<iπ)p(S^{\pi}_{i}|S^{\pi}_{<i}) is highly complex as it has to capture how node π(vi)\pi(v_{i}) links to previous nodes based on how previous nodes are interconnected among each other. Here we propose to parameterize p(SiπS<iπ)p(S^{\pi}_{i}|S^{\pi}_{<i}) using expressive neural networks to model the complex distribution. To achieve scalable modeling, we let the neural networks share weights across all time steps ii. In particular, we use an RNN that consists of a state-transition function and an output function:

Note that the proposed problem formulation is fully general; we discuss and present some specific variants with implementation details in the next section. Note also that RNNs require fixed-size input vectors, while we previously defined SiπS^{\pi}_{i} as having varying dimensions depending on ii; we describe an efficient and flexible scheme to address this issue in Section 2.3.4.

3.3 GraphRNN variants

Dependent Bernoulli sequence. To fully capture complex edge dependencies, in the full GraphRNN model we further decompose p(SiπS<iπ)p(S^{\pi}_{i}|S^{\pi}_{<i}) into a product of conditionals,

where Si,jπS^{\pi}_{i,j} denotes a binary scalar that is 11 if node π(vi+1)\pi(v_{i+1}) is connected to node π(vj)\pi(v_{j}) (under ordering π\pi). In this variant, each distribution in the product is approximated by an another RNN. Conceptually, we have a hierarchical RNN, where the first (i.e., the graph-level) RNN generates the nodes and maintains the state of the graph, while the second (i.e., the edge-level) RNN generates the edges of a given node (as illustrated in Figure 1). In our implementation, the edge-level RNN is a GRU model, where the hidden state is initialized via the graph-level hidden state hih_{i} and where the output at each step is mapped by a MLP to a scalar indicating the probability of having an edge. Si,jπS^{\pi}_{i,j} is sampled from this distribution specified by the jjth output of the iith edge-level RNN, and is fed into the j+1j+1th input of the same RNN. All edge-level RNNs share the same parameters.

3.4 Tractability via breadth-first search

A crucial insight in our approach is that rather than learning to generate graphs under any possible node permutation, we learn to generate graphs using breadth-first-search (BFS) node orderings, without a loss of generality. Formally, we modify Equation (1) to

where \textscBFS()\textsc{BFS}(\cdot) denotes the deterministic BFS function. In particular, this BFS function takes a random permutation π\pi as input, picks π(v1)\pi(v_{1}) as the starting node and appends the neighbors of a node into the BFS queue in the order defined by π\pi. Note that the BFS function is many-to-one, i.e., multiple permutations can map to the same ordering after applying the BFS function.

Using BFS to specify the node ordering during generation has two essential benefits. The first is that we only need to train on all possible BFS orderings, rather than all possible node permutations, i.e., multiple node permutations map to the same BFS ordering, providing a reduction in the overall number of sequences we need to consider.In the worst case (e.g., star graphs), the number of BFS orderings is n!n!, but we observe substantial reductions on many real-world graphs. The second is that the BFS ordering makes learning easier by reducing the number of edge predictions we need to make in the edge-level RNN; in particular, when we are adding a new node under a BFS ordering, the only possible edges for this new node are those connecting to nodes that are in the “frontier” of the BFS (i.e., nodes that are still in the BFS queue)—a notion formalized by Proposition 1 (proof in the Appendix):

Suppose v1,,vnv_{1},\ldots,v_{n} is a BFS ordering of nn nodes in graph GG, and (vi,vj1)E(v_{i},v_{j-1})\in E but (vi,vj)∉E(v_{i},v_{j})\not\in E for some i<jni<j\leq n, then (vi,vj)∉E(v_{i^{\prime}},v_{j^{\prime}})\not\in E, 1ii\forall 1\leq i^{\prime}\leq i and jj<nj\leq j^{\prime}<n.

Importantly, this insight allows us to redefine the variable size SiπS^{\pi}_{i} vector as a fixed MM-dimensional vector, representing the connectivity between node π(vi)\pi(v_{i}) and nodes in the current BFS queue with maximum size MM:

As a consequence of Proposition 1, we can bound MM as follows:

The overall time complexity of GraphRNN is thus O(Mn)O(Mn). In practice, we estimate an empirical upper bound for MM (see the Appendix for details).

GraphRNN Model Capacity

In this section we analyze the representational capacity of GraphRNN, illustrating how it is able to capture complex edge dependencies. In particular, we discuss two very different cases on how GraphRNN can learn to generate graphs with a global community structure as well as graphs with a very regular geometric structure. For simplicity, we assume that hih_{i} (the hidden state of the graph-level RNN) can exactly encode S<iπS^{\pi}_{<i}, and that the edge-level RNN can encode Si,<jπS^{\pi}_{i,<j}. That is, we assume that our RNNs can maintain memory of the decisions they make and elucidate the models capacity in this ideal case. We similarly rely on the universal approximation theorem of neural networks (Hornik, 1991).

Graphs with community structure. GraphRNN can model structures that are specified by a given probabilistic model. This is because the posterior of a new edge probability can be expressed as a function of the outcomes of previous nodes. For instance, suppose that the training set contains graphs generated from the following distribution pcom(G)p_{com}(G): half of the nodes are in community AA, and half of the nodes are in community BB (in expectation), and nodes are connected with probability psp_{s} within each community and probability pdp_{d} between communities. Given such a model, we have the following key (inductive) observation:

Assume there exists a parameter setting for GraphRNN such that it can generate S<iπS^{\pi}_{<i} and Si,<jπS^{\pi}_{i,<j} according to the distribution over SπS^{\pi} implied by pcom(G)p_{com}(G), then there also exists a parameter setting for GraphRNN such that it can output p(Si,jπSi,<jπ,S<iπ)p(S^{\pi}_{i,j}|S^{\pi}_{i,<j},S^{\pi}_{<i}) according to pcom(G)p_{com}(G).

This observation follows from three facts: First, we know that p(Si,jπSi,<jπ,S<iπ)p(S^{\pi}_{i,j}|S^{\pi}_{i,<j},S^{\pi}_{<i}) can be expressed as a function of psp_{s}, pdp_{d}, and p(π(vj)A),p(π(vj)B)1jip(\pi(v_{j})\in A),p(\pi(v_{j})\in B)\>\forall 1\leq j\leq i (which holds by pcomp_{com}’s definition). Second, by our earlier assumptions on the RNN memory, S<iπS^{\pi}_{<i} can be encoded into the initial state of the edge-level RNN, and the edge-level RNN can also encode the outcomes of Si,<jπS^{\pi}_{i,<j}. Third, we know that p(π(vi)A)p(\pi(v_{i})\in A) is computable from S<iπS^{\pi}_{<i} and Si,1πS^{\pi}_{i,1} (by Bayes’ rule and pcomp_{com}’s definition, with an analogous result for p(π(vi)B)p(\pi(v_{i})\in B)). Finally, GraphRNN can handle the base case of the induction in Observation 1, i.e., Si,1S_{i,1}, simply by sampling according to 0.5ps+0.5pd0.5p_{s}+0.5p_{d} at the first step of the edge-level RNN (i.e., 0.5 probability ii is in same community as node π(v1)\pi(v_{1})).

Graphs with regular structure. GraphRNN can also naturally learn to generate regular structures, due to its ability to learn functions that only activate for Si,jπS^{\pi}_{i,j} where vjv_{j} has specific degree. For example, suppose that the training set consists of ladder graphs (Noy & Ribó, 2004). To generate a ladder graph, the edge-level RNN must handle three key cases: if k=1jSi,jπ=0\sum_{k=1}^{j}S^{\pi}_{i,j}=0, then the new node should only connect to the degree 11 node or else any degree 22 node; if k=1jSi,jπ=1\sum_{k=1}^{j}S^{\pi}_{i,j}=1, then the new node should only connect to the degree 22 node that is exactly two hops away; and finally, if k=1jSi,jπ=2\sum_{k=1}^{j}S^{\pi}_{i,j}=2 then the new node should make no further connections. And note that all of the statistics needed above are computable from S<iπS^{\pi}_{<i} and Si,<jπS^{\pi}_{i,<j}. The appendix contains visual illustrations and further discussions on this example.

Experiments

We compare GraphRNN to state-of-the-art baselines, demonstrating its robustness and ability to generate high-quality graphs in diverse settings.

We perform experiments on both synthetic and real datasets, with drastically varying sizes and characteristics. The sizes of graphs vary from V=10|V|=10 to V=2025|V|=2025.

Community. 500 two-community graphs with 60V16060\leq|V|\leq 160. Each community is generated by the Erdős-Rényi model (E-R) (Erdős & Rényi, 1959) with n=V/2n=|V|/2 nodes and p=0.3p=0.3. We then add 0.05V0.05|V| inter-community edges with uniform probability.

Grid. 100 standard 2D grid graphs with 100V400100\leq|V|\leq 400. We also run our models on 100 standard 2D grid graphs with 1296V20251296\leq|V|\leq 2025, and achieve comparable results.

B-A. 500 graphs with 100V200100\leq|V|\leq 200 that are generated using the Barabási-Albert model. During generation, each node is connected to 4 existing nodes.

Protein. 918 protein graphs (Dobson & Doig, 2003) with 100V500100\leq|V|\leq 500. Each protein is represented by a graph, where nodes are amino acids and two nodes are connected if they are less than 66 Angstroms apart.

Ego. 757 3-hop ego networks extracted from the Citeseer network (Sen et al., 2008) with 50V39950\leq|V|\leq 399. Nodes represent documents and edges represent citation relationships.

2 Experimental Setup

We compare the performance of our model against various traditional generative models for graphs, as well as some recent deep graph generative models.

Traditional baselines. Following Li et al. 2018 we compare against the Erdős-Rényi model (E-R) (Erdős & Rényi, 1959) and the Barabási-Albert (B-A) model (Albert & Barabási, 2002). In addition, we compare against popular generative models that include learnable parameters: Kronecker graph models (Leskovec et al., 2010) and mixed-membership stochastic block models (MMSB) (Airoldi et al., 2008).

Deep learning baselines. We compare against the recent methods of Simonovsky & Komodakis 2018 (GraphVAE) and Li et al. 2018 (DeepGMG). We provide reference implementations for these methods (which do not currently have associated public code), and we adapt GraphVAE to our problem setting by using one-hot indicator vectors as node features for the graph convolutional network encoder.We also attempted using degree and clustering coefficients as features for nodes, but did not achieve better performance.

Experiment settings. We use 80%80\% of the graphs in each dataset for training and test on the rest. We set the hyperparameters for baseline methods based on recommendations made in their respective papers. The hyperparameter settings for GraphRNN were fixed after development tests on data that was not used in follow-up evaluations (further details in the Appendix). Note that all the traditional methods are only designed to learn from a single graph, therefore we train a separate model for each training graph in order to compare with these methods. In addition, both deep learning baselines suffer from aforementioned scalability issues, so we only compare to these baselines on a small version of the community dataset with 12V2012\leq|V|\leq 20 (Community-small) and 200 ego graphs with 4V184\leq|V|\leq 18 (Ego-small).

3 Evaluating the Generated Graphs

Evaluating the sample quality of generative models is a challenging task in general (Theis et al., 2016), and in our case, this evaluation requires a comparison between two sets of graphs (the generated graphs and the test sets). Whereas previous works relied on qualitative visual inspection (Simonovsky & Komodakis, 2018) or simple comparisons of average statistics between the two sets (Leskovec et al., 2010), we propose novel evaluation metrics that compare all moments of their empirical distributions.

Our proposed metrics are based on Maximum Mean Discrepancy (MMD) measures. Suppose that a unit ball in a reproducing kernel Hilbert space (RKHS) H\mathcal{H} is used as its function class F\mathcal{F}, and kk is the associated kernel, the squared MMD between two sets of samples from distributions pp and qq can be derived as (Gretton et al., 2012)

where Π(p,q)\Pi(p,q) is the set of all distributions whose marginals are pp and qq respectively, and γ\gamma is a valid transport plan. To capture high-order moments, we use the following kernel, whose Taylor expansion is a linear combination of all moments (proof in the Appendix):

The kernel function defined by kW(p,q)=expW(p,q)2σ2k_{W}(p,q)=\exp{\frac{W(p,q)}{2\sigma^{2}}} induces a unique RKHS.

In experiments, we show this derived MMD score for degree and clustering coefficient distributions, as well as average orbit counts statistics, i.e., the number of occurrences of all orbits with 4 nodes (to capture higher-level motifs) (Hočevar & Demšar, 2014). We use the RBF kernel to compute distances between count vectors.

4 Generating High Quality Graphs

Our experiments demonstrate that GraphRNN can generate graphs that match the characteristics of the ground truth graphs in a variety of metrics.

Graph visualization. Figure 2 visualizes the graphs generated by GraphRNN and various baselines, showing that GraphRNN can capture the structure of datasets with vastly differing characteristics—being able to effectively learn regular structures like grids as well as more natural structures like ego networks. Specifically, we found that grids generated by GraphRNN do not appear in the training set, i.e., it learns to generalize to unseen grid widths/heights.

Evaluation with graph statistics. We use three graph statistics—based on degrees, clustering coefficients and orbit counts—to further quantitatively evaluate the generated graphs. Figure 3 shows the average graph statistics in the test vs. generated graphs, which demonstrates that even from hundreds of graphs with diverse sizes, GraphRNN can still learn to capture the underlying graph statistics very well, with the generated average statistics closely matching the overall test set distribution.

Tables 1 and 2 summarize MMD evaluations on the full datasets and small versions, respectively. Note that we train all the models with a fixed number of steps, and report the test set performance at the step with the lowest training error.Using the training set or a validation set to evaluate MMD gave analogous results, so we used the train set for early stopping. GraphRNN variants achieve the best performance on all datasets, with 80%80\% decrease of MMD on average compared with traditional baselines, and 90%90\% decrease of MMD compared with deep learning baselines. Interestingly, on the protein dataset, our simpler GraphRNN-S model performs very well, which is likely due to the fact that the protein dataset is a nearest neighbor graph over Euclidean space and thus does not involve highly complex edge dependencies. Note that even though some baseline models perform well on specific datasets (e.g., MMSB on the community dataset), they fail to generalize across other types of input graphs.

Generalization ability. Table 2 also shows negative log-likelihoods (NLLs) on the training and test sets. We report the average p(Sπ)p(S^{\pi}) in our model, and report the likelihood in baseline methods as defined in their papers. A model with good generalization ability should have small NLL gap between training and test graphs. We found that our model can generalize well, with 22%22\% smaller average NLL gap.The average likelihood is ill-defined for the traditional models.

5 Robustness

Finally, we also investigate the robustness of our model by interpolating between Barabási-Albert (B-A) and Erdős-Rényi (E-R) graphs. We randomly perturb [0%,20%,...,100%0\%,20\%,...,100\%] edges of B-A graphs with 100100 nodes. With 0%0\% edges perturbed, the graphs are E-R graphs; with 100%100\% edges perturbed, the graphs are B-A graphs. Figure 4 shows the MMD scores for degree and clustering coefficient distributions for the 66 sets of graphs. Both B-A and E-R perform well when graphs are generated from their respective distributions, but their performance degrades significantly once noise is introduced. In contrast, GraphRNN maintains strong performance as we interpolate between these structures, indicating high robustness and versatility.

Further Related Work

In addition to the deep graph generative approaches and traditional graph generation approaches surveyed previously, our framework also builds off a variety of other methods.

Molecule and parse-tree generation. There has been related domain-specific work on generating candidate molecules and parse trees in natural language processing. Most previous work on discovering molecule structures make use of a expert-crafted sequence representations of molecular graph structures (SMILES) (Olivecrona et al., 2017; Segler et al., 2017; Gómez-Bombarelli et al., 2016). Most recently, SD-VAE (Dai et al., 2018) introduced a grammar-based approach to generate structured data, including molecules and parse trees. In contrast to these works, we consider the fully general graph generation setting without assuming features or special structures of graphs.

Deep autoregressive models. Deep autoregressive models decompose joint probability distributions as a product of conditionals, a general idea that has achieved striking successes in the image (Oord et al., 2016b) and audio (Oord et al., 2016a) domains. Our approach extends these successes to the domain of generating graphs. Note that the DeepGMG algorithm (Li et al., 2018) and the related prior work of Johnson 2017 can also be viewed as deep autoregressive models of graphs. However, unlike these methods, we focus on providing a scalable (i.e., O(n2)O(n^{2})) algorithm that can generate general graphs.

Conclusion and Future Work

We proposed GraphRNN, an autoregressive generative model for graph-structured data, along with a comprehensive evaluation suite for the graph generation problem, which we used to show that GraphRNN achieves significantly better performance compared to previous state-of-the-art models, while being scalable and robust to noise. However, significant challenges remain in this space, such as scaling to even larger graphs and developing models that are capable of doing efficient conditional graph generation.

Acknowledgements

The authors thank Ethan Steinberg, Bowen Liu, Marinka Zitnik and Srijan Kumar for their helpful discussions and comments on the paper. This research has been supported in part by DARPA SIMPLEX, ARO MURI, Stanford Data Science Initiative, Huawei, JD, and Chan Zuckerberg Biohub. W.L.H. was also supported by the SAP Stanford Graduate Fellowship and an NSERC PGS-D grant.

References

Appendix A Appendix

In this section we detail parameter setting, data preparation and training strategies for GraphRNN.

We use two sets of model parameters for GraphRNN. One larger model is used to train and test on the larger datasets that are used to compare with traditional methods. One smaller model is used to train and test on datasets with nodes up to 2020. This model is only used to compare with the two most recent preliminary deep generative models for graphs proposed in (Li et al., 2018; Simonovsky & Komodakis, 2018).

For GraphRNN, the graph-level RNN uses 44 layers of GRU cells, with 128128 dimensional hidden state for the larger model, and 6464 dimensional hidden state for the smaller model in each layer. The edge-level RNN uses 44 layers of GRU cells, with 1616 dimensional hidden state for both the larger model and the smaller model. To output the adjacency vector prediction, the edge-level RNN first maps the highest layer of the 1616 dimensional hidden state to a 88 dimensional vector through a MLP with ReLU activation, then another MLP maps the vector to a scalar with sigmoid activation. The edge-level RNN is initialized by the output of the graph-level RNN at the start of generating SiπS^{\pi}_{i}, 1in\forall 1\leq i\leq n. Specifically, the highest layer hidden state of the graph-level RNN is used to initialize the lowest layer of edge-level RNN, with a liner layer to match the dimensionality. During training time, teacher forcing is used for both graph-level and edge-level RNNs, i.e., we use the groud truth rather than the model’s own prediction during training. At inference time, the model uses its own preditions at each time steps to generate a graph.

For the simple version GraphRNN-S, a two-layer MLP with ReLU and sigmoid activations respectively is used to generate SiπS^{\pi}_{i}, with 6464 dimensional hidden state for the larger model, and 3232 dimensional hidden state for the smaller model. In practice, we find that the performance of the model is relatively stable with respect to these hyperparameters.

We generate the graph sequences used for training the model following the procedure in Section 2.3.4. Specifically, we first randomly sample a graph from the training set, then randomly permute the node ordering of the graph. We then do the deterministic BFS discussed in Section 2.3.4 over the graph with random node ordering, resulting a graph with BFS node ordering. An exception is in the robustness section, where we use the node ordering that generates B-A graphs to get graph sequences, in order to see if GraphRNN can capture the underlying preferential attachment properties of B-A graphs.

With the proposed BFS node ordering, we can reduce the maximum dimension MM of SiπS^{\pi}_{i}, illustrated in Figure 5. To set the maximum dimension MM of SiπS^{\pi}_{i}, we use the following empirical procedure. We randomly ran 100000100000 times the above data pre-processing procedure to get graph with BFS node orderings. We remove the all consecutive zeros in all resulting SiπS^{\pi}_{i}, to find the empirical distribution of the dimensionality of SiπS^{\pi}_{i}. We set MM to be roughly the 99.999.9 percentile, to account for the majority dimensionality of SiπS^{\pi}_{i}. In principle, we find that graphs with regular structures tend to have smaller MM, while random graphs or community graphs tend to have larger MM. Specifically, for community dataset, we set M=100M=100; for grid dataset, we set M=40M=40; for B-A dataset, we set M=130M=130; for protein dataset, we set M=230M=230; for ego dataset, we set M=250M=250; for all small graph datasets, we set M=20M=20.

The Adam Optimizer is used for minibatch training. Each minibatch contains 3232 graph sequences. We train the model for 9600096000 batchs in all experiments. We set the learning rate to be 0.0010.001, which is decayed by 0.30.3 at step 1280012800 and 3200032000 in all experiments.

A.2 Running Time of GraphRNN

Training is performed on only 11 Titan X GPU. For the protein dataset that consists of about 10001000 graphs, each containing about 500500 nodes, training converges at around 6400064000 iterations. The runtime is around 1212 to 2424 hours. This also includes pre-processing, batching and BFS, which are currently implemented using CPU without multi-threading. The less expressive GraphRNN-S variant is about twice faster. At inference time, for the above dataset, generating a graph using the trained model only takes about 11 second.

A.3 More Details on GraphRNN’s Expressiveness

We illustrate the intuition underlying the good performance of GraphRNNon graphs with regular structures, such as grid and ladder networks. Figure 6 (a) shows the generation process of a ladder graph at an intermediate step. At this time step, the ground truth data (under BFS node ordering) specifies that the new node added to the graph should make an edge to the node with degree 11. Note that node degree is a function of S<iπS^{\pi}_{<i}, thus could be approximated by a neural network.

Once the first edge has been generated, the new node should make an edge with another node of degree 22. However, there are multiple ways to do so, but only one of them gives a valid grid structure, i.e. one that forms a 44-cycle with the new edge. GraphRNN crucially relies on the edge-level RNN and the knowledge of the previously added edge, in order to distinguish between the correct and incorrect connections in Figure 6 (c) and (d).

A.4 Code Overview

In the code repository, main.py consists of the main training pipeline, which loads datasets and performs training and inference. It also consists of the Args class, which stores the hyper-parameter settings of the model. model.py consists of the RNN, MLP and loss function modules that are use to build GraphRNN. data.py contains the minibatch sampler, which samples a random BFS ordering of a batch of randomly selected graphs. evaluate.py contains the code for evaluating the generated graphs using the MMD metric introduced in Sec. 4.3.

Baselines including the Erdős-Rényi model, Barabási-Albert model, MMSB, and rge very recent deep generative models (GraphVAE, DeepGMG) are also implemented in the baselines folders. We adopt the C++ Kronecker graph model implementation in the SNAP package The SNAP package is available at http://snap.stanford.edu/snap/index.html..

A.5 Proofs

By definition of BFS, if i<ki<k, then the children of viv_{i} in the BFS ordering come before the children of vkv_{k} that do not connect to viv_{i^{\prime}}, 1ii\forall 1\leq i^{\prime}\leq i.

By definition of BFS, all neighbors of a node viv_{i} include the parent of viv_{i} in the BFS tree, all children of viv_{i} which have consecutive indices, and some children of viv_{i^{\prime}} which connect to both viv_{i^{\prime}} and viv_{i}, for some 1ii1\leq i^{\prime}\leq i. Hence if (vi,vj1)E(v_{i},v_{j-1})\in E but (vi,vj)∉E(v_{i},v_{j})\not\in E, vj1v_{j-1} is the last children of viv_{i} in the BFS ordering. Hence (vi,vj)∉E(v_{i},v_{j^{\prime}})\not\in E, jjn\forall j\leq j^{\prime}\leq n.

For all i[i]i^{\prime}\in[i], supposed that (vi,vj1)E(v_{i^{\prime}},v_{j^{\prime}-1})\in E but (vi,vj)∉E(v_{i^{\prime}},v_{j^{\prime}})\not\in E. By Observation 2, j<jj^{\prime}<j. By conclusion in the previous paragraph, (vi,vj)∉E(v_{i^{\prime}},v_{j^{\prime\prime}})\not\in E, jjn\forall j^{\prime}\leq j^{\prime\prime}\leq n. Specifically, (vi,vj)∉E(v_{i^{\prime}},v_{j^{\prime\prime}})\not\in E, jjn\forall j\leq j^{\prime\prime}\leq n. This is true for all i[i]i^{\prime}\in[i]. Hence we prove that (vi,vj)∉E(v_{i^{\prime}},v_{j^{\prime}})\not\in E, 1ii\forall 1\leq i^{\prime}\leq i and jj<nj\leq j^{\prime}<n.

A.5.2 Proof of Proposition 2

As proven in kolouri2016sliced, this Wasserstein distance based kernel is a positive definite (p.d.) kernel. By properties that linear combinations, product and limit (if exists) of p.d. kernels are p.d. kernels, kW(p,q)k_{W}(p,q) is also a p.d. kernel.This can be seen by expressing the kernel function using Taylor expansion. By the Moore-Aronszajn theorem, a symmetric p.d. kernel induces a unique RKHS. Therefore Equation (9) holds if we set kk to be kWk_{W}.

A.6 Extension to Graphs with Node and Edge Features

A.7 Extension to Graphs with Four Communities

To further show the ability of GraphRNN to learn from community graphs, we further conduct experiments on a four-community synthetic graph dataset. Specifically, the data set consists of 500 four community graphs with 48V6848\leq|V|\leq 68. Each community is generated by the Erdős-Rényi model (E-R) (Erdős & Rényi, 1959) with n[V/42,V/4+2]n\in[|V|/4-2,|V|/4+2] nodes and p=0.7p=0.7. We then add 0.01V20.01|V|^{2} inter-community edges with uniform probability. FIgure 7 shows the comparison of visualization of generated graph using GraphRNN and other baselines. We observe that in contrast to baselines, GraphRNN consistently generate 4-community graphs and each community has similar structure to that in the training set.