GraphMix: Improved Training of GNNs for Semi-Supervised Learning

Vikas Verma, Meng Qu, Kenji Kawaguchi, Alex Lamb, Yoshua Bengio, Juho Kannala, Jian Tang

Introduction

Due to the presence of graph-structured data across a wide variety of domains, such as biological networks, citation networks and social networks, there have been several attempts to design neural networks, known as graph neural networks (GNN), that can process arbitrarily structured graphs. Early work includes (Gori et al., 2005; Scarselli et al., 2009) which propose a neural network that can directly process most types of graphs e.g., acyclic, cyclic, directed, and undirected graphs. More recent approaches include (Bruna et al., 2013; Henaff et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016; Gilmer et al., 2017; Hamilton et al., 2017; Veličković et al., 2018, 2019; Qu et al., 2019; Gao & Ji, 2019; Ma et al., 2019), among others. Many of these approaches are designed for addressing the problem of semi-supervised learning over graph-structured data (Zhou et al., 2018). Much of these research efforts have been dedicated to developing novel architectures.

Here we instead propose an architecture-agnostic method for regularized training of GNNs for semi-supervised node classification. Recently, regularization based on data-augmentation has been shown to be very effective in other types of neural networks but how to apply these techniques in GNNs is still under-explored. Our proposed method GraphMix 111code available at https://github.com/vikasverma1077/GraphMix is a unified framework that draws inspiration from interpolation based data augmentation (Zhang et al., 2018; Verma et al., 2019a) and self-training based data-augmentation (Laine & Aila, 2016; Tarvainen & Valpola, 2017; Verma et al., 2019b; Berthelot et al., 2019). We show that with our proposed method, we can achieve state-of-the-art performance even when using simpler GNN architectures such as Graph Convolutional Networks (Kipf & Welling, 2017), with no additional memory cost and with minimal additional computation cost. Further, we conduct a theoritical analysis to demonstrate the effectiveness of the proposed method over the underlying GNNs.

Problem Definition and Preliminaries

We are interested in the problem of semi-supervised node and edge classification using graph-structured data. We can formally define such graph-structured data as an undirected graph ${\mathcal{G}}=({\mathcal{V}},\mathcal{A},\mathcal{X})$ , where $\mathcal{V}=\mathcal{V}_{l}\cup\mathcal{V}_{u}$ is the union of labeled ( $\mathcal{V}_{l}$ ) and unlabeled ( $\mathcal{V}_{u}$ ) nodes in the graph with cardinalities $n_{l}$ and $n_{u}$ , and $\mathcal{A}$ is the adjacency matrix representing the edges between the nodes of $\mathcal{V}$ , $\mathcal{X}\in\mathbb{R}^{(n_{l}+n_{u})\times d}$ is the input node features. Each node $v$ belongs to one out of $C$ classes and can be labeled with a $C$ -dimensional one-hot vector ${y}_{v}\in\mathbb{R}^{C}$ . Given the labels $Y_{l}\in\mathbb{R}^{n_{l}\times C}$ of the labeled nodes $\mathcal{V}_{l}$ , the task is to predict the labels $Y_{u}\in\mathbb{R}^{n_{u}\times C}$ of the unlabeled nodes $\mathcal{V}_{u}$ .

Graph Neural Networks (GNN) learn the $l_{th}$ layer representations of a sample $i$ by leveraging the representations of the samples $NB(i)$ in the neighbourhood of $i$ . This is done by using an aggregation function that takes as an input the representations of all the samples along with the graph structure and outputs the aggregated representation. The aggregation function can be defined using the Graph Convolution layer (Kipf & Welling, 2017), Graph Attention Layer (Veličković et al., 2018), or any general message passing layer (Gilmer et al., 2017). Formally, let $\mathbf{h}^{(l)}\in\mathbb{R}^{n\times k}$ be a matrix containing the $k$ -dimensional representation of $n$ nodes in the $l_{th}$ layer, then:

𝑙1𝜎𝐴𝐺𝐺𝑅𝐸𝐺𝐴𝑇𝐸superscript𝐡𝑙𝐖𝒜\displaystyle\mathbf{h}^{(l+1)}=\sigma(AGGREGATE(\mathbf{h}^{(l)}\mathbf{W},\mathcal{A})) (1) where $\mathbf{W}\in\mathbb{R}^{k\times k^{\prime}}$ is a linear transformation matrix, $k^{\prime}$ is the dimension of $(l+1)_{th}$ layer, $AGGREGATE$ is the aggregation function that utilizes the graph adjacency matrix $\mathcal{A}$ to aggregate the hidden representations of neighbouring nodes and $\sigma$ is a non-linear activation function, e.g. ReLU.

Recently, interpolation-based techniques have been proposed for regularizing neural networks. We briefly describe some of these techniques here. Mixup (Zhang et al., 2018) trains a neural network on the convex combination of input and targets, whereas Manifold Mixup (Verma et al., 2019a) trains a neural network on the convex combination of the hidden states of a randomly chosen hidden layer and the targets. While Mixup regularizes a neural network by enforcing that the model output should change linearly in between the examples in the input space, Manifold Mixup regularizes the neural network by learning better (more discriminative) hidden states.

Formally, suppose $T_{\theta}(\mathbf{x})=(f\circ g)_{\theta}(\mathbf{x})$ is a neural network parametrized with $\theta$ such that $g:\mathbf{x}\xrightarrow{}\mathbf{h}$ is a function that maps input sample to hidden states, $f:\mathbf{h}\xrightarrow{}\mathbf{\hat{y}}$ is a function that maps hidden states to predicted output, $\lambda$ is a random variable drawn from $\operatorname{Beta}(\alpha,\alpha)$ distribution, $\text{Mix}_{\lambda}(\mathbf{a},\mathbf{b})=\lambda*\mathbf{a}+(1-\lambda)*\mathbf{b}$ is an interpolation function, $\mathcal{D}$ is the data distribution, $(\mathbf{x},\mathbf{y})$ and $(\mathbf{x^{\prime}},\mathbf{y^{\prime}})$ is a pair of labeled examples sampled from distribution $\mathcal{D}$ and $\ell$ be a loss function such as cross-entropy loss, then the Manifold Mixup Loss is defined as:

We use above Manifold Mixup loss for training an auxiliary Fully-connected-network as described in Section 3.

GraphMix

Data Augmentation is one of the simplest and most efficient technique for regularizing a neural network. In the domains of computer vision, speech and natural language, there exist efficient data augmentation techniques, for example, random cropping, translation or Cutout (Devries & Taylor, 2017) for computer vision, (Ko et al., 2015) and (Park et al., 2019) for speech and (Xie et al., 2017) for natural language. However, data augmentation for the graph-structured data remains under-explored. There exists some recent work along these lines but the prohibitive computation cost (see Section • ‣ 5) introduced by these methods make them impractical for real-world large graph datasets. Based on these limitations, our main objective is to propose an efficient data augmentation technique for graph datasets.

Recent work based on interpolation-based data augmentation (Zhang et al., 2018; Verma et al., 2019a) has seen sizable improvements in regularization performance across a number of tasks. However, these techniques are not directly applicable to graphs for an important reason: Although we can create additional nodes by interpolating the features and corresponding labels, it remains unclear how these new nodes must be connected to the original nodes via synthetic edges such that the structure of the whole graph is preserved. To alleviate this issue, we propose to train an auxiliary Fully-Connected Network (FCN) using Manifold Mixup as discussed in Section 3.2. Note that the FCN only uses the node features (not the graph structure), thus the Manifold mixup loss in Eq. 2 can be directly used for training the FCN.

Interpolation based data-augmentation techniques have an added advantage for training GNNs: A vanilla GNN learns the representation of each node by iteratively aggregating information from the neighbors of that node (Equation 1). However, this induces the problem of oversmoothing (Li et al., 2018; Xu et al., 2018) while training GNNs with many layers. Due to this limitation, GNNs are trained only with a few layers, and thus they can only leverage the local neighbourhood of each node for learning its representations, without leveraging the representations of the nodes which are multiple hops away in the graph. This limitation can be addressed using the interpolation-based method such as Manifold Mixup: in Manifold Mixup, the representations of a randomly chosen pair of nodes is used to facilitate better representation learning; it is possible that the randomly chosen pair of nodes will not be in the local neighbourhood of each other. Based on these challenges and motivations we present our proposed approach GraphMix for training GNNs in the following Section.

2 Method

We first describe GraphMix at a high-level and then give a more formal description. GraphMix augments the vanilla GNN with a Fully-Connected Network (FCN). The FCN loss is computed using Manifold Mixup as discussed below and the GNN loss is computed normally. Manifold Mixup training of FCN facilitates learning more discriminative node representations (Verma et al., 2019a). An important question is how these more discriminative node representations can be transferred to the GNN? One potential approach could involve maximizing the mutual information between the hidden states of the FCN and the GNN using formulations similar to those proposed by (Hjelm et al., 2019; Sun et al., 2020). However, this requires optimizing additional network parameters. Instead, we propose parameter sharing between FCN and GNN to facilitate the transfer of discriminative node representations from the FCN to the GNN. It is a viable option because as mentioned in Eq 1, a GNN layer typically performs an additional operation ( $AGGREGATE$ ) on the linear transformations of node representations (which are essentially pre-activation representations of the FCN layer). Using the more discriminative representations of the nodes from FCN, as well as the graph structure, the GNN loss is computed in the usual way to further refine the node representations. In this way we can exploit the improved representations from Manifold Mixup for training GNNs.

In Section 3.3, without making any assumption about the aggregation function and the depth of the graph neural network, we show that GraphMix improves the generalization of the underlying graph neural network. This makes GraphMix applicable to various kind of architectures having different aggregation functions, such as weighed averaging in GCN (Kipf & Welling, 2016), attention based aggregation in GAT (Veličković et al., 2018) and graph-pooling/unpooling operations in Graph U-Nets (Gao & Ji, 2019). In the aforementioned sense, GraphMix procedure is highly flexible: it can be applied to any underlying GNN as long as the underlying GNN applies parametric transformations to the node features.

Additionally, we propose to use the predicted targets from the GNN to augment the training set of the FCN. In this way, both the FCN and the GNN facilitate each other’s learning process. Both the FCN loss and the GNN loss are optimized in an alternating fashion during training. At inference time, predictions are made using only the GNN.

A diagram illustrating GraphMix is presented in Figure 1 and the full algorithm is presented in Appendix A.3. Further, we draw similarities and difference of GraphMix w.r.t. Co-training framework in the Appendix A.2.

So far we have presented the general design of GraphMix, now we present GraphMix more formally. Given a graph $\mathcal{G}$ , let $(\mathbf{X}_{l},\mathbf{Y}_{l})$ be the input features and the labels of the labeled nodes $\mathcal{V}_{l}$ and let $(\mathbf{X}_{u})$ be the input features of the unlabeled nodes $\mathcal{V}_{u}$ . Let $F_{\theta}$ and $G_{\theta}$ be a FCN and a GNN respectively, which share the parameters $\theta$ . The FCN loss from the labeled data is computed using Eq. 2 as follows:

For unlabeled nodes $\mathcal{V}_{u}$ , we compute the prediction $\mathbf{\hat{Y}_{u}}$ using the GNN:

We note that recent state-of-the-art semi-supervised learning methods use a teacher model to accurately predict targets for the unlabeled data. The teacher model can be realized as a temporal ensemble of the student model (the model being trained) (Laine & Aila, 2016) or by using an Exponential Moving Average (EMA) of the parameters of the student model (Tarvainen & Valpola, 2017; Verma et al., 2019b). Different from these approaches, we use the GNN as a teacher model for predicting the targets for the FCN. This is due to the fact that GNNs leverage graph structure, which in practice, allows them to make more accurate predictions than the temporal ensemble or EMA ensemble of FCN (although there is no theoretical guarantee for this).

Moreover, to improve the accuracy of the predicted targets in Eq 4, we applied the average of the model prediction on $K$ random perturbations of an input sample along with sharpening as discussed in Appendix A.1.

Using the predicted targets for unlabeled nodes, we create a new training set $(\mathbf{X}_{u},\mathbf{\hat{Y}_{u}})$ . The loss from the unlabeled data for the FCN is computed as:

Total loss for training the FCN is given as the weighted sum of above two loss terms.

subscriptℒsupervised𝑤𝑡subscriptℒunsupervised\displaystyle\mathcal{L}_{\text{FCN}}=\mathcal{L}_{\text{supervised}}+w(t)*\mathcal{L}_{\text{unsupervised}} (6) where $w(t)$ is a sigmoid ramp-up function (Tarvainen & Valpola, 2017) which increases from zero to its max value $\gamma$ during the course of training.

Now let us assume that the loss for an underlying GNN is $\mathcal{L}_{\text{GNN}}=\ell(G_{\theta}(\mathbf{X}_{l}),\mathbf{Y}_{l})$ ; the overall GraphMix loss for the joint training of the FCN and GNN can be defined as the weighted sum of the GNN loss and the FCN loss:

subscriptℒGNN𝜆subscriptℒFCN\displaystyle\mathcal{L}_{\text{GraphMix }}=\mathcal{L}_{\text{GNN}}+\lambda*\mathcal{L}_{\text{FCN}} (7) However, throughout our experiments, optimizing FCN loss and GNN loss alternatively at each training epoch achieved better test accuracy (discussed in Appendix A.12). This has an added benefit that it removes the need to tune weighing hyper-parameter $\lambda$ .

For Manifold Mixup training of FCN, we apply mixup only in the hidden layer. Note that in (Verma et al., 2019a), the authors recommended applying mixing in a randomly chosen layer (which also includes the input layer) at each training update. However, we observed under-fitting when applying mixup randomly at the input layer or the hidden layer. Applying mixup only in the input layer also resulted in underfitting and did not improve test accuracy.

Memory and Computational Requirements: One of the major limitations of current GNNs, which prohibits their application to real-world large datasets, is their memory complexity. For example, the fastest implementation of GCN, which stores the entire adjacency matrix $\mathcal{A}$ in the memory, has the memory complexity $O(|\mathcal{V}|^{2})$ . Implementations with lower memory requirements are possible but they incur higher latency cost due to repeatedly loading parts of adjacency matrix in the memory. Due to these reasons, methods which have additional memory requirement in comparison to the baseline GNNs, are less appealing in practice. In GraphMix, since the parameters of the FCN and GNN are shared, there is no additional memory requirement. Furthermore, GraphMix does not add any significant computation cost over the underlying GNN, because the underlying GNN is trained in the standard way and the FCN training requires trivial additional computation cost for computing the predicted-targets (Appendix A.1) and the interpolation function ( $\text{Mix}_{\lambda}(\mathbf{a},\mathbf{b})$ in Section 2).

3 Analysis

In this subsection, we study how GraphMix impacts the generalization bound of a underlying GNN. Our analysis, which is based on Rademacher complexity (Bartlett & Mendelson, 2002), provides a new generalization bound for GraphMix, which shows how adding regularization to training the FCN with pseudolabels improves overall generalization. We rely on the effect of changing one sample in Manifold Mixup and the fact that the weights are shared by a GNN and the corresponding FCN in GraphMix.

Let $G$ be a fixed graph with $n$ total nodes. Define $z_{i}=(x_{i},y_{i})$ to be the pair of the feature and the true label of the $i$ -th node. Without loss of generality, let $S=(z_{1},\dots,z_{m})$ be the training set with the labeled nodes where $m=n_{l}$ and data points $z_{1},\dots,z_{m}$ are sampled according to an unknown distribution $\mathcal{D}$ to form the labeled node set $S$ . In this subsection, we follow the previous paper on graph neural networks by assuming that all samples are i.i.d. (including replacement sample) (Verma & Zhang, 2019).

Let $\Gamma$ be a finite set of the hyperparameters. For every hyperparameter $\gamma\in\Gamma$ , define $\mathcal{F}_{\gamma}$ to be a distribution-dependent hypothesis space corresponding the hyperparameter $\gamma$ . That is, $\mathcal{F}_{\gamma}=\{f_{\gamma}:(\exists S\in\mathcal{S})[f_{\gamma}=\mathcal{A}_{\gamma}(S)]\}$ where $\mathcal{A}_{\gamma}$ is an algorithm that outputs the hypothesis $f_{\gamma}$ given a dataset $S$ , and $\mathcal{S}$ is the set of training datasets such that the probability of $S\in\mathcal{S}$ according to $\mathcal{D}$ is one. For each $f_{\gamma}\in\mathcal{F}_{\gamma}$ , $f_{\gamma}(\cdot\ ;G)$ represents GNN with the graph $G$ and $f_{\gamma}(\cdot\ ;G_{0})$ represents FCN where $G_{0}$ is the null graph version of $G$ (i.e., $G$ without edges). Let $\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})$ be the Rademacher complexity (Bartlett & Mendelson, 2002) of the set $\{(x,y)\mapsto\ell(f_{\gamma}(x;G),y):f_{\gamma}\in\mathcal{F}_{\gamma}\}$ .

Let $L_{{\text{GNN}}}(S,f_{\gamma})$ be the $\mathcal{L}_{{\text{GNN}}}$ with the GNN $f_{\gamma}(\cdot\ ;G)$ and labeled data points $S$ as $L_{{\text{GNN}}}(S,f_{\gamma})=\frac{1}{m}\sum_{i=1}^{m}\ell(f_{\gamma}(x_{i};G),y_{{i}})$ . Let $L_{{\text{FCN}}}(S,f_{\gamma})$ be $\mathcal{L}_{{\text{FCN}}}$ with the FCN $f_{\gamma}(\cdot\ ;G_{0})$ and labeled data points $S$ as $L_{{\text{FCN}}}(S,f_{\gamma})=\mathcal{L}_{\text{MM}}(S,f_{\gamma}(\cdot\ ;G_{0}),\alpha)+\frac{n_{u}}{m}\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S}),f_{\gamma}(\cdot\ ;G_{0}),\alpha)$ where $(\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S})$ is the unlabeled nodes $(\mathbf{X}_{u},\mathbf{\hat{Y}}_{u})$ that corresponds to the labeled node set $S$ . Let $L_{{\text{GraphMix}}}(S,f_{\gamma})=L_{{\text{GNN}}}(S,f_{\gamma})+\lambda L_{{\text{FCN}}}(S,f_{\gamma})$ . Let $c$ be the upper bound on per-sample loss as $c\geq\ell(f_{\gamma}(x_{i};G),y_{{i}})$ and $c\geq\ell(f_{\gamma}(x_{i};G_{0}),y_{{i}})$ . For example, $c=1$ for training and test error (or 0-1 loss).

Theorem 1 provides a generalization bound for GraphMix, which shows that GraphMix can improve the generalization bound of the underlying GNN under the condition of $V>0$ as discussed below.

For any $\delta>0$ , with probability at least $1-\delta$ , the following holds: for all $\gamma\in\Gamma$ and all $f_{\gamma}\in\mathcal{F}_{\gamma}$ , we have that $\mathbb{E}_{x,y\sim\mathcal{D}}[\ell(f_{\gamma}(x;G),y)]-L_{{\text{GraphMix}}}(S,f_{\gamma})\leq\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})+c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}-\lambda V$ , where $V=\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})]-4c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}.$

The proof of Theorem 1 is given in the following. The generalization bound in Theorem 1 becomes a generalization bound for GNN without GraphMix when $\lambda=0$ as desired. By comparing the generalization bound with $\lambda=0$ (no GraphMix) and $\lambda>0$ (with GraphMix), we can see that GraphMix improves the generalization bound of underlying GNN when $V=\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})]-4c\sqrt{\ln(|\Gamma|/\delta)/(2m)}>0.$ Here, the first term $\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})]$ increases as the hypothesis space $\mathcal{F}_{\gamma}$ gets “smaller” or has less complexity. Thus, GraphMix improves the generalization bound of an underlying GNN when the hyperparameter search over $\gamma\in\Gamma$ results in $\mathcal{F}_{\gamma}$ of moderate complexity such that the first term is greater than $4c\sqrt{\ln(|\Gamma|/\delta)/(2m)}$ . The first term contains the manifold mixup loss $L_{{\text{FCN}}}(S^{\prime},f_{\gamma})$ over random training datasets $S^{\prime}$ ( $\neq S$ ), which is expected to be larger when compared with that of the standard loss without manifold mixup.

For each fixed $\mathcal{F}_{\gamma}$ , the generalization bound in Theorem 1 goes to zero as $m\rightarrow\infty$ since $\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})\rightarrow 0$ and $V\rightarrow V_{0}\geq 0$ as $m\rightarrow\infty$ . The training loss is also minimized when the trained model $f_{\gamma}\in\mathcal{F}_{\gamma}$ fits well to the particular given training data set $S$ . Therefore, given a particular training dataset $S$ , the expected loss is minimized if we conduct a hyperparameter search over $\gamma\in\Gamma$ such that $f_{\gamma}\in\mathcal{F}_{\gamma}$ minimize the training loss for the given $S$ but $\mathcal{F}_{\gamma}$ has moderate complexity to not being able to minimize the manifold mixup losses for other datasets $S^{\prime}$ ( $\neq S$ ) drawn from $\mathcal{D}$ . Unlike the standard data points, the data points generated during manifold mixup in GraphMix are typically not memorizable or interpolated by FCN.

4 Proof of Theorem 1

In the proof, we analyse the effect of changing one sample in Manifold Mixup and GNN-generated targets by utilizing the fact that the weights are shared by a GNN and the corresponding FCN. The the fact of sharing weights also results in the generalization bound that relates the generalization in FCN via Manifold Mixup to the generalization of the GNN.

Let $\gamma\in\Gamma$ be fixed. Define $\varphi(S)=\sup_{f_{\gamma}\in\mathcal{F}_{\gamma}}\mathbb{E}_{x,y\sim\mathcal{D}}[\ell(f_{\gamma}(x;G),y)]-L_{{\text{GraphMix}}}(S,f_{\gamma})$ . We first provide an upper bound on $\varphi(S)$ by using McDiarmid’s inequality. To apply McDiarmid’s inequality to $\varphi(\mathcal{D})$ , we compute an upper bound on $|\varphi(S)-\varphi(S^{\prime})|$ where $S$ and $S^{\prime}$ be two training datasets differing by exactly one point of an arbitrary index $i_{0}$ ; i.e., $S_{i}=S^{\prime}_{i}$ for all $i\neq i_{0}$ and $S_{i_{0}}\neq S^{\prime}_{i_{0}}$ .

𝜆subscript𝐿FCN𝑆subscript𝑓𝛾subscript𝐿FCNsuperscript𝑆′subscript𝑓𝛾\displaystyle+\lambda(L_{{\text{FCN}}}(S,f_{\gamma})-L_{{\text{FCN}}}(S^{\prime},f_{\gamma})) where the first line follows the property of the supremum, $\sup(a)-\sup(b)\leq\sup(a-b)$ , and the second line follows the definition of $L_{{\text{GraphMix}}}$ .

where we used the fact that given a fixed $G$ and a fixed $f_{\gamma}$ , $\ell(f_{\gamma}(x_{i};G),y_{{i}})=\ell(f_{\gamma}(x_{i}^{\prime};G),y_{{i}}^{\prime})$ for $i\neq i_{0}$ . This holds since $f_{\gamma}(\cdot\ ;G)$ does not depend on $S$ given a $G$ and a $f_{\gamma}$ , even though $f_{\gamma}(x_{i};G)$ contains the aggregation functions over the graph $G$ .

where we use the fact that $\mathcal{L}_{\text{MM}}(S,f_{\gamma}(\cdot\ ;G_{0}),\alpha)$ has $m^{2}$ terms and $2m-1$ terms differ for $S$ and $S^{\prime}$ , each of which is bounded by the constant $c$ . Similarly, $\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S}),f_{\gamma}(\cdot\ ;G_{0}),\alpha)-\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S^{\prime}},\mathbf{\hat{Y}}_{u}^{S^{\prime}}),f_{\gamma}(\cdot\ ;G_{0}),\alpha)\leq\frac{2c}{n_{u}}$ , since the labels $\mathbf{\hat{Y}}_{u}^{S}$ and $\mathbf{\hat{Y}}_{u}^{S^{\prime}}$ are determined by $f_{\gamma}(x_{i};G)$ , and $f_{\gamma}(x_{i};G)=f_{\gamma}(x_{i}^{\prime};G)$ for $i\neq i_{0}$ , given a fixed $G$ and a fixed $f_{\gamma}$ . Therefore,

14𝜆𝑚\displaystyle\varphi(S^{\prime})-\varphi(S)\leq\frac{c(1+4\lambda)}{m}. Similarly, $\varphi(S)-\varphi(S^{\prime})\leq\frac{c(1+4\lambda)}{m}$ . Thus, by McDiarmid’s inequality, for any $\delta>0$ , with probability at least $1-\delta/|\Gamma|$ ,

subscript𝔼¯𝑆delimited-[]𝜑¯𝑆𝑐14𝜆Γ𝛿2𝑚\varphi(S)\leq\mathbb{E}_{\bar{S}}[\varphi(\bar{S})]+c(1+4\lambda)\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}. Moreover,

subscript𝔼¯𝑆delimited-[]𝜑¯𝑆𝜆subscript𝔼¯𝑆delimited-[]subscriptinfimumsubscript𝑓𝛾subscriptℱ𝛾subscript𝐿FCN¯𝑆subscript𝑓𝛾\displaystyle\mathbb{E}_{\bar{S}}[\varphi(\bar{S})]+\lambda\mathbb{E}_{\bar{S}}\left[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(\bar{S},f_{\gamma})\right] $\displaystyle\leq\mathbb{E}_{\bar{S}}\left[\sup_{f_{\gamma}\in\mathcal{F}_{\gamma}}\mathbb{E}_{\bar{x}^{\prime},\bar{y}^{\prime}\sim\mathcal{D}}[\ell(f_{\gamma}(\bar{x}^{\prime};G),\bar{y}^{\prime})]-L_{{\text{GNN}}}(\bar{S},f_{\gamma})\right]$ $\displaystyle\leq 2\mathcal{R}_{n}(\Theta)$ where the second line and the last line use the subadditivity of supremum, the third line uses the Jensen’s inequality and the convexity of the supremum, the fourth line follows that for each $\xi_{i}\in\{-1,+1\}$ , the distribution of each term $\xi_{i}(\ell(f_{\gamma}(\bar{x}_{i}^{\prime};G),\bar{y}_{i}^{\prime})-\ell(f_{\gamma}(\bar{x}_{i};G),\bar{y}_{i}))$ is the distribution of $(\ell(f_{\gamma}(\bar{x}_{i}^{\prime};G),\bar{y}_{i}^{\prime})-\ell(f_{\gamma}(\bar{x}_{i};G),\bar{y}_{i}))$ since $\bar{S}$ and $\bar{S}^{\prime}$ are drawn iid with the same distribution.

Therefore, for any $\delta>0$ , with probability at least $1-\delta/|\Gamma|$ , all $f_{\gamma}\in\mathcal{F}_{\gamma}$ ,

superscriptsubscriptℛ𝑚ℓsubscriptℱ𝛾𝑐Γ𝛿2𝑚𝜆𝑉\displaystyle\leq\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})+c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}-\lambda V. by taking union bounds over all $\gamma\in\Gamma$ , we obtain the statement of this theorem.

Experiments

We present results for GraphMix algorithm using standard benchmark datasets and the standard architecture in Section 4.1 and 4.3. We also conduct an ablation study on GraphMix in Section A.5 to understand the contribution of various components to its performance. Refer to Appendix A.4 for dataset details and A.8 for implementation and hyperparameter details.

For baselines, we choose GCN (Kipf & Welling, 2017), and the recent state-of-the-art graph neural networks GAT (Veličković et al., 2018), GMNN (Qu et al., 2019) and Graph U-Net (Gao & Ji, 2019), as well as the recently proposed regularizers for graph neural networks. GraphMix(GCN) , GraphMix(GAT) and GraphMix(Graph U-Net) refer to the methods where underlying GNNs are GCN, GAT and Graph U-Net respectively. Refer to Appendix A.8 for implementation and hyperparameter details. (Shchur et al., 2018) demonstrated that the performance of the current state-of-the-art Graph Neural Networks on the standard train/validation/test split of the popular benchmark datasets (such as Cora, Citeseer, Pubmed, etc) is significantly different from their performance on the random splits. For fair evaluation, they recommend using multiple random partitions of the datasets. Along these lines, we created $10$ random splits of the Cora, Citeseer and Pubmed with the same train/ validation/test number of samples as in the standard split. We also provide the results for the standard train/validation/test split. The results are presented in Table 2. We observe that GraphMix always improves the accuracy of the underlying GNNs such as GCN, GAT and Graph-U-Net across all the dataset, with GraphMix(GCN) achieving the best results. We further present results with fewer labeled samples in Appendix A.7. We observer that the relative increase in test accuracy using GraphMix over the baseline GNN is more pronounced when the labeled samples are fewer. This makes GraphMix particularly appealing for very few labeled data problems.

2 Results on Larger Datasets

We also provide results on three recently proposed datasets which are relatively larger than standard benchmark datasets (Cora/Citeseer/Pubmed). We use Cora-Full dataset proposed in (Bojchevski & Günnemann, 2018) and Coauthor-CS and Coauthor-Physics datasets proposed in (Shchur et al., 2018). The results are presented in Table 3222We do not provide results for GAT based experiments in Table 3 and Table 4 because we ran out of GPU memory required to run these experiments with larger (higher number of nodes) datasets.. We observe that GraphMix(GCN) improves the results over GCN for all the three datasets with a significant margin. We note that we did minimal hyperparameter search for GraphMix(GCN) as mentioned in Appendix A.8.2. The details of the datasets is given in Appendix A.4.

3 Semi-supervised Link Classification

In the semi-supervised link classification problem, the task is to predict the labels of the remaining links, given a graph and labels of a few links. Following (Taskar et al., 2004), we can formulate the link classification problem as a node classification problem, i.e., given an original graph $G$ , we construct a dual Graph $G^{\prime}$ , the node set $V^{\prime}$ of the dual graph corresponds to the link set $E^{\prime}$ of the original graph. The nodes in the dual graph $G^{\prime}$ are connected if their corresponding links in the graph $G$ share a node. The attributes of a node in the dual graph are defined as the index of the nodes of the corresponding link in the original graph. Using this formulation, we present results on link classification on Bit OTC and Bit Alpha benchmark datasets in the Table 4. As the numbers of the positive and negative edges are strongly imbalanced, we report the F1 score. Our results show that GraphMix(GCN) improves the performance over the baseline GCN method and is comparable with the recently proposed state-of-the-art method GMNN (Qu et al., 2019) for both the datasets.

4 Visualization of the Learned Features

In Figure 2, we present an analysis of the features learned by GraphMix for Cora dataset using the t-SNE (van der Maaten & Hinton, 2008) based 2D visualization of the hidden states. We observe that GraphMix learns hidden states which are better separated and condensed than GCN and GCN(self-training). Here, GCN(self-training) refers to training a GCN in a normal way but with additional self-prediction based targets for the unlabeled samples. We further evaluate the Soft-rank (refer to Appendix A.10) of the class-specific hidden states to demonstrate that GraphMix(GCN) makes the class-specific hidden states more concentrated than GCN and GCN(self-training), as shown in Figure 2(d). Refer to Appendix A.11 for 2D representation of other datasets.

Related Work

Semi-supervised Learning over Graph Data: There exists a long line of work for semi-supervised learning over graph data (Zhu & Ghahramani, 2002; Zhu et al., 2003). Contrary to previous methods, the recent GNN based approaches define the convolutional operators using the neighbourhood information of the nodes (Kipf & Welling, 2017; Veličković et al., 2018).These convolution operator based method exhibit state-of-the-results for semi-supervised learning over graph data, hence much of the recent attention is dedicated to proposing architectural changes to these methods (Qu et al., 2019; Gao & Ji, 2019; Ma et al., 2019). Unlike these methods, we propose a regularization technique that can be applied to any of these GNNs which uses a parameterized transformation on the node features.

Data Augmentation: It is well known that the generalization of a learning algorithm can be improved by enlarging the training data size. Since labeling more samples is labour-intensive and costly, Data-augmentation has become de facto technique for enlarging the training data size. Mixing based algorithms are a particular class of data-augmentation methods in which additional training samples are generated by interpolating the samples (either in the input or hidden space) and/or their corresponding targets. Mixup (Zhang et al., 2018), BC-learning (Tokozume et al., 2017), Manifold Mixup (Verma et al., 2019a), AMR (Beckham et al., 2019) are notable examples of this class of algorithms. Unlike, these methods which have been proposed for the fixed topology datasets such as images, in this work, we propose interpolation based data-augmentation techniques for graph-structured data.

Regularizing Graph Neural Networks: Regularizing Graph Neural Networks has drawn some attention recently. GraphSGAN (Ding et al., 2018) first uses an embedding method such as DeepWalk (Perozzi et al., 2014) and then trains generator-classifier networks in the adversarial learning setting to generate fake samples in the low-density region between sub-graphs. BVAT (Deng et al., 2019) and (Feng et al., 2019) generate adversarial perturbations to the features of the graph nodes while taking graph structure into account. While these methods improve generalization in graph-structured data, they introduce significant additional computation cost: GraphScan requires computing node embedding as a preprocessing step, BVAT and (Feng et al., 2019) require additional gradient computation for computing adversarial perturbations. Unlike these methods, GraphMix does not introduce any significant additional computation since it is based on interpolation-based techniques and self-generated targets.

Discussion

We presented GraphMix, a simple and efficient regularizer for semi-supervised node classification using graph neural networks. Through extensive experiments, we demonstrated state-of-the-art performances using GraphMix on various benchmark datasets. We also presented a theoritical analysis to compare generalization bounds of GraphMix vs the underlying GNNs. Further, we conduct a systematic ablation study to understand the effect of different components in the performance of GraphMix. The strong empirical results of GraphMix suggest that in parallel to designing new architectures, exploring better regularization for graph-structured data is a promising avenue for research. A future research direction is to jointly model the node features and edges of the graph such that it can be further used for generating the synthetic interpolated nodes and their corresponding connectivity to the other nodes in the graph. This will alleviate the need to train the auxiliary FCN in GraphMix.

References

Appendix A Appendix

A recently proposed method for accurate target predictions for unlabeled data uses the average of predicted targets across $K$ random augmentations of the input sample (Berthelot et al., 2019). Along these lines, in GraphMix we compute the predicted-targets as the average of predictions made by GNN on $K$ drop-out versions of the input sample.

Many recent semi-supervised learning algorithms (Laine & Aila, 2016; Miyato et al., 2018; Tarvainen & Valpola, 2017; Verma et al., 2019b) are based on the cluster assumption (Chapelle et al., 2010), which posits that the class boundary should pass through the low-density regions of the marginal data distribution. One way to enforce this assumption is to explicitly minimize the entropy of the model’s predictions $p(y|\mathbf{x},\theta)$ on unlabeled data by adding an extra loss term to the original loss term (Grandvalet & Bengio, 2005). The entropy minimization can be also achieved implicitly by modifying the model’s prediction on the unlabeled data such that the prediction has low entropy and using these low-entropy predictions as targets for the further training of the model. Examples include "Pseudolabels" (Lee, 2013) and "Sharpening" (Berthelot et al., 2019). Pseudolabeling constructs hard (one-hot) labels for the unlabeled samples which have “high-confidence predictions”. Since many of the unlabeled samples may have “low-confidence predictions”, they can not be used in the Pseudolabeling technique. On the other hand, Sharpening does not require “high-confidence predictions” , and thus it can be used for all the unlabelled samples. Hence in this work, we use Sharpening for entropy minimization. The Sharpening function over the model prediction $p(y|\mathbf{x},\theta)$ can be formally defined as follows (Berthelot et al., 2019), where $T$ is the temperature hyperparameter and $C$ is the number of classes:

A.2 Connection to Co-training

The GraphMix approach can be seen as a special instance of the Co-training framework (Blum & Mitchell, 1998). Co-training assumes that the description of an example can be partitioned into two distinct views and either of these views would be sufficient for learning given sufficient labeled data. In this framework, two learning algorithms are trained separately on each view and then the prediction of each learning algorithm on the unlabeled data is used to enlarge the training set of the other. Our method has some important differences and similarities to the Co-training framework. Similar to Co-training, we train two neural networks and the predictions from the GNN are used to enlarge the training set of the FCN. An important difference is that instead of using the predictions from the FCN to enlarge the training set for the GNN, we employ parameter sharing for passing the learned information from the FCN to the GNN. In our experiments, directly using the predictions of the FCN for GNN training resulted in reduced accuracy. This is due to the fact that the number of labeled samples for training the FCN is sufficiently low and hence the FCN does not make accurate enough predictions. Another important difference is that unlike the co-training framework, the FCN and GNN do not use completely distinct views of the data: the FCN uses feature vectors $\mathcal{X}$ and the GNN uses the feature vector and adjacency matrix $(\mathcal{X},\mathcal{A})$ .

A.3 Algorithm

The procedure for GraphMix training is given in Algorithm 1.

A.4 Datasets

We use three standard benchmark citation network datasets for semi-supervised node classification, namely Cora, Citeseer and Pubmed. In all these datasets, nodes correspond to documents and edges correspond to citations. Node features correspond to the bag-of-words representation of the document. Each node belongs to one of $C$ classes. During training, the algorithm has access to the feature vectors and edge connectivity of all the nodes but has access to the class labels of only a few of the nodes.

For semi-supervised link classification, we use two datasets Bitcoin Alpha and Bitcoin OTC from (Kumar et al., 2016, 2018). The nodes in these datasets correspond to the bitcoin users and the edge weights between them correspond to the degree of trust between the users. Following (Qu et al., 2019), we treat edges with weights greater than 3 as positive instances, and edges with weights less than -3 are treated as negative ones. Given a few labeled edges, the task is to predict the labels of the remaining edges. The statistics of these datasets as well as the number of training/validation/test nodes is presented in Appendix A.4.

The statistics of standard benchmark datasets as well as the number of training/validation/test nodes is presented in Table 5. The statistics of larger datasets in given in Table 6.

For larger datasets of Section 4.2, we took processed versions of these dataset available here 333https://github.com/shchur/gnn-benchmark. We did 10 random splits of the the data into train/validation/test split. For the classes which had more than 100 samples. We choose 20 samples per class for training, 30 samples per class for validation and the remaining samples as test data. For the classes which had less than 100 samples, we chose 20% samples, per class for training, 30% samples for validation and the remaining for testing. For each split we run experiments using 100 random seeds. The statistics of these datasets is presented in Appendix Table 6 and

A.5 Ablation Study

Since GraphMix consists of various components, some of which are common with the existing literature of semi-supervised learning, we set out to study the effect of various components by systematically removing or adding a component from GraphMix . We measure the effect of the following:

Removing both the Manifold Mixup and predicted targets from the FCN training.

Having only the Manifold Mixup training for FCN ( No predicted targets for FCN training)

Using the predicted targets for the FCN training (No Manifold Mixup in FCN training)

Using both Manifold Mixup and predicted targets for FCN training

The ablation results for semi-supervised node classification are presented in Table 7. We did not do any hyperparameter tuning for the ablation study and used the best performing hyperparameters found for the results presented in Table 1. We observe that all the components of GraphMix contribute to its performance, with Manifold Mixup training of FCN contributing possibly the most.

A.6 Comparison with State-of-the-art Methods

We present the comparion of GraphMix with the recent state-of-the-art methods as well as earlier methods using the standard Train/Validation/Test split in Table 8. We additionally use self-training based baselines, where FCN, GCN, GAT and Graph-U-Net are trained with self-generated targets. These are named as FCN (self-training), GCN (self-training), GAT (self-training) and Graph-U-Net(self-training) respectively in Table 8. For generating the predicted-targets in above two baselines, we followed the procedure of Appendix A.1.

A.7 Results with fewer labeled samples

We further evaluate the effectiveness of GraphMix in the learning regimes where fewer labeled samples exist. For each class, we randomly sampled $K\in\{5,10\}$ samples for training and the same number of samples for the validation. We used all the remaining labeled samples as the test set. We repeated this process for $10$ times. The results in Table 9 show that GraphMix achieves even better improvements when the labeled samples are fewer.

A.8 Implementation and Hyperparameter Details

We use the standard benchmark architecture as used in GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018) and GMNN (Qu et al., 2019), among others. This architecture has one hidden layer and the graph convolution is applied twice : on the input layer and on the output of the hidden layer. The FCN in GraphMix shares the parameters with the GCN.

GraphMix introduces four additional hyperparameters, namely the $\alpha$ parameter of $\operatorname{Beta}$ distribution used in Manifold Mixup training of the FCN, the maximum weighing coefficient $\mathcal{\gamma}$ which controls the trade-off between the supervised loss and the unsupervised loss (loss computed using the predicted-targets) of FCN, the temparature $T$ in sharpening and the number of random perturbations $K$ applied to the input data for the averaging of the predictions.

We conducted minimal hyperparameter seach over only $\alpha$ and $\mathcal{\gamma}$ and fixed the hyperparameters $T$ and $K$ to $0.1$ and $10$ respectively. The other hyperparameters were set to the best values for underlying GNN (GCN or GAT), including the learning rate, the $L2$ decay rate, number of units in the hidden layer etc. We observed that GraphMix is not very sensitive to the values of $\alpha$ and $\mathcal{\gamma}$ and similar values of these hyperparameters work well across all the benchmark datasets. Refer to Appendix A.8 and A.9 for the details about the hyperparameter values and the procedure used for the best hyperparameters selection.

For GCN and GraphMix(GCN), we used Adam optimizer with learning rate $0.01$ and $L2$ -decay 5e-4, the number of units in the hidden layer $16$ , dropout rate in the input layer and hidden layer was set to $0.5$ and $0.0$ , respectively. For GAT and GraphMix(GAT), we used Adam optimizer with learning rate $0.005$ and $L2$ -decay 5e-4, the number of units in the hidden layer $8$ , and the dropout rate in the input layer and hidden layer was searched from the values $\{0.2,0.5,0.8\}$ .

For $\alpha$ and $\mathcal{\gamma}$ of GraphMix(GCN) and GraphMix(GAT) , we searched over the values in the set $[0.0,0.1,1.0,2.0]$ and $[0.1,1.0,10.0,20.0]$ respectively.

For GraphMix(GCN) : $\alpha=1.0$ works best across all the datasets. $\mathcal{\gamma}=1.0$ works best for Cora and Citeseer and $\mathcal{\gamma}=10.0$ works best for Pubmed.

For GraphMix(GAT) : $\alpha=1.0$ works best for Cora and Citeseer and $\alpha=0.1$ works best for Pubmed. $\mathcal{\gamma}=1.0$ works best for Cora and Citeseer and $\mathcal{\gamma}=10.0$ works best for Pubmed. Input droputrate=0.5 and hidden dropout rate=0.5 work best for Cora and Citeseer and Input dropout rate=0.2 and hidden dropout rate =0.2 work best for Pubmed.

We conducted all the experiments for 2000 epochs. The value of weighing coefficient $w(t)$ in Algorithm 1) is increased from to its maximum value $\gamma$ from epoch 500 to 1000 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

A.8.2 Hyperparameter Details for Results in Section 4.2

For all the experiments we use the standard architecture mentioned in Section A.8 and used Adam optimizer with learning rate 0.001 and 64 hidden units in the hidden layer. For Coauthor-CS and Coauthor-Physics, we trained the network for 2000 epochs. For Cora-Full, we trained the network for 5000 epochs because we observed the training loss of Cora-Full dataset takes longer to converge.

For Coauthor-CS and Coauthor-Physics: We set the input layer dropout rate to 0.5 and weight-decay to 0.0005, both for GCN and GraphMix(GCN) . We did not conduct any hyperparameter search over the GraphMix hyperparameters $\alpha$ , $\lambda_{max}$ , temparature $T$ and number of random permutations $K$ applied to the input data for GraphMix(GCN) for these two datasets, and set these values to $1.0$ , $1.0$ , $0.1$ and $10$ respectively.

For Cora-Full dataset: We found input layer dropout rate 0.2 and weight-decay 0.0 to be the best for both GCN and GraphMix(GCN) . For GraphMix(GCN) we fixed $\alpha$ , temparature $T$ and number of random permutations $K$ to $1.0$ $0.1$ and $10$ respectively. For $\lambda_{max}$ , we did search over $\{1.0,10.0,20.0\}$ and found that $10.0$ works best.

For all the GraphMix(GCN) experiments, the value of weighing coefficient $w(t)$ in Algorithm 1) is increased from to its maximum value $\gamma_{max}$ from epoch 500 to 1000 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

A.8.3 For results reported in Section 4.3

For $\alpha$ of GraphMix(GCN) , we searched over the values in the set $[0.0,0.1,0.5,1.0]$ and found that $0.1$ works best for both the datasets. For $\mathcal{\gamma}$ , we searched over the values in the set $[0.1,1.0,10.0]$ and found that $0.1$ works best for both the datasets. We conducted all the experiments for 150 epochs. The value of weighting coefficient $w(t)$ in Algorithm 1) is increased from to its maximum value $\mathcal{\gamma}$ from epoch 75 to 125 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

Both for GraphMix(GCN) and GCN, we use Adam optimizer with learning rate $0.01$ and $L2$ -decay $0.0$ , the number of units in the hidden layer $128$ , dropout rate in the input layer was set to $0.5$ .

A.8.4 For results reported in Section A.7

For $\alpha$ of GraphMix(GCN) , we searched over the values in the set $[0.0,0.1,0.5,1.0]$ and found that $0.1$ works best across all the datasets. For $\mathcal{\gamma}$ , we searched over the values in the set $[0.1,1.0,10.0]$ and found that $0.1$ and $1.0$ works best across all the datasets. Rest of the details for GraphMix(GCN) and GCN are same as Section A.8.1.

A.9 Hyperparameter Selection

For each configuration of hyperparameters, we run the experiments with $100$ random seeds. We select the hyperparameter configuration which has the best validation accuracy averaged over these $100$ trials. With this best hyperparameter configuration, for $100$ random seeds, we train the model again and use the validataion set for model selection ( i.e. we report the test accuracy at the epoch which has best validation accuracy.)

A.10 Soft-Rank

Let H be a matrix containing the hidden states of all the samples from a particular class. The Soft-Rank of matrix H is defined by the sum of the singular values of the matrix divided by the largest singular value. A lower Soft-Rank implies fewer dimensions with substantial variability and it provides a continuous analogue to the notion of rank from matrix algebra. This provides evidence that the concentration of class-specific states observed when using GraphMix in Figure 3 can be measured directly from the hidden states and is not an artifact of the T-SNE visualization.

A.11 Feature Visualization

We present the 2D visualization of the hidden states learned using GCN and GraphMix(GCN) for Cora, Pubmed and Citeseer datasets in Figure 3. We observe that for Cora and Citeseer, GraphMix learns substantially better hidden states than GCN. For Pubmed, we observe that although there is no clear separation between classes, "Green" and "Red" classes overlap less using the GraphMix, resulting in better hidden states.

A.12 Joint Optimization vs Alternate optimization

In this Section, we discuss the effect of hyperparameter $\lambda$ , that is used to compute the weighted sum of GCN and FCN losses in in Eq 7. In Figure 4, we see that a wide range of $\lambda$ ( from 0.1 to 1.0) achieves better validation accuracy than the vanilla GCN. Furthermore, alternate optimization of the FCN loss and the GCN loss achieves substantially better validation accuracy than the simultaneous optimization.