GraphMix: Improved Training of GNNs for Semi-Supervised Learning

Vikas Verma, Meng Qu, Kenji Kawaguchi, Alex Lamb, Yoshua Bengio, Juho Kannala, Jian Tang

Introduction

Due to the presence of graph-structured data across a wide variety of domains, such as biological networks, citation networks and social networks, there have been several attempts to design neural networks, known as graph neural networks (GNN), that can process arbitrarily structured graphs. Early work includes (Gori et al., 2005; Scarselli et al., 2009) which propose a neural network that can directly process most types of graphs e.g., acyclic, cyclic, directed, and undirected graphs. More recent approaches include (Bruna et al., 2013; Henaff et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016; Gilmer et al., 2017; Hamilton et al., 2017; Veličković et al., 2018, 2019; Qu et al., 2019; Gao & Ji, 2019; Ma et al., 2019), among others. Many of these approaches are designed for addressing the problem of semi-supervised learning over graph-structured data (Zhou et al., 2018). Much of these research efforts have been dedicated to developing novel architectures.

Here we instead propose an architecture-agnostic method for regularized training of GNNs for semi-supervised node classification. Recently, regularization based on data-augmentation has been shown to be very effective in other types of neural networks but how to apply these techniques in GNNs is still under-explored. Our proposed method GraphMix 111code available at https://github.com/vikasverma1077/GraphMix is a unified framework that draws inspiration from interpolation based data augmentation (Zhang et al., 2018; Verma et al., 2019a) and self-training based data-augmentation (Laine & Aila, 2016; Tarvainen & Valpola, 2017; Verma et al., 2019b; Berthelot et al., 2019). We show that with our proposed method, we can achieve state-of-the-art performance even when using simpler GNN architectures such as Graph Convolutional Networks (Kipf & Welling, 2017), with no additional memory cost and with minimal additional computation cost. Further, we conduct a theoritical analysis to demonstrate the effectiveness of the proposed method over the underlying GNNs.

Problem Definition and Preliminaries

We are interested in the problem of semi-supervised node and edge classification using graph-structured data. We can formally define such graph-structured data as an undirected graph G=(V,A,X){\mathcal{G}}=({\mathcal{V}},\mathcal{A},\mathcal{X}), where V=VlVu\mathcal{V}=\mathcal{V}_{l}\cup\mathcal{V}_{u} is the union of labeled (Vl\mathcal{V}_{l}) and unlabeled (Vu\mathcal{V}_{u}) nodes in the graph with cardinalities nln_{l} and nun_{u}, and A\mathcal{A} is the adjacency matrix representing the edges between the nodes of V\mathcal{V}, XR(nl+nu)×d\mathcal{X}\in\mathbb{R}^{(n_{l}+n_{u})\times d} is the input node features. Each node vv belongs to one out of CC classes and can be labeled with a CC-dimensional one-hot vector yvRC{y}_{v}\in\mathbb{R}^{C}. Given the labels YlRnl×CY_{l}\in\mathbb{R}^{n_{l}\times C} of the labeled nodes Vl\mathcal{V}_{l}, the task is to predict the labels YuRnu×CY_{u}\in\mathbb{R}^{n_{u}\times C} of the unlabeled nodes Vu\mathcal{V}_{u}.

Graph Neural Networks (GNN) learn the lthl_{th} layer representations of a sample ii by leveraging the representations of the samples NB(i)NB(i) in the neighbourhood of ii. This is done by using an aggregation function that takes as an input the representations of all the samples along with the graph structure and outputs the aggregated representation. The aggregation function can be defined using the Graph Convolution layer (Kipf & Welling, 2017), Graph Attention Layer (Veličković et al., 2018), or any general message passing layer (Gilmer et al., 2017). Formally, let h(l)Rn×k\mathbf{h}^{(l)}\in\mathbb{R}^{n\times k} be a matrix containing the kk-dimensional representation of nn nodes in the lthl_{th} layer, then:

𝑙1𝜎𝐴𝐺𝐺𝑅𝐸𝐺𝐴𝑇𝐸superscript𝐡𝑙𝐖𝒜\displaystyle\mathbf{h}^{(l+1)}=\sigma(AGGREGATE(\mathbf{h}^{(l)}\mathbf{W},\mathcal{A})) (1) where WRk×k\mathbf{W}\in\mathbb{R}^{k\times k^{\prime}} is a linear transformation matrix, kk^{\prime} is the dimension of (l+1)th(l+1)_{th} layer, AGGREGATEAGGREGATE is the aggregation function that utilizes the graph adjacency matrix A\mathcal{A} to aggregate the hidden representations of neighbouring nodes and σ\sigma is a non-linear activation function, e.g. ReLU.

Recently, interpolation-based techniques have been proposed for regularizing neural networks. We briefly describe some of these techniques here. Mixup (Zhang et al., 2018) trains a neural network on the convex combination of input and targets, whereas Manifold Mixup (Verma et al., 2019a) trains a neural network on the convex combination of the hidden states of a randomly chosen hidden layer and the targets. While Mixup regularizes a neural network by enforcing that the model output should change linearly in between the examples in the input space, Manifold Mixup regularizes the neural network by learning better (more discriminative) hidden states.

Formally, suppose Tθ(x)=(fg)θ(x)T_{\theta}(\mathbf{x})=(f\circ g)_{\theta}(\mathbf{x}) is a neural network parametrized with θ\theta such that g:xhg:\mathbf{x}\xrightarrow{}\mathbf{h} is a function that maps input sample to hidden states, f:hy^f:\mathbf{h}\xrightarrow{}\mathbf{\hat{y}} is a function that maps hidden states to predicted output, λ\lambda is a random variable drawn from Beta(α,α)\operatorname{Beta}(\alpha,\alpha) distribution, Mixλ(a,b)=λa+(1λ)b\text{Mix}_{\lambda}(\mathbf{a},\mathbf{b})=\lambda*\mathbf{a}+(1-\lambda)*\mathbf{b} is an interpolation function, D\mathcal{D} is the data distribution, (x,y)(\mathbf{x},\mathbf{y}) and (x,y)(\mathbf{x^{\prime}},\mathbf{y^{\prime}}) is a pair of labeled examples sampled from distribution D\mathcal{D} and \ell be a loss function such as cross-entropy loss, then the Manifold Mixup Loss is defined as:

We use above Manifold Mixup loss for training an auxiliary Fully-connected-network as described in Section 3.

GraphMix

Data Augmentation is one of the simplest and most efficient technique for regularizing a neural network. In the domains of computer vision, speech and natural language, there exist efficient data augmentation techniques, for example, random cropping, translation or Cutout (Devries & Taylor, 2017) for computer vision, (Ko et al., 2015) and (Park et al., 2019) for speech and (Xie et al., 2017) for natural language. However, data augmentation for the graph-structured data remains under-explored. There exists some recent work along these lines but the prohibitive computation cost (see Section • ‣ 5) introduced by these methods make them impractical for real-world large graph datasets. Based on these limitations, our main objective is to propose an efficient data augmentation technique for graph datasets.

Recent work based on interpolation-based data augmentation (Zhang et al., 2018; Verma et al., 2019a) has seen sizable improvements in regularization performance across a number of tasks. However, these techniques are not directly applicable to graphs for an important reason: Although we can create additional nodes by interpolating the features and corresponding labels, it remains unclear how these new nodes must be connected to the original nodes via synthetic edges such that the structure of the whole graph is preserved. To alleviate this issue, we propose to train an auxiliary Fully-Connected Network (FCN) using Manifold Mixup as discussed in Section 3.2. Note that the FCN only uses the node features (not the graph structure), thus the Manifold mixup loss in Eq. 2 can be directly used for training the FCN.

Interpolation based data-augmentation techniques have an added advantage for training GNNs: A vanilla GNN learns the representation of each node by iteratively aggregating information from the neighbors of that node (Equation 1). However, this induces the problem of oversmoothing (Li et al., 2018; Xu et al., 2018) while training GNNs with many layers. Due to this limitation, GNNs are trained only with a few layers, and thus they can only leverage the local neighbourhood of each node for learning its representations, without leveraging the representations of the nodes which are multiple hops away in the graph. This limitation can be addressed using the interpolation-based method such as Manifold Mixup: in Manifold Mixup, the representations of a randomly chosen pair of nodes is used to facilitate better representation learning; it is possible that the randomly chosen pair of nodes will not be in the local neighbourhood of each other. Based on these challenges and motivations we present our proposed approach GraphMix for training GNNs in the following Section.

2 Method

We first describe GraphMix at a high-level and then give a more formal description. GraphMix augments the vanilla GNN with a Fully-Connected Network (FCN). The FCN loss is computed using Manifold Mixup as discussed below and the GNN loss is computed normally. Manifold Mixup training of FCN facilitates learning more discriminative node representations (Verma et al., 2019a). An important question is how these more discriminative node representations can be transferred to the GNN? One potential approach could involve maximizing the mutual information between the hidden states of the FCN and the GNN using formulations similar to those proposed by (Hjelm et al., 2019; Sun et al., 2020). However, this requires optimizing additional network parameters. Instead, we propose parameter sharing between FCN and GNN to facilitate the transfer of discriminative node representations from the FCN to the GNN. It is a viable option because as mentioned in Eq 1, a GNN layer typically performs an additional operation (AGGREGATEAGGREGATE) on the linear transformations of node representations (which are essentially pre-activation representations of the FCN layer). Using the more discriminative representations of the nodes from FCN, as well as the graph structure, the GNN loss is computed in the usual way to further refine the node representations. In this way we can exploit the improved representations from Manifold Mixup for training GNNs.

In Section 3.3, without making any assumption about the aggregation function and the depth of the graph neural network, we show that GraphMix improves the generalization of the underlying graph neural network. This makes GraphMix applicable to various kind of architectures having different aggregation functions, such as weighed averaging in GCN (Kipf & Welling, 2016), attention based aggregation in GAT (Veličković et al., 2018) and graph-pooling/unpooling operations in Graph U-Nets (Gao & Ji, 2019). In the aforementioned sense, GraphMix procedure is highly flexible: it can be applied to any underlying GNN as long as the underlying GNN applies parametric transformations to the node features.

Additionally, we propose to use the predicted targets from the GNN to augment the training set of the FCN. In this way, both the FCN and the GNN facilitate each other’s learning process. Both the FCN loss and the GNN loss are optimized in an alternating fashion during training. At inference time, predictions are made using only the GNN.

A diagram illustrating GraphMix is presented in Figure 1 and the full algorithm is presented in Appendix A.3. Further, we draw similarities and difference of GraphMix w.r.t. Co-training framework in the Appendix A.2.

So far we have presented the general design of GraphMix, now we present GraphMix more formally. Given a graph G\mathcal{G}, let (Xl,Yl)(\mathbf{X}_{l},\mathbf{Y}_{l}) be the input features and the labels of the labeled nodes Vl\mathcal{V}_{l} and let (Xu)(\mathbf{X}_{u}) be the input features of the unlabeled nodes Vu\mathcal{V}_{u}. Let FθF_{\theta} and GθG_{\theta} be a FCN and a GNN respectively, which share the parameters θ\theta. The FCN loss from the labeled data is computed using Eq. 2 as follows:

For unlabeled nodes Vu\mathcal{V}_{u}, we compute the prediction Y^u\mathbf{\hat{Y}_{u}} using the GNN:

We note that recent state-of-the-art semi-supervised learning methods use a teacher model to accurately predict targets for the unlabeled data. The teacher model can be realized as a temporal ensemble of the student model (the model being trained) (Laine & Aila, 2016) or by using an Exponential Moving Average (EMA) of the parameters of the student model (Tarvainen & Valpola, 2017; Verma et al., 2019b). Different from these approaches, we use the GNN as a teacher model for predicting the targets for the FCN. This is due to the fact that GNNs leverage graph structure, which in practice, allows them to make more accurate predictions than the temporal ensemble or EMA ensemble of FCN (although there is no theoretical guarantee for this).

Moreover, to improve the accuracy of the predicted targets in Eq 4, we applied the average of the model prediction on KK random perturbations of an input sample along with sharpening as discussed in Appendix A.1.

Using the predicted targets for unlabeled nodes, we create a new training set (Xu,Y^u)(\mathbf{X}_{u},\mathbf{\hat{Y}_{u}}). The loss from the unlabeled data for the FCN is computed as:

Total loss for training the FCN is given as the weighted sum of above two loss terms.

subscriptℒsupervised𝑤𝑡subscriptℒunsupervised\displaystyle\mathcal{L}_{\text{FCN}}=\mathcal{L}_{\text{supervised}}+w(t)*\mathcal{L}_{\text{unsupervised}} (6) where w(t)w(t) is a sigmoid ramp-up function (Tarvainen & Valpola, 2017) which increases from zero to its max value γ\gamma during the course of training.

Now let us assume that the loss for an underlying GNN is LGNN=(Gθ(Xl),Yl)\mathcal{L}_{\text{GNN}}=\ell(G_{\theta}(\mathbf{X}_{l}),\mathbf{Y}_{l}); the overall GraphMix loss for the joint training of the FCN and GNN can be defined as the weighted sum of the GNN loss and the FCN loss:

subscriptℒGNN𝜆subscriptℒFCN\displaystyle\mathcal{L}_{\text{GraphMix }}=\mathcal{L}_{\text{GNN}}+\lambda*\mathcal{L}_{\text{FCN}} (7) However, throughout our experiments, optimizing FCN loss and GNN loss alternatively at each training epoch achieved better test accuracy (discussed in Appendix A.12). This has an added benefit that it removes the need to tune weighing hyper-parameter λ\lambda.

For Manifold Mixup training of FCN, we apply mixup only in the hidden layer. Note that in (Verma et al., 2019a), the authors recommended applying mixing in a randomly chosen layer (which also includes the input layer) at each training update. However, we observed under-fitting when applying mixup randomly at the input layer or the hidden layer. Applying mixup only in the input layer also resulted in underfitting and did not improve test accuracy.

Memory and Computational Requirements: One of the major limitations of current GNNs, which prohibits their application to real-world large datasets, is their memory complexity. For example, the fastest implementation of GCN, which stores the entire adjacency matrix A\mathcal{A} in the memory, has the memory complexity O(V2)O(|\mathcal{V}|^{2}). Implementations with lower memory requirements are possible but they incur higher latency cost due to repeatedly loading parts of adjacency matrix in the memory. Due to these reasons, methods which have additional memory requirement in comparison to the baseline GNNs, are less appealing in practice. In GraphMix, since the parameters of the FCN and GNN are shared, there is no additional memory requirement. Furthermore, GraphMix does not add any significant computation cost over the underlying GNN, because the underlying GNN is trained in the standard way and the FCN training requires trivial additional computation cost for computing the predicted-targets (Appendix A.1) and the interpolation function ( Mixλ(a,b)\text{Mix}_{\lambda}(\mathbf{a},\mathbf{b}) in Section 2).

3 Analysis

In this subsection, we study how GraphMix impacts the generalization bound of a underlying GNN. Our analysis, which is based on Rademacher complexity (Bartlett & Mendelson, 2002), provides a new generalization bound for GraphMix, which shows how adding regularization to training the FCN with pseudolabels improves overall generalization. We rely on the effect of changing one sample in Manifold Mixup and the fact that the weights are shared by a GNN and the corresponding FCN in GraphMix.

Let GG be a fixed graph with nn total nodes. Define zi=(xi,yi)z_{i}=(x_{i},y_{i}) to be the pair of the feature and the true label of the ii-th node. Without loss of generality, let S=(z1,,zm)S=(z_{1},\dots,z_{m}) be the training set with the labeled nodes where m=nlm=n_{l} and data points z1,,zmz_{1},\dots,z_{m} are sampled according to an unknown distribution D\mathcal{D} to form the labeled node set SS. In this subsection, we follow the previous paper on graph neural networks by assuming that all samples are i.i.d. (including replacement sample) (Verma & Zhang, 2019).

Let Γ\Gamma be a finite set of the hyperparameters. For every hyperparameter γΓ\gamma\in\Gamma, define Fγ\mathcal{F}_{\gamma} to be a distribution-dependent hypothesis space corresponding the hyperparameter γ\gamma. That is, Fγ={fγ:(SS)[fγ=Aγ(S)]}\mathcal{F}_{\gamma}=\{f_{\gamma}:(\exists S\in\mathcal{S})[f_{\gamma}=\mathcal{A}_{\gamma}(S)]\} where Aγ\mathcal{A}_{\gamma} is an algorithm that outputs the hypothesis fγf_{\gamma} given a dataset SS, and S\mathcal{S} is the set of training datasets such that the probability of SSS\in\mathcal{S} according to D\mathcal{D} is one. For each fγFγf_{\gamma}\in\mathcal{F}_{\gamma}, fγ( ;G)f_{\gamma}(\cdot\ ;G) represents GNN with the graph GG and fγ( ;G0)f_{\gamma}(\cdot\ ;G_{0}) represents FCN where G0G_{0} is the null graph version of GG (i.e., GG without edges). Let Rm(Fγ)\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma}) be the Rademacher complexity (Bartlett & Mendelson, 2002) of the set {(x,y)(fγ(x;G),y):fγFγ}\{(x,y)\mapsto\ell(f_{\gamma}(x;G),y):f_{\gamma}\in\mathcal{F}_{\gamma}\}.

Let LGNN(S,fγ)L_{{\text{GNN}}}(S,f_{\gamma}) be the LGNN\mathcal{L}_{{\text{GNN}}} with the GNN fγ( ;G)f_{\gamma}(\cdot\ ;G) and labeled data points SS as LGNN(S,fγ)=1mi=1m(fγ(xi;G),yi)L_{{\text{GNN}}}(S,f_{\gamma})=\frac{1}{m}\sum_{i=1}^{m}\ell(f_{\gamma}(x_{i};G),y_{{i}}). Let LFCN(S,fγ)L_{{\text{FCN}}}(S,f_{\gamma}) be LFCN\mathcal{L}_{{\text{FCN}}} with the FCN fγ( ;G0)f_{\gamma}(\cdot\ ;G_{0}) and labeled data points SS as LFCN(S,fγ)=LMM(S,fγ( ;G0),α)+numLMM((XuS,Y^uS),fγ( ;G0),α)L_{{\text{FCN}}}(S,f_{\gamma})=\mathcal{L}_{\text{MM}}(S,f_{\gamma}(\cdot\ ;G_{0}),\alpha)+\frac{n_{u}}{m}\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S}),f_{\gamma}(\cdot\ ;G_{0}),\alpha) where (XuS,Y^uS)(\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S}) is the unlabeled nodes (Xu,Y^u)(\mathbf{X}_{u},\mathbf{\hat{Y}}_{u}) that corresponds to the labeled node set SS. Let LGraphMix(S,fγ)=LGNN(S,fγ)+λLFCN(S,fγ)L_{{\text{GraphMix}}}(S,f_{\gamma})=L_{{\text{GNN}}}(S,f_{\gamma})+\lambda L_{{\text{FCN}}}(S,f_{\gamma}). Let cc be the upper bound on per-sample loss as c(fγ(xi;G),yi)c\geq\ell(f_{\gamma}(x_{i};G),y_{{i}}) and c(fγ(xi;G0),yi)c\geq\ell(f_{\gamma}(x_{i};G_{0}),y_{{i}}). For example, c=1c=1 for training and test error (or 0-1 loss).

Theorem 1 provides a generalization bound for GraphMix, which shows that GraphMix can improve the generalization bound of the underlying GNN under the condition of V>0V>0 as discussed below.

For any δ>0\delta>0, with probability at least 1δ1-\delta, the following holds: for all γΓ\gamma\in\Gamma and all fγFγf_{\gamma}\in\mathcal{F}_{\gamma}, we have that Ex,yD[(fγ(x;G),y)]LGraphMix(S,fγ)Rm(Fγ)+cln(Γ/δ)2mλV\mathbb{E}_{x,y\sim\mathcal{D}}[\ell(f_{\gamma}(x;G),y)]-L_{{\text{GraphMix}}}(S,f_{\gamma})\leq\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})+c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}-\lambda V, where V=ESDm[inffγFγLFCN(S,fγ)]4cln(Γ/δ)2m.V=\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})]-4c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}.

The proof of Theorem 1 is given in the following. The generalization bound in Theorem 1 becomes a generalization bound for GNN without GraphMix when λ=0\lambda=0 as desired. By comparing the generalization bound with λ=0\lambda=0 (no GraphMix) and λ>0\lambda>0 (with GraphMix), we can see that GraphMix improves the generalization bound of underlying GNN when V=ESDm[inffγFγLFCN(S,fγ)]4cln(Γ/δ)/(2m)>0.V=\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})]-4c\sqrt{\ln(|\Gamma|/\delta)/(2m)}>0. Here, the first term ESDm[inffγFγLFCN(S,fγ)]\mathbb{E}_{S^{\prime}\sim\mathcal{D}^{m}}[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(S^{\prime},f_{\gamma})] increases as the hypothesis space Fγ\mathcal{F}_{\gamma} gets “smaller” or has less complexity. Thus, GraphMix improves the generalization bound of an underlying GNN when the hyperparameter search over γΓ\gamma\in\Gamma results in Fγ\mathcal{F}_{\gamma} of moderate complexity such that the first term is greater than 4cln(Γ/δ)/(2m)4c\sqrt{\ln(|\Gamma|/\delta)/(2m)}. The first term contains the manifold mixup loss LFCN(S,fγ)L_{{\text{FCN}}}(S^{\prime},f_{\gamma}) over random training datasets SS^{\prime} (S\neq S), which is expected to be larger when compared with that of the standard loss without manifold mixup.

For each fixed Fγ\mathcal{F}_{\gamma}, the generalization bound in Theorem 1 goes to zero as mm\rightarrow\infty since Rm(Fγ)0\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})\rightarrow 0 and VV00V\rightarrow V_{0}\geq 0 as mm\rightarrow\infty. The training loss is also minimized when the trained model fγFγf_{\gamma}\in\mathcal{F}_{\gamma} fits well to the particular given training data set SS. Therefore, given a particular training dataset SS, the expected loss is minimized if we conduct a hyperparameter search over γΓ\gamma\in\Gamma such that fγFγf_{\gamma}\in\mathcal{F}_{\gamma} minimize the training loss for the given SS but Fγ\mathcal{F}_{\gamma} has moderate complexity to not being able to minimize the manifold mixup losses for other datasets SS^{\prime} (S\neq S) drawn from D\mathcal{D}. Unlike the standard data points, the data points generated during manifold mixup in GraphMix are typically not memorizable or interpolated by FCN.

4 Proof of Theorem 1

In the proof, we analyse the effect of changing one sample in Manifold Mixup and GNN-generated targets by utilizing the fact that the weights are shared by a GNN and the corresponding FCN. The the fact of sharing weights also results in the generalization bound that relates the generalization in FCN via Manifold Mixup to the generalization of the GNN.

Let γΓ\gamma\in\Gamma be fixed. Define φ(S)=supfγFγEx,yD[(fγ(x;G),y)]LGraphMix(S,fγ)\varphi(S)=\sup_{f_{\gamma}\in\mathcal{F}_{\gamma}}\mathbb{E}_{x,y\sim\mathcal{D}}[\ell(f_{\gamma}(x;G),y)]-L_{{\text{GraphMix}}}(S,f_{\gamma}). We first provide an upper bound on φ(S)\varphi(S) by using McDiarmid’s inequality. To apply McDiarmid’s inequality to φ(D)\varphi(\mathcal{D}), we compute an upper bound on φ(S)φ(S)|\varphi(S)-\varphi(S^{\prime})| where SS and SS^{\prime} be two training datasets differing by exactly one point of an arbitrary index i0i_{0}; i.e., Si=SiS_{i}=S^{\prime}_{i} for all ii0i\neq i_{0} and Si0Si0S_{i_{0}}\neq S^{\prime}_{i_{0}}.

𝜆subscript𝐿FCN𝑆subscript𝑓𝛾subscript𝐿FCNsuperscript𝑆′subscript𝑓𝛾\displaystyle+\lambda(L_{{\text{FCN}}}(S,f_{\gamma})-L_{{\text{FCN}}}(S^{\prime},f_{\gamma})) where the first line follows the property of the supremum, sup(a)sup(b)sup(ab)\sup(a)-\sup(b)\leq\sup(a-b), and the second line follows the definition of LGraphMixL_{{\text{GraphMix}}}.

where we used the fact that given a fixed GG and a fixed fγf_{\gamma}, (fγ(xi;G),yi)=(fγ(xi;G),yi)\ell(f_{\gamma}(x_{i};G),y_{{i}})=\ell(f_{\gamma}(x_{i}^{\prime};G),y_{{i}}^{\prime}) for ii0i\neq i_{0}. This holds since fγ( ;G)f_{\gamma}(\cdot\ ;G) does not depend on SS given a GG and a fγf_{\gamma}, even though fγ(xi;G)f_{\gamma}(x_{i};G) contains the aggregation functions over the graph GG.

where we use the fact that LMM(S,fγ( ;G0),α)\mathcal{L}_{\text{MM}}(S,f_{\gamma}(\cdot\ ;G_{0}),\alpha) has m2m^{2} terms and 2m12m-1 terms differ for SS and SS^{\prime}, each of which is bounded by the constant cc. Similarly, LMM((XuS,Y^uS),fγ( ;G0),α)LMM((XuS,Y^uS),fγ( ;G0),α)2cnu\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S},\mathbf{\hat{Y}}_{u}^{S}),f_{\gamma}(\cdot\ ;G_{0}),\alpha)-\mathcal{L}_{\text{MM}}((\mathbf{X}_{u}^{S^{\prime}},\mathbf{\hat{Y}}_{u}^{S^{\prime}}),f_{\gamma}(\cdot\ ;G_{0}),\alpha)\leq\frac{2c}{n_{u}}, since the labels Y^uS\mathbf{\hat{Y}}_{u}^{S} and Y^uS\mathbf{\hat{Y}}_{u}^{S^{\prime}} are determined by fγ(xi;G)f_{\gamma}(x_{i};G), and fγ(xi;G)=fγ(xi;G)f_{\gamma}(x_{i};G)=f_{\gamma}(x_{i}^{\prime};G) for ii0i\neq i_{0}, given a fixed GG and a fixed fγf_{\gamma}. Therefore,

14𝜆𝑚\displaystyle\varphi(S^{\prime})-\varphi(S)\leq\frac{c(1+4\lambda)}{m}. Similarly, φ(S)φ(S)c(1+4λ)m\varphi(S)-\varphi(S^{\prime})\leq\frac{c(1+4\lambda)}{m}. Thus, by McDiarmid’s inequality, for any δ>0\delta>0, with probability at least 1δ/Γ1-\delta/|\Gamma|,

subscript𝔼¯𝑆delimited-[]𝜑¯𝑆𝑐14𝜆Γ𝛿2𝑚\varphi(S)\leq\mathbb{E}_{\bar{S}}[\varphi(\bar{S})]+c(1+4\lambda)\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}. Moreover,

subscript𝔼¯𝑆delimited-[]𝜑¯𝑆𝜆subscript𝔼¯𝑆delimited-[]subscriptinfimumsubscript𝑓𝛾subscriptℱ𝛾subscript𝐿FCN¯𝑆subscript𝑓𝛾\displaystyle\mathbb{E}_{\bar{S}}[\varphi(\bar{S})]+\lambda\mathbb{E}_{\bar{S}}\left[\inf_{f_{\gamma}\in\mathcal{F}_{\gamma}}L_{{\text{FCN}}}(\bar{S},f_{\gamma})\right] ESˉ[supfγFγExˉ,yˉD[(fγ(xˉ;G),yˉ)]LGNN(Sˉ,fγ)]\displaystyle\leq\mathbb{E}_{\bar{S}}\left[\sup_{f_{\gamma}\in\mathcal{F}_{\gamma}}\mathbb{E}_{\bar{x}^{\prime},\bar{y}^{\prime}\sim\mathcal{D}}[\ell(f_{\gamma}(\bar{x}^{\prime};G),\bar{y}^{\prime})]-L_{{\text{GNN}}}(\bar{S},f_{\gamma})\right] 2Rn(Θ)\displaystyle\leq 2\mathcal{R}_{n}(\Theta) where the second line and the last line use the subadditivity of supremum, the third line uses the Jensen’s inequality and the convexity of the supremum, the fourth line follows that for each ξi{1,+1}\xi_{i}\in\{-1,+1\}, the distribution of each term ξi((fγ(xˉi;G),yˉi)(fγ(xˉi;G),yˉi))\xi_{i}(\ell(f_{\gamma}(\bar{x}_{i}^{\prime};G),\bar{y}_{i}^{\prime})-\ell(f_{\gamma}(\bar{x}_{i};G),\bar{y}_{i})) is the distribution of ((fγ(xˉi;G),yˉi)(fγ(xˉi;G),yˉi))(\ell(f_{\gamma}(\bar{x}_{i}^{\prime};G),\bar{y}_{i}^{\prime})-\ell(f_{\gamma}(\bar{x}_{i};G),\bar{y}_{i})) since Sˉ\bar{S} and Sˉ\bar{S}^{\prime} are drawn iid with the same distribution.

Therefore, for any δ>0\delta>0, with probability at least 1δ/Γ1-\delta/|\Gamma|, all fγFγf_{\gamma}\in\mathcal{F}_{\gamma},

superscriptsubscriptℛ𝑚ℓsubscriptℱ𝛾𝑐Γ𝛿2𝑚𝜆𝑉\displaystyle\leq\mathcal{R}_{m}^{\ell}(\mathcal{F}_{\gamma})+c\sqrt{\frac{\ln(|\Gamma|/\delta)}{2m}}-\lambda V. by taking union bounds over all γΓ\gamma\in\Gamma, we obtain the statement of this theorem.

Experiments

We present results for GraphMix algorithm using standard benchmark datasets and the standard architecture in Section 4.1 and 4.3. We also conduct an ablation study on GraphMix in Section A.5 to understand the contribution of various components to its performance. Refer to Appendix A.4 for dataset details and A.8 for implementation and hyperparameter details.

For baselines, we choose GCN (Kipf & Welling, 2017), and the recent state-of-the-art graph neural networks GAT (Veličković et al., 2018), GMNN (Qu et al., 2019) and Graph U-Net (Gao & Ji, 2019), as well as the recently proposed regularizers for graph neural networks. GraphMix(GCN) , GraphMix(GAT) and GraphMix(Graph U-Net) refer to the methods where underlying GNNs are GCN, GAT and Graph U-Net respectively. Refer to Appendix A.8 for implementation and hyperparameter details. (Shchur et al., 2018) demonstrated that the performance of the current state-of-the-art Graph Neural Networks on the standard train/validation/test split of the popular benchmark datasets (such as Cora, Citeseer, Pubmed, etc) is significantly different from their performance on the random splits. For fair evaluation, they recommend using multiple random partitions of the datasets. Along these lines, we created 1010 random splits of the Cora, Citeseer and Pubmed with the same train/ validation/test number of samples as in the standard split. We also provide the results for the standard train/validation/test split. The results are presented in Table 2. We observe that GraphMix always improves the accuracy of the underlying GNNs such as GCN, GAT and Graph-U-Net across all the dataset, with GraphMix(GCN) achieving the best results. We further present results with fewer labeled samples in Appendix A.7. We observer that the relative increase in test accuracy using GraphMix over the baseline GNN is more pronounced when the labeled samples are fewer. This makes GraphMix particularly appealing for very few labeled data problems.

2 Results on Larger Datasets

We also provide results on three recently proposed datasets which are relatively larger than standard benchmark datasets (Cora/Citeseer/Pubmed). We use Cora-Full dataset proposed in (Bojchevski & Günnemann, 2018) and Coauthor-CS and Coauthor-Physics datasets proposed in (Shchur et al., 2018). The results are presented in Table 3222We do not provide results for GAT based experiments in Table 3 and Table 4 because we ran out of GPU memory required to run these experiments with larger (higher number of nodes) datasets.. We observe that GraphMix(GCN) improves the results over GCN for all the three datasets with a significant margin. We note that we did minimal hyperparameter search for GraphMix(GCN) as mentioned in Appendix A.8.2. The details of the datasets is given in Appendix A.4.

3 Semi-supervised Link Classification

In the semi-supervised link classification problem, the task is to predict the labels of the remaining links, given a graph and labels of a few links. Following (Taskar et al., 2004), we can formulate the link classification problem as a node classification problem, i.e., given an original graph GG, we construct a dual Graph GG^{\prime}, the node set VV^{\prime} of the dual graph corresponds to the link set EE^{\prime} of the original graph. The nodes in the dual graph GG^{\prime} are connected if their corresponding links in the graph GG share a node. The attributes of a node in the dual graph are defined as the index of the nodes of the corresponding link in the original graph. Using this formulation, we present results on link classification on Bit OTC and Bit Alpha benchmark datasets in the Table 4. As the numbers of the positive and negative edges are strongly imbalanced, we report the F1 score. Our results show that GraphMix(GCN) improves the performance over the baseline GCN method and is comparable with the recently proposed state-of-the-art method GMNN (Qu et al., 2019) for both the datasets.

4 Visualization of the Learned Features

In Figure 2, we present an analysis of the features learned by GraphMix for Cora dataset using the t-SNE (van der Maaten & Hinton, 2008) based 2D visualization of the hidden states. We observe that GraphMix learns hidden states which are better separated and condensed than GCN and GCN(self-training). Here, GCN(self-training) refers to training a GCN in a normal way but with additional self-prediction based targets for the unlabeled samples. We further evaluate the Soft-rank (refer to Appendix A.10) of the class-specific hidden states to demonstrate that GraphMix(GCN) makes the class-specific hidden states more concentrated than GCN and GCN(self-training), as shown in Figure 2(d). Refer to Appendix A.11 for 2D representation of other datasets.

Related Work

Semi-supervised Learning over Graph Data: There exists a long line of work for semi-supervised learning over graph data (Zhu & Ghahramani, 2002; Zhu et al., 2003). Contrary to previous methods, the recent GNN based approaches define the convolutional operators using the neighbourhood information of the nodes (Kipf & Welling, 2017; Veličković et al., 2018).These convolution operator based method exhibit state-of-the-results for semi-supervised learning over graph data, hence much of the recent attention is dedicated to proposing architectural changes to these methods (Qu et al., 2019; Gao & Ji, 2019; Ma et al., 2019). Unlike these methods, we propose a regularization technique that can be applied to any of these GNNs which uses a parameterized transformation on the node features.

Data Augmentation: It is well known that the generalization of a learning algorithm can be improved by enlarging the training data size. Since labeling more samples is labour-intensive and costly, Data-augmentation has become de facto technique for enlarging the training data size. Mixing based algorithms are a particular class of data-augmentation methods in which additional training samples are generated by interpolating the samples (either in the input or hidden space) and/or their corresponding targets. Mixup (Zhang et al., 2018), BC-learning (Tokozume et al., 2017), Manifold Mixup (Verma et al., 2019a), AMR (Beckham et al., 2019) are notable examples of this class of algorithms. Unlike, these methods which have been proposed for the fixed topology datasets such as images, in this work, we propose interpolation based data-augmentation techniques for graph-structured data.

Regularizing Graph Neural Networks: Regularizing Graph Neural Networks has drawn some attention recently. GraphSGAN (Ding et al., 2018) first uses an embedding method such as DeepWalk (Perozzi et al., 2014) and then trains generator-classifier networks in the adversarial learning setting to generate fake samples in the low-density region between sub-graphs. BVAT (Deng et al., 2019) and (Feng et al., 2019) generate adversarial perturbations to the features of the graph nodes while taking graph structure into account. While these methods improve generalization in graph-structured data, they introduce significant additional computation cost: GraphScan requires computing node embedding as a preprocessing step, BVAT and (Feng et al., 2019) require additional gradient computation for computing adversarial perturbations. Unlike these methods, GraphMix does not introduce any significant additional computation since it is based on interpolation-based techniques and self-generated targets.

Discussion

We presented GraphMix, a simple and efficient regularizer for semi-supervised node classification using graph neural networks. Through extensive experiments, we demonstrated state-of-the-art performances using GraphMix on various benchmark datasets. We also presented a theoritical analysis to compare generalization bounds of GraphMix vs the underlying GNNs. Further, we conduct a systematic ablation study to understand the effect of different components in the performance of GraphMix. The strong empirical results of GraphMix suggest that in parallel to designing new architectures, exploring better regularization for graph-structured data is a promising avenue for research. A future research direction is to jointly model the node features and edges of the graph such that it can be further used for generating the synthetic interpolated nodes and their corresponding connectivity to the other nodes in the graph. This will alleviate the need to train the auxiliary FCN in GraphMix.

References

Appendix A Appendix

A recently proposed method for accurate target predictions for unlabeled data uses the average of predicted targets across KK random augmentations of the input sample (Berthelot et al., 2019). Along these lines, in GraphMix we compute the predicted-targets as the average of predictions made by GNN on KK drop-out versions of the input sample.

Many recent semi-supervised learning algorithms (Laine & Aila, 2016; Miyato et al., 2018; Tarvainen & Valpola, 2017; Verma et al., 2019b) are based on the cluster assumption (Chapelle et al., 2010), which posits that the class boundary should pass through the low-density regions of the marginal data distribution. One way to enforce this assumption is to explicitly minimize the entropy of the model’s predictions p(yx,θ)p(y|\mathbf{x},\theta) on unlabeled data by adding an extra loss term to the original loss term (Grandvalet & Bengio, 2005). The entropy minimization can be also achieved implicitly by modifying the model’s prediction on the unlabeled data such that the prediction has low entropy and using these low-entropy predictions as targets for the further training of the model. Examples include "Pseudolabels" (Lee, 2013) and "Sharpening" (Berthelot et al., 2019). Pseudolabeling constructs hard (one-hot) labels for the unlabeled samples which have “high-confidence predictions”. Since many of the unlabeled samples may have “low-confidence predictions”, they can not be used in the Pseudolabeling technique. On the other hand, Sharpening does not require “high-confidence predictions” , and thus it can be used for all the unlabelled samples. Hence in this work, we use Sharpening for entropy minimization. The Sharpening function over the model prediction p(yx,θ)p(y|\mathbf{x},\theta) can be formally defined as follows (Berthelot et al., 2019), where TT is the temperature hyperparameter and CC is the number of classes:

A.2 Connection to Co-training

The GraphMix approach can be seen as a special instance of the Co-training framework (Blum & Mitchell, 1998). Co-training assumes that the description of an example can be partitioned into two distinct views and either of these views would be sufficient for learning given sufficient labeled data. In this framework, two learning algorithms are trained separately on each view and then the prediction of each learning algorithm on the unlabeled data is used to enlarge the training set of the other. Our method has some important differences and similarities to the Co-training framework. Similar to Co-training, we train two neural networks and the predictions from the GNN are used to enlarge the training set of the FCN. An important difference is that instead of using the predictions from the FCN to enlarge the training set for the GNN, we employ parameter sharing for passing the learned information from the FCN to the GNN. In our experiments, directly using the predictions of the FCN for GNN training resulted in reduced accuracy. This is due to the fact that the number of labeled samples for training the FCN is sufficiently low and hence the FCN does not make accurate enough predictions. Another important difference is that unlike the co-training framework, the FCN and GNN do not use completely distinct views of the data: the FCN uses feature vectors X\mathcal{X} and the GNN uses the feature vector and adjacency matrix (X,A)(\mathcal{X},\mathcal{A}).

A.3 Algorithm

The procedure for GraphMix training is given in Algorithm 1.

A.4 Datasets

We use three standard benchmark citation network datasets for semi-supervised node classification, namely Cora, Citeseer and Pubmed. In all these datasets, nodes correspond to documents and edges correspond to citations. Node features correspond to the bag-of-words representation of the document. Each node belongs to one of CC classes. During training, the algorithm has access to the feature vectors and edge connectivity of all the nodes but has access to the class labels of only a few of the nodes.

For semi-supervised link classification, we use two datasets Bitcoin Alpha and Bitcoin OTC from (Kumar et al., 2016, 2018). The nodes in these datasets correspond to the bitcoin users and the edge weights between them correspond to the degree of trust between the users. Following (Qu et al., 2019), we treat edges with weights greater than 3 as positive instances, and edges with weights less than -3 are treated as negative ones. Given a few labeled edges, the task is to predict the labels of the remaining edges. The statistics of these datasets as well as the number of training/validation/test nodes is presented in Appendix A.4.

The statistics of standard benchmark datasets as well as the number of training/validation/test nodes is presented in Table 5. The statistics of larger datasets in given in Table 6.

For larger datasets of Section 4.2, we took processed versions of these dataset available here 333https://github.com/shchur/gnn-benchmark. We did 10 random splits of the the data into train/validation/test split. For the classes which had more than 100 samples. We choose 20 samples per class for training, 30 samples per class for validation and the remaining samples as test data. For the classes which had less than 100 samples, we chose 20% samples, per class for training, 30% samples for validation and the remaining for testing. For each split we run experiments using 100 random seeds. The statistics of these datasets is presented in Appendix Table 6 and

A.5 Ablation Study

Since GraphMix consists of various components, some of which are common with the existing literature of semi-supervised learning, we set out to study the effect of various components by systematically removing or adding a component from GraphMix . We measure the effect of the following:

Removing both the Manifold Mixup and predicted targets from the FCN training.

Having only the Manifold Mixup training for FCN ( No predicted targets for FCN training)

Using the predicted targets for the FCN training (No Manifold Mixup in FCN training)

Using both Manifold Mixup and predicted targets for FCN training

The ablation results for semi-supervised node classification are presented in Table 7. We did not do any hyperparameter tuning for the ablation study and used the best performing hyperparameters found for the results presented in Table 1. We observe that all the components of GraphMix contribute to its performance, with Manifold Mixup training of FCN contributing possibly the most.

A.6 Comparison with State-of-the-art Methods

We present the comparion of GraphMix with the recent state-of-the-art methods as well as earlier methods using the standard Train/Validation/Test split in Table 8. We additionally use self-training based baselines, where FCN, GCN, GAT and Graph-U-Net are trained with self-generated targets. These are named as FCN (self-training), GCN (self-training), GAT (self-training) and Graph-U-Net(self-training) respectively in Table 8. For generating the predicted-targets in above two baselines, we followed the procedure of Appendix A.1.

A.7 Results with fewer labeled samples

We further evaluate the effectiveness of GraphMix in the learning regimes where fewer labeled samples exist. For each class, we randomly sampled K{5,10}K\in\{5,10\} samples for training and the same number of samples for the validation. We used all the remaining labeled samples as the test set. We repeated this process for 1010 times. The results in Table 9 show that GraphMix achieves even better improvements when the labeled samples are fewer.

A.8 Implementation and Hyperparameter Details

We use the standard benchmark architecture as used in GCN (Kipf & Welling, 2017), GAT (Veličković et al., 2018) and GMNN (Qu et al., 2019), among others. This architecture has one hidden layer and the graph convolution is applied twice : on the input layer and on the output of the hidden layer. The FCN in GraphMix shares the parameters with the GCN.

GraphMix introduces four additional hyperparameters, namely the α\alpha parameter of Beta\operatorname{Beta} distribution used in Manifold Mixup training of the FCN, the maximum weighing coefficient γ\mathcal{\gamma} which controls the trade-off between the supervised loss and the unsupervised loss (loss computed using the predicted-targets) of FCN, the temparature TT in sharpening and the number of random perturbations KK applied to the input data for the averaging of the predictions.

We conducted minimal hyperparameter seach over only α\alpha and γ\mathcal{\gamma} and fixed the hyperparameters TT and KK to 0.10.1 and 1010 respectively. The other hyperparameters were set to the best values for underlying GNN (GCN or GAT), including the learning rate, the L2L2 decay rate, number of units in the hidden layer etc. We observed that GraphMix is not very sensitive to the values of α\alpha and γ\mathcal{\gamma} and similar values of these hyperparameters work well across all the benchmark datasets. Refer to Appendix A.8 and A.9 for the details about the hyperparameter values and the procedure used for the best hyperparameters selection.

For GCN and GraphMix(GCN), we used Adam optimizer with learning rate 0.010.01 and L2L2-decay 5e-4, the number of units in the hidden layer 1616 , dropout rate in the input layer and hidden layer was set to 0.50.5 and 0.00.0, respectively. For GAT and GraphMix(GAT), we used Adam optimizer with learning rate 0.0050.005 and L2L2-decay 5e-4, the number of units in the hidden layer 88 , and the dropout rate in the input layer and hidden layer was searched from the values {0.2,0.5,0.8}\{0.2,0.5,0.8\}.

For α\alpha and γ\mathcal{\gamma} of GraphMix(GCN) and GraphMix(GAT) , we searched over the values in the set [0.0,0.1,1.0,2.0][0.0,0.1,1.0,2.0] and [0.1,1.0,10.0,20.0][0.1,1.0,10.0,20.0] respectively.

For GraphMix(GCN) : α=1.0\alpha=1.0 works best across all the datasets. γ=1.0\mathcal{\gamma}=1.0 works best for Cora and Citeseer and γ=10.0\mathcal{\gamma}=10.0 works best for Pubmed.

For GraphMix(GAT) : α=1.0\alpha=1.0 works best for Cora and Citeseer and α=0.1\alpha=0.1 works best for Pubmed. γ=1.0\mathcal{\gamma}=1.0 works best for Cora and Citeseer and γ=10.0\mathcal{\gamma}=10.0 works best for Pubmed. Input droputrate=0.5 and hidden dropout rate=0.5 work best for Cora and Citeseer and Input dropout rate=0.2 and hidden dropout rate =0.2 work best for Pubmed.

We conducted all the experiments for 2000 epochs. The value of weighing coefficient w(t)w(t) in Algorithm 1) is increased from to its maximum value γ\gamma from epoch 500 to 1000 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

A.8.2 Hyperparameter Details for Results in Section 4.2

For all the experiments we use the standard architecture mentioned in Section A.8 and used Adam optimizer with learning rate 0.001 and 64 hidden units in the hidden layer. For Coauthor-CS and Coauthor-Physics, we trained the network for 2000 epochs. For Cora-Full, we trained the network for 5000 epochs because we observed the training loss of Cora-Full dataset takes longer to converge.

For Coauthor-CS and Coauthor-Physics: We set the input layer dropout rate to 0.5 and weight-decay to 0.0005, both for GCN and GraphMix(GCN) . We did not conduct any hyperparameter search over the GraphMix hyperparameters α\alpha, λmax\lambda_{max}, temparature TT and number of random permutations KK applied to the input data for GraphMix(GCN) for these two datasets, and set these values to 1.01.0, 1.01.0, 0.10.1 and 1010 respectively.

For Cora-Full dataset: We found input layer dropout rate 0.2 and weight-decay 0.0 to be the best for both GCN and GraphMix(GCN) . For GraphMix(GCN) we fixed α\alpha, temparature TT and number of random permutations KK to 1.01.0 0.10.1 and 1010 respectively. For λmax\lambda_{max}, we did search over {1.0,10.0,20.0}\{1.0,10.0,20.0\} and found that 10.010.0 works best.

For all the GraphMix(GCN) experiments, the value of weighing coefficient w(t)w(t) in Algorithm 1) is increased from to its maximum value γmax\gamma_{max} from epoch 500 to 1000 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

A.8.3 For results reported in Section 4.3

For α\alpha of GraphMix(GCN) , we searched over the values in the set [0.0,0.1,0.5,1.0][0.0,0.1,0.5,1.0] and found that 0.10.1 works best for both the datasets. For γ\mathcal{\gamma}, we searched over the values in the set [0.1,1.0,10.0][0.1,1.0,10.0] and found that 0.10.1 works best for both the datasets. We conducted all the experiments for 150 epochs. The value of weighting coefficient w(t)w(t) in Algorithm 1) is increased from to its maximum value γ\mathcal{\gamma} from epoch 75 to 125 using the sigmoid ramp-up of Mean-Teacher (Tarvainen & Valpola, 2017).

Both for GraphMix(GCN) and GCN, we use Adam optimizer with learning rate 0.010.01 and L2L2-decay 0.00.0, the number of units in the hidden layer 128128 , dropout rate in the input layer was set to 0.50.5.

A.8.4 For results reported in Section A.7

For α\alpha of GraphMix(GCN) , we searched over the values in the set [0.0,0.1,0.5,1.0][0.0,0.1,0.5,1.0] and found that 0.10.1 works best across all the datasets. For γ\mathcal{\gamma}, we searched over the values in the set [0.1,1.0,10.0][0.1,1.0,10.0] and found that 0.10.1 and 1.01.0 works best across all the datasets. Rest of the details for GraphMix(GCN) and GCN are same as Section A.8.1.

A.9 Hyperparameter Selection

For each configuration of hyperparameters, we run the experiments with 100100 random seeds. We select the hyperparameter configuration which has the best validation accuracy averaged over these 100100 trials. With this best hyperparameter configuration, for 100100 random seeds, we train the model again and use the validataion set for model selection ( i.e. we report the test accuracy at the epoch which has best validation accuracy.)

A.10 Soft-Rank

Let H be a matrix containing the hidden states of all the samples from a particular class. The Soft-Rank of matrix H is defined by the sum of the singular values of the matrix divided by the largest singular value. A lower Soft-Rank implies fewer dimensions with substantial variability and it provides a continuous analogue to the notion of rank from matrix algebra. This provides evidence that the concentration of class-specific states observed when using GraphMix in Figure 3 can be measured directly from the hidden states and is not an artifact of the T-SNE visualization.

A.11 Feature Visualization

We present the 2D visualization of the hidden states learned using GCN and GraphMix(GCN) for Cora, Pubmed and Citeseer datasets in Figure 3. We observe that for Cora and Citeseer, GraphMix learns substantially better hidden states than GCN. For Pubmed, we observe that although there is no clear separation between classes, "Green" and "Red" classes overlap less using the GraphMix, resulting in better hidden states.

A.12 Joint Optimization vs Alternate optimization

In this Section, we discuss the effect of hyperparameter λ\lambda, that is used to compute the weighted sum of GCN and FCN losses in in Eq 7. In Figure 4, we see that a wide range of λ\lambda ( from 0.1 to 1.0) achieves better validation accuracy than the vanilla GCN. Furthermore, alternate optimization of the FCN loss and the GCN loss achieves substantially better validation accuracy than the simultaneous optimization.