Dropout Training of Matrix Factorization and Autoencoder for Link Prediction in Sparse Graphs

Shuangfei Zhai, Zhongfei Zhang

Introduction

Link prediction is one of the fundamental problems of network analysis, as pointed out in , ”a network model is useful to the extent that it can support meaningful inferences from observed network data.” Given a graph $G(V,E)$ together with its adjacency matrix $A\in\{0,1\}^{N\times N}$ , the set of nodes $V$ , and the set of edges $E$ , link prediction can be considered as a matrix completion problem on $A$ . The problem is challenging because $A$ is often large and sparse, which means that only a small fraction of the links are observed. As a result, a good model should have enough capacity to accommodate the complex connectivity pattern between all $N^{2}$ pairs of nodes, as well as strong generalization ability to make accurate predictions on unobserved pairs.

Among the large number of models proposed over the decade, Matrix Factorization (MF) is one of the most popular ones in network modeling and relational learning in general . In its simplest form, MF directly models the interaction between a pair of nodes as the inner product of two latent factors, one for each node. This assumes that the large square adjacency matrix $A$ can be factorized into the product of a tall, thin matrix and a short, wide matrix as $A\approx W_{2}W_{1}$ , where $W_{2}\in R^{N\times K},W_{1}\in R^{K\times N}$ . Each row of $W_{1}$ and each column of $W_{1}$ are often called a latent factor, as they often capture the community membership information of the nodes. Training of such a model is usually conducted with stochastic gradient descent, which makes it easily scalable to large datasets .

Bayesian models, such as MMSB are another family of latent factor models for studying network structures. Compared with MF which directly learns the latent factors as free parameters by solving an optimization problem, Bayesian models treat the latent factors as random variables and model the stochastic process of the creation of a link. As a result, they can be considered as a stochastic version of MF. By putting various priors on the latent factors, link prediction is reduced to the inference problem of the posterior distribution of the link status. While Bayesian models can often significantly reduce overfitting compared with their deterministic counterparts, the inference processes such as MCMC and Variational Inference are usually much slower to run than direct optimization. As a result, their application has been limited to only moderate sized datasets.

Autoencoder (AE) together with its variants such as Restricted Boltzman Machine (RBM) has recently achieved great success in various machine learning applications, and is well recognized as the building block of Deep Learning . AE learns useful representations by learning a mapping from an input to itself, which makes it different from the above mentioned approaches. Surprisingly, it is not until recently that AE finds its application to modeling graphs . In this paper we investigate the effectiveness of AE on the link prediction problem. In particular, we show that AE is closely connected to MF when unified in the same architecture. Motivated by this observation, we argue that MF and AE are indeed two complementary views of the same problem, and propose a novel model MF+AE where MF and AE are jointly trained with shared parameters.

Model

Matrix Factorization: Given an undirected graph $G(V,E)$ with $N$ nodes and $|E|$ edges, MF approximates its adjacency matrix $A\in\{0,1\}^{N\times N}$ with the product of two low rank matrices $W_{2}$ and $W_{1}$ by solving the optimization problem:

where $W_{1}\in R^{K\times N},W_{2}\in R^{N\times K}$ ; $b_{1}\in R^{N},b_{2}\in R^{N}$ are the biases, $W_{1,i}$ is the $i^{th}$ column of $W_{1}$ , $W_{2}^{j}$ is the $j^{th}$ row of $W_{2}$ , $g$ is the link function, $L$ is the loss function defined on each pair of nodes, and $K$ is the dimension of the latent factor. Since $A$ is symmetric for undirected graph, it is sometimes useful to adopt tied weights, i.e., $W_{1}=W_{2},b_{1}=b_{2}$ . We refer to the model with tied weights as the symmetric version of MF in this paper.

with $W_{1}\in R^{K\times N},b_{1}\in R^{K},W_{2}\in R^{N\times K},b_{2}\in R^{N}$ . The parameters are learned by solving the following optimization problem:

Here we have slightly overloaded the loss function $L$ by defining it on the column vector $x_{i}$ . The natural way of applying AE to modeling graphs is to represent each node as the set of its neighbors; in other words, set $x_{i}=A_{i}$ . This is analogous to the bag of words representation prevalent in the document modeling community, and we call it the bag of nodes representation. Note that this representation is sufficient since when only the topological structure is available, we can learn an unseen node if we know all its neighbors.

The Joint Model: To better see the connection and difference between MF and AE, we now rewrite (2.3) by substituting $x_{i}$ with $A_{i}$ :

And we rewrite (2.1) by omitting the $b_{1}$ term:

where $\delta_{i}\in R^{N}$ is the indicator vector of node $i$ , which is a binary vector with $1$ at the $i^{th}$ entry and for all the rest. We deliberately organize (2.4) and (2.5) in such a way that in the high level, they share the same architecture. Both models first learn a hidden representation $h_{i}$ , which is then fed through a classifier with link function $g$ and loss function $L$ . The main difference is only in the form of the hidden representation. For each node $i$ , MF only looks at its id, and the hidden layer representation is learned by simply extracting the $i^{th}$ column of $W_{1}$ . For AE, we first sum up the columns of $W_{1}$ indicated by $i$ ’s neighbors, and then pass the sum through an activation function $f$ . As a result, in MF, two nodes propagate ”positive” information to each other only if they are directly connected; in AE, however, two nodes can do so as long as they appear in the same neighborhood of some other node, even if they are not directly connected. The different ways of the information propagation between that two models indicates that MF and AE are complementary to each other to model different aspects of the same topological structure.

We can also interpret the distinction between MF and AE as two different views of the same entity: MF uses $\delta_{i}$ , and AE uses $A_{i}$ . We note that the two views are disjoint and sufficient: they do not overlap with each other, but each of them can sufficiently represent a node. This perspective motivates us to build a unified architecture where we train the two models jointly, and require both of them to be able to uncover the graph structure. The idea is similar to co-training in the semisupervised learning setting, where one trains two classifiers on two sufficient views such that the two views can ”teach” each other on unlabeled data. While in our problem, there is no ”unlabeled data” available, we argue that the model can still benefit from the co-training idea by requiring the two views to ”agree with” each other. We then formulate our model, which we call MF+AE, as follows:

In (2.6), the objective function is composed of two parts, one for AE and the other for MF. $h_{1,i}$ and $h_{2,i}$ are the hidden representation learned by the AE and MF part, respectively, and the ”agreement” is achieved by using the same set of weights $W_{1}$ and $W_{2}$ for both AE and MF. We modify the architecture of the MF objective by adding the same activation function $f$ and a corresponding bias term $b_{3}$ . $\rho$ is a positive real number which could be simply set to $1$ in practice. We show the architecture of MF+AE in Figure 1.

where $\odot$ denotes the element-wise product of two vectors.

Activation Function and Loss Function: For the activation function $f$ , we choose the Rectified Linear Units ( $ReLU$ ), which is defined as:

$ReLU$ provides nonlinearity by zeroing out all the negative inputs and keeping positive ones intact. We choose $ReLU$ over other popular choices such as $Sigmoid$ or $\tanh$ for two reasons. First, it does not squash the output to a fixed interval as $Sigmoid$ or $tanh$ does. As a result, it is closest to our intuition of approximating the behavior of MF. In fact, from the point of view of MF, the effect of $ReLU$ can be considered as putting a non-negativity constraint on $h$ , which is closely related to the idea of Non-negative Matrix Factorization. Secondly, $ReLU$ is fast to compute compared with its alternatives, and still provides enough nonlinearity to significantly improve the model capacity over linear structures.

For $g$ and $L$ , we choose the $Sigmoid$ function combined with cross entropy loss:

The saturating property of the $Sigmoid$ function endows the model much flexibility since $h_{1,i}$ and $h_{2,i}$ need only to have similar activation patterns to both achieve good reconstructions; cross entropy is naturally a more appropriate choice than square loss for binary matrix. Moreover, as A is often extremely sparse, reconstructing the whole matrix incurs the class imbalance problem, which means that the loss of the positive entries is dominated by the negative entries. As a result, it is important to reweight the cost of the positive and negative classes by utilizing the cost sensitive strategy. Consequently, our final form of the loss function becomes:

Dropout As Regularization: Regularization is critical for most machine learning models to generalize well to unseen data. Instead of putting explicit constraints on the parameters or hidden units, we use dropout training as an implicit regularization. Dropout is a technique originally proposed for training feed forward neural networks to avoid overfitting. It works as follows: in the stochastic gradient training, we randomly drop out half of the hidden units (for both AE and MF) and half of the feature units (for AE) for each node in the current iteration. Mathematically, the effect of dropping out can be simulated as applying an element-wise random dropout mask as follows:

Here $\xi_{h,i}\in\{0,1\}^{K}$ and $\xi_{in,i}\in\{0,1\}^{N}$ are the dropout masks for the hidden and input units, respectively; each element of them is an iid draw from the Bernoulli distribution. And $\odot$ is the element-wise product of two vectors. Note that we use the same dropout mask $\xi_{h,i}$ for both the AE and MF modules. This is to ensure that dropout does not cause any difference in the architecture between the two views.

For the AE module, randomly dropping out the input can be considered as a ”denoising” technique, which was exploited by the previous work , and also was applied to link prediction . The motivation is that a model should be able to make a good reconstruction under a noisy or partial input. This property is particularly interesting to our problem because this is exactly the same link prediction problem: prediction of the whole based on parts.

While theoretically explaining the effect of dropout is difficult for complex models, we can still gain an insight by looking at an approximate surrogate. Previously, used the second order Taylor expansion to explain the effect of feature noising in generalized linear models and AE, respectively. We can borrow the same tool to showcase a simplified version of MF+AE.

Dropout for Matrix Factorization: We consider the effect of dropout training on (2.1). For a concise articulation we ignore the bias terms; the resulting model is described as the following objective function:

When $Sigmoid$ activation function with the cross entropy loss is used, we compute the second order approximation of (2.13) in a closed form as:

Next we proceed to study the effect of dropping out the input. Let us now denote $W=W_{2}W_{1}$ , and we have the dropout version of objective function:

The second order approximation immediately follows as:

To summarize, we show that in the simplified case, dropping out the hidden units and inputs can be both interpreted as an adaptive regularization. They both push the model to make confident predictions by minimizing the factor $g_{i,j}(1-g_{i,j})$ , while they penalize different aspects of the parameter matrices. When combined in the joint training architecture, they provide complementary regularization to prevent MF+AE from overfitting.

Experiments

We conduct the experiments on six real world datasets: DBLP, Facebook, Youtube, Twitter, GooglePlus, and LiveJournal, all of which are available to download at snap.stanford.edu/data. We summarize the statistics of the six datasets in Table 1. Except for DBLP which is an author collaboration network, all the rest are online social networks. In particular, Youtube, Twitter, Gplus, and LiveJournal are all originally directed network; we convert them to undirected graphs by ignoring the direction of the links.

Following the experimental protocol suggested in , we first split the observed links into a training graph $G_{train}$ and a testing graph $G_{test}$ . We then train the models on $G_{train}$ , and evaluate the performance of the models on $G_{test}$ . In particular, note that a naive algorithm which simply predicts a link for all the pairs that have at least one common neighbor would make a pretty good accuracy. We then only consider the nodes that are 2-hops from the target node as the candidate nodes . The metrics used are Precision at top 10 position( $Prec@10$ ) and AUC.

Adamic-Adar Score (AA) . This is a popular score based method, it calculates the pair-wise score of node $i,j$ as $S(i,j)=\sum\limits_{n\in CN(i,j)}\log(\frac{1}{d_{n}})$ , where $CN(i,j)$ denotes the set of common neighbors of node $i$ and $j$ , $d_{n}$ is the degree of node $n$ . Prediction is made by ranking the score of all nodes to the target node.

Random Walk with restart (RW). RW uses the stationary distribution of a random walk from a given node to all the nodes as the score. The restart probability needs to be set in advance. In practice we find that while the performance of RW is pretty robust within a wide range of values, a relatively large value (compared with $0.3$ used in ) works slightly better on our problem . We set it as $0.5$ throughout the experiments.

Matrix Factorization with dropout (MFd). This corresponds to the model described in (2.13) with the additional bias vectors. No weight decay is used.

Autoencoder with Dropout (AEd). This is the single AE with $ReLU$ activation function and the cross entropy loss trained by dropout.

Marginalized Denoising Model (MDM). This is one of the few existing AE based models where a linear activation together with square loss is used. MDM marginalizes the dropout noise of the features during training, but no hidden layer dropout is adopted. The model requires the input noise level to be set; we set it as $0.9$ throughout the experiments.

The Joint Model (MF+AE). This corresponds to the model described in (2.12), where we jointly train MF and AE with shared weights using dropout.

All the above methods are implemented in $Matlab$ . We use the authors’ implementation for MDM, and use our own implementations for the rest models. Note that MF2, AE2, MDM, MFd, AEd, and MF+AE all require the dimensionality of the latent factor or hidden layer $K$ as the input. To be fair for a comparison, we do not attempt to optimize this parameter for each model. Instead, we set it the same for all of the models on all the six datasets. In the experiments, we use $100$ .

Another important aspect for both MF and AE models is the choice of a symmetry model vs. an asymmetry model, i.e., whether or not to set $W_{1}=W_{2}^{T}$ . We find that AE models are less sensitive to the characteristics of a dataset, and almost always benefit from the tied weights. The optimal choice for MF models is, however, extremely problem-dependent. In our experiments, we use tied weights for all AE based models including MF+AE on all the six datasets. For MF based models, we use tied weights on Facebook, Twitter, DBLP, LiverJournal, and untied weights on Youtube and Gplus.

In this paper, we are interested in evaluating the performance of different models on sparse graphs. In other words, we investigate how well a model generalizes given only a sparse training graph $G_{train}$ . To this end, each of the six datasets is chosen as a relatively densely connected subgraph as in Table 1. We then randomly take $10\%$ of links for training and use the rest for testing. In this way, we train the model on a sparse graph, while still have enough held out links for testing. We train all the models (except AA and RW which require no training) to convergence, which means that we do not use a separate validation set to perform early stopping. We do this for two reasons. First, in sparse graphs, splitting out a separate set for validation is expensive. Secondly and more importantly, we are interested in testing the generalization ability of each model. In practice we find that almost all the models we have tested benefit from a properly chosen early stopping point. However, this makes the results very difficult to interpret as it is difficult to evaluate the contribution of early stopping in different models.

MF+AE Achieves Best Performance We first list the results of the experiments in Table 2 for a comparison. Overall, we see that MF+AE has the best average performance. In particular, it achieves the best performance on all the six datasets evaluated by $Prec@10$ , and on all but the LiveJournal dataset evaluated by AUC. This shows that the joint training of MF and AE consistently boosts the generalization ability without increasing the model size. AEd achieves the second best average performance evaluated by both metrics. MDM, as a variant of AE, achieves the third best performance on $Prec@10$ and fourth on AUC. This is reasonable since on the one hand, the utilization of feature noising improves the generalization ability, and on the other hand, the linear nature limits its ability to model complex graph structures, and also due to the use of the square loss and ignorance of the class imbalance, the performance further deteriorates in sparse graphs. One seemingly surprising result is that RW performs pretty well despite of its simplicity.

We then compare the predictions of MF+AE, AEd, MFd, MDM, respectively, which all use (different variants of) dropout training. It is not difficult to see that MF+AE’s prediction resembles the full graph the most. AEd and MFd make a lot of ”False Positive” predictions which are clearly shown by the pepper salt like pixels off the diagonal. MDM makes more ”False Negative” predictions, such as the missing of the small cluster near the right top corner. We note that the quality of the predictions shown by the visualization is also consistent with the results in Table 2.

Among all the six datasets we have tested, Youtube and Gplus datasets demonstrate least cohesive structures, i.e., they consist of more follower-followee relationships than the other datasets. The cohesive vs. non-cohesive distinction of graph structure has previously been investigated in . To show this, we have trained the symmetric and asymmetric versions of MFd and MF2, respectively, on all the six datasets. We then report the averaged performances of the symmetric MF and asymmetric MF in Figure 3. We see that the symmetric version works better on Facebook, Twitter, DBLP, LiveJournal, and the asymmetric version works better on Youtube and Gplus. This experiment shows that Youtube and Gplus demonstrate more non-cohesive structure which cannot be symmetrically modeled by the inner product of two vectors.

With this in mind, let us look at the performances of different models on these two datasets. It is clear that MF+AE and AEd as the best and second best models outperform the other methods by much larger margins than on the other four datasets. Note that AEd and MF+AE still use the tied weights on these two datasets, as we found little difference in performance when switched to untied weights. Also note that even though MDM uses feature dropout, it still fails to model the non-cohesive structures properly. We argue that it is the nonlinear activation function that gives the MF+AE and AEd more modeling power than linear models.

Related Work

The link prediction problem can be considered as a special case of relational learning and recommender systems , and a lot of techniques proposed are directly applicable to link prediction as well. Salakhutdinov et al. first apply RBM to movie recommendation. Recently, Chen and Zhang propose a variant of linear AE with marginalized feature noise for link prediction, and Li et al. apply RBM to link prediction in dynamic graphs.

MF+AE is also related to the supervised learning based methods . While these approaches directly train a classifier on manually collected features, MF+AE directly learns the appropriate features from the adjacency matrix.

The utilization of dropout training as an implicit regularization also contrasts with Bayesian models . While both dropout and Bayesian Inference are designed to reduce overfitting, their approaches are essentially orthogonal to each other. It would be an interesting future work to investigate whether they can be combined to further increase the generalization ability. Dropout has also been applied to training generalized linear models , log linear models with structured output , and distance metric learning .

This work is also related to graph representation learning. Recently, Perozzi et al. propose to learn node embeddings by predicting the path of a random walk, and they show that the learned representation can boost the performance of the classification task on graph data. It would also be interesting to evaluate the effectiveness of MF+AE in the same setting.

Conclusion

We propose a novel model MF+AE which jointly trains MF and AE with shared parameters. We show that dropout can significantly improve the generalization ability of both MF and AE by acting as an adaptive regularization on the weight matrices. We conduct experiments on six real world sparse graphs, and show that MF+AE outperforms all the competing methods, especially on datasets with strong non-cohesive structures.

Acknowledgements