Leveraging Node Attributes for Incomplete Relational Data

He Zhao, Lan Du, Wray Buntine

Introduction

Relational learning from network data, particularly with probabilistic methods, has gained a wide range of applications such as social network analysis (xiang2010modeling), recommender systems (gopalan2014bayesian), knowledge graph completion (hu2016topic), and bioinformatics (huopaniemi2010multivariate). Generally speaking, the goal of relational learning is to discover and analyse latent clusters of entities (i.e., community detection), and predict missing links (i.e., link prediction).

The standard approach for modelling relational data is latent factor analysis via matrix factorisation and its variations. Among the existing approaches, Non-negative Matrix Factorisation (NMF) and the Stochastic Block Model (SBM) are prominent foundational methods. NMF is usually used to model relationships between two sets of entities such as users and movies in collaborative filtering (mnih2008probabilistic). While developed independently, SBM (wang1987stochastic; nowicki2001estimation) can be viewed as an extension of NMF that introduces a block matrix to capture the interactions between latent factors. There have been many Bayesian extensions of these two methods, relaxing the assumptions and/or introducing extra components, such as the Infinite Relational Model (IRM) (kemp2006learning), the mixture membership stochastic block model (MMSB) (airoldi2008mixed), and the non-parametric latent feature models (NLFM) (miller2009nonparametric). Poisson Factorisation (PF) (dunson2005bayesian; zhou2012beta), is a popular version of NMF which models count data with convenient statistical properties (gopalan2014bayesian; Gopalan2015ScalableRW). Combining the ideas of PF and SBM, the infinite Edge Partition Model (EPM) (zhou2015infinite) and its extensions (hu2016topic) have proven successful for relational networks.

When a network has less data, relational learning becomes more difficult. One extreme case is the cold-start problem (Lin:2013; Sedhain:2014; Zhang:2015), where a node has no observed links, making suggestion of links for that node even more challenging. In such cases, it is natural to appeal to side information such as node attributes or features. For instance, papers in citation networks are often associated with categories and authors, and users in Facebook or Twitter are often asked to provide information such as age, gender and interests. It is reasonable to assume that nodes having similar attributes are more likely to relate to each other (i.e., homophily, Nickel:2016jd). Thus, node attributes serve as important complementary information to relational data.

There are few Bayesian probabilistic relational models that are able to leverage side information. For example, NLFM uses a linear regression model to transform the features of each node into a single number, which contributes to link probabilities. However, side information in NLFM cannot directly influence the latent factors, which gives little support for community detection. As an extension of MMSB, the Non-parametric Meta-data Dependent Relational (NMDR) model (kim2012nonparametric) incorporates attributes into the mixed-membership distribution of each node with the logistic-normal transform, which results in non-conjugacy for inference. fan2016learning further developed this idea in the Node information Involved Mixture Membership model (niMM), where side information is integrated in a conjugate way. Although these models demonstrate improvement using side information, they scale quadratically in the number of nodes and the incorporation of side information is often complicated.

Several recent methods (gopalan2014content; acharya2015gamma; hu2016non) extend PF with side information using the additivity of the Poisson and gamma distributions/processes. With improved scalability, the Structural Side Information Poisson Factorisation (SSI-PF) (hu2016non) models directed unweighted networks with node labels, such as citation networks with papers labelled with one of several categories. However, its performance remains untested when a node has multiple attributes. Moreover, undirected networks are not handled by SSI-PF.

In this paper we present the Node Attribute Relational Model (NARM)Code available at https://github.com/ethanhezhao/NARM/, a fully Bayesian approach that models large, sparse, and unweighted relational networks with arbitrary node attributes encoded in binary form. It works with Poisson gamma relational models to incorporate side information. Specifically, we propose the Symmetric NARM (Sym-NARM) for undirected networks, an extension of EPM (zhou2015infinite) and the Asymmetric NARM (Asym-NARM) for directed networks, an extension of PF (zhou2012beta). The proposed models have several key properties: (1) Effectively modelling node attributes: the proposed models are able to achieve improved link prediction performance, especially where training data are limited. (2) Fully Bayesian and conjugate: the inference is done by efficient, closed-form Gibbs sampling which scales linearly in the number of observed links and takes advantage of the sparsity of node attributes. It makes our models scalable for large but sparse relational networks with large sets of node attributes. (3) Flexibility: the proposed models work on directed and undirected relational networks with flat and hierarchical node attributes.

The Node Attribute Relational Model

Here we focus on modelling unweighted networks that can be either directed (i.e., the relationship is asymmetric) or undirected. Assume a relational network with $N$ nodes is stored in a binary adjacency matrix $\mathbf{Y}\in\{0,1\}^{N\times N}$ where $y_{i,j}=1$ indicates the presence of a link between nodes $i$ and $j$ . If the relationship described in the network is symmetric, then $y_{i,j}=y_{j,i}$ , and if asymmetric, possibly $y_{i,j}\neq y_{j,i}$ . Node attributes are encoded in a binary matrix $\mathbf{F}\in\{0,1\}^{N\times L}$ , where $L$ is the total number of attributes. Attribute $f_{i,l}=1$ indicates attribute $l$ is active with node $i$ and vice versa. Although our models incorporate binary attributes, categorical attributes and real-valued attributes can be converted into binary values with proper transformations (kim2012nonparametric; fan2016learning; hu2016non).

Sym-NARM works with undirected networks. Its generative process is shown in Figure 2. Instead of modelling the binary matrix $\mathbf{Y}$ directly, it applies the Bernoulli-Poisson link (BPL) function (zhou2015infinite) using an underlying latent count matrix $\mathbf{X}$ . One first draws a latent count $x_{i,j}$ from the Poisson distribution and then thresholds it at 1 to generate a binary value $y_{i,j}$ . This is shown in Eqs. (2)-(2). Analysed in (zhou2015infinite; hu2016topic; hu2016non), BPL has the appealing property that if $y_{i,j}=0$ , then $x_{i,j}=0$ with probability one. Thus, only non-zeros in $\mathbf{Y}$ need to be sampled, giving huge computational savings for large sparse networks, illustrated in Section 3 and Section LABEL:sec-runtime.

One appealing aspect of our model is the incorporation of node attributes on the prior of $\phi_{i,k}$ (i.e., $g_{i,k}$ ). Shown in Eq. (2), $g_{i,k}$ is constructed with a log linear combination of $f_{i,l}$ . $h_{l,k}$ is referred to as the $k^{\text{th}}$ attribute factor loading of attribute $l$ , which influences $g_{i,k}$ iff attribute $l$ is active with node $i$ (i.e., $f_{i,l}=1$ ). $b_{k}$ acts as an attribute-free bias for each latent factor $k$ . $h_{l,k}$ and $b_{k}$ are gamma distributed with mean $1$ , hence if attribute $l$ does not contribute to latent factor $k$ or is less useful, $h_{l,k}$ is expected to be near $1$ and to have little influence on $g_{i,k}$ . The hyper-parameter $\mu_{0}$ controls the variation of $h_{l,k}$ .

The intuition of our model is: if two nodes have more common attributes, their gamma shape parameters will be more similar, with similar node factor loadings, resulting in a larger probability that they relate to each other. Moreover, instead of incorporating the node attributes directly into the node factor loadings, Sym-NARM uses them as the prior information using Eq. (2), which results in a principled way of balancing the side information and the network data. In addition, different attributes can contribute differently to the latent factors. For example, the gender of an author may be much less important to co-authorship with others than the research fields. This is controlled by the attribute factor loading $h_{l,k}$ in our model.

2 The Asymmetric Node Attribute Relational Model

3 Incorporating Hierarchical Node Attributes

Relational networks can be associated with hierarchical side information (hu2016non). For example, in a patent citation network, patents can be labelled with the International Patent Classification (IPC) code, which is a hierarchy of patent categories and sub-categories. Suppose the second level attributes are stored in a binary matrix $\mathbf{F^{\prime}}\in\{0,1\}^{L\times M}$ where $M$ is the number of attributes in the second level. Our models can be used to incorporate hierarchical node attributes via a straightforward extension: replace hyper-parameter $\mu_{0}$ in Eq. (2) with $\mu_{l,k}=\prod_{m}^{M}\delta^{f^{\prime}_{l,m}}_{m,k}$ . This extension mirrors what is done for first level attributes.

Inference with Gibbs Sampling

Both Sym-NARM and Asym-NARM enjoy local conjugacy so the inference of all latent variables can be done by closed-form Gibbs sampling. Moreover, the inference only needs to be conducted on the non-zero entries in $\mathbf{Y}$ and $\mathbf{F}$ . This section focuses on the sampling of $h_{l,k}$ ( $b_{k}$ ), the key variable in the proposed incorporation of node attributes. The sampling of the other latent variables is similar to those in EPM and BGGPF, detailed in (zhou2015infinite; zhou2012beta). As the sampling for $h_{l,k}$ is analogous in Sym-NARM and Asym-NARM, our introduction will be based on Asym-NARM alone.

With the Poisson gamma conjugacy, the likelihood for $g_{i,k}$ with $\phi_{i,k}$ marginalised out is: {IEEEeqnarray}+rCl+x* p(g_i,k — x_i,⋅,k ) &∝ (1-q_k)^g_i,k Γ(gi,k+ xi,⋅,k)Γ(gi,k) where $x_{i,\cdot,k}=\sum_{j}x_{i,j,k}$ and $x_{i,j,k}$ is the latent count. The gamma ratio in Eq. (3), i.e., the Pochhammer symbol for a rising factorial, can be augmented with an auxiliary variable $t_{i,k}$ : $\frac{\Gamma(g_{i,k}+x_{i,\cdot,k})}{\Gamma(g_{i,k})}=\sum_{t_{i,k}=0}^{x_{i,\cdot,k}}S^{x_{i,\cdot,k}}_{t_{i,k}}g_{i,k}^{t_{i,k}}$ where $S^{x}_{t}$ indicates an unsigned Stirling number of the first kind (chen2011sampling; teh2012hierarchical; zhou2015negative).

Taking $\mathcal{O}(x_{i,\cdot,k})$ , $t_{i,k}$ can be directly sampled by a Chinese Restaurant Process with $g_{i,k}$ as the concentration and $x_{i,\cdot,k}$ as the number of customers: {IEEEeqnarray}+rCl+x* t_i,k &← t_i,k + Bern(gi,kgi,k+i’) for i’ = 1:x_i,⋅,k where $\text{Bern}(\cdot)$ is the Bernoulli distribution. Alternatively, for large $x_{i,\cdot,k}$ , because the standard deviation of $t_{i,k}$ is $\mathcal{O}(\sqrt{\log x_{i,\cdot,k}})$ (BunHut:12), one can sample $t_{i,k}$ in a small window around the current value (Du:2010:STM:1842816.1842821).

With the above augmentation and Eq. (2), we get: {IEEEeqnarray}+rCl+x* $\displaystyle\operatorname{p}\left(\mathbf{G},\mathbf{H}\nonscript\;\middle|\nonscript\;x_{:,\cdot,:},\mathbf{T},\mathbf{F}\right)~{}\propto$ & ∏_i=1^N ∏_k=1^K S^x_i,⋅,k_t_i,k e^-log(11 - qk) g_i,k ⋅∏