Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction

Mingyuan Zhou

INTRODUCTION

Community detection and link prediction are two important problems in network analysis. A vast number of community detection algorithms based on various useful heuristics, such as modularity maximization (Newman and Girvan, 2004) and clique percolation (Palla et al., 2005), have been proposed. See Fortunato (2010) for a comprehensive review. These algorithms, however, are not based on generative models and hence usually cannot be used to generate networks and predict missing edges (links). Moreover, how to set the number of communities is a critical issue that has not been well addressed by them. In this paper, we will fit unweighted undirected relational networks using nonparametric Bayesian generative models, which can be used to simulate random networks, detect latent overlapping communities and community-community interactions, and predict missing edges, with the number of communities automatically inferred from the data.

For a relational network, a community can be considered as a subset of nodes (vertices) that are densely connected to each other but sparsely to the others, such as those in a social network, or it can be considered as a subset of nodes that are sparsely connected to each other but densely connected to the nodes belonging to another community, such as those in a network consisting of carnivores and herbivores: tigers and bears both hunt deers but rarely prey on each other. The former phenomenon is usually described as assortativity or homophily, while the latter is known as dissortativity or stochastic equivalence (Hoff, 2008). As a relational network may exhibit both homophily and stochastic equivalence, an algorithm capable of modeling both phenomena would usually be preferred if no prior information on assortativity is available. If analyzing assortative networks with dense intra-community connections is the main goal, then one may consider an assortative algorithm that models homophily but not necessarily stochastic equivalence.

The stochastic blockmodel (SBM) is a popular latent class model to detect latent communities (Holland et al., 1983; Nowicki and Snijders, 2001). It partitions the nodes into disjoint communities, and models the probability for an edge to exist between two nodes solely based on which two communities that they belong to. It is simple and scalable, and models both homophily and stochastic equivalence. In addition, the infinite relational model, a nonparametric Bayesian extension of the SBM based on the Chinese restaurant process (Aldous, 1985), allows the number of communities to be automatically inferred from the data (Kemp et al., 2006). Despite these attractive properties, the SBM is restrictive in that communities are not allowed to overlap. In practice, however, it is common for a node to belong to multiple communities, motivating the development of more advanced latent class models, such as the mixed-membership stochastic blockmodel (MMSB) of Airoldi et al. (2008) and its various extensions (Gopalan et al., 2012; Kim et al., 2013). The MMSB generalizes the SBM to allow a node to participate in multiple communities, yet since it has to infer two community indicators for each pair of nodes, regardless of whether an edge exists in that pair, its computation grows quadratically as a function of the number of nodes $N$ . Moreover, the number of communities in the MMSB is a model parameter that needs to be carefully selected.

In this paper, instead of clustering nodes, as in the SBM, or clustering all possible edges, as in the MMSB, we propose the edge partition model (EPM) to partition only the observed edges, which readily leads to the partition of nodes: if the edges linked to a node are partitioned into multiple communities, then the node is naturally affiliated with all these communities, and could be hard assigned to a single community that has the strongest presence in its edges. In contrast to the SBM, the EPM allows communities to overlap; and in contrast to the MMSB that spends $O(N^{2})$ computation clustering all possible edges, the EPM spends $O(\bar{d}N)$ computation partitioning only observed edges, where $\bar{d}$ is the average degree (number of edges) per node, leading to notable computational savings as $\bar{d}$ is often much smaller than $N$ in a big sparse network commonly observed in practice.

To support a potentially infinite number of communities and to model both homophily and stochastic equivalence in an unweighted undirected relational network, we propose a hierarchical gamma process (HGP) EPM, which links each observed edge to a latent count using a Bernoulli-Poisson link, and then factorizes the latent $N\times N$ random count matrix. The HGP supports the EPM to have an infinite dimensional feature vector for each node to describe its affiliations with communities, and an infinite dimensional square rate matrix, whose diagonal and off-diagonal elements describe the intra- and inter- community interactions, respectively. We also propose a gamma process EPM as a simplified version of the HGP-EPM, which omits inter-community interactions to gain simpler inference and faster computation at the expense of reduced ability to model stochastic equivalence.

Conceptually, our idea of directly partitioning edges and implicitly partitioning nodes into communities is related to the one in Ahn et al. (2010) and Evans and Lambiotte (2009). In terms of construction, our EPMs are related to the Poisson factor models of Ball et al. (2011) and the Eigenmodel of Hoff (2008). In terms of supporting an infinite number of features, our EPMs are related to the models in Miller et al. (2009) and Morup et al. (2011) that use the Indian buffet process of Griffiths and Ghahramani (2005) to support an infinite binary feature matrix. The proposed models depart from existing ones with several distinctions: 1) a Bernoulli-Poisson link connects each edge to a latent count that is further partitioned; 2) a hierarchical gamma process is constructed to support an infinite number of communities and an infinite-dimensional square matrix to describe community-community interactions; 3) two nonparametric Bayesian EPMs are constructed to factorize the $N\times N$ binary adjacency matrix under the Bernoulli-Poisson link, supporting a nonnegative feature matrix with an unbounded number of columns, and at the same time assign each edge and hence each node to one or multiple latent communities; and 4) efficient and scalable Bayesian inference via Gibbs sampling is provided.

FACTOR ANALYSIS AND BERNOULLI-POISSON LINK

Our basic idea is to factorize the BINARY network adjacency matrix using tools developed for COUNT data analysis, and to discover overlapping communities and their interactions by examining how the latent count for each edge is partitioned. This section will primarily discuss individual model components and their properties, with hierarchical Bayesian models presented later.

We propose a Poisson factor model for a weighted undirected $N\times N$ relational network as

where $m_{ij}\equiv m_{ji}$ is the integer-valued weight (observed or latent) that links nodes $i$ and $j$ , $(\phi_{i1},\ldots,\phi_{iK})$ is the positive feature vector for node $i$ , $\lambda_{k_{1}k_{2}}\equiv\lambda_{k_{2}k_{1}}$ is a positive rate, and the symbol $\equiv$ denotes “equal by definition.” This model is conceptually simple: with $\phi_{ik_{1}}$ measuring how strongly node $i$ is affiliated with community $k_{1}$ and $\lambda_{k_{1}k_{2}}$ measuring how strongly communities $k_{1}$ and $k_{2}$ interact with each other, the product $\phi_{ik_{1}}\lambda_{k_{1}k_{2}}\phi_{jk_{2}}$ measures how strongly nodes $i$ and $j$ are connected due to their affiliations with communities $k_{1}$ and $k_{2}$ , respectively, and a weighted combination of all intra-community weights $\{\lambda_{kk}\}_{1\leq k\leq K}$ and inter-community ones $\{\lambda_{k_{1}k_{2}}\}_{1\leq k_{1}\neq k_{2}\leq K}$ is the expected value of $m_{ij}$ .

The factor model in (1) makes intuitive sense. For example, suppose persons $i$ and $j$ are both residents of City Avatar and active members of the Avatar anglers Meetup group that organizes fishing trips regularly. In addition, persons $i$ is an active member of the Avatar artificial intelligence (AI) Meetup group while person $j$ is an active member of the Avatar statistics Meetup group. Denoting $m_{ij}$ as the number of times that $i$ and $j$ attend the same group meeting in 2015, then due to their strong affiliations with the anglers Meetup group, $m_{ij}$ would have a large expected value, which is likely to be further increased if the AI and statistics Meetup groups hold joint events regularly.

To model an assortative relational network exhibiting homophily but not necessarily stochastic equivalence, we may omit the inter-community interactions by letting $\lambda_{k_{1}k_{2}}\equiv 0$ for $k_{1}\neq k_{2}$ and simplify (1) as

where $r_{k}\equiv\lambda_{kk}$ indicates the prevalence of community $k$ , and two nodes with similar latent features are encouraged to be linked by an edge with a large weight.

We note Ball et al. (2011) had examined a model related to (2) and briefly mentioned a model related to (1). However, they used a heuristic approach to model binary data under the Poisson distribution, did not provide a principled way to set the number of communities $K$ , and had to create possibly nonexistent self-edges in order to derive tractable expectation-maximization (EM) inference. This paper will address all these issues rigorously, in a nonparametric Bayesian manner, and carefully examine the models in both (2) and (1) and provide efficient Bayesian inference.

2 Bernoulli-Poisson Link

To use the Poisson factor models in (1) and (2) for an unweighted network with a binary adjacency matrix, we introduce a Bernoulli-Poisson (BerPo) link function that thresholds a random count at one to obtain a random variable in $\{0,1\}$ as

where $b=1$ if $m\geq 1$ and $b=0$ if $m=0$ . The intuition is that two nodes are connected if they interact at least once. The mathematical motivation is after transforming a binary-modeling problem into a count-modeling one, one is readily equipped with a rich set of statistical tools developed for count data analysis using the Poisson and negative binomial distributions.

If $m$ is marginalized out from (3), then given $\lambda$ , one obtains a Bernoulli random variable as

The conditional posterior of $m$ can be expressed as

where $x\sim\mbox{Po}_{+}(\lambda)$ follows a truncated Poisson distribution, with $P(x=k)=(1-e^{-\lambda})^{-1}{\lambda^{k}e^{-\lambda}}/{k!}$ for $k\in\{1,2,\ldots\}$ . Thus if $b=0$ , then $m=0$ almost surely (a.s.), and if $b=1$ , then $m\sim\mbox{Po}_{+}(\lambda)$ , which can be simulated with rejection sampling: if $\lambda\geq 1$ , we draw $m\sim\mbox{Po}(\lambda)$ till $m\geq 1$ ; and if $\lambda<1$ , we draw both $n\sim\mbox{Po}(\lambda)$ and $u\sim\mbox{Unif}(0,1)$ till $u<1/(n+1)$ , and then let $m=n+1$ . The acceptance rate is $1-e^{-\lambda}$ if $\lambda\geq 1$ and ${\lambda^{-1}}(1-e^{-\lambda})$ if $\lambda<1$ , and reaches its minimum, 63.2%, when $\lambda=1$ .

The BerPo link shares some similarities with the probit link that thresholds a normal random variable at zero, and the logit link that lets $b\sim\mbox{Ber}[e^{x}/(1+e^{x})]$ . We advocate the BerPo link as an alternative to the probit and logit links since if $b=0$ , then $m=0$ a.s., which could lead to significant computational savings if a considerable proportion of the data are equal to zero. In addition, the additive property of the Poisson allows us to model the link strength between any two nodes by aggregating the contributions of all possible intra- and inter- community interactions, and the conjugacy between the Poisson and gamma distributions makes it convenient to construct hierarchical Bayesian models amenable to posterior simulation.

3 Overlapping Community Structures

where $m_{ik_{1}k_{2}j}$ represents how often nodes $i$ and $j$ interact due to their affiliations with communities $k_{1}$ and $k_{2}$ , respectively. We may consider that the model is partitioning the count $m_{ij}$ into $\{m_{ik_{1}k_{2}j}\}_{1\leq k_{1},k_{2}\leq K}$ , and hence we call the Poisson factor model in (1) together with the BerPo link in (3) as an edge partition model (EPM), in which each edge is partitioned according to all possible $K^{2}$ community-community interactions, and how strongly node $i$ is affiliated with community $k$ can be measured with $\phi_{ik}\omega_{ik}$ , where

represents how strongly node $i$ would interact with all the other nodes through its affiliation with community $k$ . We further introduce the latent count

to represent how often node $i$ is connected to the other nodes due to its affiliation with community $k$ . We can then assign node $i$ to multiple communities in $\{k:m_{ik\boldsymbol{\cdot}\boldsymbol{\cdot}}\geq 1\}$ , or (hard) assign it to a single community using either $\underset{k}{\operatorname{argmax}}(\phi_{ik}\omega_{ik})$ or $\underset{k}{\operatorname{argmax}}(m_{ik\boldsymbol{\cdot}\boldsymbol{\cdot}})$ . Similar analysis applies to a simpler EPM built on (2). By hard assigning each node to a single community and ordering the nodes from the same community to be adjacent to each other, we expect the ordered adjacency matrix to exhibit a block structure, where the blocks along and off the diagonal represent the intra- and inter- community connections, respectively.

4 Scalability for Big Sparse Networks

We are motivated to construct the EPMs because they not only allow each edge and hence each node to participate in multiple communities, but also readily scale to a big sparse network whose average degree per node is much smaller than $N$ . A key observation for scalable computation is that (1) can be augmented as (4), where $m_{ik_{1}k_{2}j}=0$ a.s. for any $k_{1}$ and $k_{2}$ if no edge exists between nodes $i$ and $j$ (i.e., $m_{ij}=0$ ). On a sparse network, where the edges constitute only a small portion of all possible $N^{2}$ edges, this property makes the EPMs computationally appealing. By contrast, conceptually related models, including the MMSB of Airoldi et al. (2008), Eigenmodel of Hoff (2008) and latent feature relational model of Miller et al. (2009), spend computation indiscriminately on all pairs of nodes $(i,j)$ no matter whether an edge exists between nodes $i$ and $j$ , and hence they have $O(N^{2})$ computation and do not scale well as $N$ increases.

EDGE PARTITION MODELS

The EPM takes a weighted combination of all possible intra- and inter- community weights to explain each pair of node, however, the number of communities $K$ is still a model parameter that needs to be set appropriately. To allow $K$ to be inferred from the data and potentially grow to infinity, we need to introduce a stochastic process that can generate a countably infinite number of atoms $\{\boldsymbol{\phi}_{k}\}_{1,\infty}$ , where $\boldsymbol{\phi}_{k}=(\phi_{1k},\ldots,\phi_{Nk})^{T}$ measures how strongly the $N$ nodes are affiliated with community $k$ , and an infinite dimensional square matrix $\{\lambda_{k_{1}k_{2}}\}_{1\leq k_{1},k_{2}\leq\infty}$ , where $\lambda_{k_{2}k_{1}}=\lambda_{k_{1}k_{2}}$ measures how strongly communities $k_{1}$ and $k_{2}$ interact with each other. Moreover, we need to ensure $\sum_{k_{1}=1}^{\infty}\sum_{k_{2}=1}^{\infty}\lambda_{k_{1}k_{2}}$ to be finite a.s. and we may wish to impose some structural regularization on the infinite square matrix.

To satisfy all these needs, we first define

Given $G=\sum_{k=1}^{\infty}r_{k}\delta_{\boldsymbol{\phi}_{k}}$ , we further define a relational gamma process ( $\mbox{r}\Gamma\mbox{P}$ ) as

Given a relational gamma process draw $\Lambda$ , we generate a binary adjacency matrix ${{\bf B}}\in\{0,1\}^{N\times N}$ as

Equations (9), (8) and (7) constitute an HGP-EPM that supports countably infinite atoms and a countably infinite square matrix, the total sum of whose elements has a finite expectation, as shown in the following Lemma, with proof provided in the Appendix.

The expectation of $\sum_{k_{1}=1}^{\infty}\sum_{k_{2}=1}^{\infty}\lambda_{k_{1}k_{2}}$ is finite and can be expressed as

The usual scenario to consider an HGP construction is when one models grouped data and wishes to share statistical strengths across groups. For example, the gamma-negative binomial process of Zhou and Carin (2012), related to the hierarchical Dirichlet process of Teh et al. (2006), is considered for topic modeling, where each document is associated with a gamma process, and these gamma processes are coupled by sharing a lower-level ( $i.e.$ , further from the data) gamma process as their atomic base measure. The proposed HGP is distinct in that the product of the weights of any two atoms of the lower-level gamma process is used to parameterize the shape parameter of a gamma random variable higher in the hierarchy.

The proposed HGP also helps express our prior belief that an atom with a small weight tends to represent a small community, which also tends to interact with the others less frequently. Note that if we let $\lambda_{kk}\sim\mbox{Gam}(r_{k}^{2},1/\beta)$ , then the expectation of the matrix $\{\lambda_{k_{1}k_{2}}\}_{1\leq k_{1},k_{2}\leq\infty}$ given $\{r_{k}\}_{1,\infty}$ has a rank of one. We use $\xi r_{k}$ instead of $r_{k}^{2}$ as the shape parameter of $\lambda_{kk}$ to allow $r_{k}$ to be inferred with Gibbs sampling and to prevent overly shrinking $\lambda_{kk}$ for small communities. Note that Palla et al. (2014) proposed a reversible infinite hidden Markov model using a related HGP infinite square rate matrix, the normalization of whose each row represents a state transition probability vector. Our HGP serves a distinct modeling purpose; no normalization is required for the infinite square rate matrix, and our model allows exploiting unique data augmentation techniques to infer both $\lambda_{k_{1}k_{2}}$ and $r_{k}$ with closed-form Gibbs sampling update equations, as discussed in Section 3.4 and the Appendix.

2 Hierarchical Gamma Process EPM

We choose the base distribution of the gamma process $G\sim\Gamma\mbox{P}(G_{0},1/c_{0})$ as $g_{0}(\boldsymbol{\phi})=\prod_{i=1}^{N}\mbox{Gam}(a_{i},1/c_{i})$ . For implementation convenience, we consider a discrete base measure as $G_{0}=\sum_{k=1}^{K}\frac{\gamma_{0}}{K}\delta_{\boldsymbol{\phi}_{k}}$ , where $K$ is a truncation level that is set large enough to ensure a good approximation to the truly infinite model. We express the (truncated) HGP-EPM as

where $\lambda_{k_{2}k_{1}}\equiv\lambda_{k_{1}k_{2}}$ and conjugate gamma priors are imposed on $\gamma_{0}$ , $\xi$ , $c_{0}$ , $c_{i}$ and $\beta$ . Note that marginalizing out both $m_{ij}$ and $m_{ik_{1}k_{2}j}$ from (10) leads to

A noticeable advantage of the augmented representation in (10) over (11) is that (10) is amenable to posterior simulation, as discussed in Section 3.4.

Note that similar to Hoff (2008) and Lloyd et al. (2012), we assume that the nodes are exchangeable and hence the discussions of Hoover (1982) and Aldous (1985) on exchangeability also apply to our EPMs.

3 Gamma Process EPM

If we omit inter-community interactions by letting $\lambda_{k_{1}k_{2}}\equiv 0$ for $k_{1}\neq k_{2}$ and $\lambda_{kk}\equiv r_{k}$ , then the HGP-EPM reduces to a gamma process EPM (GP-EPM), which is likely to well fit assortative networks but not necessarily disassortative ones. We notice an interesting connection to the community-affiliation graph model (AGM) of Yang and Leskovec (2012, 2014): the GP-EPM generates an edge with probability

if we define $p_{k}=1-e^{-r_{k}}$ and further impose the restriction that $\phi_{ik}\in\{0,1\}$ , then (12) reduces to

Yang and Leskovec (2012, 2014) argue that all previous community detection methods, including clique percolation and MMSB, would fail to detect communities with dense overlaps, because they all had a hidden assumption that a community’s overlapping parts are less densely connected than its non-overlapping ones. The same as the AGM, both the GP-EPM and HGP-EPM do not make such a restrictive assumption, and they both allow overlaps of communities to be denser than communities themselves; Beyond the AGM, we do not restrict $\phi_{ik}$ to be either zero or one, and our generative models are built under a rigorous nonparametric Bayesian framework with efficient Bayesian inference, as presented below.

4 MCMC Inference

In this paper, we consider an unweighted undirected network, where $b_{ji}\equiv b_{ij}$ and self-links $b_{ii}$ are not defined. Thus we only consider $b_{ij}$ for $j>i$ in (10). Let $m_{ik\boldsymbol{\cdot}\boldsymbol{\cdot}}$ be defined as in (6) and $m_{\boldsymbol{\cdot}k_{1}k_{2}\boldsymbol{\cdot}}$ as

where $\delta_{k_{1}k_{2}}=1$ if $k_{1}=k_{2}$ and $\delta_{k_{1}k_{2}}=0$ otherwise. Using (5) and the Poisson additive property, we have

Using the BerPo link, the gamma-Poisson conjugacy, and the augment-and-conquer techniques to infer the negative binomial dispersion parameters (Zhou and Carin, 2012, 2015), we exploit (14)-(17) to derive closed-form Gibbs sampling update equations for all model parameters except $\gamma_{0}$ , and construct an excellent proposal distribution to sample $\gamma_{0}$ using an independence chain Metropolis-Hastings algorithm. We present in the Appendix the details of MCMC inference for the HGP-EPM, and the hierarchical model and closed-form Gibbs sampling update equations for the GP-EPM. The inference of the nonparametric Bayesian AGM would be almost the same as that of the GP-EPM, with the only difference that the $(\phi_{ik}|-)$ would be sampled from Bernoulli distributions.

EXPERIMENTAL RESULTS

For comparison, we consider the infinite relational model (IRM) of Kemp et al. (2006), the Eigenmodel of Hoff (2008), the infinite latent attribute (ILA) model of Palla et al. (2012), the AGM of Yang and Leskovec (2012, 2014), and our GP- and HGP-EPMs. We use the R package provided for the Eigenmodel. We use the ILA codehttp://mlg.eng.cam.ac.uk/konstantina/ILA/ILA $\_$ code(v1).tar.gz provided for Palla et al. (2012), in which it is shown that the ILA outperforms the related nonparametric latent feature relational model of Miller et al. (2009). We implement a nonparametric Bayesian version of the AGM as a special case of the GP-EPM, as discussed in Section 3.3. Matlab code for the EPMs is available on the author’s website.

For the Eigenmodel, we find the best $K$ in $\{5,10,25,50\}$ . For the ILA, we use its default parameter settingThe default training/testing partition of the ILA code sends self-edges into the testing set; whereas in this paper, we do not intend to predict self-edges and hence we do not allow them to appear in the testing set. . For the IRM, we choose $\mbox{Beta}(0.1,1)$ as the prior for each latent block and $\mbox{Gam}(0.01,1/0.01)$ as the prior for the Chinese restaurant process concentration parameter; for the nonparametric Bayesian AGM, we let $\phi_{ik}\sim\mbox{Ber}(\pi_{i}),~{}\pi_{i}\sim\mbox{Beta}(0.01,0.01)$ and $\epsilon\sim\mbox{Gam}(0.01,1/0.01)$ ; these parameters are found to consistently provide good performance. For our models’ hyper-parameters, we choose $e_{0}=f_{0}=0.01$ and let $\gamma_{0}$ , $c_{i}$ , $c_{0}$ and $\beta$ be all drawn from $\mbox{Gam}(1,1)$ .

We consider 3000 MCMC iterations and collect the last 1500 samples, unless otherwise stated. We consider two small-scale benchmark networks, for which we test all algorithms and set the truncation level as $K_{\max}=100$ for our algorithms, and another two networks with more than 2000 nodes, for which we set $K_{\max}=256$ .

To test a model’s ability to predict missing edges of an unweighted undirected relational network, we randomlyIf removing an edge disconnects a node to all the others, then the edge will be kept in the training set. hold out 20% pairs of nodes and use the the remaining 80% to predict the probability for an edge to exist in each of these held-out pairs. Letting $o_{ij}=0$ if $b_{ij}$ is held out and $o_{ij}=1$ otherwise, we only need to slightly modify the inference by only considering $\{b_{ij}:o_{ij}=1\}$ in the likelihood. For example, $\omega_{ik}$ in (5) would be redefined as $\omega_{ik}=\sum_{j:o_{ij}=1}\sum_{k^{\prime}}\lambda_{kk^{\prime}}\phi_{jk^{\prime}}$ . We consider exactly the same five random training-testing partitions for all algorithms and report the average area under the curve (AUC) of both the receiver operating characteristic (ROC) and precision-recall (PR) curves (Davis and Goadrich, 2006). For link prediction, the AUC-PR is more sensitive to the percentage of true edges among the top ranked ones. Note that in addition to link prediction, the HGP-EPM, GP-EPM, AGM and IRM all have easily interpretable latent representations that will be used to detect overlapping/disjoint communities.

We first consider the Protein230 dataset of Butland et al. (2005) that describes the interactions between 230 proteins, with 595 edges. This is a small-scale benchmark network that exhibits both homophily and stochastic equivalence, as shown in Hoff (2008) and also tested in Lloyd et al. (2012). We are able to run 3000 MCMC iterations quickly enough for all algorithms except for the ILA on this network.

As shown in Tab. 1, the HGP-EPM has the best overall performance. The Eigenmodel is the second best with $K=10$ and the IRM is the third best. The AGM is not competitive as it restricts its features to be binary. In this and all future tables, we highlight in bold both the best result and the ones that are less than one standard error away from the best. Below we analyze why the HGP-EPM performs the best while the simpler GP-EPM is not that competitive on this dataset.

As shown in Figs. 1 (b)-(d), the HGP-EPM captures both homophily and stochastic equivalence by accurately modeling both diagonal and off-diagonal dense regions of the adjacency matrix; the GP-EPM captures homophily by accurately modeling diagonal dense regions that represent intra-community interactions, but at the expense of creating nonexistent blocks in order to fit dense off-diagonal regions that represent strong inter-community interactions; and the IRM captures these large dense blocks, but produces a cartoonish estimation, which overlooks small communities that represent fine details along the diagonal.

Fig. 2 shows how the HGP-EPM works. First, each feature vector $\boldsymbol{\phi}_{k}$ shown in Fig. 2 (a) clearly describes how strongly the nodes are affiliated with the community it represents, and each node may have large weights on multiple community. Second, about 30 latent feature vectors are inferred and the remaining ones are essentially drawn from the prior $\prod_{i}\mbox{Gam}(a_{i},1/c_{i})$ . Third, the inter- and intra-community interaction strengths in Fig. 2 (b) can be matched to the corresponding communities (subsets of nodes) in Figs. 1 (a) and (b). For example, Fig. 2 (a) suggests that the first and second largest communities have 24 and 22 nodes, respectively, and Fig. 2 (b) suggests that the first and second communities have sparse and dense intra-community connections, respectively, and have denser connections between them, as confirmed by examining the block structures within the top-left $46\times 46$ corner of both Figs. 1 (a) and (b).

2 NIPS234 Coauthor Network

We consider the small-scale NIPS234 network consists of the top 234 authors in NIPS 1-17 conferenceshttp://chechiklab.biu.ac.il/ $\sim$ gal/data.html in terms of the number of publications, as studied in Miller et al. (2009). There are 598 edges. As shown in Tab. 2, the GP-EPM and HGP-EPM have the best overall performance, followed by the IRM. Comparing with the simpler GP-EPM, the extra flexibility to model stochastic equivalence does not bring the HGP-EPM additional advantages on this dataset, which is not surprising as Fig. 3 suggests that this coauthor network mainly exhibits homophily. Note that the IRM performs well measured by the AUC-ROC, but its AUC-PR is clearly worse than those of the EPMs. This may again be explained by its overly smoothed cartoonish estimation that overlooks small communities, as clearly shown in Fig. 3 (d).

3 Yeast and NIPS12 Networks

We also consider the Yeasthttp://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm protein interaction network of Bu et al. (2003), with 2361 nodes and 6646 non-self edges, and the NIPS12 coauthor networkhttp://www.cs.nyu.edu/ $\sim$ roweis/data.html that includes all the 2037 authors in NIPS papers vols 0-12, with 3134 edges. These two median-size networks are already too large for the Eigenmodel and ILA to produce reasonable results given our computational resources. The results in Tabs. 4 and 4 show that the HGP-EPM performs the best on the Yeast protein-protein interaction network, which is found to clearly exhibit stochastic equivalence by examining the plots corresponding to the ones in Figs. 1 and 3 (not shown for brevity), and the HGP-EPM and GP-EPM both perform well on the NIPS12 coauthor network, which is found to mainly exhibit homophily by examining related plots (not shown for brevity).

As discussed before, the HGP-EPM, GP-EPM, AGM and IRM can all be used to assign nodes to disjoint communities. In Fig. 4 we plot the size of an inferred latent community as a function of its rank (smaller ranks indicate larger sizes) on the log-10 scale, for the four scalable algorithms on the four tested real networks. It is clear that in contrast to the other three latent factor models, the IRM, a latent class model, infers a smaller number of communities, with more larger-size and fewer smaller-size ones. Examining the details we find that the IRM tends to place all the low-degree nodes into one or several large-size communities, whereas the other models are able to better preserve fine details involving small-size communities.

We mention that the HGP-EPM, GP-EPM and AGM have $O(N\bar{d}+NK)$ computation, whereas the Eigenmodel and ILA have at least $O(N^{2}+NK)$ computation, where $K$ is the number of latent features. With unoptimized Matlab on a 2.7 GHz CPU, for 1000 MCMC iterations, the HGP-EPM (GP-EPM) takes about 80 (20) seconds on Protein230, about 85 (28) seconds on NIPS234, about 50 (18) minutes on Yeast, and about 32 (12) minutes on NIPS12. The Eigenmodel with $K=25$ takes about 200 seconds on NIPS234 to run 1000 MCMC iterations. For the ILA on NIPS234, we considered 1000 MCMC iterations that took over 18 hours to run; for Protein230, the ILA inferred about two times more features as it did on NIPS234, and we considered 500 MCMC iterations that took over 21 hours to run.

CONCLUSIONS

To model unweighted undirected relational networks characterized by both homophily and stochastic equivalence, we propose a hierarchical gamma process edge partition model (EPM) that supports an infinite number of communities and an infinite square rate matrix to describe community-community interactions. The EPM exploits a Bernoulli-Poisson link to assign a latent count to each binary edge, and further partitions that count according to the edge’s affiliations with all pairs of communities, which naturally leads to the partition of each node into overlapping communities. We also provide a simpler gamma process EPM that omits inter-community interactions, which is found to perform well on assortative networks. Efficient MCMC inference with closed-form update equations is provided. Experimental results on four real networks illustrate the EPMs’ working mechanisms and properties, as well as their state-of-the-art performance and interpretable latent representations. While previous latent feature relational models and their nonparametric Bayesian versions are often not scalable, our infinite EPMs are readily scalable to networks with thousands of nodes. It would be interesting to investigate strategies to make them scalable to relational networks with millions of nodes and edges.

Appendix A Proof for Lemma 1

Using the law of total expectation, we have

Using Campbell’s theorem (Kingman, 1993), we have

Appendix B MCMC Inference for HGP-EPM

Sample $m_{ij}$ . As in Section 2.2, we sample a latent count for each $b_{ij}$ as

Sample $m_{ik_{1}k_{2}j}$ . Using the relationships between the Poisson and multinomial distributions, similar to the derivation in Zhou et al. (2012), we partition the latent count $m_{ij}$ as

Note that in each MCMC iteration we store $m_{ik\boldsymbol{\cdot}\boldsymbol{\cdot}}$ and $m_{\boldsymbol{\cdot}k_{1}k_{2}\boldsymbol{\cdot}}$ but not necessarily $m_{ik_{1}k_{2}j}$ in the memory. Sample $a_{i}$ . Using (16) and the data augmentation technique developed in Zhou and Carin (2012, 2015) for the negative binomial distribution, we sample $a_{i}$ as

where with a slight abuse of notation, but for added conciseness, we use $x\sim\sum_{t=1}^{m}\mbox{Ber}[a/(a+t)]$ to represent $x=\sum_{t=1}^{m}u_{t},~{}u_{t}\sim\mbox{Ber}[{a}/({a+t})]$ . Sample $\phi_{ik}$ . Using (14) and the gamma-Poisson conjugacy, we have

Sample $r_{k}$ . Similar to the inference of $a_{i}$ , using (17), we sample $r_{k}$ as

Sample $\xi$ . We resample the auxiliary variables $l_{kk}$ using the updated $r_{k}$ and then sample $\xi$ as

Sample $\lambda_{k_{1}k_{2}}$ . Using (15) and the gamma-Poisson conjugacy, we have

Sample $\beta$ , $c_{i}$ and $c_{0}$ . They can be sampled from gamma distributions using the conjugacy between gamma distributions, omitted here for brevity. Sample $\gamma_{0}$ . As show in Lemma 1, the mass parameter $\gamma_{0}$ plays an important role in determining the total sum of the infinite rate matrix $\{\lambda_{k_{1}k_{2}}\}$ . Our experiments show that it could be used as a tuning parameter to impose one’s prior preference on the number of active communities to be discovered. In this paper, we impose a gamma prior as $\gamma_{0}\sim\mbox{Gam}(1,1)$ to let the data infer the posterior of $\gamma_{0}$ . We employ an independence chain Metropolis-Hastings algorithm to sample $\gamma_{0}$ , with the proposal distribution constructed as

We accept $\gamma_{0}^{\ast}$ with probability $\min\{1,\pi\}$ , where $\pi$ is

which is usually greater than 50% for the networks considered in this paper.

Each iteration of the MCMC for the HGP-EPM proceeds from (18) to (25).

Appendix C Gamma Process EPM

The gamma process EPM differs from the HGP-EPM in that it omits inter-community interactions, which leads to a simpler hierarchical model and faster computation at the expense of reduced ability to model stochastic equivalence. It is found to have good performance on assortative networks but not necessarily on disassortative ones.

The (truncated) gamma process EPM is expressed as

where the $\mbox{Gam}(1,1)$ prior is also imposed on $c_{0}$ and $c_{i}$ . As $K\rightarrow\infty$ , we recover the (exact) gamma process with a finite and continuous base measure. We usually set $K$ to be large enough to ensure a good approximation to the truly infinite model.

Note that if we marginalize out both $m_{ij}$ and $m_{ijk}$ , then we have

C.2 Gibbs Sampling

Let the latent counts $m_{i\boldsymbol{\cdot}k}$ and $m_{\boldsymbol{\cdot}\boldsymbol{\cdot}k}$ be defined as

Using the Poisson additive property, we have

Marginalizing out $\phi_{ik}$ from (27), we have

Marginalizing out $r_{k}$ from (28), we have

Similar to the inference techniques used in Appendix B, one may exploit (27)-(30) to derive closed-form Gibbs sampling update equations for all model parameters, omitted here for brevity.

Appendix D Gamma Process AGM

Closely related to the gamma process EPM, the hierarchical model for the (truncated) gamma process AGM can be expressed as

We sample $r_{k}$ , $\gamma_{0}$ and $c_{0}$ in the same way we sample them in the gamma process EPM. To sample $\phi_{ik}$ , one may use (27) as the likelihood, under which $\phi_{ik}$ is equal to one a.s. if $m_{i\boldsymbol{\cdot}k}>0$ and is drawn from a Bernoulli distribution if $m_{i\boldsymbol{\cdot}k}=0$ . Gibbs sampling update equations for the other model parameters can be conviniently derived by exploiting conditional conjugacies, omitted here for brevity.