Recovering communities in the general stochastic block model without knowing the parameters
Emmanuel Abbe, Colin Sandon
Introduction
This paper studies the problem of recovering communities in the general stochastic block model with linear size communities, for constant and slowly diverging degree regimes. In contrast to [AS15], this paper does not require knowledge of the SBM parameters. In particular, the problem of learning the model parameters is solved when average degrees are diverging. We next provide some motivations on the problem and further background on the model.
Detecting communities (or clusters) in graphs is a fundamental problem in networks, computer science and machine learning. This applies to a large variety of complex networks (e.g., social and biological networks) as well as to data sets engineered as networks via similarly graphs, where one often attempts to get a first impression on the data by trying to identify groups with similar behavior. In particular, finding communities allows one to find like-minded people in social networks [GN02, NWS], to improve recommendation systems [LSY03, XWZ+14], to segment or classify images [SM97, SHB07], to detect protein complexes [CY06, MPN+99], to find genetically related sub-populations [PSD00, JTZ04], or discover new tumor subclasses [SPT+01].
While a large variety of community detection algorithms have been deployed in the past decades, the understanding of the fundamental limits of community detection has only appeared more recently, in particular for the SBM [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb, YC14, AS15]. The SBM is a canonical model for community detection [HLL83, WBB76, FMW85, WW87, BC09, KN11, BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12], where vertices are partitioned into communities of relative size , , and pairs of nodes in communities and connect independently with probability .
Recently the SBM came back to the center of the attention at both the practical level, due to extensions allowing overlapping communities [ABFX08] that have proved to fit well real data sets in massive networks [GB13], and at the theoretical level due to new phase transition phenomena [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb]. The latter works focus exclusively on the SBM with two symmetric communities, i.e., each community is of the same size and the connectivity in each community is identical. Denoting by the intra- and the extra-cluster probabilities, most of the results are concerned with two figure of merits: (i) recovery (also called exact recovery or strong consistency), which investigates the regimes of and for which there exists an algorithm that recovers with high probability the two communities completely [BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12, Vu14, YC14], (ii) detection, which investigates the regimes for which there exists an algorithm that recovers with high probability a positively correlated partition [Co10, DKMZ11, MNS12, Mas14, MNS14].
The sharp threshold for exact recovery was obtained in [ABH14, MNSb], showing[MNSb] generalizes this to . that for , , , exact recovery is solvable if and only if , with efficient algorithms achieving the threshold provided in [ABH14, MNSb]. In addition, [ABH14] introduces an SDP proved to achieve the threshold in [BH14, Ban15], while [YP14] shows that a spectral algorithm also achieves the threshold. Prior to these, the sharp threshold for detection was obtained in [Mas14, MNS14], showing that detection is solvable (and so efficiently) if and only if , when , , settling a conjecture made in [DKMZ11] and improving on [Co10].
Besides the detection and the recovery properties, one may ask about the partial recovery of the communities, studied in [MNSa, GV14, Vu14, CRV15, AS15]. Of particular interest to this paper is the case of almost exact recovery (also called weak consistency), where only a vanishing fraction of the nodes is allowed to be misclassified. For two-symmetric communities, [MNSb] shows that almost exact recovery is possible if and only if diverges, generalized in [AS15] for general SBMs.
In the next section, we discuss the results for the general SBM of interest in this paper and the problem of learning the model parameters. We conclude this section by providing motivations on the problem of achieving the threshold with an efficient and universal algorithm.
Threshold phenomena have long been studied in fields such as information theory (e.g., Shannon’s capacity) and constraint satisfaction problems (e.g., the SAT threshold). In particular, the quest of achieving the threshold has generated major algorithmic developments in these fields (e.g., LDPC codes, polar codes, survey propagation to name a few). Likewise, identifying thresholds in community detection models is key to benchmark and guide the development of clustering algorithms. Most reasonable algorithms may succeed in some regimes, while in others they may be doomed to fail due to computational barriers. However, it is particularly crucial to develop benchmarks that do not depend sensitively on the knowledge of the model parameters. A natural question is hence whether one can solve the various recovery problems in the SBM without having access to the parameters. This paper answers this question by the affirmative for the exact and almost exact recovery of the communities.
Most of the previous works are concerned with the SBM having symmetric communities (mainly 2 or sometimes ), with the exception of [Vu14] which provides some achievability results for the general SBM.[GV14] also study variations of the -symmetric model. Recently, [AS15] studied the fundamental limits for the general SBM, with results as follows (where SBM is the SBM with community prior and connectivity matrix ).
The algorithm Sphere-comparison is proposed that solves partial recovery with exponential accuracy and quasi-linear complexity when the SNR diverges, solving in particular almost exact recovery.
Note that for symmetric clusters, SNR reduces to , which is the quantity of interest for detection [DKMZ11, MNS12]. Moreover, the SNR must diverge to ensure almost exact recovery in the symmetric case [AS15]. The following is an important consequence of the previous theorem, as it shows that Sphere-comparison achieves almost exact recovery when the entries of are scaled.
II. Exact recovery in the general SBM. The second result in [AS15] is for the regime where the connectivity matrix scales as , fixed, where it is shown that exact recovery has a sharp threshold characterized by the divergence function
named the CH-divergence in [AS15]. Specifically, if all pairs of columns in are at -distance at least 1 from each other, then exact recovery is solvable in the general SBM. This provides in particular an operational meaning to a new divergence function analog to the KL-divergence in the channel coding theorem (see Section 2.3 in [AS15]). Moreover, an algorithm (Degree-profiling) is developed that solves exact recovery down to the limit in quasi-linear time, showing that exact recovery has no informational to computational gap (as opposed to the conjectures made for detection with more than 4 communities [DKMZ11]). The following gives a more general statement characterizing which subset of communities can be extracted — see Definition 3 for formal definitions.
In particular, exact recovery is information-theoretically solvable in SBM if and only if . (ii) The Degree-profiling algorithm (see [AS15]) recovers the finest partition that can be recovered with probability and runs in time for all . In particular, exact recovery is efficiently solvable whenever it is information-theoretically solvable.
In summary, exact or almost exact recovery is closed for the general SBM (and detection is closed for 2 symmetric communities). However this is for the case where the parameters of the SBM are assumed to be known, and with linear-size communities.
2 Estimating the parameters
For the estimation of the parameter, some results are known for two-symmetric communities. In the logarithmic degree regime, since the SDP is agnostic to the parameters (it is a relaxation of the min-bisection), and the parameters can be estimated by recovering the communities [ABH14, BH14, Ban15]. For the constant-degree regime, [MNS12] shows that the parameters can be estimated above the threshold by counting cycles (which is efficiently approximated by counting non-backtracking walks). These are however for a fixed number of communities, namely 2. We also became recently aware of a parallel work [BCS15], which considers private graphon estimation (including SBMs). In particular, for the logarithmic degree regime, [BCS15] obtains a procedure to estimate parameters of graphons in an appropriate version of the norm. This procedure is however not efficient.
For the general SBM, the results of [AS15] allow to find communities efficiently, however these rely on the knowledge of the parameters. Hence, a major open problem is to understand if these results can be extended without such a knowledge.
Results
Agnostic algorithms are developed for the constant and diverging node degrees. These afford optimal accuracy scaling for large node degrees and achieve the CH-divergence limit for logarithmic node degrees in quasi-linear time. In particular, these solve the parameter estimation problems for SBM without knowing the number of communities. An example with real data is provided in Section 4.
The general stochastic block model is a random graph ensemble defined on the vertex-set , where each vertex is assigned independently a hidden (or planted) label in under a probability distribution on , and each (unordered) pair of nodes is connected independently with probability , where is specified by a symmetric matrix with entries in $G\sim\text{SBM}(n,p,W)\sigma_{v}$ ) revealed. The goal is to recover these labels by observing only the graph.
This paper focuses on independent of (the communities have linear size), dependent on such that the average node degrees are either constant or logarithmically growing, and fixed. These assumptions on and could be relaxed, for example to slowly growing , but we leave this for future work. As discussed in the introduction, the above regimes for are both motivated by applications, as networks are typically sparse [LLDM08, Str01] though the average degrees may not be small, and by the fact that interesting mathematical phenomena take place in these regimes. For convenience, we attribute specific notations for the model in these regimes:
(Partial recovery.) An algorithm recovers or detects communities in with an accuracy of , if it outputs a labelling of the nodes , which agrees with the true labelling on a fraction of the nodes with probability . The agreement is maximized over relabellings of the communities.
(Exact recovery.) Exact recovery is solvable in for a community partition , where is a subset of , if there exists an algorithm that takes and assigns to each node in an element of that contains its true communityThis is again up to relabellings of the communities. with probability . Exact recovery is solvable in if it is solvable for the partition of into singletons, i.e., all communities can be recovered.
The problem is solvable information-theoretically if there exists an algorithm that solves it, and efficiently if the algorithm runs in polynomial-time in . Note that exact recovery for the partition is equivalent to extracting community . In general, recovering a partition is equivalent to merging the communities that are in a common subset and recovering the merged communities. Note also that exact recovery in requires the graph not to have vertices of degree in multiple communities with high probability (i.e., connectivity in the symmetric case). Therefore, for exact recovery, we focus below on where is fixed.
2 Partial recovery
Our main result in the Appendix (Theorme 6) applies to SBM with arbitrary . We provided here a specific instance which is easier to parse.
Note that a vertex in community has degree with probability exponential in , and there is no way to differentiate between vertices of degree from different communities. So, an error rate that decreases exponentially with is optimal. The above gives in particular the parameter estimation in the case (see also Lemma 17 in the Appendix).
The general result in the Appendix yields the following refined results in the -block symmetric case.
Consider the -block symmetric case. In other words, for all , and is if and otherwise. The vector whose entries are all s is an eigenvector of with eigenvalue , and every vector whose entries add up to is an eigenvector of with eigenvalue . So, and and Then, as long as and , there exist a constant such that Agnostic-sphere-comparison() detects communities with an accuracy of for sufficiently large .
We refer to Section 4 for an example of implementation with real data.
3 Exact recovery
We next show that this can be achieved without knowing the parameters. Recall that the finest partition is the largest partition of that ensure (19).
The proof assumes that the entries of are non-zero, see Remark 1 for zero entries. To achieve this result we rely on a two step procedure. First an algorithm is developed to recover all but a vanishing fraction of nodes — this is the main focus of our partial recovery result — and then a procedure is used to “clean up” the leftover graphs using the node degrees of the preliminary classification. This turns out to be much more efficient than aiming for an algorithm that directly achieves exact recovery. We already used this technique in [AS15], but here we also deal with the difficulties resulting from not knowing the SBM’s parameters.
Proof Techniques and Algorithms
For any vertex , let be the set of all vertices with shortest path to of length . We also refer to the vector with -th entry equal to the number of vertices in that are in community as . If there are multiple graphs that could be considered a vertex in, let be the set of all vertices with shortest paths in to of length .
One could probably determine and given the values of for a few different , but using to approximate that would require knowing how many of the vertices in are in each community. So, we attempt to get information relating to how many vertices in are in each community by checking how it connects to for some vertex and integer . The obvious way to do this would be to compute the cardinality of their intersection. Unfortunately, whether a given vertex in community is in is not independent of whether it is in , which causes the cardinality of to differ from what one would expect badly enough to disrupt plans to use it for approximations.
In order to get around this, we randomly assign every edge in to a set with probability . We hence define the following.
Note that and are disjoint; however, is sparse enough that even if they were generated independently a given pair of vertices would have an edge between them in both with probability . So, is approximately independent of . Thus, for any and , with a probability of approximately . As a result,
Let be the distinct eigenvalues of , ordered so that . Also define so that if and if . If is the eigenspace of corresponding to the eigenvalue , and is the projection operator on to , then
where the final equality holds because for all ,
and since , this implies that . In order to simplify the terminology,
Let for all , , and .
Equation (14) is dominated by the term, so getting good estimate of the through terms requires cancelling it out somehow. As a start, if then
Note that the left hand side of this expression is equal to . More generally, in order to get an expression that can be used to estimate the and , we consider the determinant of the following.
Let be the matrix such that for each and .
To the degree that approximation 8 holds and is small, each column of is a linear combination of the vectors
with coefficients that depend only on . So, by linearity of the determinant in one column, is a linear combination of these vector’s wedge products with coefficients that are independent of and . By antisymmetry of wedge products, only the products that use different such vectors contribute to the determinant, and the products involving the eigenvalues of highest magnitude will dominate. As a result, there exist constants and such that
if . These facts suggest the following plan for estimating the eigenvalues corresponding to a graph. First, pick several vertices at random. Then, use the fact that for any good vertex to estimate . Next, use the formulas above about to get an approximation of and all of ’s eigenvalues for each selected vertex. Finally, take the median of these estimates.
Now, note that whether or not , we have
This fact can be used in combination with estimates of ’s eigenvalues to approximate for arbitrary , , and .
Of course, this requires and to be large enough that
is large relative to the error terms for all . At a minimum, that requires that .
On a different note, for any and ,
with equality for all if and only if , so sufficiently good approximations of and can be used to determine which pairs of vertices are in the same community.
One could generate a reasonable classification based solely on this method of comparing vertices. However, that would require computing for every vertex in the graph with fairly large , which would be slow. Instead, we use the fact that for any vertices , , and with ,
for all , and the inequality is strict for at least one . So, subtracting from both sides gives us that
for all , and the inequality is still strict for at least one .
So, given a representative vertex in each community, we can determine which of them a given vertex, , is in the same community as without needing to know the value of .
This runs fairly quickly if is approximated using such that is large and is small because the algorithm only requires focusing on vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one in each community. After that, use the algorithm referred to above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.
The risk that this randomly gives a bad classification due to a bad set of initial vertices can be mitigated as follows. First, repeat the previous classification procedure several times. Assuming that the procedure gives a good classification more often than not, the good classifications should comprise a set that contains more than half the classifications and which has fairly little difference between any two elements of the set. Furthermore, any such set would have to contain at least one good classification, so none of its elements could be too bad. So, find such a set and average its classifications together. This completes the Agnostic-Sphere-comparison-algorithm. We refer to Section 6 for a detailed version.
2 Exact recovery and the Agnostic-degree-profiling algorithm
This is proved in [AS15]. In the case of unknown parameters, the algorithmic approach is largely unchanged, adding a step where the best known classification is used to estimate and prior to any step in which vertices are classified based on their neighbors. The analysis of the algorithm requires however some careful handling.
First, it is necessary to prove that given a labelling of the graph’s vertices with an error rate of , one can compute approximations of and that are within of their true values with probability . Secondly, one needs to modify the robust degree profiling lemma to show that attempting to determine vertices’ communities based on estimates of and that are off by at most , and , and a classification of its neighbors that has an error rate of classifies the vertices with an error rate only times higher than it would be given accurate values of and and accurate classifications of the vertices’ neighbors. Combining these yields the conclusion that any errors in the estimates of the SBM’s parameters do not disrupt vertex classification any worse than the errors in the preliminary classifications already were.
The Agnostic-degree-profiling algorithm. The inputs are , where is a graph, and (see Theorem 7 for how to set ).
The algorithm outputs an assignment of each vertex to one of the groups of communities , where is the partition of in to the largest number of subsets such that for all in that are in different subsets (this is called the “finest partition”). It does the following:
(1) Define the graph on the vertex set by selecting each edge in independently with probability , and define the graph that contains the edges in that are not in .
(2) Run Agnostic-sphere-comparison on to obtain the preliminary classification (see Section 7.1.)
(3) Determine the size of each alleged community, and the edge density between each pair of alleged communities.
(4) For each node , determine in which community node is most likely to belong to based on its degree profile computed from the preliminary classification (see Section 7.2.2), and call it
(5) Use to get new estimates of and .
(6) For each node , determine in which group node is most likely to belong to based on its degree profile computed from the preliminary classification (see Section 7.2.2).
An example with real data
We have tested a simplified version of our algorithm on the data from “The political blogosphere and the 2004 US Election” [AG05], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs.
The algorithm we used has a few major modifications relative to our standard algorithm. First of all, instead of using as its basic tool for comparing vertices, it uses a different measure, which is defined as the fraction of pairs of an edge leaving the ball of radius centered on and an edge leaving the ball of radius centered on which hit the same vertex but are not the same edge. Making the measure a fraction of the pairs rather than a count of pairs was necessary to prevent from being massively dependent on the degrees of and , which would have resulted in the increased variance in vertex degree obscuring the effects of and on . The other changes to the definition make the measure somewhat less reliable, but it is still useable as long as the average degree is fairly high and .
Secondly, the version of Vertex-comparison-algorithm we used simply concludes that two vertices, and , are in different communities if is below average and the same community otherwise. This is reasonable because of the following facts. For one thing, because the normalization converts the dominant term to a constant, is approximately affine in . Also, as a result of the symmetry between communities, is the same for all . So, is also affine in . Furthermore, there are only two possible values of by symmetry, and , so iff is below average. Finally, because both communities have the same average degree, is independent of and so is always . The version of Vertex-classification-algorithm we used is comparably modified.
Finally, our algorithm generates reference vertices by repeatedly picking two vertices at random and comparing them. If it concludes that they are in different communities and they both have above-average degree, it accepts them as reference vertices; otherwise it tries again. Requiring above-average degree is useful because a higher degree vertex is less likely to have its neighborhood distorted by a couple of atypical neighbors.
Out of trials, the resulting algorithm gave a reasonably good classification times. Each of these classified all but to of the vertices in the graph’s main component correctly. The state-of-the-art described in [CG15] gives a lowest value at 58, with the best algorithms around , while algorithms regularized spectral methods such as the one in [QR13] obtain about 80 errors.
Open problems
The current result should also extend directly to a slowly growing number of communities (e.g., up to logarithmic). It would be interesting to extend the current approach to smaller sized communities or larger numbers of communities (watching the complexity scaling with the number of communities), as well as more general models with corrected-degrees, labeled-edges, or overlapping communities (though linear-sized overlapping communities can be treated with the approach of [AS15]).
References
The Agnostic-sphere-comparison algorithm in details
Recall the following motivation and definitions.
For any vertex , let be the set of all vertices with shortest path to of length . We also refer to the vector with -th entry equal to the number of vertices in that are in community as . If there are multiple graphs that could be considered a vertex in, let be the set of all vertices with shortest paths in to of length .
One could probably determine and given the values of for a few different , but using to approximate that would require knowing how many of the vertices in are in each community. So, we attempt to get information relating to how many vertices in are in each community by checking how it connects to for some vertex and integer . The obvious way to do this would be to compute the cardinality of their intersection. Unfortunately, whether a given vertex in community is in is not independent of whether it is in , which causes the cardinality of to differ from what one would expect badly enough to disrupt plans to use it for approximations.
In order to get around this, we randomly assign every edge in to a set with probability . We hence define the following.
Note that and are disjoint; however, is sparse enough that even if they were generated independently a given pair of vertices would have an edge between them in both with probability . So, is approximately independent of . Thus, for any and , with a probability of approximately . As a result,
Let be the distinct eigenvalues of , ordered so that . Also define so that if and if . If is the eigenspace of corresponding to the eigenvalue , and is the projection operator on to , then
where the final equality holds because for all ,
and since , this implies that . In order to simplify the terminology,
Let for all , , and .
Equation (14) is dominated by the term, so getting good estimate of the through terms requires cancelling it out somehow. As a start, if then
Note that the left hand side of this expression is equal to . More generally, in order to get an expression that can be used to estimate the and , we consider the determinant of the following.
Let be the matrix such that for each and .
To the degree that approximation 8 holds and is small, each column of is a linear combination of the vectors
with coefficients that depend only on . So, by linearity of the determinant in one column, is a linear combination of these vector’s wedge products with coefficients that are independent of and . By antisymmetry of wedge products, only the products that use different such vectors contribute to the determinant, and the products involving the eigenvalues of highest magnitude will dominate. As a result, there exist constants and such that
if . In the later case, , so either and have the same sign or and have the same sign. In any of these cases,
This suggests the following algorithm for finding ’s eigenvalues.
The Basic-Eigenvalue-approximation-algorithm The inputs are , where is a vertex of the graph, , and is a subset of ’s edges.
The algorithm ouputs a claim about how many distinct nonzero eigenvalues has and a list of approximations of them.
(1) Compute for each until , and then set .
(2) Set . Then, compute
until an is found for which this expression is less than . Then, set .
unless , in which case set
Repeat this for each
(4) Next, for each , if then set and . For each such that and set
(5) Return
The risk when using this algorithm is that if the set of edges in ’s immediate neighborhood is sufficiently atypical it may not work correctly. This can be solved by repeating it for several vertices and taking the median estimates.
The Improved-Eigenvalue-approximation-algorithm The input is
The algorithm ouputs a claim about how many distinct nonzero eigenvalues has and a list of approximations of them.
(1) Create a set of edges , that each of ’s edges is independently assigned to with probability .
(2) Randomly select of ’s vertices, , ,…, .
(3) Run Basic-Eigenvalue-approximation-algorithm(E,c,v[i]) for each , stopping the algorithm prematurely if it takes more than time.
(4) Return where and are the median outputs of the executions of Basic-Eigenvalue-approximation-algorithm for each .
Now, note that whether or not , we have
This fact can be used in combination with estimates of ’s eigenvalues to approximate for arbitrary , , and as follows.
The algorithm outputs such that for all .
(1) For each , set
(2) Return .
Of course, this requires and to be large enough that
is large relative to the error terms for all . At a minimum, that requires that , so
because otherwise the graph will start running out of vertices before one gets steps away from or steps away from .
Furthermore, for any and ,
with equality for all if and only if , so sufficiently good approximations of and can be used to determine which pairs of vertices are in the same community as follows.
The Vertex-comparison-algorithm. The inputs are , where are two vertices, are positive integers, is a subset of ’s edges, is a positive real number, is a real number between and , and are real numbers.
The algorithm outputs a decision on whether and are in the same community or not. It proceeds as follows.
(1) Run Vertex-product-approximation-algorithm(v,v’,r,r’,E,c,),Vertex-product-approximation-algorithm(v,v,r,r’,E,c,), and Vertex-product-approximation- algorithm(v’,v’,r,r’,E,c,).
(2) If then conclude that and are in different communities. Otherwise, conclude that and are in the same community.
One could generate a reasonable classification based solely on this method of comparing vertices (with an appropriate choice of the parameters, as later detailed). However, that would require computing for every vertex in the graph with fairly large , which would be slow. Instead, we use the fact that for any vertices , , and with ,
for all , and the inequality is strict for at least one . So, subtracting from both sides gives us that
for all , and the inequality is still strict for at least one .
So, given a representative vertex in each community, we can determine which of them a given vertex, , is in the same community as without needing to know the value of as follows.
The Vertex-classification-algorithm. The inputs are , where is a list of vertices, is a vertex, are positive integers, is a subset of ’s edges, is a real number between and , and are real numbers. It is assumed that has already been computed for each and .
The algorithm is supposed to output such that is in the same community as . It works as follows.
(1) Run Vertex-product-approximation-algorithm(,v’,r,r’,E,c,) for each .
(2) Find a that minimizes the value of
and conclude that is in the same community as .
This runs fairly quickly if is large and is small because the algorithm only requires focusing on vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one anchor vertex in each community. After that, use the algorithm above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.
The Unreliable-graph-classification-algorithm. The inputs are , where is a graph, is a real number between and , is a positive integer, is a real number between and , is a positive real number, and are real numbers.
The algorithm outputs an alleged list of communities for . It works as follows.
(1) Randomly assign each edge in to independently with probability .
(2) Randomly select vertices in , .
(3) Set and
(4) Compute for each and .
(5) Run vertex-comparison-algorithm(v[i],v[j],r,r’,E,x,) for every and
(6) If these give consistent results, randomly select one alleged member of each community . Otherwise, fail.
(7) For every in the graph, compute for each . Then, run Vertex-classification-algorithm(v’[],v”, r,r’,E,) in order to get a hypothesized classification of
The risk that this randomly gives a bad classification due to a bad set of initial vertices can be mitigated as follows. First, repeat the previous classification procedure several times. Assuming that the procedure gives a good classification more often than not, the good classifications should comprise a set that contains more than half the classifications and which has fairly little difference between any two elements of the set. Furthermore, any such set would have to contain at least one good classification, so none of its elements could be too bad. So, find such a set and average its classifications together.
So, the overall Agnostic-sphere-comparison-algorithm starts by estimating ’s eigenvalues. Then, it uses those estimates to pick appropriate values of and for the Unreliable-graph-classification-algorithm. Finally, it runs it several times and combines the resulting classifications as explained above. The only inputs it requires are the graph itself and some such that for all .
The Reliable-graph-classification-algorithm (i.e., Agnostic Sphere comparison). The inputs are , where is a graph, is a positive integer, is a positive real number, and is a function from the positive integers to itself.
The algorithm outputs an alleged list of communities for . It works as follows.
(1) Run Improved-Eigenvalue-approximation-algorithm(.1) in order to compute
(2) Let , , and
(3) Let be the smallest rational number of minimal numerator such that
(4) Let be the smallest rational number of the form or such that and
(5) Let be the largest unit reciprocal less than such that all of the following hold:
(6) Run Unreliable-graph-classification-algorithm times and record the resulting classifications.
(7) Find the smallest such that there exists a set of more than half of the classifications no two of which have more than disagreement, and discard all classifications not in the set. In this step, define the disagreement between two classifications as the minimum disagreement over all bijections between their communities.
(8) For every vertex in , randomly pick one of the remaining classifications and assert that it is in the community claimed by that classification, where a community from one classification is assumed to correspond to the community it has the greatest overlap with in each other classification.
(9) Return the resulting combined classification.
If the conditions of theorem are satisfied, then there exists such that
of ’s vertices correctly with probability and it runs in time, for appropriate and .
Appendix
Considering the way , , , and scale when is multiplied by a scalar yields the following corollary.
If instead of having constant average degree, one has an average degree which increases as increases, one can slowly reduce , , and as increases, leading to the following corollary.
These corollaries are important as they show that if the entries of the connectivity matrix are amplified by a coefficient growing with , almost exact recovery is achieved by (Agnostic-sphere-comparison) without parameter knowledge.
1.2 Proof of Theorem 6
Proving Theorem 6 will require establishing some terminology. First, let be the distinct eigenvalues of , ordered so that and if then . Also define so that if and if . In addition to this, let be the largest sum of a column of .
For any vertex , let be the set of all vertices with shortest path to of length . If there are multiple graphs that could be considered a vertex in, let be the set of all vertices with shortest paths in to of length .
We also typically refer to as simply , as the context will make it clear whether the expression refers to a set or vector.
Note that since any such can be written as a linear combination of the , is -good if for all and .
First, note that for any eigenvector of , , and ,
Thus, for any , it must be the case that
Now, define ,…, such that for each and . For any ,
If , then , so this implies that . It follows from this that
Therefore, for any , we have that
The following two lemmas are proved in [AS15].
Note that if and have already been computed, can be computed by means of the following algorithm, where
Roughly speaking, for each and , with probability . This is complicated by the facts that is never in and no edge is in and . However, this changes the expected value of given by at most a constant unless has more than double its expected number of edges, something that happens with probability . Furthermore, whether is in is independent of whether is in unless or . So, the variance of is proportional to its expected value, which is
is within standard deviations of its expected value with probability , which completes the proof. ∎
Note that if is an eigenvector of , is an eigenvector of the symmetric matrix . So, since eigenvectors of a symmetric matrix with different eigenvalues are orthogonal, we have
where we temporarily adopt the convention that if , and for all .
Alternately, if then with probability , either is -bad, is -bad, or
For each and , let
Next, for each and , let be the column vector thats th entry is for . Also, for each , let be the length column vector thats th entry is for . Note that for each , the th column of is . So,
If and is some permutation of the integers from to , then
The th column of this matrix is proportional to , so there exists some such that the sum of all such terms is
Alternately, the sum of all such terms is equal to
That can only hold for all such if for all , and that can only be the case if . Therefore, the determinant is nonzero unless for some , which implies that . If then by similar logic, there exists such that the sum of all terms for which is a permutation of the integers from to and is
for all . Similarly, if is -good then
for all . If both hold, then for all . Furthermore, for any and ,
In other words, under these circumstances is upper bounded by a constant multiple of if and are both good, and every entry of has a magnitude that is upper bounded by a constant multiple of . Either way, every entry of is upper bounded by a constant multiple of .
Alternately, recall that if is -good then for any and ,
Also, for fixed values of for all , each element of has a variance of . So,
with probability for all and . This implies that if is -good and is -good then
with probability . There are only possible choices of , so with probability , either is -bad, is -bad, or
While this is a helpful result, it turns out to be more useful to have an expression that links the determinant to for fixed values of and . So, we have the following:
Combining these inequalities with the determinant lemma yields the desired result. ∎
However, in order to know this in a useful sense it is necessary to prove that these terms are large relative to the error terms. In order to do that, we need the following:
Every entry in is positive, so its eigenvector of largest eigenvalue is unique up to multiplication, and its entries all have the same sign. That means that it has a unique multiple, , such that and . In this case all of ’s entries must be positive, and since , it must be the case that .
Given any -good vertex , and any ,
For any eigenvector with eigenvalue and ,
, so there exists a constant such that for any ,
for all . For , , and a fixed value of , the probability distribution of is within of a poisson distribution with expected value . Furthermore, has negligible dependence on for all . So, has an expected value of and a standard deviation of
In particular, this means that if there exists such that then with probability ,
Furthermore, since the probability distribution of for a fixed value of is a sum of constant multiples of poisson distributions with expected values of , it must be the case that with probability . Therefore, there exists such that this holds with probability , and the lemma follows. ∎
This also implies that for any , , and , either is -bad, is -bad, or with probability , since the degree of dependence between and is negligible. This allows me to attempt to approximate ’s eigenvalues as follows.
With probability , either is -bad or Basic-Eigenvalue-approximation-algorithm(E,c,v) runs in time and returns such that and for all .
If is -good, there exists a constant such that for any ,
So, the minimum such that is within a constant of . That means that is upper bounded by a constant multiple of , and that is within a constant of the value it would have if were . This also implies that if is sufficiently large it is less than .
For , , and if then
with probability . If then , so either has the same sign as or has the same sign as . Either way, we have that
with probability . Furthermore, by lemma , these bounds are within a factor of of each other with probability .
is within a factor of of with probability , and thus that
with probability . , so with probability , this expression is not less than .
If , , and , then with probability either is -bad or
Therefore, with probability , either is -bad or .
By the previous two lemmas, with probability either is -bad, or
for all and .
Assume that the later holds. If then if is sufficiently large the term will be negligable relative to the term, while if increasing by two would multiply both of these terms by . Either way, we have that for any and ,
as long as is sufficiently large, and
We assumed that is large, so that can only fail if . In this case, , so increasing by one would result in both terms having the same sign, at which point the condition is satisfied. So,
for any . For any , if , then for sufficiently large , will be greater than . On the other hand, if and is sufficiently large, will be less than . So, the algorithm will suceed at determining which eigenvalues have the same absolute values and assign the same sign as for each . So, as desired. Assuming this all works, the slowest part of the algorithm is computing expressions of the form , each such expression can be computed in time, and only a constant number of them need to be computed. So, the algorithm runs in time. ∎
So, this algorithm sometimes works, but it fails if is bad. This risk can be mitigated by using multiple vertices as follows.
then Improved-Eigenvalue-approximation-algorithm(c) returns such that and for all with probability .
Generating takes time, picking takes time, each execution of Basic-Eigenvalue-approximation-algorithm(c) runs in time, and combining their outputs takes time. So, this algorithm runs in time.
Assuming the conditions are satisfied, there exists such that of ’s vertices are -good with probability . So, with probability , the majority of the selected vertices of are -good and the majority of the executions of Basic-Eigenvalue-approximation-algorithm give good output. If this happens, then the median value of is , and for each , the median value of is within of for each , as desired. ∎
Given approximations of ’s eigenvalues, one can attempt to approximate as follows.
The slowest part of the algorithm is computing the expressions of the form . Each of these can be computed in average time, and the number of these that need to be computed is constant in . So, this algorithm runs in average time.
With probability either is -bad, is -bad, or
for all and .
for any with probability . So,
with probability . That implies that
By goodness of and , this is less than or equal to
For any two vertices in different communities, and , the fact that ’s rows are distinct implies that . So, for some . That means that for any two vertices and ,
for all , with equality for all if and only if and are in the same community. This also implies that given a vertex , another vertex in the same community , and a vertex in a different community ,
for all and the inequality is strict for at least one . This suggests the following algorithms for classifying vertices.
Assuming that each of ’s edges was independently assigned to with probability , this algorithm runs in average time. Furthermore, if each execution of Vertex-product-approximation-algorithm succeeds and is less than the minimum nonzero value of then the algorithm returns the correct result with probability .
The slowest step of the algorithm is using Vertex-product-approximation-algorithm. This runs in an average time of and must be done times. If each execution of Vertex-product-approximation-algorithm succeeds then with probability the are all within of the products they seek to approximate, in which case
which is true for some if and only if and are in different communities. ∎
Assuming that was generated properly, this algorithm runs in average time. Let and assume that each execution of Vertex-product-approximation-algorithm succeeds (including the previous ones to compute ), and that contains exactly one vertex from each community. Also, assume that is less than the minimum nonzero value of . Then this algorithm classifies correctly with probability .
Again, the slowest step of the algorithm is running Vertex-product-approximation-algorithm. This runs in an average time of and must be done times. If the conditions given above are satisifed, then each is within of the product it seeks to approximate with probability . If this is the case, then
for all and if is in the same community as , and
for all and iff is in the same community as . So,
iff is in the same community as . Therefore, the algorithm returns the correct result with probability . ∎
At this point, we can finally start giving algorithms for classifying a graph’s vertices.
Let , and assume that all of the following hold:
Let
and
This algorithm runs in time. Furthermore, with probability , is such that Unreliable-graph-classification-algorithm has at least a
chance of classifying at least of ’s vertices correctly.
Let . There exists such that if these conditions hold, then with probability , at least of ’s vertices are -good and the number of vertices in in community is within of for all . If this is the case, then for sufficiently large , it is at least likely that every one of the randomly selected vertices is -good and at least one is selected from each community.
If is -good for all , then with probability , vertex-comparison-algorithm determines whether or not and are in the same community correctly for every and , allowing the algorithm to pick one member of each community. If that happens, then the algorithm will classify each -good vertex correctly with probability . So, as long as the initial selection of is good, the algorithm classifies at least of the graph’s vertices correctly with probability .
Generating and takes time. Computing for all takes time, and computing for all and takes
time. Once these have been computed, running Vertex-comparison- algorithm for every and takes time, at which point an alleged member of each community can be found in time. Running Vertex-classification-algorithm for every takes time. So, the overall algorithm runs in average time. ∎
So, this algorithm can sometimes give a vertex classification that is nontrivially better than that obtained by guessing. However, it has an assymptotically nonzero failure rate and requires too much information about the graph’s parameters. In order to get around that, we combine the results of multiple executions of the algorithm and add in a parameter analysis procedure as follows.
Assume that there exist , , and such that is either a unit reciprocal or an integer, is a rational number of the form or , and all of the following hold:
With probability , Reliable-graph-classification-algorithm(G,m,, T(n)) runs in time and classifies at least of ’s vertices correctly, where
With probability , Improved-Eigenvalue-approximation-algorithm gives output such that and for each . Assuming that this holds, the algorithm finds the largest and that satisfy the conditions above and the largest unit reciprocal less than , , that satisifes the conditions for Unreliable-graph-classification-algorithm to have a greater than success rate for any with probability . Since its success rate is greater than and , more than half of the executions of Unreliable-graph-classification-algorithm give classifications with error or less with probability . That means that and at least one of the classifications in the selected set must have error or less. Thus, all of the selected classifications have error or less. The requirement that ensures that the bijection between any two of these classifications’ communities that minimizes disagreeement is the identity. Therefore, this algorithm classifies at least of the vertices correctly, as desired.
Assuming this all works correctly, Improved-Eigenvalue-approximation-algorithm runs in time. Finding , , and takes constant time, and running Unreliable-graph-classification-algorithm times takes . Computing the degree of agreement between each pair of classifications takes time. , so the brute force algorithm finds and the corresponding set in time. Combining the classifications takes time. Therefore, this whole algorithm runs in time with probability , as desired. ∎
If the conditions hold, Reliable-graph-classification- algorithm has the desired properties by the previous lemma. ∎
2 Exact recovery
Recall also that exact recovery is solvable for a community partition , if there exists an algorithm that assigns to each node in an element of that contains its true communityUp to a relabelling of the communities. with probability . Exact recovery is solvable in if it is solvable for the partition of into singletons, i.e., all communities can be recovered.
is an -divergence. For , i.e., the gap between the arithmetic and geometric means, we have
which is the Hellinger divergence (or distance), and the maximization over of the part is the exponential of the Chernoff divergence. We refer to Section 8.3 for further discussions on . Note also that we will often evaluate as where are vectors instead of measures.
We next present our main theorem for exact recovery. We first provide necessary and sufficient conditions for exact recovery of partitions, and then provide an agnostic algorithm that solves exact recovery efficiently, more precisely, in quasi-linear time.
Note that the second item in the theorem implies that Agnostic-degree-profiling solves exact recovery efficiently whenever the parameters and allow for exact recovery to be solvable.
If for some and then the results above still hold, except that if for all and in different subsets of the partition,
but there exist and in different subsets of the partition such that and for all , then the optimal algorithm will have an assymptotically constant failure rate. The recovery algorithm also needs to be modified to accomodate ’s in .
Agnostic-degree-profiling algorithm. Inputs: a graph , and a splitting parameter (see Theorem 7 for the choice of ). Output: Each node is assigned a community-list , where is intended to be the partition of in to the largest number of subsets such that for all in that are in different subsets. Algorithm: (1) Define the graph on the vertex set by selecting each edge in independently with probability , and define the graph that contains the edges in that are not in . (2) Run Agnostic-sphere-comparison on to obtain the preliminary classification (see Section 7.1 and Corollary 2.) (3) Estimate and based on the alleged communities’ sizes and the edge densities between them. (4) For each node , determine in which community node is most likely to belong to based on its degree profile computed from the preliminary classification and the estimates of and (see Section 7.2.2), and call it (5) Re-estimate and based on the sizes of the communities claimed by and the edge densities between them. (6) For each node , determine in which group node is most likely to belong to based on its degree profile computed from the preliminary classification and the new estimates of and (see Section 7.2.2).
2.2 Testing degree profiles
In this section, we consider the problem of deciding which community a node in the SBM belongs to based on its degree profile. The notions are the same as in [AS15], we repeat them to ease the reading and explain how the Agnostic-degree-profiling algorithm works.
hence, by the additivity of Poisson distribution and the triangular inequality,
We will rely on a simple one-sided bound (see (44)) to approximate our events under the Poisson measure.
In other words, has independent Poisson entries with different means. We use the following notation to summarize the above setting:
Our goal is to infer the value of by observing a realization of . To minimize the error probability given a realization of , we must pick the most likely hypothesis conditioned on this realization, i.e.,
which is the Maximum A Posteriori (MAP) decoding rule.Ties can be broken arbitrarily. To resolve this maximization, we can proceed to a tournament of pairwise comparisons of the hypotheses. Each comparison allows us to eliminate one candidate for the maxima, i.e.,
The error probability of this decoding rule is then given by,
where is the CH-divergence as defined in (16).
In other words, the CH-divergence provides the error exponent for deciding among multivariate Poisson distributions. We did not find this result in the literature, but found a similar result obtained by Verdú [Ver86], who shows that the Hellinger distance (the special case with instead of the maximization over ) appears in the error exponent for testing Poisson point-processes, although [Ver86] does not investigate the exact error exponent. This lemma was proven in [AS15].
This lemma together with previous bounds on imply that if for all , the true hypothesis is correctly recovered with probability . However, it may be that only for a subset of -pairs. What can we then infer? While we may not recover the true value of with probability , we may narrow down the search within a subset of possible hypotheses with that probability of error.
Testing composite multivariate Poisson distributions. We now consider the previous setting, but we are no longer interested in determining the true hypothesis, but in deciding between two (or more) disjoint subsets of hypotheses. Under hypothesis 1, the distribution of belongs to a set of possible distributions, namely where , and under hypothesis 2, the distribution of belongs to another set of distributions, namely where . Note that and are disjoint subsets such that . In short,
Moreover, applying bounds on the minima of two sums,
The same reasoning can be applied to the problem of deciding whether a given node belongs to a group of communities, with more than two groups. Also, for any and such that for each , , , and ,
So, the error rate for any algorithm that classifies vertices based on their degree profile in a graph drawn from a sparse SBM is at most more than twice what it would be if the probability distribution of degree profiles really was the poisson distribution.
In summary, we have proved the following.
So, this hypothesis testing procedure has an error probability of at most
With probability , the size of every community is within of its expected value, and the edge density between each pair of communities is within of its expected value. Also, there exists a constant such that with probability , no vertex has degree more than . Assuming this is the case, the misclassification of vertices can not change the apparent number of edges between any two communities by more than , and it can not change the apparent size of any community by more than . Thus, and
This lets us prove the following “robust” version of this lemma to prove Theorem 7.
First, note that by the previous lemma, with probability , is such that every element of the estimates of and is within of its true value. Choose such that every vertex in the graph has degree less than with probability . If this holds, then the probability of misclassifying a vertex based on its true degree profile and the estimates of its parameters is at most
based on the previous bounds on the probability that classifying a vertex based on its degree profile fails.
The key observation is that ’s th neighbor had at least a chance of actually being in community for each , so its probability of being reported as being in community is at most times the probability that it actually is. So, the probability that its reported degree profile is bad is at most times the probability that its actual degree profile is bad. The conclusion follows. ∎
2.3 Proof of Theorem 7
We prove the possibility result here. The converse was proven in [AS15] for known parameters, hence also applies here.
The idea behind Claim 1 is contained in Lemma 16. However, there are several technical steps that need to be handled:
The graphs and obtained in step 1 of the algorithm are correlated, since an edge cannot be both in and . However, this effect can be discarded since two independent versions would share edges with low enough probability.
The classification in step 2 using Agnostic-sphere-comparison has a vanishing fraction of vertices which are wrongly labeled, and the SBM parameters are unknown. This requires using the robust version of Lemma 16, namely Lemma 18.
In the case where a more careful classification is needed as carried in steps 3 and 4 of the algorithm.
With probability , no vertex in the graph has degree greater than . Assuming that this holds, no vertex’s set of neighbors in is more than
times as likely to occur as it would be if were independent of . So, the fact that they are not has negligible impact on the algorithm’s error rate. Now, let
By Lemma 18, if the classification in step has an error rate of at most , then the classification in step has an error rate of
observing that if the error rate of for adjacent to is at worst multiplied by a constant. That in turn ensures that the final classification has an error rate of at most