Accurate Community Detection in the Stochastic Block Model via Spectral Algorithms
Se-Young Yun, Alexandre Proutiere
Introduction
Extracting structures or communities in networks is a central task in many disciplines including social sciences, biology, computer science, statistics, and physics. The Stochastic Block Model (SBM) was introduced a few decades ago as a performance benchmark to study the problem of community detection in random graphs, and it has, since then, attracted a lot attention. In this paper, we provide new results on the performance of spectral algorithms for detecting communities in the SBM. We consider a network consisting of a set of nodes. admits a hidden partition of non-overlapping subsets or communities (). The size of community is for some . Without loss of generality, let . We assume that when the network size grows large, the number of communities and their relative sizes are kept fixed. The communities have to be reconstructed from an observed realization of a random graph constructed as follows. Each pair of vertices is connected independently with probability within communities and across communities, where and may depend on the network size . We assume that there exists such that uniformly in . We further restrict our attention to the sparse case such that and the case where , which is a necessary condition for asymptotically accurate community detection i.e., for the existence of algorithms that yield a vanishing proportion of misclassified vertices.
We show that under certain spectral algorithms, the number of misclassified vertices does not exceed with high probability as grows large, whenever and
This result extends recent work about exact community reconstruction in the binary symmetric SBM (i.e., in the specific case of two communities of equal sizes). Indeed, by choosing in (1), we get a condition under which spectral algorithms exactly recover the structure of any asymmetric networks with an arbitrary (but finite) number of communities. However our results is not limited to exact reconstruction, as we may choose any , e.g., .
We conjecture that the condition (1) is necessary for the existence of algorithms yielding less than misclassified vertices. The conjecture is true in the case of exact reconstruction () for the binary symmetric SBM. Please refer to the next section for a more detailed description on the related work.
Previous Results
Exact Detection. Asymptotically exact community reconstruction in the SBM has been recently addressed in [Abbe et al.(2014)Abbe, Bandeira, and Hall], [Mossel et al.(2014)Mossel, Neeman, and Sly], and [Hajek et al.(2014)Hajek, Wu, and Xu]. These papers only consider the binary symmetric SBM. They establish a necessary and sufficient condition for asymptotically exact reconstruction that coincides with (1) when applied to two communities of equal sizes () and . For example, when and for , (1) becomes equivalent to . The three aforementioned papers further provide optimal algorithms, i.e., algorithms exactly recovering the network structure when this is possible. Note that in [Abbe et al.(2014)Abbe, Bandeira, and Hall], and [Hajek et al.(2014)Hajek, Wu, and Xu], the proposed algorithms are based on SDP, and can be computationally expensive. In contrast, we prove that simple spectral algorithms are optimal.
Asymptotically Accurate Detection. Necessary and sufficient conditions for asymptotically accurate detection (i.e., the proportion of misclassified vertices vanishes when grows large) in the SBM has been derived in [Yun and Proutiere(2014)]. This condition is . In the present paper, we provide results that fill the gap between exact detection and asymptotically accurate detection.
Detectability. In the sparse regime where , and for the binary symmetric SBM, the main focus recently has been on identifying the phase transition threshold (a condition on and ) for detectability: It was conjectured in [Decelle et al.(2011)Decelle, Krzakala, Moore, and Zdeborová] that if (i.e., under the threshold), no algorithm can perform better than a simple random assignment of vertices to communities, and above the threshold, communities can partially be recovered. The conjecture was recently proved in [Mossel et al.(2012)Mossel, Neeman, and Sly] (necessary condition), and [Massoulié(2013)] (sufficient condition).
A more exhaustive list of papers related to the SBM can be found in [Yun and Proutiere(2014)].
Spectral Algorithms and Their Performance
The proposed algorithm, referred to as Spectral Partition, is the same as that in [Yun and Proutiere(2014)], and is simple modifications of algorithms initially presented in [Coja-Oghlan(2010)]. In this paper, we present a more precise analysis of its performance than that of [Yun and Proutiere(2014)]. Let denote the observed random adjacency matrix. The algorithm consists in three steps. 1. Trimming. We first trim the adjacency matrix , i.e., we keep the entries corresponding to a set of vertices whose degrees are not too large. More precisely, . The resulting trimmed observation matrix is denoted by . 2. Spectral decomposition. We then extract the communities from the spectral analysis of . 3. Improvement. Finally, we further improve the estimated communities. After the spectral decomposition step, the identified communities are good approximations of the true communities. The improvement is obtained by sequentially considering each vertex and by moving it to the community with which it has the largest number of edges. The pseudo-code of the algorithm is presented in Algorithms 1 and 2. The next theorem provides performance guarantees for the Spectral Partition algorithm.
Then under the Spectral Partition algorithm, the number of misclassified vertices is less than with high probability.
By assumption, we have , and . We may deduce that:
This can be proven using extensions of the weighted Arithmetic-Mean Geometric-Mean inequality. From Theorem 3.1, we deduce that: if
then the Spectral Partition algorithm yields less than misclassified vertices with high probability. We conclude this paper by exemplifying the condition (1). Consider the binary symmetric SBM, with and for some .
Exact reconstruction: with , (1) is equivalent to , which also constitutes a necessary condition for exact reconstruction. Theorem 3.1 then states that Spectral Partition is optimal for exact reconstruction, i.e., it extracts the communities exactly whenever this is at all possible.
Accurate reconstruction: choose for some . Then (1) is equivalent to
References
Appendix A Proof of Theorem 3.1
We first provide key intermediate results.
With high probability, .
The proof of Lemma A.1 relies on arguments used in the spectral analysis of random graphs, see [Feige and Ofek(2005)]. The next lemma provides a bound on the number of misclassified nodes after spectral decomposition applied to the trimmed matrix , see Algorithm 2.
Assume that and . Let denotes the output of Algorithm 2. With high probability, there exists a permutation of such that:
Observe that using Chernoff bound, we have , hence combining the two previous lemma yields:
Assume that . The output of Algorithm 2 satisfies: with high probability, there exists a permutation of such that
Proof of Theorem 3.1: Let be the largest set of vertices satisfying:
When , for all .
The proof proceeds as follows. We first show that with high probability. To this aim, we control the number of vertices satisfying (H1), (H2), and (H3), see Lemma A.4, Claim 1 and Lemma A.5, respectively. The result is summarised in Lemma A.5. Next Lemma A.6 establishes that there is no misclassified vertices in with high probability, which concludes the proof.
From Lemma A.4, with high probability, the number of vertices that do not satisfy (H1) is less than when , since
Claim 1. From Chernoff bound, we can easily show that does not satisfy (H2) with probability at most and thus, with high probability, the number of vertices that do not satisfy (H2) is less than , since
In Lemma A.5, we conclude that after showing the number of vertices that do not satisfy (H3) is less that with high probability.
When , , with high probability.
Lemma A.6 shows that when initial (after Algorithm 2) number of misclassified vertices is ,
Since the initial number of misclassified vertices is negligible compared to by Lemma A.3, after iterations, there is no misclassified vertice in .
If
A.2 Proof of Lemma A.4
From Chernoff bound, we know that for all ,
We conclude the proof by proving (6) and (7).
Proof of (7): Let
A.3 Proof of Lemma A.5
Let denote the set of vertices that do not satisfy at least one of (H1) and (H2). From Lemma A.4 and Chernoff bound, with high probability.
Next we prove the following intermediate claim: there is no subset such that and with high probability. For any subset such that by Markov inequality,
where, in the last two inequalities, we have set and used the fact that: which comes from the assumptions made in the theorem. Since the number of subsets with size is from (20), we deduce:
Therefore, by Markov inequality, we can conclude that there is no such that and with high probability.
To conclude the proof of the lemma, we build the following sequence of sets. Let be generated as follows:
For , if there exists such that and and if there does not exist, the sequence ends.
The sequence ends after the construction of . By construction, every satisfies the conditions (H1), (H2), and (H3). Since is the largest set of vertices satisfying (H1), (H2), and (H3), .
The proof is hence completed if we show that . Let . If and since ,
However, from the previous claim, we know that with high probability, all such that have to satisfy . Therefore, with high probability, and
A.4 Proof of Lemma A.6
Since and when ,
With the above inequality and (H1), we can bound as follows:
where stems from (H1), stems from the fact that where indicates the vector -th value is 1 if and 0 otherwise, and stems from (H3). Since , we conclude that