Accurate Community Detection in the Stochastic Block Model via Spectral Algorithms

Se-Young Yun, Alexandre Proutiere

Introduction

Extracting structures or communities in networks is a central task in many disciplines including social sciences, biology, computer science, statistics, and physics. The Stochastic Block Model (SBM) was introduced a few decades ago as a performance benchmark to study the problem of community detection in random graphs, and it has, since then, attracted a lot attention. In this paper, we provide new results on the performance of spectral algorithms for detecting communities in the SBM. We consider a network consisting of a set VV of nn nodes. VV admits a hidden partition of KK non-overlapping subsets or communities V1,,VKV_{1},\ldots,V_{K} (V=k=1KVkV=\bigcup_{k=1}^{K}V_{k}). The size of community VkV_{k} is αk×n\alpha_{k}\times n for some αk>0\alpha_{k}>0. Without loss of generality, let α1α2αK\alpha_{1}\leq\alpha_{2}\leq\dots\leq\alpha_{K}. We assume that when the network size nn grows large, the number of communities KK and their relative sizes are kept fixed. The communities have to be reconstructed from an observed realization of a random graph constructed as follows. Each pair of vertices is connected independently with probability pp within communities and qq across communities, where pp and qq may depend on the network size nn. We assume that there exists ϵ>0\epsilon>0 such that pq1+ϵ{p\over q}\geq 1+\epsilon uniformly in nn. We further restrict our attention to the sparse case such that p=o(1/log2n)p=o(1/\log^{2}n) and the case where pn=ω(1)pn=\omega(1), which is a necessary condition for asymptotically accurate community detection i.e., for the existence of algorithms that yield a vanishing proportion of misclassified vertices.

We show that under certain spectral algorithms, the number of misclassified vertices does not exceed ss with high probability as nn grows large, whenever s=o(n)s=o(n) and

This result extends recent work about exact community reconstruction in the binary symmetric SBM (i.e., in the specific case of two communities of equal sizes). Indeed, by choosing s<1s<1 in (1), we get a condition under which spectral algorithms exactly recover the structure of any asymmetric networks with an arbitrary (but finite) number of communities. However our results is not limited to exact reconstruction, as we may choose any s=o(n)s=o(n), e.g., s=ns=\sqrt{n}.

We conjecture that the condition (1) is necessary for the existence of algorithms yielding less than ss misclassified vertices. The conjecture is true in the case of exact reconstruction (s<1s<1) for the binary symmetric SBM. Please refer to the next section for a more detailed description on the related work.

Previous Results

Exact Detection. Asymptotically exact community reconstruction in the SBM has been recently addressed in [Abbe et al.(2014)Abbe, Bandeira, and Hall], [Mossel et al.(2014)Mossel, Neeman, and Sly], and [Hajek et al.(2014)Hajek, Wu, and Xu]. These papers only consider the binary symmetric SBM. They establish a necessary and sufficient condition for asymptotically exact reconstruction that coincides with (1) when applied to two communities of equal sizes (α1=α2=1/2\alpha_{1}=\alpha_{2}=1/2) and s<1s<1. For example, when p=alog(n)np={a\log(n)\over n} and q=blog(n)nq={b\log(n)\over n} for a>ba>b, (1) becomes equivalent to a+b2ab>1{a+b\over 2}-\sqrt{ab}>1. The three aforementioned papers further provide optimal algorithms, i.e., algorithms exactly recovering the network structure when this is possible. Note that in [Abbe et al.(2014)Abbe, Bandeira, and Hall], and [Hajek et al.(2014)Hajek, Wu, and Xu], the proposed algorithms are based on SDP, and can be computationally expensive. In contrast, we prove that simple spectral algorithms are optimal.

Asymptotically Accurate Detection. Necessary and sufficient conditions for asymptotically accurate detection (i.e., the proportion of misclassified vertices vanishes when nn grows large) in the SBM has been derived in [Yun and Proutiere(2014)]. This condition is n(pq)2/(p+q)=ω(1)n(p-q)^{2}/(p+q)=\omega(1). In the present paper, we provide results that fill the gap between exact detection and asymptotically accurate detection.

Detectability. In the sparse regime where p,q=o(1)p,q=o(1), and for the binary symmetric SBM, the main focus recently has been on identifying the phase transition threshold (a condition on pp and qq) for detectability: It was conjectured in [Decelle et al.(2011)Decelle, Krzakala, Moore, and Zdeborová] that if n(pq)<2n(p+q)n(p-q)<\sqrt{2n(p+q)} (i.e., under the threshold), no algorithm can perform better than a simple random assignment of vertices to communities, and above the threshold, communities can partially be recovered. The conjecture was recently proved in [Mossel et al.(2012)Mossel, Neeman, and Sly] (necessary condition), and [Massoulié(2013)] (sufficient condition).

A more exhaustive list of papers related to the SBM can be found in [Yun and Proutiere(2014)].

Spectral Algorithms and Their Performance

The proposed algorithm, referred to as Spectral Partition, is the same as that in [Yun and Proutiere(2014)], and is simple modifications of algorithms initially presented in [Coja-Oghlan(2010)]. In this paper, we present a more precise analysis of its performance than that of [Yun and Proutiere(2014)]. Let AA denote the observed random adjacency matrix. The algorithm consists in three steps. 1. Trimming. We first trim the adjacency matrix AA, i.e., we keep the entries corresponding to a set Γ\Gamma of vertices whose degrees are not too large. More precisely, Γ={v:wVAvw10(v,w)EAvwn}\Gamma=\{v:\sum_{w\in V}A_{vw}\leq 10\frac{\sum_{(v,w)\in E}A_{vw}}{n}\}. The resulting trimmed observation matrix is denoted by AΓA_{\Gamma}. 2. Spectral decomposition. We then extract the communities from the spectral analysis of AΓA_{\Gamma}. 3. Improvement. Finally, we further improve the estimated communities. After the spectral decomposition step, the identified communities (Sk)k=1,2(S_{k})_{k=1,2} are good approximations of the true communities. The improvement is obtained by sequentially considering each vertex and by moving it to the community with which it has the largest number of edges. The pseudo-code of the algorithm is presented in Algorithms 1 and 2. The next theorem provides performance guarantees for the Spectral Partition algorithm.

Then under the Spectral Partition algorithm, the number of misclassified vertices is less than ss with high probability.

By assumption, we have pq1+ϵ{p\over q}\geq 1+\epsilon, and pn=ω(1)pn=\omega(1). We may deduce that:

This can be proven using extensions of the weighted Arithmetic-Mean Geometric-Mean inequality. From Theorem 3.1, we deduce that: if

then the Spectral Partition algorithm yields less than ss misclassified vertices with high probability. We conclude this paper by exemplifying the condition (1). Consider the binary symmetric SBM, with p=alog(n)np={a\log(n)\over n} and q=blog(n)nq={b\log(n)\over n} for some a>ba>b.

Exact reconstruction: with s<1s<1, (1) is equivalent to a+b2ab>1{a+b\over 2}-\sqrt{ab}>1, which also constitutes a necessary condition for exact reconstruction. Theorem 3.1 then states that Spectral Partition is optimal for exact reconstruction, i.e., it extracts the communities exactly whenever this is at all possible.

Accurate reconstruction: choose s=nxs=n^{x} for some x(0,1)x\in(0,1). Then (1) is equivalent to

References

Appendix A Proof of Theorem 3.1

We first provide key intermediate results.

With high probability, XΓ=O(np)\|X_{\Gamma}\|=O(\sqrt{np}).

The proof of Lemma A.1 relies on arguments used in the spectral analysis of random graphs, see [Feige and Ofek(2005)]. The next lemma provides a bound on the number of misclassified nodes after spectral decomposition applied to the trimmed matrix AΓA_{\Gamma}, see Algorithm 2.

Assume that VΓ=O(1/p)|V\setminus\Gamma|=O(1/p) and XΓ=O(np)\|X_{\Gamma}\|=O(\sqrt{np}). Let (Sk)1kK(S_{k})_{1\leq k\leq K} denotes the output of Algorithm 2. With high probability, there exists a permutation σ\sigma of {1,,K}\{1,\ldots,K\} such that:

Observe that using Chernoff bound, we have VΓ=O(1/p)|V\setminus\Gamma|=O(1/p), hence combining the two previous lemma yields:

Assume that np=ω(1)np=\omega(1). The output (Sk)1kK(S_{k})_{1\leq k\leq K} of Algorithm 2 satisfies: with high probability, there exists a permutation σ\sigma of {1,,K}\{1,\ldots,K\} such that 1nk=1KVkSk=O(1np).\frac{1}{n}|\bigcup_{k=1}^{K}V_{k}\setminus S_{k}|=O(\frac{1}{np}).

Proof of Theorem 3.1: Let HH be the largest set of vertices vVv\in V satisfying:

When vVkv\in V_{k}, e(v,Vk)Vke(v,Vj)Vjplog4np\frac{e(v,V_{k})}{|V_{k}|}-\frac{e(v,V_{j})}{|V_{j}|}\geq\frac{p}{\log^{4}np} for all jkj\neq k.

The proof proceeds as follows. We first show that VHs|V\setminus H|\leq s with high probability. To this aim, we control the number of vertices satisfying (H1), (H2), and (H3), see Lemma A.4, Claim 1 and Lemma A.5, respectively. The result is summarised in Lemma A.5. Next Lemma A.6 establishes that there is no misclassified vertices in HH with high probability, which concludes the proof.

From Lemma A.4, with high probability, the number of vertices that do not satisfy (H1) is less than s/3s/3 when n(α1p+α2q(α1+α2)pα1α1+α2qα2α1+α2)log(n/s)np2lognp=ω(1)n(\alpha_{1}p+\alpha_{2}q-(\alpha_{1}+\alpha_{2})p^{\frac{\alpha_{1}}{\alpha_{1}+\alpha_{2}}}q^{\frac{\alpha_{2}}{\alpha_{1}+\alpha_{2}}})-\log(n/s)-\frac{np}{2\log np}=\omega(1), since

Claim 1. From Chernoff bound, we can easily show that vv does not satisfy (H2) with probability at most exp(5np)\exp(-5np) and thus, with high probability, the number of vertices that do not satisfy (H2) is less than s10\frac{s}{10}, since

In Lemma A.5, we conclude that VHsV\setminus H\leq s after showing the number of vertices that do not satisfy (H3) is less that s2\frac{s}{2} with high probability.

When n(α1p+α2q(α1+α2)pα1α1+α2qα2α1+α2)log(n/s)np2lognp=ω(1)n(\alpha_{1}p+\alpha_{2}q-(\alpha_{1}+\alpha_{2})p^{\frac{\alpha_{1}}{\alpha_{1}+\alpha_{2}}}q^{\frac{\alpha_{2}}{\alpha_{1}+\alpha_{2}}})-\log(n/s)-\frac{np}{2\log np}=\omega(1), VHs|V\setminus H|\leq s, with high probability.

Lemma A.6 shows that when initial (after Algorithm 2) number of misclassified vertices is O(1/p)O(1/p),

Since the initial number of misclassified vertices is negligible compared to nn by Lemma A.3, after logn\log n iterations, there is no misclassified vertice in HH.

If k=1K(Sk(0)Vk)H+VH=O(1/p),{|\bigcup_{k=1}^{K}(S^{(0)}_{k}\setminus V_{k})\cap H|+|V\setminus H|}=O(1/p),

A.2 Proof of Lemma A.4

From Chernoff bound, we know that for all 1tK1\leq t\leq K,

We conclude the proof by proving (6) and (7).

Proof of (7): Let x=nplog4np.x^{\star}=\lfloor\frac{np}{\log^{4}np}\rfloor.

A.3 Proof of Lemma A.5

Let Z1Z_{1} denote the set of vertices that do not satisfy at least one of (H1) and (H2). From Lemma A.4 and Chernoff bound, Z1<s2|Z_{1}|<\frac{s}{2} with high probability.

Next we prove the following intermediate claim: there is no subset SVS\subset V such that e(S,S)slog2npe(S,S)\geq s\log^{2}np and S=s|S|=s with high probability. For any subset SVS\in V such that S=s,|S|=s, by Markov inequality,

where, in the last two inequalities, we have set t=nplognpt=\frac{np}{\log np} and used the fact that: nsexp(nplognp),\frac{n}{s}\geq\exp(\frac{np}{\log np}), which comes from the assumptions made in the theorem. Since the number of subsets SVS\subset V with size ss is (ns)(ens)s,{{n}\choose{s}}\leq(\frac{en}{s})^{s}, from (20), we deduce:

Therefore, by Markov inequality, we can conclude that there is no SVS\subset V such that e(S,S)slog2npe(S,S)\geq s\log^{2}np and S=s|S|=s with high probability.

To conclude the proof of the lemma, we build the following sequence of sets. Let {Z(i)V}1ii\{Z(i)\subset V\}_{1\leq i\leq i^{\star}} be generated as follows:

For i1i\geq 1, Z(i)=Z(i1){vi}Z(i)=Z(i-1)\cup\{v_{i}\} if there exists viVv_{i}\in V such that e(vi,Z(i1))2log2npe(v_{i},Z(i-1))\geq 2\log^{2}np and viZ(i1)v_{i}\notin Z(i-1) and if there does not exist, the sequence ends.

The sequence ends after the construction of Z(i)Z(i^{\star}). By construction, every vVZ(i)v\in V\setminus Z(i^{\star}) satisfies the conditions (H1), (H2), and (H3). Since HH is the largest set of vertices satisfying (H1), (H2), and (H3), HVZ(i)|H|\geq|V\setminus Z(i^{\star})|.

The proof is hence completed if we show that Z(i)<s|Z(i^{\star})|<s. Let t=sZ1t^{\star}=s-|Z_{1}|. If it,i^{\star}\geq t^{\star}, Z(t)=s|Z(t^{\star})|=s and since Z1s2|Z_{1}|\leq\frac{s}{2},

However, from the previous claim, we know that with high probability, all SVS\subset V such that S=s|S|=s have to satisfy e(S,S)slog2npe(S,S)\leq s\log^{2}np. Therefore, with high probability, i<ti^{\star}<t^{\star} and

A.4 Proof of Lemma A.6

Since E(i)=O(1/p)|\mathcal{E}^{(i)}|=O(1/p) and e(v,V)10npe(v,V)\leq 10np when vHv\in H,

With the above inequality and (H1), we can bound E(i+1)E(i)\frac{|\mathcal{E}^{(i+1)}|}{|\mathcal{E}^{(i)}|} as follows:

where (a)(a) stems from (H1), (b)(b) stems from the fact that vE(i+1)(e(v,E(i))μ(v,E(i)))=1E(i)TXΓ1E(i+1)\sum_{v\in\mathcal{E}^{(i+1)}}(e(v,\mathcal{E}^{(i)})-\mu(v,\mathcal{E}^{(i)}))=1_{\mathcal{E}^{(i)}}^{T}\cdot X_{\Gamma}\cdot 1_{\mathcal{E}^{(i+1)}} where 1S1_{S} indicates the vector vv-th value is 1 if vSv\in S and 0 otherwise, and (c)(c) stems from (H3). Since E(i)=O(1/p)|\mathcal{E}^{(i)}|=O(1/p), we conclude that