Accurate Community Detection in the Stochastic Block Model via Spectral Algorithms

Se-Young Yun, Alexandre Proutiere

Introduction

Extracting structures or communities in networks is a central task in many disciplines including social sciences, biology, computer science, statistics, and physics. The Stochastic Block Model (SBM) was introduced a few decades ago as a performance benchmark to study the problem of community detection in random graphs, and it has, since then, attracted a lot attention. In this paper, we provide new results on the performance of spectral algorithms for detecting communities in the SBM. We consider a network consisting of a set $V$ of $n$ nodes. $V$ admits a hidden partition of $K$ non-overlapping subsets or communities $V_{1},\ldots,V_{K}$ ( $V=\bigcup_{k=1}^{K}V_{k}$ ). The size of community $V_{k}$ is $\alpha_{k}\times n$ for some $\alpha_{k}>0$ . Without loss of generality, let $\alpha_{1}\leq\alpha_{2}\leq\dots\leq\alpha_{K}$ . We assume that when the network size $n$ grows large, the number of communities $K$ and their relative sizes are kept fixed. The communities have to be reconstructed from an observed realization of a random graph constructed as follows. Each pair of vertices is connected independently with probability $p$ within communities and $q$ across communities, where $p$ and $q$ may depend on the network size $n$ . We assume that there exists $\epsilon>0$ such that ${p\over q}\geq 1+\epsilon$ uniformly in $n$ . We further restrict our attention to the sparse case such that $p=o(1/\log^{2}n)$ and the case where $pn=\omega(1)$ , which is a necessary condition for asymptotically accurate community detection i.e., for the existence of algorithms that yield a vanishing proportion of misclassified vertices.

We show that under certain spectral algorithms, the number of misclassified vertices does not exceed $s$ with high probability as $n$ grows large, whenever $s=o(n)$ and

This result extends recent work about exact community reconstruction in the binary symmetric SBM (i.e., in the specific case of two communities of equal sizes). Indeed, by choosing $s<1$ in (1), we get a condition under which spectral algorithms exactly recover the structure of any asymmetric networks with an arbitrary (but finite) number of communities. However our results is not limited to exact reconstruction, as we may choose any $s=o(n)$ , e.g., $s=\sqrt{n}$ .

We conjecture that the condition (1) is necessary for the existence of algorithms yielding less than $s$ misclassified vertices. The conjecture is true in the case of exact reconstruction ( $s<1$ ) for the binary symmetric SBM. Please refer to the next section for a more detailed description on the related work.

Previous Results

Exact Detection. Asymptotically exact community reconstruction in the SBM has been recently addressed in [Abbe et al.(2014)Abbe, Bandeira, and Hall], [Mossel et al.(2014)Mossel, Neeman, and Sly], and [Hajek et al.(2014)Hajek, Wu, and Xu]. These papers only consider the binary symmetric SBM. They establish a necessary and sufficient condition for asymptotically exact reconstruction that coincides with (1) when applied to two communities of equal sizes ( $\alpha_{1}=\alpha_{2}=1/2$ ) and $s<1$ . For example, when $p={a\log(n)\over n}$ and $q={b\log(n)\over n}$ for $a>b$ , (1) becomes equivalent to ${a+b\over 2}-\sqrt{ab}>1$ . The three aforementioned papers further provide optimal algorithms, i.e., algorithms exactly recovering the network structure when this is possible. Note that in [Abbe et al.(2014)Abbe, Bandeira, and Hall], and [Hajek et al.(2014)Hajek, Wu, and Xu], the proposed algorithms are based on SDP, and can be computationally expensive. In contrast, we prove that simple spectral algorithms are optimal.

Asymptotically Accurate Detection. Necessary and sufficient conditions for asymptotically accurate detection (i.e., the proportion of misclassified vertices vanishes when $n$ grows large) in the SBM has been derived in [Yun and Proutiere(2014)]. This condition is $n(p-q)^{2}/(p+q)=\omega(1)$ . In the present paper, we provide results that fill the gap between exact detection and asymptotically accurate detection.

Detectability. In the sparse regime where $p,q=o(1)$ , and for the binary symmetric SBM, the main focus recently has been on identifying the phase transition threshold (a condition on $p$ and $q$ ) for detectability: It was conjectured in [Decelle et al.(2011)Decelle, Krzakala, Moore, and Zdeborová] that if $n(p-q)<\sqrt{2n(p+q)}$ (i.e., under the threshold), no algorithm can perform better than a simple random assignment of vertices to communities, and above the threshold, communities can partially be recovered. The conjecture was recently proved in [Mossel et al.(2012)Mossel, Neeman, and Sly] (necessary condition), and [Massoulié(2013)] (sufficient condition).

A more exhaustive list of papers related to the SBM can be found in [Yun and Proutiere(2014)].

Spectral Algorithms and Their Performance

The proposed algorithm, referred to as Spectral Partition, is the same as that in [Yun and Proutiere(2014)], and is simple modifications of algorithms initially presented in [Coja-Oghlan(2010)]. In this paper, we present a more precise analysis of its performance than that of [Yun and Proutiere(2014)]. Let $A$ denote the observed random adjacency matrix. The algorithm consists in three steps. 1. Trimming. We first trim the adjacency matrix $A$ , i.e., we keep the entries corresponding to a set $\Gamma$ of vertices whose degrees are not too large. More precisely, $\Gamma=\{v:\sum_{w\in V}A_{vw}\leq 10\frac{\sum_{(v,w)\in E}A_{vw}}{n}\}$ . The resulting trimmed observation matrix is denoted by $A_{\Gamma}$ . 2. Spectral decomposition. We then extract the communities from the spectral analysis of $A_{\Gamma}$ . 3. Improvement. Finally, we further improve the estimated communities. After the spectral decomposition step, the identified communities $(S_{k})_{k=1,2}$ are good approximations of the true communities. The improvement is obtained by sequentially considering each vertex and by moving it to the community with which it has the largest number of edges. The pseudo-code of the algorithm is presented in Algorithms 1 and 2. The next theorem provides performance guarantees for the Spectral Partition algorithm.

Then under the Spectral Partition algorithm, the number of misclassified vertices is less than $s$ with high probability.

By assumption, we have ${p\over q}\geq 1+\epsilon$ , and $pn=\omega(1)$ . We may deduce that:

This can be proven using extensions of the weighted Arithmetic-Mean Geometric-Mean inequality. From Theorem 3.1, we deduce that: if

then the Spectral Partition algorithm yields less than $s$ misclassified vertices with high probability. We conclude this paper by exemplifying the condition (1). Consider the binary symmetric SBM, with $p={a\log(n)\over n}$ and $q={b\log(n)\over n}$ for some $a>b$ .

Exact reconstruction: with $s<1$ , (1) is equivalent to ${a+b\over 2}-\sqrt{ab}>1$ , which also constitutes a necessary condition for exact reconstruction. Theorem 3.1 then states that Spectral Partition is optimal for exact reconstruction, i.e., it extracts the communities exactly whenever this is at all possible.

Accurate reconstruction: choose $s=n^{x}$ for some $x\in(0,1)$ . Then (1) is equivalent to

References

Appendix A Proof of Theorem 3.1

We first provide key intermediate results.

With high probability, $\|X_{\Gamma}\|=O(\sqrt{np})$ .

The proof of Lemma A.1 relies on arguments used in the spectral analysis of random graphs, see [Feige and Ofek(2005)]. The next lemma provides a bound on the number of misclassified nodes after spectral decomposition applied to the trimmed matrix $A_{\Gamma}$ , see Algorithm 2.

Assume that $|V\setminus\Gamma|=O(1/p)$ and $\|X_{\Gamma}\|=O(\sqrt{np})$ . Let $(S_{k})_{1\leq k\leq K}$ denotes the output of Algorithm 2. With high probability, there exists a permutation $\sigma$ of $\{1,\ldots,K\}$ such that:

Observe that using Chernoff bound, we have $|V\setminus\Gamma|=O(1/p)$ , hence combining the two previous lemma yields:

Assume that $np=\omega(1)$ . The output $(S_{k})_{1\leq k\leq K}$ of Algorithm 2 satisfies: with high probability, there exists a permutation $\sigma$ of $\{1,\ldots,K\}$ such that $\frac{1}{n}|\bigcup_{k=1}^{K}V_{k}\setminus S_{k}|=O(\frac{1}{np}).$

Proof of Theorem 3.1: Let $H$ be the largest set of vertices $v\in V$ satisfying:

When $v\in V_{k}$ , $\frac{e(v,V_{k})}{|V_{k}|}-\frac{e(v,V_{j})}{|V_{j}|}\geq\frac{p}{\log^{4}np}$ for all $j\neq k$ .

The proof proceeds as follows. We first show that $|V\setminus H|\leq s$ with high probability. To this aim, we control the number of vertices satisfying (H1), (H2), and (H3), see Lemma A.4, Claim 1 and Lemma A.5, respectively. The result is summarised in Lemma A.5. Next Lemma A.6 establishes that there is no misclassified vertices in $H$ with high probability, which concludes the proof.

From Lemma A.4, with high probability, the number of vertices that do not satisfy (H1) is less than $s/3$ when $n(\alpha_{1}p+\alpha_{2}q-(\alpha_{1}+\alpha_{2})p^{\frac{\alpha_{1}}{\alpha_{1}+\alpha_{2}}}q^{\frac{\alpha_{2}}{\alpha_{1}+\alpha_{2}}})-\log(n/s)-\frac{np}{2\log np}=\omega(1)$ , since

Claim 1. From Chernoff bound, we can easily show that $v$ does not satisfy (H2) with probability at most $\exp(-5np)$ and thus, with high probability, the number of vertices that do not satisfy (H2) is less than $\frac{s}{10}$ , since

In Lemma A.5, we conclude that $V\setminus H\leq s$ after showing the number of vertices that do not satisfy (H3) is less that $\frac{s}{2}$ with high probability.

When $n(\alpha_{1}p+\alpha_{2}q-(\alpha_{1}+\alpha_{2})p^{\frac{\alpha_{1}}{\alpha_{1}+\alpha_{2}}}q^{\frac{\alpha_{2}}{\alpha_{1}+\alpha_{2}}})-\log(n/s)-\frac{np}{2\log np}=\omega(1)$ , $|V\setminus H|\leq s$ , with high probability.

Lemma A.6 shows that when initial (after Algorithm 2) number of misclassified vertices is $O(1/p)$ ,

Since the initial number of misclassified vertices is negligible compared to $n$ by Lemma A.3, after $\log n$ iterations, there is no misclassified vertice in $H$ .

If ${|\bigcup_{k=1}^{K}(S^{(0)}_{k}\setminus V_{k})\cap H|+|V\setminus H|}=O(1/p),$

A.2 Proof of Lemma A.4

From Chernoff bound, we know that for all $1\leq t\leq K$ ,

We conclude the proof by proving (6) and (7).

Proof of (7): Let $x^{\star}=\lfloor\frac{np}{\log^{4}np}\rfloor.$

A.3 Proof of Lemma A.5

Let $Z_{1}$ denote the set of vertices that do not satisfy at least one of (H1) and (H2). From Lemma A.4 and Chernoff bound, $|Z_{1}|<\frac{s}{2}$ with high probability.

Next we prove the following intermediate claim: there is no subset $S\subset V$ such that $e(S,S)\geq s\log^{2}np$ and $|S|=s$ with high probability. For any subset $S\in V$ such that $|S|=s,$ by Markov inequality,

where, in the last two inequalities, we have set $t=\frac{np}{\log np}$ and used the fact that: $\frac{n}{s}\geq\exp(\frac{np}{\log np}),$ which comes from the assumptions made in the theorem. Since the number of subsets $S\subset V$ with size $s$ is ${{n}\choose{s}}\leq(\frac{en}{s})^{s},$ from (20), we deduce:

Therefore, by Markov inequality, we can conclude that there is no $S\subset V$ such that $e(S,S)\geq s\log^{2}np$ and $|S|=s$ with high probability.

To conclude the proof of the lemma, we build the following sequence of sets. Let $\{Z(i)\subset V\}_{1\leq i\leq i^{\star}}$ be generated as follows:

For $i\geq 1$ , $Z(i)=Z(i-1)\cup\{v_{i}\}$ if there exists $v_{i}\in V$ such that $e(v_{i},Z(i-1))\geq 2\log^{2}np$ and $v_{i}\notin Z(i-1)$ and if there does not exist, the sequence ends.

The sequence ends after the construction of $Z(i^{\star})$ . By construction, every $v\in V\setminus Z(i^{\star})$ satisfies the conditions (H1), (H2), and (H3). Since $H$ is the largest set of vertices satisfying (H1), (H2), and (H3), $|H|\geq|V\setminus Z(i^{\star})|$ .

The proof is hence completed if we show that $|Z(i^{\star})|<s$ . Let $t^{\star}=s-|Z_{1}|$ . If $i^{\star}\geq t^{\star},$ $|Z(t^{\star})|=s$ and since $|Z_{1}|\leq\frac{s}{2}$ ,

However, from the previous claim, we know that with high probability, all $S\subset V$ such that $|S|=s$ have to satisfy $e(S,S)\leq s\log^{2}np$ . Therefore, with high probability, $i^{\star}<t^{\star}$ and

A.4 Proof of Lemma A.6

Since $|\mathcal{E}^{(i)}|=O(1/p)$ and $e(v,V)\leq 10np$ when $v\in H$ ,

With the above inequality and (H1), we can bound $\frac{|\mathcal{E}^{(i+1)}|}{|\mathcal{E}^{(i)}|}$ as follows:

where $(a)$ stems from (H1), $(b)$ stems from the fact that $\sum_{v\in\mathcal{E}^{(i+1)}}(e(v,\mathcal{E}^{(i)})-\mu(v,\mathcal{E}^{(i)}))=1_{\mathcal{E}^{(i)}}^{T}\cdot X_{\Gamma}\cdot 1_{\mathcal{E}^{(i+1)}}$ where $1_{S}$ indicates the vector $v$ -th value is 1 if $v\in S$ and 0 otherwise, and $(c)$ stems from (H3). Since $|\mathcal{E}^{(i)}|=O(1/p)$ , we conclude that