Exact Recovery in the Stochastic Block Model

Emmanuel Abbe, Afonso S. Bandeira, Georgina Hall

Introduction

Learning community structures in graphs is a central problem in machine learning, computer science and complex networks. Increasingly, data is available about interactions among agents (e.g., social, biological, computer or image networks), and the goal is to infer from these interactions communities that are alike or complementary. As the study of community detection grows at the intersections of various fields, in particular computer science, machine learning, statistics and social computing, the notions of clusters, the figure of merits and the models vary significantly, often based on heuristics (see for a survey). As a result, the comparison and validation of clustering algorithms remains a major challenge. Key enablers to benchmark algorithms and to measure the accuracy of clustering methods are statistical network models. More specifically, the stochastic block model has been at the center of the attention in a large body of literature , as a testbed for algorithms (see for a survey) as well as a scalable model for large data sets (see and reference therein). On the other hand, the fundamental analysis of the stochastic block model (SBM) is still holding major open problems, as discussed next.

The SBM can be seen as an extension of the ErdősRényi (ER) model . In the ER model, edges are placed independently with probability pp, providing a models described by a single parameter. This model has been (and still is) a source of intense research activity, in particular due to its phase transition phenomena. It is however well known to be too simplistic to model real networks, in particular due to its strong homogeneity and absence of community structure. The stochastic block model is based on the assumption that agents in a network connect not independently but based on their profiles, or equivalently, on their community assignment. More specifically, each node vv in the graph is assigned a label xvXx_{v}\in\mathcal{X}, where X\mathcal{X} denotes the set of community labels, and each pair of nodes u,vVu,v\in V is connected with probability p(xu,xv)p(x_{u},x_{v}), where p(,)p(\cdot,\cdot) is a fixed probability matrix. Upon observing the graph (without labels), the goal of community detection is to reconstruct the community assignments, with either full or partial recovery.

Of particular interest is the SBM with two communities and symmetric parameters, also known as the planted bisection model, denoted in this paper by G(n,p,q)\mathcal{G}(n,p,q), with nn an even integer denoting the number of vertices. In this model, the graph has two clusters of equal size, and the probabilities of connecting are pp within the clusters and qq across the clusters (see Figure 1). Of course, one can only hope to recover the communities up to a global flip of the labels, in other words, only the partition can be recovered. Hence we use the terminology exact recovery or simply recovery when the partition is recovered correctly with high probability (w.h.p.), i.e., with probability tending to one as nn tends to infinity. When p=qp=q, it is clearly impossible to recover the communities, whereas for p>qp>q or p<qp<q, one may hope to succeed in certain regimes. While this is a toy model, it captures some of the central challenges for community detection.

A large body of literature in statistics and computer science has focused on determining lower-bounds on the scaling of pq|p-q| for which efficient algorithms succeed in recovering the two communities in G(n,p,q)\mathcal{G}(n,p,q). We overview these results in the next section. The best bound seems to come from , ensuring recovery for (pq)/pΩ(log(n)/n)(p-q)/\sqrt{p}\geq\Omega(\sqrt{\log(n)/n}), and has not be improved for more than a decade. More recently, a new phenomena has been identified for the SBM in a regime where p=a/np=a/n and q=b/nq=b/n . In this regime, exact recovery is not possible, since the graph is, with high probability, not connected. However, partial recovery is possible, and the focus has been shifted on determining for which regime of aa and bb it is possible to obtain a reconstruction of the communities which is asymptotically better than a random guess (which gets roughly 50%50\% of accuracy). In other words, to recover only a proportion 1/2+ε1/2+\varepsilon of the vertices correctly, for some ε>0\varepsilon>0. We refer to this reconstruction requirement as detection. In , it was conjectured that detection is possible if and only if (ab)2>2(a+b)(a-b)^{2}>2(a+b). This is a particularly fascinating and strong conjecture, as it provides a necessary and sufficient condition for detection with a sharp closed-form expression. The study of this regime was initiated with the work of Coja-Oghlan , which obtains detection when (ab)2>2log(a+b)(a+b)(a-b)^{2}>2\log(a+b)(a+b) using spectral clustering on a trimmed adjacency matrix. The conjecture was recently proved by Massoulie and Mossel et al. using two different efficient algorithms. The impossibility result was first proved in .

While the sparse regime with constant degree points out a fascinating threshold phenomena for the detection property, it also raises a natural question: does exact recovery also admit a similar phase transition? Most of the literature has been focusing on the scaling of the lower-bounds, often up to poly-logarithmic terms, and the answer to this question appears to be currently missing in the literature. In particular, we did not find tight impossibility results, or guarantees of optimality of the proposed algorithms. This paper answers this question, establishing a sharp phase transition for recovery, obtaining a tight bound with an efficient algorithm achieving it.

Related works

There has been a significant body of literature on the recovery property for the stochastic block model with two communities G(n,p,q)\mathcal{G}(n,p,q), ranging from computer science and statistics literature to machine learning literature. We provide next a partialThe approach of McSherry was recently simplified and extended in . list of works that obtain bounds on the connectivity parameters to ensure recovery with various algorithms:

While these algorithmic developments are impressive, we next argue how they do not reveal the sharp behavioral transition that takes place in this model. In particular, we will obtain an improved bound that is shown to be tight.

Information theoretic perspective and main results

In this paper, rather than starting with a specific algorithmic approach, we first seek to establish the information-theoretic threshold for recovery irrespective of efficiency requirements. Obtaining an information-theoretic benchmark, we then seek for an efficient algorithm that achieves it. There are several reasons to expect that an information-theoretic phase transition takes place for recovery in the SBM:

From a random graph perspective, note that recovery requires the graph to be at least connected (with high probability), hence (α+β)/2>1(\alpha+\beta)/2>1, for p=αlog(n)/np=\alpha\log(n)/n and q=βlog(n)/nq=\beta\log(n)/n is necessary. In turn, if α=0\alpha=0 or β=0\beta=0, then (α+β)/2<1(\alpha+\beta)/2<1 prohibits recovery (since the model has either two separate Erdős-Reényi graphs that are not connected, or a bipartite Erdős-Reényi graph which is not connected). So one can expect that recovery take place in the regime p=αlog(n)/np=\alpha\log(n)/n and q=βlog(n)/nq=\beta\log(n)/n, if and only if f(α,β)>1f(\alpha,\beta)>1 for some function ff that satisfies f(α,0)=α/2f(\alpha,0)=\alpha/2, f(0,β)=β/2f(0,\beta)=\beta/2 and where f(α,β)>1f(\alpha,\beta)>1 implies (α+β)/2>1(\alpha+\beta)/2>1. In particular, such a result has been shown to take place for the detection property , where a giant component is necessary, i.e., (a+b)/2>1(a+b)/2>1 for p=a/np=a/n and q=b/nq=b/n, and where detection is shown to be possible if and only if (a+b)/2+2ab/(a+b)>1(a+b)/2+2ab/(a+b)>1 (which is equivalent to (ab)2>2(a+b)(a-b)^{2}>2(a+b)). Note also that the regime p=αlog(n)/np=\alpha\log(n)/n and q=βlog(n)/nq=\beta\log(n)/n is the bottleneck regime for recovery, as other regimes lead to extremal behaviour of the model (either trivially possible or impossible to recover the communities).

From an information theory perspective, note that the SBM can be seen as specific code on a discrete memoryless channel. Namely, the community assignment is a vector x{0,1}nx\in\{0,1\}^{n}, the graph is a vector (or matrix) y{0,1}Ny\in\{0,1\}^{N}, N=(n2)N={n\choose 2}, where yijy_{ij} is the output of xixjx_{i}\oplus x_{j} through the discrete memoryless channel (1pp1qq)(\begin{smallmatrix}1-p&p\\ 1-q&q\end{smallmatrix}), for 1i<jn1\leq i<j\leq n. The problem is hence to decode xnx^{n} from yNy^{N} correctly with high probability.

This information theory model is a specific structured channel: first the channel is memoryless but it is not time homogeneous, since p=alog(n)/np=a\log(n)/n and q=blog(n)/nq=b\log(n)/n are scaling with nn. Then the code has a specific structure, it has constant right-degree of 2 and constant left-degree of n1n-1, and rate 2/(n1)2/(n-1). However, as shown in for the constant-degree regime, this model can be approximated by another model where the sparsity of the channel (i.e., the fact that pp and qq tend to 0) can be transferred to the code, which becomes an LDGM code of constant degree 2, and for which maximum-likelihood is expected to have a phase transition . It is then legitimate to expect a phase transition, as in coding theory, for the recovery of the input (the community assignment) from the output (the graph).

To establish the information-theoretic limit, note that, as for channel coding, the algorithm maximizing the probability of reconstructing the communities correctly is the Maximum A Posteriori (MAP) decoding. Since the community assignment is uniform, MAP is in particular equivalent to Maximum Likelihood (ML) decoding. Hence if ML fails in reconstructing the communities with high probability when nn diverges, there is no algorithm (efficient or not) which can succeed with high probability. However, ML amounts to finding a balanced cut (a bisection) of the graph which minimizes the number of edges across the cut (in the case a>ba>b), i.e., the min-bisection problem, which is well-known to be NP-hard. Hence ML can be usedML was also used for the SBM in , requiring however poly-logarithmic degrees for the nodes. to establish the fundamental limit but does not provide an efficient algorithm, which we consider in a second stage.

We now summarize the main results of this paper. Theorem 1 and Theorem 2 provide the information-theoretic limit for recovery. Theorem 1 establishes the converse, showing that the maximum likelihood estimator does not coincide with the planted partition w.h.p. if (α+β)/2αβ<1(\alpha+\beta)/2-\sqrt{\alpha\beta}<1 and Theorem 2 states that ML succeeds w.h.p. if (α+β)/2αβ>1(\alpha+\beta)/2-\sqrt{\alpha\beta}>1. One can express the recovery requirement as

where (α+β)/2>1(\alpha+\beta)/2>1 is the requirement for the connectivity threshold (which is necessary), and the oversampling term αβ\sqrt{\alpha\beta} is needed to allow for recovery (this is also equivalent to αβ>1\sqrt{\alpha}-\sqrt{\beta}>1 for α>β\alpha>\beta). Analyzing ML requires a sharp analysis of the tail event of the sum of discrete random variables tending to constants with the number of summands. Interestingly, standard estimates à la CLT, Chernoff, or Sanov’s Theorem do not provide the right answer in our regime due to the slow concentration taking place.

Note that the best bounds from the table of Section 2 are obtained from and , which allow for recovery in the regime where p=αlog(n)/np=\alpha\log(n)/n and q=βlog(n)/nq=\beta\log(n)/n, obtaining the conditions (αβ)2>64(α+β)(\alpha-\beta)^{2}>64(\alpha+\beta) in and (αβ)2>72(α+β)(\alpha-\beta)^{2}>72(\alpha+\beta) in . Hence, although these works reach the scaling for nn where the threshold takes place, they do not obtain the right threshold behaviour in terms the parameters α\alpha and β\beta.

For efficient algorithms, we propose first an algorithm based on a semidefinite programming relaxation of ML, and show in Theorem 3 that it succeeds in recovering the communities w.h.p. when (αβ)2>8(α+β)+8/3(αβ)(\alpha-\beta)^{2}>8(\alpha+\beta)+8/3(\alpha-\beta). This is shown by building a candidate dual certificate and showing that it indeed satisfies all the require properties, using Berstein’s matrix inequality. To compare this expression with the optimal threshold, the latter can be rewritten as (αβ)2>4(α+β)4(\alpha-\beta)^{2}>4(\alpha+\beta)-4 and α+β>2\alpha+\beta>2. The SDP is hence provably successful with a slightly looser threshold. It however already improves on the state of the art for exact recovery in the SBM, since the above condition is implied by (αβ)2>31/3(α+β)(\alpha-\beta)^{2}>31/3(\alpha+\beta), which improves on . Moreover, numerical simulations suggest that the SDP algorithm works all the way down to the optimal threshold, and the analysis may not be tight. The success of the SDP algorithm under the model of this paper, suggest that it may have robustness properties relevant in practical contexts.

Finally, we provide in Section 7.2 an efficient algorithm whose guarantees match the information theoretical threshold, using an efficient partial recovery algorithm, followed by a procedure of local improvements.

Additional related literature

From an algorithmic point of view, the censored block model investigated in is also related to this paper. It considers the following problem: GG is a random graph from the ER(n,p)ER(n,p) ensemble, and each node vv is assigned an unknown binary label xvx_{v}. For each edge (i,j)(i,j) in GG, the variable Yij=xi+xj+Zijmod2Y_{ij}=x_{i}+x_{j}+Z_{ij}\mod 2 is observed, where ZijZ_{ij} are i.i.d. Bernouilli(ϵ\epsilon) variables. The goal is to recover the values of the node variables from the {Yij}\{Y_{ij}\} variables. Matching bounds are obtained in for ε\varepsilon close to 1/21/2, with an efficient algorithm based on SDP, which is related to the algorithm developed in this paper.

Shortly after the posting of this paper on arXiv, a paper of Mossel, Neeman and Sly , fruit of a parallel research effort, was posted for the recovery problem in G(n,p,q)\mathcal{G}(n,p,q). In , the authors obtained a similar type of result as in this paper, slightly more general, allowing in particular for the parameters aa and bb to depend on nn as long as both parameters are Θ(1)\Theta(1).

Information theoretic lower bound

In this section we prove an information theoretic lower bound for exact recovery on the stochastic block model. The techniques are similar to the estimates for decoding a codeword on a memoryless channel with a specific structured codes.

Recall the G(n,p,q)\mathcal{G}(n,p,q) stochastic block model: nn denotes the number of vertices in the graph, assumed to be even for simplicity, for each vertex v[n]v\in[n], a binary label XvX_{v} is attached, where {Xv}v[n]\{X_{v}\}_{v\in[n]} are uniformly drawn such that {v[n]:Xv=1}=n/2|\{v\in[n]:X_{v}=1\}|=n/2, and for each pair of distinct nodes u,v[n]u,v\in[n], an edge is placed with probability pp if Xu=XvX_{u}=X_{v} and qq if XuXvX_{u}\neq X_{v}, where edges are placed independently conditionally on the vertex labels. In the sequel, we consider p=αlog(n)/np=\alpha\log(n)/n and q=βlog(n)/nq=\beta\log(n)/n, and focus on the case α>β\alpha>\beta to simplify the writing.

Let α>β0\alpha>\beta\geq 0. If (α+β)/2αβ<1(\alpha+\beta)/2-\sqrt{\alpha\beta}<1, or equivalently, if either α+β<2\alpha+\beta<2 or (αβ)2<4(α+β)4(\alpha-\beta)^{2}<4(\alpha+\beta)-4 and α+β2\alpha+\beta\geq 2, then ML fails in recovering the communities with probability bounded away from zero.

If β=0\beta=0, recovery is possibly if and only if there are no isolated nodes which is known to have a sharp threshold at α=2\alpha=2. We will focus on α>β>0\alpha>\beta>0.

Let AA and BB denote the two communities, each with n2\frac{n}{2} nodes.

and let HH be a fixed subset of AA of size nγ(n)\frac{n}{\gamma(n)}. We define the following events:

where E(,)E(\cdot,\cdot) is the number of edges between two sets. Note that we identify nodes of our graph with integers with a slight abuse of notation when there is no risk of confusion.

By symmetry, the probability of a failure in BB is also at least 23\frac{2}{3} so, by union bound, with probability at least 13\frac{1}{3} both failures will happen simultaneously which implies that ML fails. ∎

It is easy to see that ΔFHFA\Delta\cap F_{H}\Rightarrow F_{A} and Lemma 10 states that

which together with Lemma 1 concludes the proof. ∎

Recall the definitions in (2) and (3). If

FH(i)F_{H}^{(i)} are independent and identically distributed random variables so

where the last inequality used the hypothesis ρ(n)>n1γ(n)log(10)\rho(n)>n^{-1}\gamma(n)\log(10). ∎

Let NN be a natural number, p,qp,q\in, and ϵ0\epsilon\geq 0, we define

where W1,,WNW_{1},\dots,W_{N} are i.i.d. Bernoulli(p)(p) and Z1,,ZNZ_{1},\dots,Z_{N} are i.i.d. Bernoulli(q)(q), independent of W1,,WNW_{1},\dots,W_{N}.

From the definitions in (2) and (3) we have

where W1,,WNW_{1},\dots,W_{N} are i.i.d. Bernoulli(αlog(n)n)\left(\frac{\alpha\log(n)}{n}\right) and Z1,,ZNZ_{1},\dots,Z_{N} are i.i.d. Bernoulli(βlog(n)n)\left(\frac{\beta\log(n)}{n}\right), all independent. Since

Hence ρ(n)>n1γ(n)log(10)\rho(n)>n^{-1}\gamma(n)\log(10), and the conclusion follows from Lemma 3. ∎

Information theoretic upper bound

We present now the main result of this Section.

If α+β2αβ>1\frac{\alpha+\beta}{2}-\sqrt{\alpha\beta}>1, i.e., if α+β>2\alpha+\beta>2 and (αβ)2>4(α+β)4(\alpha-\beta)^{2}>4(\alpha+\beta)-4, then the maximum likelihood estimator exactly recovers the communities (up to a global flip), with high probability.

The case β=0\beta=0 follows directly from the connectivity threshold phenomenon on Erdős-Rényi graphs so we will restrict our attention to α>β>0\alpha>\beta>0.

We will prove this theorem through a series of lemmas. The techniques are similar to the estimates for decoding a codeword on a memoryless channel with a specific structured codes. In what follows we refer to the true community partition as the ground truth.

If the maximum likelihood estimator does not coincide with the ground truth, then there exists 1kn41\leq k\leq\frac{n}{4} and a set AwAA_{w}\subset A and BwBB_{w}\subset B with Aw=Bw=k|A_{w}|=|B_{w}|=k such that

Recall that the maximum likelihood estimator finds two equally sized communities (of size n2\frac{n}{2} each) that have the minimum number of edges between them, thus for it to fail there must exist another balanced partition of the graph with a smaller cut, let us call it ZAZ_{A} and ZBZ_{B}. Without loss of generality ZAAn4Z_{A}\cap A\geq\frac{n}{4} and ZBBn4Z_{B}\cap B\geq\frac{n}{4}. Picking Aw=ZBAA_{w}=Z_{B}\cap A and Bw=ZABB_{w}=Z_{A}\cap B gives the result. ∎

Let FF be the event of the maximum likelihood estimator not coinciding with the ground truth. Given AwA_{w} and BwB_{w} both of size kk, define Pn(k)P_{n}^{(k)} as

We have, by a simple union bound argument,

Let WiW_{i} be a sequence of i.i.d. Bernoulli(αlognn)\left(\frac{\alpha\log n}{n}\right) random variables and ZiZ_{i} an independent sequence of i.i.d. Bernoulli(βlognn)\left(\frac{\beta\log n}{n}\right) random variables, note that (cf. Definition 3),

We thus have, combining (12) and (13), and using (nk)(ne/k)k{n\choose k}\leq(ne/k)^{k},

Recall that FF is the event of the maximum likelihood estimator not coinciding with the ground truth. We next show that for ϵ>0\epsilon>0, if,

then there exists a constant c>0c>0 such that

Note that, for sufficiently large nn, 1kn41\leq k\leq\frac{n}{4} we have

and nk4ϵn14ϵn^{-\frac{k}{4}\epsilon}\leq n^{-\frac{1}{4}\epsilon}. Hence, for sufficiently large nn,

which, together with the observation that k=1n/4exp[23k(log2k3)]=O(1)\sum_{k=1}^{n/4}\exp\left[-\frac{2}{3}k\left(\log 2k-3\right)\right]=O(1), concludes the proof of the theorem. ∎

Efficient algorithms

We propose and analyze an algorithm, based in semidefinite programming (SDP), to efficiently reconstruct the two communities. Let G=(V,E(G))\mathcal{G}=(V,E(\mathcal{G})) be the observed graph, where edges are independently present, with probability αlog(n)n\frac{\alpha\log(n)}{n} if they connect two nodes in the same community and with probability βlog(n)n\frac{\beta\log(n)}{n} if they connect two nodes in different communities, with α>β\alpha>\beta. Recall that there are n nodes in this graph and that with a slight abuse of notation, we will identify nodes in the graph by an integer in [n][n]. Our goal is to recover the two communities in G\mathcal{G}.

The proposed algorithm will attempt to maximize the following

Our approach will be to consider a simple SDP relaxation to this combinatorial problem. The SDP relaxation considered here dates back to the seminal work of Goemans and Williamson on the Max-Cut problem. The techniques behind our analysis are similar to the ones used by the first two authors on a recent publication :

If (αβ)2>8(α+β)+83(αβ)(\alpha-\beta)^{2}>8(\alpha+\beta)+\frac{8}{3}(\alpha-\beta), the following holds with high probability: (19) has a unique solution which is given by the outer-product of g{±1}ng\in\left\{\pm 1\right\}^{n} whose entries corresponding to community AA are 11 and community BB are 1-1. Hence, if (αβ)2>8(α+β)+83(αβ)(\alpha-\beta)^{2}>8(\alpha+\beta)+\frac{8}{3}(\alpha-\beta), full recovery of the communities is possible in polynomial time.

We will prove this result through a series of lemmas. Recall that G\mathcal{G} is the observed graph and that the vector gg corresponds to the correct choice of communities. As stated above, the optimization problem (19) is an SDP (Semidefinite Program) and any SDP can be solved in polynomial time using methods such as the Interior Point Method. Hence if we can prove that the solution of (19) is gg, then we will have proved that the algorithm can recover the correct choice of communities in polynomial time.

Recall that the degree matrix DD of a graph GG is a diagonal matrix where each diagonal coefficient DiiD_{ii} corresponds to the number of neighbours of vertex ii and that λ2(M)\lambda_{2}(M) is the second smallest eigenvalue of a symmetric matrix MM.

Let G+\mathcal{G}_{+} (resp. G\mathcal{G}_{-}) be a subgraph of G\mathcal{G} that includes the edges that link two nodes in the same community (resp. in different communities) and AA the adjacency matrix of G\mathcal{G}. We denote by DG+D_{\mathcal{G}}^{+} (resp. DGD_{\mathcal{G}}^{-}) the degree matrix of G+\mathcal{G}_{+} (resp. G\mathcal{G}_{-}) and define the Stochastic Block Model Laplacian to be

then ggTgg^{T} is the unique solution to the SDP (19).

We can suppose that g=(1,...,1,1,...,1)Tg=(1,...,1,-1,...,-1)^{T} WLOG. First of all, we obtain a sufficient condition for ggTgg^{T} to be a solution to SDP (19) by using the KKT conditions. This will give us the first part of condition (20). The primal problem of SDP (19) is

ggTgg^{T} is guaranteed to be an optimal solution to SDP (19) under the following conditions:

ggTgg^{T} is a feasible solution for the primal problem

There exists a matrix YY feasible for the dual problem such that Tr(BggT)=Tr(Y)\operatorname{Tr}(Bgg^{T})=\operatorname{Tr}(Y).

The first point being trivially verified, it remains to find such a YY (known as a dual certificate). Generally, one can also use complementary slackness to help find such a certificate but, in this case, it is equivalent to strong duality.

Define a correct (resp. incorrect) edge to be an edge between two nodes in the same (resp. different) community and a correct (resp. incorrect) non-edge to be the absence of an edge between two nodes in different (resp. same) communities. Notice that (BggT)ii(Bgg^{T})_{ii} counts positively the correct edges and non-edges incident from node ii and negatively incorrect edges and incorrect non edges incident from node ii. In other words

Hence: Tr(BggT)=Tr(2(DG+DG)+In)\operatorname{Tr}\left(Bgg^{T}\right)=\operatorname{Tr}\left(2\left(D_{\mathcal{G}}^{+}-D_{\mathcal{G}}^{-}\right)+I_{n}\right) so Y=2(DG+DG)+InY=2\left(D_{\mathcal{G}}^{+}-D_{\mathcal{G}}^{-}\right)+I_{n} verifies Tr(BggT)=Tr(Y)\operatorname{Tr}(Bgg^{T})=\operatorname{Tr}(Y) and, thus defined, is diagonal. As long as 2(DG+DG)+InB2\left(D_{\mathcal{G}}^{+}-D_{\mathcal{G}}^{-}\right)+I_{n}\succeq B, or in other words, 2LSBM+In11T02L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T}\succeq 0, we can then conclude that ggTgg^{T} is an optimal solution for SDP (19).

The second part of condition (20) ensures that ggTgg^{T} is the unique solution to SDP (19). Suppose that XX^{*} is another optimal solution to SDP (19), then Tr(X(2(DG+DG)+InB))=Tr(X(2LSBM+In11T))=0\operatorname{Tr}\left(X^{*}\left(2\left(D_{\mathcal{G}}^{+}-D_{\mathcal{G}}^{-}\right)+I_{n}-B\right)\right)=Tr\left(X^{*}\left(2L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T}\right)\right)=0 from complementary slackness and X0X^{*}\succeq 0. By assumption, the second smallest eigenvalue of 2LSBM+In11T2L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T} is non-zero. This entails that gg spans all of its null space. Combining this with complementary slackness, the fact that X0X^{*}\succeq 0 and 2LSBM+In11T02L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T}\succeq 0, we obtain that XX^{*} needs to be a multiple of ggTgg^{T}. Since Xii=1X^{*}_{ii}=1 we must have X=ggTX^{*}=gg^{T}. ∎

Given Lemma (6), the next natural step would be to control the eigenvalues of 2LSBM+In11T2L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T} when nn\rightarrow\infty. We want to use Benstein’s inequality to do this; to make its application easier, we rewrite 2LSBM+In11T2L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T} as a linear combination of elementary deterministic matrices with random coefficients. Define

where the (αij+)i,j(\alpha_{ij}^{+})_{i,j}, (αij)i,j(\alpha_{ij}^{-})_{i,j} are independent and independent of each other. Define

where eie_{i} (resp. eje_{j}) is the vector of all zeros except the ithi^{th} (resp. jthj^{th}) coefficient which is 1. Using these definitions, we can then write 2LSBM+In11T2L_{SBM}+I_{n}-\textbf{1}\textbf{1}^{T} as the difference of two matrices CC and Γ\Gamma where Γ\Gamma is a zero-expectation matrix and CC, a deterministic matrix that corresponds to the expectation, ie

2 Efficient full recovery from efficient partial recovery

In this section we show how to leverage state of the art algorithms for partial recovery in the sparse case in order to construct an efficient algorithm that achieves exact recovery down to the optimal information theoretical threshold.

The algorithm proceeds by splitting the information obtained in the graph into a part that is used by the partial recovery algorithm and a part that is used for the local steps. In order to make the two steps (almost) independent, we propose the following procedure: First take a random partition of the edges of complete graph on the nn nodes into 2 graphs H1H_{1} and H2H_{2} (done independently of the observed graph G\mathcal{G}). H1H_{1} is an Erdos-Renyi graph on n nodes with edge probability C/log(n)C/\log(n), H2H_{2} is the complement of H1H_{1}. We then define G1G_{1} and G2G_{2} subgraphs of G\mathcal{G} as G1=H1GG_{1}=H_{1}\cap\mathcal{G} and G2=H2GG_{2}=H_{2}\cap\mathcal{G}. In the second step, we apply Massoulie’s algorithm for partial recovery to G1G_{1}. As G1G_{1} is an SBM graph with parameters (Cα,Cβ)(C\alpha,C\beta), this algorithm is guaranteed to output, with high probability, a partition of the n nodes into two communities AA^{\prime} and BB^{\prime}, such that the partition is correct for at least (1δ(C))n(1-\delta(C))n nodes, where δ(C)0\delta(C)\to 0 as CC\to\infty. In other words, AA^{\prime} and BB^{\prime} coincide with AA and BB (the correct communities) on at least (1δ(C))n(1-\delta(C))n nodes. Lastly, we flip some of the nodes’ memberships depending on the edges they have in G2G_{2}. Using the communities AA^{\prime} and BB^{\prime} obtained in the previous step, we flip the membership of a given node if it has more edges in G2G_{2} going to the opposite community than it has to its own. If the the number of flips in each cluster is not the same, keep the clusters unchanged.

If α+β2αβ>1\frac{\alpha+\beta}{2}-\sqrt{\alpha\beta}>1, then, there exists large enough CC (depending only on α\alpha and β\beta) such that, with high probability, the algorithm described above will successfully recover the communities from the observed graph.

In the following, we will suppose that the partial recovery algorithm succeeds as described above w.h.p. and we want to show that when α+β2αβ>1\frac{\alpha+\beta}{2}-\sqrt{\alpha\beta}>1 and δ\delta small enough, the probability that there exists a node that doesn’t belong to the correct community, after the local improvements, goes to 0 when nn\rightarrow\infty. Our goal is to union bound over all possible nodes. We are thus interested in the probability that a node is mislabeled at the end of the algorithm.

Recall the random variables (Wi)i(W_{i})_{i} and (Zi)i(Z_{i})_{i} iid and mutually independent Bernoulli random variables with expectations respectively αlog(n)/n\alpha\log(n)/n and βlog(n)/n\beta\log(n)/n. WiW_{i} represents if there is an edge between two nodes in the same community and ZiZ_{i} if there is an edge between two nodes in different communities. Define (Wi)i(W^{\prime}_{i})_{i} and (Zi)i(Z^{\prime}_{i})_{i} iid copies of (Wi)i(W_{i})_{i} and (Zi)i(Z_{i})_{i}. For simplicity, we start by assuming that H2H_{2} is the complete graph. In this case we have at most δ(C)n\delta(C)n incorrectly labelled nodes (ie δ(C)n2\delta(C)\frac{n}{2} nodes that are in A but belong to B’ and δ(C)n2\delta(C)\frac{n}{2} nodes that are in B but belong to A’). A node in the graph is mislabeled only if it has at least as many connections to the wrong cluster as connections to the right one. This is illustrated in Figure 3. We can express the event with the random variables (Zi)i(Z_{i})_{i}, (Wi)i(W_{i})_{i}, their copies, and δ(C)\delta(C).

Recall that we assumed that H2H_{2} was a complete graph. In reality, using Lemma 14, it can be shown that the degree of any node in H2H_{2} is at least n(12Clog(n))n\left(1-2\frac{C}{\log(n)}\right) w.h.p. . Taking this into consideration, we will loosely upperbound (34) by removing 2Clog(n)n2\frac{C}{\log(n)}n on both the rhs terms. Notice that the removal of edges is independent of the outcome of the random variables and

Lemma 9 shows that (35) can be upperbounded as follows

Notice that g(α,β,δ)g(\alpha,\beta,\delta^{\prime}) is a function that converges continuously to f(α,β)f(\alpha,\beta) when δ0\delta^{\prime}\rightarrow 0. In this particular case, this is verified as γδ(C)0-\gamma\delta(C)\rightarrow 0 when CC\rightarrow\infty. Using a union bound on all nodes

For δ(C)\delta(C) small enough (ie C large enough) and 1f(α,β)<01-f(\alpha,\beta)<0, (40) goes to 0 when nn\rightarrow\infty. ∎

Conclusion and open problems

Note that at high SNR (large αβ\alpha-\beta), the SDP based algorithm succeeds in the regime of the optimal threshold obtained with ML, up to a factor 22. When running numerical simulations however, it would seem that the SDP based method achieves exact recovery all the way down to the optimal threshold. As a consequence, the additional factor 22 is likely a limitation of the analysis, in particular the matrix Bernstein inequality, rather than the algorithm itself. It remains open to show that this algorithm (or a spectral algorithm) achieves the optimal bound α+β2αβ>1\frac{\alpha+\beta}{2}-\sqrt{\alpha\beta}>1. While we obtain that there is no gap between what can be achieved with an efficient algorithm and the maximum likelihood, as shown in Section 7.2 using black-box algorithms for partial recovery and local improvement, obtaining direct algorithms would still be interesting. It would also be interesting to understand if efficient algorithms achieving the detection threshold can be used to achieve the recovery threshold and vice versa, or whether targeting the two different thresholds leads to different algorithmic developments.

Finally, it is natural to expect that the results obtained in this paper extend to a much more general family of network models, with multiple clusters, overlapping communities and labelled edges .

The authors are grateful to Dustin G. Mixon for thoroughly reading a preliminary version of this manuscript and providing useful comments, as well as to Philippe Rigollet and Van Vu for stimulating discussions on the efficient recovery via partial recovery.

References

Appendix A Proof of technical lemmas

Let mm be a natural number, p,qp,q\in, and δ0\delta\geq 0, we define

where W1,,WmW_{1},\dots,W_{m} are i.i.d. Bernoulli(p)(p) and Z1,,ZmZ_{1},\dots,Z_{m} are i.i.d. Bernoulli(q)(q), independent of W1,,WmW_{1},\dots,W_{m}.

For a better understanding of some of the proofs that follow, it is important to consider the behavior of T(n/2,αlog(n)/n,βlog(n)/n,0)T\left(n/2,\alpha\log(n)/n,\beta\log(n)/n,0\right) when nn\rightarrow\infty. It can be shown that

This result is particularly interesting as one can’t hope to obtain this bound using standard techniques such as Central Limit Theorem approximations or Chernoff bounds. This comes from the fact that when using these bounds, the error on the exponent is of order O(log(n))O(\log(n)) which is relevant here. In the same way, when using an approximation of the binomial coefficients to prove (42), one has to rely on tight estimates. Equation (42) has been extended to other values of the parameters (typically T(n/2,αlog(n)/n,βlog(n)/n,ϵ)T\left(n/2,\alpha\log(n)/n,\beta\log(n)/n,\epsilon\right) where ϵ\epsilon is given) but the main idea is contained in equation (42).

The idea in the subsequent proofs will be to bound T(m,p,q,εlog(n))T(m,p,q,\varepsilon\log(n)) with its dominant term T(m,p,q,ϵ)T^{*}(m,p,q,\epsilon) that we define below. As a consequence, it is particularly important to bound this dominant term as well, which is what is done in the following lemma.

We recall that p=αlog(n)np=\frac{\alpha\log(n)}{n} and q=βlog(n)nq=\frac{\beta\log(n)}{n} and we define:

where ϵ=O(1)\epsilon=O(1). We also define the function

Then we have the following results for T(m,p,q,ϵ)=maxτ>0V(m,p,q,τ,ϵ)T^{*}(m,p,q,\epsilon)=\max_{\tau>0}V(m,p,q,\tau,\epsilon) :

For cnm<cn3/2cn\leq m<c^{\prime}n^{3/2} and τ>0\forall\tau>0:

The proof is mainly computational and the main difficulty comes from upperbounding or lowerbounding the binomial coefficients. We start by writing out log(V(m,p,q,τ,ϵ))\log(V(m,p,q,\tau,\epsilon)).

In this expression, we replace p,qp,q by their expressions given above and obtain

To prove (44) we upperbound the binomial coefficients using the following result: if knk\leq n, then (nk)(nek)k\binom{n}{k}\leq\left(\frac{ne}{k}\right)^{k} and get

To prove (45) we lowerbound the binomial coefficients using the following bound for the binomial coefficient, for kNk\leq\sqrt{N} (see for a nice presentation)

Given the condition on mm, we can use the previous inequality and we obtain

We expand log(1+τ1nlog(n))\log(1+\tau\frac{1}{n}\log(n)) as mcnm\geq cn to get

We minimize h(α,β,τ,ϵ)h(\alpha,\beta,\tau,\epsilon) with respect to τ\tau. We obtain

We replace τ\tau by τ\tau^{*} in (49) and (57) and obtain the results given in the lemma. ∎

Let δ=δ(n)=log(n)/loglog(n)\delta=\delta(n)=\lceil\log(n)/\log\log(n)\rceil. For sake of brevity we take p=αlognnp=\frac{\alpha\log n}{n} and q=αlognnq=\frac{\alpha\log n}{n}.

By definition, T(n/2,p,q,δ)T(n/2,p,q,\delta) is larger than the probability that i=1n/2(ZiWi)\sum_{i=1}^{n/2}(Z_{i}-W_{i}) is equal to δ\delta, hence

Choosing k=τlog(n)k=\tau\log(n), for τ>0\tau>0, kk is in the range [0,n/2δ][0,n/2-\delta] for nn sufficiently large

where (n/2τlog(n)){n/2\choose\tau\log(n)} is defined as (n/2τlog(n)){n/2\choose\lfloor\tau\log(n)\rfloor} if τlog(n)\tau\log(n) is not an integer and ε=1/loglog(n)\varepsilon=1/\log\log(n).

We use the result from Lemma 1 with m=n2m=\frac{n}{2} and ϵ=1/loglog(n)\epsilon=1/\log\log(n) and notice that

Let WiW_{i} be a sequence of i.i.d. Bernoulli(αlognn)\left(\frac{\alpha\log n}{n}\right) random variables and ZiZ_{i} an independent sequence of i.i.d. Bernoulli(βlognn)\left(\frac{\beta\log n}{n}\right) random variables. Recall (Definition 3) that

The following bound holds for nn sufficiently large:

In the following, for clarity of notation, we have omitted the floor/ceiling symbols for numbers that are not integers but should be. Recall that

where ZZ is a Binomial(m,q)(m,q), WW is a Binomial(m,p)(m,p) and p=αlog(n)np=\frac{\alpha\log(n)}{n}, q=αlog(n)nq=\frac{\alpha\log(n)}{n}.

The idea behind the proof is to bound log(T(m,p,q,0))\log(T(m,p,q,0)) with the dominant term log(T(m,p,q,0)\log(T^{*}(m,p,q,0) when nn is large and then use Lemma 7. Notice that n2mn24n-2\leq m\leq\frac{n^{2}}{4}, we split the proof into 2 parts based on the regime of mm.

The first case corresponds to mm such that mnloglognm\geq n\log\log n. What is important is that n=o(m)n=o(m). We have

Notice that each term in the double-sum can be upper-bounded by T(m,p,q,0)T^{*}(m,p,q,0) as defined in (7). Hence

As mnloglogn\frac{m}{n}\geq\log\log n and mn2/4m\leq n^{2}/4, notice that log(m)=o(mnlog(n))\log(m)=o\left(\frac{m}{n}\log(n)\right) and

The second case corresponds to m<nloglognm<n\log\log n. We define δ=mn\delta=\frac{m}{n} and note that δ<loglogn\delta<\log\log n. Notice that the same idea as in the proof above does not work when mm is O(n)O(n). Nevertheless a similar idea gives valid results by restricting ourselves to the first log2(n)\log^{2}(n) terms of the sum over mm, breaking TT as

We can hence apply Bernestein’s inequality and get, for any t0t\geq 0,

Here we take t=m(αβ)log(n)n+log2(n)t=m(\alpha-\beta)\frac{\log(n)}{n}+\log^{2}(n)

And plugging (A.1) and (78) into (73) we obtain

From (44) and e(Ω(1)log2(n)loglogn)=o(elog(n))e^{\left(-\Omega(1)\frac{\log^{2}(n)}{\log\log n}\right)}=o\left(e^{\log(n)}\right) we get

Let (Wi)i(W_{i})_{i} and (Zi)i(Z_{i})_{i} be iid and mutually independent Bernouillis with expectations respectively αlog(n)n\frac{\alpha\log(n)}{n} and βlog(n)n\frac{\beta\log(n)}{n}. Define (Wi)i(W^{\prime}_{i})_{i} and (Zi)i(Z^{\prime}_{i})_{i} iid copies of (Wi)i(W_{i})_{i} and (Zi)i(Z_{i})_{i}. Then:

where g(α,β,δ)g(\alpha,\beta,\delta) is defined in (43).

where γ=1δlog(1/δ)\gamma=\frac{1}{\delta\sqrt{\log(1/\delta)}}. For the second part of (85), we upperbound using multiplicative Chernoff. Mutliplicative Chernoff states

In our case 1+ϵ=γα1+\epsilon=\frac{\gamma}{\alpha}. To simplify notation we will write δ\delta instead of δ(C)\delta(C) in the following.

For the first part of (85), we adapt Lemma 8.

As shown in Lemma 8 the second part of inequality (93) can be upperbounded in the following way

We now upperbound the first part of inequality (93) in a similar way to Lemma 8.

We group inequalities (93) and (A.1), we then take the log and using (44) we obtain

A.2 Information Theoretic Lower Bound Proofs

Recall that Δ\Delta is the event that in a graph with n/log3(n)n/\log^{3}(n) vertices where each pair of nodes is connected, independently, with probability αlognn\frac{\alpha\log n}{n}, every node has degree strictly less than lognloglogn\frac{\log n}{\log\log n}.

Let Δi\Delta_{i} be the probability that the degree of node ii is smaller than lognloglogn\frac{\log n}{\log\log n}. Let XiX_{i} be iid Bernoulli(αlognn)\left(\frac{\alpha\log n}{n}\right) random variables, then

We consider a slightly weaker version (since μ>0\mu>0)

This means that, by setting t=log2nαlognloglogn=log3nαloglognt=\frac{\log^{2}n}{\alpha}\frac{\log n}{\log\log n}=\frac{\log^{3}n}{\alpha\log\log n} so that tμ=lognloglognt\mu=\frac{\log n}{\log\log n}, we have

By union bound we have, for any vertex ii,

A.3 SDP Algorithm Proofs

Recall that ΓS\Gamma^{S} (resp. CSC^{S}) denotes the projection of Γ\Gamma (resp. C) onto the space S.

C is the following deterministic symmetric matrix

Assuming α>β\alpha>\beta the eigenvalues of C take three distinct values

λ1=n2βlog(n)\lambda_{1}=n-2\beta\log(n) associated to the eigenvector 1

λ2=0\lambda_{2}=0 associated to the eigenvector corresponding to the ground truth g

λ3=(αβ)log(n)\lambda_{3}=(\alpha-\beta)\log(n) associated to all other eigenvectors

g is also an eigenvector for Γ\Gamma corresponding to eigenvalue 0. As a consequence, condition (33) is satisfied if the following holds on the orthogonal of g

(Matrix Bernstein) Consider a finite sequence {Xk}\{X_{k}\} of independent, random, self adjoint matrices with dimension d. Assume that each random matrix satisfies

This particular formulation of the Theorem can be found in .

Let ϵ>0\epsilon>0. Let Q=11TnQ=\frac{\textbf{1}\textbf{1}^{T}}{n} be the projection matrix onto the 1 space. Then:

using the fact that Δij+Q=0n\Delta_{ij}^{+}Q=0_{n} and the fact that QTΔijQ=4nQQ^{T}\Delta_{ij}^{-}Q=-\frac{4}{n}Q.

For big n this is clearly verified as en2=o(n(1+ϵ))e^{-n^{2}}=o\left(n^{-(1+\epsilon)}\right). ∎

Let P=In11TnP=I_{n}-\frac{\textbf{1}\textbf{1}^{T}}{n} be the projection matrix onto the 1\perp\textbf{1} space. We have:

R=max(R1,R2)R=\max(R_{1},R_{2}) where R1R_{1} corresponds to Γ1\Gamma_{1} and R2R_{2} corresponds to Γ21\Gamma_{2}^{\perp\textbf{1}}.

σ2=σ12+σ22\sigma^{2}=\sigma_{1}^{2}+\sigma_{2}^{2} where σ12\sigma_{1}^{2} corresponds to Γ1\Gamma_{1} and σ22\sigma_{2}^{2} corresponds to Γ21\Gamma_{2}^{\perp\textbf{1}}.

We then apply Theorem (5) using the values found previously and obtain

A.4 Full Recovery Algorithm Proof

With high probability, the degree of any node in H1H_{1} is at most 2Clog(n)n2\frac{C}{\log(n)}n.

where the right hand side goes to 0 when nn\rightarrow\infty as C is fixed. Hence using a union bound on all nodes