Community detection thresholds and the weak Ramanujan property

Laurent Massoulie

Introduction

Community detection, like clustering, aims to identify groups of similar items from a global population. It is a useful primitive for performing recommendation, e.g. of contents or contacts to users of online social networks. The stochastic block model has been introduced by Holland et al. to represent interactions between individuals. It consists of a random graph on nn nodes, each node iN={1,,n}i\in\mathcal{N}=\{1,\ldots,n\} being assigned a type σi\sigma_{i} from some fixed set Σ\Sigma. Conditionally on node types, edge (i,j)(i,j) is present with probability p(σi,σj)p(\sigma_{i},\sigma_{j}) independently of other edges, for some matrix of probabilities (p(σ,σ))(p(\sigma,\sigma^{\prime})).

It constitutes an adequate testbed for community detection. Indeed the performance of candidate detection schemes, captured by the fraction of nodes ii for which estimated types σ^i\hat{\sigma}_{i} and true types σi\sigma_{i} coincide, can be compared and analysed on instances of the stochastic block model. Such analyses can in turn suggest new schemes.

Recently Decelle et al. conjectured the existence of a phase transition in the sparse regime where the graph’s average degree is O(1)O(1). Specifically, they predicted that for parameters below a certain threshold, no estimates σ^i\hat{\sigma}_{i} of node types existed that would be positively correlated with true types σi\sigma_{i}, while above the threshold, belief propagation algorithms could determine estimates σ^i\hat{\sigma}_{i} achieving such a positive correlation. Their conjecture is formulated on a simple symmetric instance of the stochastic block model featuring two node types {+1,1}\{+1,-1\}. The phenomenon appears more general though: Heimlicher et al. extended the conjecture to the more general setup of labeled stochastic block models.

The study of this phenomenon is important for two reasons. First, by localizing precisely the transition point below which no useful signal is present in the observations, one thus characterizes how much subsampling of the original graph can be performed before all information is lost. Second, algorithms leading to estimates σ^i\hat{\sigma}_{i} that achieve positive correlation all the way down to the transition are expected to constitute more robust approaches than alternatives which would fail before the transition. It is therefore important to determine such algorithms.

The negative part of the conjecture has been proven by Mossel, Neeman and Sly . Essentially they established that existence of estimates σ^i\hat{\sigma}_{i} positively correlated with true types σi\sigma_{i} would imply feasibility of a reconstruction problem on a random tree model describing the local statistics of the original random graph. However by results of Evans et al. such reconstruction is infeasible below the conjectured transition point.

Until now, positive results in the sparse case did not apply down to the transition point. The best results to date (see ) relied on Coja-Oghlan , showing that spectral clustering applied to the adjacency matrix, suitably trimmed by removal of high degree nodes, yields positively correlated estimates. However this does not apply down to the conjectured threshold.

This limitation stems from the following fact. Spectral methods perform well on matrices enjoying a spectral separation property, namely the spectrum should comprise a few large eigenvalues whose associated eigenvectors reflect the sought structure and all other eigenvalues should be negligible. The prototype of such separation is the Ramanujan property, according to which dd-regular graphs have the second eigenvalue λ\lambda no larger than 2d12\sqrt{d-1} in absolute value. Friedman established that random dd-regular graphs almost satisfy this, in that for them λ2d1+o(1)|\lambda|\leq 2\sqrt{d-1}+o(1). Erdős-Rényi graphs with average degree dd are such that λO(d)|\lambda|\leq O(\sqrt{d}), provided d=Ω(logn)d=\Omega(\log n) (see Feige and Ofek ), but such Ramanujan-like separation is lost for smaller dd. This lack of separation inherently limits the power of spectral methods in the sparse case.

2 Main results

We focus on the stochastic block model in Decelle et al. . The graph is denoted G\mathcal{G}, node types (or spins) σi\sigma_{i} are uniformly and i.i.d. drawn from {1,+1}\{-1,+1\}. An edge is present between any two nodes ii,jj with probability a/na/n if σi=σj\sigma_{i}=\sigma_{j}, and b/nb/n if σi=σj\sigma_{i}=-\sigma_{j}, constants aa and bb being the model parameters. The conjectured transition point is specified by quantity τ=(ab)2/[2(a+b)]\tau=(a-b)^{2}/[2(a+b)]: for τ<1\tau<1 it is known that positively correlated detection is impossible; we set out to prove that it is feasible for τ>1\tau>1.

We shall make use of the notations α:=(a+b)/2\alpha:=(a+b)/2, β:=(ab)/2\beta:=(a-b)/2. The detectability condition τ>1\tau>1 can be restated as

the empirical overlap between the true and estimated spins defined as

converges in probability to the set {r,+r}\{-r,+r\} for some strictly positive constant r>0r>0 as nn\to\infty.

3 Paper organization

Proof structure

Before we describe the steps used to establish this, let us verify how it implies Theorem 1.1. Note that since E(X)=1{\mathbf{E}}(X)=1, writing

we see that inequality P(Xx)P(Xx)>0{\mathbf{P}}(X\geq x)-{\mathbf{P}}(-X\geq x)>0 must hold on a set of xx’s of positive Lebesgue measure. Since the points xx at which the distribution of either XX or X-X has an atom is at most countable, there thus exists an xx at which neither distribution has an atom, and the desired inequality P(Xx)P(Xx)>0{\mathbf{P}}(X\geq x)-{\mathbf{P}}(-X\geq x)>0 holds. Letting t=x/E(X2)t=x/\sqrt{{\mathbf{E}}(X^{2})} and r=P(Xx)P(Xx)r={\mathbf{P}}(X\geq x)-{\mathbf{P}}(-X\geq x) we readily have by (4) that the empirical overlap in (3) must converge to {r,+r}\{-r,+r\}.

Theorem 2.1 will follow from the combination of two analyses. Let Aˉ\bar{A} denote the expectation of the graph’s adjacency matrix conditional on the spin vector σ\sigma, that is

The first analysis establishes the following

They are close (in a sense made precise in Section 4) to the corresponding quantities (B(t)e)i(B^{(t)}e)_{i}, (B(t)σ)i(B^{(t)}\sigma)_{i}, and are easier to analyze. In particular, they enjoy a quasi-deterministic growth property:

This, combined with Theorem 2.2, yields the key intermediate step:

Matrix expansion and spectral radii bounds

Our aim in this section is to establish Theorem 2.2. Denoting ξij\xi_{ij} the indicator of edge (i,j)(i,j)’s presence in G\mathcal{G} we can write

where Aˉ\bar{A} is as in (5). We then have the expansion:

With these notations at hand, one obtains from (15):

since we chose kk so that 2kϵ>12k\epsilon>1 and the last term is polylogarithmic in nn. This establishes (7).

and this last bound decays to zero as a power of nn by the condition 2kϵ>12k\epsilon>1 and the fact that the last term in the product is polylogarithmic in nn. This completes the proof of Theorem 2.2.

Local Analysis: structure of expanded neighborhoods

For any k0k\geq 0, the number of nodes with spin ±\pm at distance kk (respectively k\leq k) of node ii is denoted Uk±(i)U^{\pm}_{k}(i) (respectively, Uk±(i)U^{\pm}_{\leq k}(i)). We thus have

We shall omit indices ii when considering quantities related to a fixed node ii. In the remainder of the section we condition on the spins σ\sigma of all nodes. We denote n±n_{\pm} as the number of nodes with spin ±\pm.

For fixed iNi\in\mathcal{N} it is readily seen that, conditionally on Fk1:=σ(Ut+,Ut,tk1){\mathcal{F}}_{k-1}:=\sigma(U^{+}_{t},U^{-}_{t},t\leq k-1), we have:

Theorem (2.3) is established based on these characterizations by extensive use of Chernoff bounds for binomial variables. Its proof is deferred to the Appendix.

The next technical result establishes approximate independence of neighborhoods of distinct nodes. It is instrumental in Section 4.3 e.g. in establishing weak laws of large numbers on the fraction of nodes satisfying a given property.

We first state how to transport the deterministic growth controls (11) of Theorem 2.3 to vectors B(m1)eB^{(m-1)}e and B(m1)σB^{(m-1)}\sigma, a key step in the proof of Theorem 2.4. One has the following

Proof is in the Appendix, together with that of the following Corollary:

(of Theorem 2.4). Using identity (6), write for unit norm xx:

Using the bounds (26,27), the right-hand side is no larger than

By the previous inequalities (10,24,25) and the row sum bound, we have that

We now state two Lemmas which will allow to establish Theorem 4.1.

Using these, we now establish the following

3 Coupling with Poisson tree growth process

Introduce the stochastic process {Vt±}t0\{V^{\pm}_{t}\}_{t\geq 0} defined by

where Gt1=σ(V±k,kt1){\mathcal{G}}_{t-1}=\sigma(V^{\pm_{k}},k\leq t-1). We then have the following

The proof given in the Appendix relies on the Stein-Chen method for Poisson approximation.

where Vt±V^{\pm}_{t} is as defined in (31). We then have the following

The two processes {Mt}\{M_{t}\}, {Δt}\{\Delta_{t}\} are Gt{\mathcal{G}}_{t}-martingales. Process {Mt}\{M_{t}\} is uniformly integrable under Condition α>1\alpha>1. Under Condition β2>α\beta^{2}>\alpha, process {Δt}\{\Delta_{t}\} is also uniformly integrable.

Under α<β2\alpha<\beta^{2} the martingale {Δt}\{\Delta_{t}\} converges almost surely to a unit mean random variable Δ\Delta_{\infty}. Moreover this random variable has a finite variance 1/(β2/α1)1/(\beta^{2}/\alpha-1) to which the variance of Δt\Delta_{t} converges. It further holds that EΔt2Δ20{\mathbf{E}}|\Delta^{2}_{t}-\Delta_{\infty}^{2}|\to 0 as tt\to\infty.

Together these properties allow to establish the following

One has the following convergence in probability

For each tt that is an atom of neither Δ\Delta_{\infty}’s or Δ-\Delta_{\infty}’s distribution, the following convergence in probability holds

To convey the main ideas of the proof (deferred to the Appendix), we now indicate how to establish a property similar to (34), namely for a continuous bounded function gg we establish convergence in probability

The expectation of the sum in the left-hand side reads

where iji\neq j are two fixed nodes with spin ±\pm. By the coupling lemma 4.1 it holds that

It follows that the variance of the empirical average in (39) goes to zero as nn\to\infty. Its announced convergence in probability to (1/2)Eg(±Δ)(1/2){\mathbf{E}}g(\pm\Delta_{\infty}) then follows by Tchebitchev’s inequality.

Theorems 4.1 and 4.2 readily imply Theorem 2.1.

Conclusions

acknowledgements: The author gratefully acknowledges stimulating discussions on the topic with Marc Lelarge and Charles Bordenave.

References

Appendix A Proof of Proposition 3.1

We bound the expectation of the corresponding sum as follows. Let vv (respectively, ee) be the number of nodes (respectively, edges) traversed by a particular circuit. The quantity c=ev+1c=e-v+1 is the so-called “tree excess”, counting the number of edges that are traversed while not being part of the tree consisting of edges whose first traversal strictly augments the number of spanned nodes.

We represent the corresponding circuit as follows.

We number nodes by the order in which they are met by the circuit, starting with node 1.

a path using only edges already used in the circuit, and lying on the tree of new node discoveries

a cycle edge connecting the end of the two previous steps to a node already spanned. Such a cycle edge may have already been traversed by the circuit one or several times.

Given the tree previously spanned, and the current position on it, the first part of the sequence is characterized by the node label of its end: indeed, since on this subsequence we enforce the condition that the paths are simple, back-tracking is forbidden. Hence there is only one path on the tree going from the origin to the destination. We thus represent the first part by the number of the destination node if this part is non-empty, and by zero otherwise.

For a given number of nodes vv and edges ee, the number of corresponding nodes in {1,,n}\{1,\ldots,n\} is upper-bounded by nvn^{v}. For a given edge present with multiplicity m{1,,2k}m\in\{1,\ldots,2k\}, the corresponding expectation is zero if m=1m=1, and for m2m\geq 2, we have

Appendix B Proof of Proposition 3.2

Thus we have the upper bound on the number of valid circuit labels with vv nodes and ee edges:

Appendix C Proof of Theorem 2.3

The following inequality is easily verified to hold for any non-negative UU, VV, aa, bb, nn such that a/n,b/n1a/n,b/n\leq 1, and will be instrumental in the sequel:

Next lemma is the key ingredient to establish Theorem 2.3.

where MM denotes the matrix (a/2  b/2,b/2  a/2)(a/2\;b/2,b/2\;a/2).

Recall that conditionally on Ft1\mathcal{F}_{t-1} the random variables Ut+U^{+}_{t} and UtU^{-}_{t} are independent, distributed according to

Let TT be the first instant tt for which UtKlog(n)U_{t}\geq K\log(n), for some KK to be specified.

By definition of TT, necessarily UT1<Klog(n)U_{T-1}<K\log(n). Thus

The mean of the Binomial distribution in the right-hand side of the above is equivalent to (ab)(1/2)Klog(n)(a\vee b)(1/2)K\log(n) and less than κlog(n)\kappa\log(n) for κ=(ab)K\kappa=(a\vee b)K. Hence by Chernoff’s inequality, for h(x):=xlog(x)x+1h(x):=x\log(x)-x+1,

Take KK^{\prime} so that κh(K/2κ)>2\kappa h(K^{\prime}/2\kappa)>2. The right-hand side of the above is then no larger than n2n^{-2}.

Thus properties (10) clearly hold for tTt\leq T. We now establish that they hold with sufficiently large probability for larger tt.

Conditional on FT\mathcal{F}_{T}, the binomial distribution of UT+1±U^{\pm}_{T+1} has mean

Using the inequalities (41) we obtain that this mean lies in the interval

For a given ϵ>0\epsilon>0, we can choose KK sufficiently large so that

It follows that UT+1±U^{\pm}_{T+1} admits a relative deviation from its conditional mean by ϵ\epsilon with probability at most n2n^{-2}.

and consider the events At:={Ut±[1ϵt,1+ϵt]aUt1±+bUt12}\mathcal{A}_{t}:=\{U^{\pm}_{t}\in[1-\epsilon_{t},1+\epsilon_{t}]\frac{aU^{\pm}_{t-1}+bU^{\mp}_{t-1}}{2}\}. Conditionally on AT,,At\mathcal{A}_{T},\ldots,\mathcal{A}_{t}, the vector Ut=(Ut+,Ut)U_{t}=(U^{+}_{t},U^{-}_{t}) verifies the announced inequality (42). Given that α\alpha is the spectral radius of MM, it follows from this condition that Ut±(1O(ϵ))αtTKlog(n)U^{\pm}_{t}\geq(1-O(\epsilon))\alpha^{t-T}K^{\prime\prime}\log(n). We then check that Chernoff’s bound applies to show that the condition holds at step tt with high enough probability. It suffices to ensure that

Using (42), we readily have for t,tTt,t^{\prime}\leq T, with t>tt>t^{\prime}:

A similar lower bound holds with ϵs-\epsilon_{s} in place of +ϵs+\epsilon_{s}. Setting t=Tt^{\prime}=T in the upper bound, since ST=O(log(n)S_{T}=O(\log(n), the upper bound (10) follows for StS_{t}, as s=T+1t(1+ϵs)=O(1)\prod_{s=T+1}^{t}(1+\epsilon_{s})=O(1).

It readily follows that (11) holds for StS_{t}.

Consider now DtD_{t}. Using (42) again, we have

Since Ss=O(log(n)αsT)S_{s}=O(\log(n)\alpha^{s-T}), DT=O(log(N)|D_{T}|=O(\log(N) and ϵs=O(α(sT)/2)\epsilon_{s}=O(\alpha^{-(s-T)/2}), we obtain for t=Tt^{\prime}=T:

where we have used the assumption that β2>α\beta^{2}>\alpha to bound u>0βuαu/2\sum_{u>0}\beta^{-u}\alpha^{u/2}. Property (10) thus holds for DtD_{t}.

Finally, the right-hand side of (43) is of order

Since for t<Tt^{\prime}<T we readily have Dt=O(log(n)D_{t^{\prime}}=O(\log(n) by definition of TT, property (11) follows for DtD_{t}. ∎

Appendix D Proof of Lemma 4.2

There are two ways for creating cycles within the distance kk-neighborhood of ii: an edge may be present between two nodes at distance k1k-1 of ii, or two nodes at distance k1k-1 may be connected to the same node at distance kk of ii. The number of edges of the first type is stochastically dominated by Bin(Sk12,ab/n)\hbox{Bin}(S_{k-1}^{2},a\vee b/n). Its expected number conditionally on Ωk1(i)\Omega_{k-1}(i), defined as

As for the second type of cycles, its number is stochastically dominated by

Appendix E Proof of Lemma 4.3

Appendix F Proof of Corollary 4.1

Using the bound (25) for iBi\in{\mathcal{B}}, we can bound the first summation, using Cauchy-Schwarz’s inequality by

By Cauchy-Schwarz again, this is no larger than

Appendix G Proof of Lemma 4.6

We assume that σi=+\sigma_{i}=+, the case σ=\sigma=- being similar. Introduce the events

where constant CC is as in Theorem 2.3. As established in the proof of Theorem 2.3, the probability of each Ωk\Omega_{k} is 1o(n2)1-o(n^{-2}).

Let us evaluate, conditionally on Fk1{\mathcal{F}}_{k-1} and on Ωk1\Omega_{k-1} the variation distance between (Uk+,Uk)(U^{+}_{k},U^{-}_{k}) and a pair of (conditionally on Fk1{\mathcal{F}}_{k-1}) independent random variables with respective distributions

The Stein-Chen method enables to bound the variation distance between a Bin(n,λ/n)\hbox{Bin}(n,\lambda/n) and a Poi(λ)\hbox{Poi}(\lambda) random variables by nmin(1,λ1)(λ/n)2λ/nn\min(1,\lambda^{-1})(\lambda/n)^{2}\leq\lambda/n. Furthermore, two Poisson random variables with respective parameters λ\lambda, λ\lambda^{\prime} have variation distance at most λλ|\lambda-\lambda^{\prime}|. This entails the bounds

Appendix H Proof of Lemma 4.7

It readily follows that both processes {Mt}\{M_{t}\}, {Δt}\{\Delta_{t}\} are martingales. To establish uniform integrability we shall show that both processes have uniformly bounded variance. To that end we use the conditional variance formula

and the fact that the variance of a Poisson random variable equals its mean. Thus

This yields by the conditional variance formula

Since Var(M0)=0\hbox{Var}(M_{0})=0, it follows by induction that

The latter is uniformly bounded for α>1\alpha>1 hence the uniform integrability of {Mt}\{M_{t}\} under this condition.

It thus follows by Var(Δ0)=0\hbox{Var}(\Delta_{0})=0 and induction that

thus establishing uniform integrability of martingale {Δt}\{\Delta_{t}\}. ∎

Appendix I Proof of Corollary 4.2

Convergence almost surely and in L1L_{1} is guaranteed under uniform integrability by the martingale convergence theorem (). Finiteness of the limiting variable’s variance under uniform bounds on the variance is also standard; it follows from Fatou’s lemma. Convergence of the variances is established as follows. The limiting variable satisfies a distributional equation given by

where the Δi\Delta_{i}, Δi\Delta^{\prime}_{i} are i.i.d. and distributed as Δ\Delta. The only solution for the variance of Δ\Delta, apart from the degenerate solution , is then readily seen to be 1/(β2/α1)1/(\beta^{2}/\alpha-1), which is indeed the limit of the variance of Δt\Delta_{t}. The L1L_{1}-convergence of Δt2\Delta^{2}_{t} to Δ2\Delta_{\infty}^{2} is then a direct consequence of Scheffé’s lemma. ∎

Appendix J Proof of Theorem 4.2

Let us now consider the second moment of the empirical sum:

We break it into two terms, the first being

Using Lemma 4.6 and Theorem 2.3, using similar arguments as before we can bound this term by

which clearly goes to zero as nn\to\infty.

The convergence in probability (33) follows.

We now turn to establishing (34). We shall only consider the case of sign +, the other being handled similarly. Fix some arbitrarily small δ>0\delta>0. Because τ\tau is a continuity point of the distribution of Δ\Delta_{\infty}, we can find two bounded Lipschitz-continuous functions ff, gg such that

we have that this empirical sum differs from the simpler one

The same argument can be applied to gg, eventually leading to the convergence in probability

As δ\delta is arbitrary, this establishes (34).

Pick again an arbitrary δ>0\delta>0, two pairs of Lipschitz-continuous functions f±f_{\pm} and g±g_{\pm} such that

The difference (n+n)/n(n_{+}-n_{-})/n is of order 1/n1/\sqrt{n} and thus vanishes. We upper-bound the remaining terms by

Letting KK denote the Lipschitz-continuity constant for both g+g_{+} and ff_{-}, this last display differs from

Because of the assumed convergence in probability limnxy=0\lim_{n\to\infty}||x-y||=0, the first error term necessarily tends to zero in probability by Cauchy-Schwarz inequality. The second term is dealt with as mentioned in the proof of the previous lemma. Finally, using the coupling lemmas 4.6 and 4.1, by evaluating the first and second moments of (49), we obtain the convergence in probability

The latter term is then an upper bound on the lim sup\limsup of the empirical overlap. By the same approach, we obtain a lower bound of

on the lim inf\liminf of the overlap. These upper and lower bounds differ by at most 2δ2\delta, and differ from P(Δt)P(Δt){\mathbf{P}}(\Delta_{\infty}\geq t)-{\mathbf{P}}(\Delta_{\infty}\leq-t) by at most δ\delta. Since δ\delta is arbitrary, this establishes the announced convergence in probability of the empirical overlap to quantity xx where

is strictly positive by our choice of tt. ∎

Appendix K Proof of Lemma 4.4

Appendix L Proof of Lemma 4.5

To establish the lower bound of (29), note that by Cauchy-Schwarz,

The lower bound in (30) is established similarly, from the inequality

We control the magnitude of this quantity in the tree model; using coupling we will then transpose the corresponding estimates to the original scenario.

Let then T{\mathcal{T}} denote a branching process with offspring Poi(α)\hbox{Poi}(\alpha). The process of spins is then constructed by sampling uniformly the root’s spin, and then propagating spins in a Markovian fashion with transition matrix (a/(a+b)b(a+b),b(a+b),a(a+b))(a/(a+b)b(a+b),b(a+b),a(a+b)) that is α1M\alpha^{-1}M. Its eigenvalues are thus (1,β/α)(1,\beta/\alpha).

We evaluate its second moment conditionally on T{\mathcal{T}} by writing X2X^{2} as

We will use this formula, and further distinguish nodes jj^{\prime} according to their distance 2(d+dτ)2(d+d^{\prime}-\tau) for τ=0,,2(dd)\tau=0,\ldots,2(d\wedge d^{\prime}). This yields

Note now that with high probability, we have the following evaluations

By coupling (techniques of Theorem 4.2 involving Tchebitchev inequality, based on the bounds of Theorem 2.3 and Lemmas 4.6 and 4.1) we thus have that with high probability,