Optimal Cluster Recovery in the Labeled Stochastic Block Model

Se-Young Yun, Alexandre Proutiere

Introduction

Community detection consists in extracting (a few) groups of similar items from a large global population, and has applications in a wide spectrum of disciplines including social sciences, biology, computer science, and statistical physics. The communities or clusters of items are inferred from the observed pair-wise similarities between items, which, most often, are represented by a graph whose vertices are items and edges are pairs of items known to share similar features.

Over the last few years, we have seen remarkable progresses for the problem of cluster recovery under the SBM (see for an exhaustive literature review), highlighting its scientific relevance and richness. Most recent work on the SBM aimed at characterizing the set of parameters (i.e., the probabilities p(i,j)p(i,j) that there exists an edge between nodes in clusters ii and jj for 1i,jK1\leq i,j\leq K) such that some qualitative recovery objectives can or cannot be met. For sparse scenarios where the average degree of items in the graph is O(1)O(1), parameters under which it is possible to extract clusters positively correlated with the true clusters have been identified . When the average degree of the graph is ω(1)\omega(1), one may predict the set of parameters allowing a cluster recovery with a vanishing (as nn grows large) proportion of misclassified items , but one may also characterize parameters for which an asymptotically exact cluster reconstruction can be achieved .

In this paper, we address the finer and more challenging question of determining, under the general LSBM, the minimal number of misclassified items given the parameters of the model. Specifically, for any given s=o(n)s=o(n), our goal is to identify the set of parameters such that it is possible to devise a clustering algorithm with at most ss misclassified items. Of course, if we achieve this goal, we shall recover all the aforementioned results on the SBM.

We first derive a tight lower bound on the average number of misclassified items when the latter is o(n)o(n). Note that such a bound was unknown even for the SBM .

These theorems indicate that under the LSBM with parameters satisfying (A1) and (A2), the number of misclassified items scales at least as nexp(nD(α,p)(1+o(1))n\exp(-nD(\alpha,p)(1+o(1)) under any clustering algorithm, irrespective of its complexity. They further establish that the Spectral Partition algorithm reaches this fundamental performance limit under the additional condition (A3). We note that the SP algorithm runs in polynomial time, i.e., it requires O(n2pˉlog(n))O(n^{2}\bar{p}\log(n)) floating-point operations.

We further establish a necessary and sufficient condition on the parameters of the LSBM for the existence of a clustering algorithm recovering the clusters exactly with high probability. Deriving such a condition was also open .

Assume that (A1) and (A2) hold. If there exists a clustering algorithm that does not misclassify any item with high probability, then the parameters (α,p)(\alpha,p) of the LSBM satisfy: liminfnnD(α,p)log(n)1\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}\geq 1. If this condition holds, then under (A3), the SP algorithm recovers the clusters exactly with high probability.

The paper is organized as follows. Section 2 presents the related work and example of application of our results. In Section 3, we sketch the proof of Theorem 1, which leverages change-of-measure and coupling arguments. We present in Section 4 the Spectral Partition algorithm, and analyze its performance (we outline the proof of Theorem 2). All results are proved in details in the supplementary material.

Related Work and Applications

Cluster recovery in the SBM has attracted a lot of attention recently. We summarize below existing results, and compare them to ours. Results are categorized depending on the targeted level of performance. First, we consider the notion of detectability, the lowest level of performance requiring that the extracted clusters are just positively correlated with the true clusters. Second, we look at asymptotically accurate recovery, stating that the proportion of misclassified items vanishes as nn grows large. Third, we present existing results regarding exact cluster recovery, which means that no item is misclassified. Finally, we report recent work whose objective, like ours, is to characterize the optimal cluster recovery rate.

Detectability. Necessary and sufficient conditions for detectability have been studied for the binary symmetric SBM (i.e., L=1L=1, K=2K=2, α1=α2\alpha_{1}=\alpha_{2}, p(1,1,1)=p(2,2,1)=ξp(1,1,1)=p(2,2,1)=\xi, and p(1,2,1)=p(2,1,1)=ζp(1,2,1)=p(2,1,1)=\zeta). In the sparse regime where ξ,ζ=o(1)\xi,\zeta=o(1), and for the binary symmetric SBM, the main focus has been on identifying the phase transition threshold (a condition on ξ\xi and ζ\zeta) for detectability: It was conjectured in that if n(ξζ)<2n(ξ+ζ)n(\xi-\zeta)<\sqrt{2n(\xi+\zeta)} (i.e., under the threshold), no algorithm can perform better than a simple random assignment of items to clusters, and above the threshold, clusters can partially be recovered. The conjecture was recently proved in (necessary condition), and (sufficient condition). The problem of detectability has been also recently studied in for the asymmetric SBM with more than two clusters of possibly different sizes. Interestingly, it is shown that in most cases, the phase transition for detectability disappears.

The present paper is not concerned with conditions for detectability. Indeed detectability means that only a strictly positive proportion of items can be correctly classified, whereas here, we impose that the proportion of misclassified items vanishes as nn grows large.

Asymptotically accurate recovery. A necessary and sufficient condition for asymptotically accurate recovery in the SBM (with any number of clusters of different but linearly increasing sizes) has been derived in and . Using our notion of divergence specialized to the SBM, this condition is nD(α,p)=ω(1)nD(\alpha,p)=\omega(1). Our results are more precise since the minimal achievable number of misclassified items is characterized, and apply to a broader setting since they are valid for the generic LSBM.

Asymptotically exact recovery. Conditions for exact cluster recovery in the SBM have been also recently studied. provide a necessary and sufficient condition for asymptotically exact recovery in the binary symmetric SBM. For example, it is shown that when ξ=alog(n)n\xi={a\log(n)\over n} and ζ=blog(n)n\zeta={b\log(n)\over n} for a>ba>b, clusters can be recovered exactly if and only if a+b2ab1{a+b\over 2}-\sqrt{ab}\geq 1. In , the authors consider a more general SBM corresponding to our LSBM with L=1L=1. They define CH-divergence as:

and show that minijD+(α,p(i),p(j))>1\min_{i\neq j}D_{+}(\alpha,p(i),p(j))>1 is a necessary and sufficient condition for asymptotically exact reconstruction. The following claim, proven in the supplementary material, relates D+D_{+} to DL+D_{L+}.

When pˉ=o(1)\bar{p}=o(1), we have for all i,ji,j:

Thus, the results in are obtained by applying Theorem 3 and Claim 4.

Again from this claim, the results derived in are obtained by applying Theorem 3 and Claim 5.

Optimal recovery rate. In , the authors consider the binary SBM in the sparse regime where the average degree of items in the graph is O(1)O(1), and identify the minimal number of misclassified items for very specific intra- and inter-cluster edge probabilities ξ\xi and ζ\zeta. Again the sparse regime is out of the scope of the present paper. are concerned with the general SBM corresponding to our LSBM with L=1L=1, and with regimes where asympotically accurate recovery is possible. The authors first characterize the optimal recovery rate in a minimax framework. More precisely, they consider a (potentially large) set of possible parameters (α,p)(\alpha,p), and provide a lower bound on the expected number of misclassified items for the worst parameters in this set. Our lower bound (Theorem 1) is more precise as it is model-specific, i.e., we provide the minimal expected number of misclassified items for a given parameter (α,p)(\alpha,p) (and for a more general class of models). Then the authors propose a clustering algorithm, with time complexity O(n3log(n))O(n^{3}\log(n)), and achieving their minimax recovery rate. In comparison, our algorithm yields an optimal recovery rate O(n2pˉlog(n))O(n^{2}\bar{p}\log(n)) for any given parameter (α,p)(\alpha,p), exhibits a lower running time, and applies to the generic LSBM.

2 Applications

We provide here a few examples of application of our results, illustrating their versatility. In all examples, f(n)f(n) is a function such that f(n)=ω(1)f(n)=\omega(1), and a,ba,b are fixed real numbers such that a>ba>b.

The binary SBM. Consider the binary SBM where the average item degree is Θ(f(n))\Theta(f(n)), and represented by a LSBM with parameters L=1L=1, K=2K=2, α=(α1,1α1)\alpha=(\alpha_{1},1-\alpha_{1}), p(1,1,1)=p(2,2,1)=af(n)np(1,1,1)=p(2,2,1)=\frac{af(n)}{n}, and p(1,2,1)=p(2,1,1)=bf(n)np(1,2,1)=p(2,1,1)=\frac{bf(n)}{n}. From Theorems 1 and 2, the optimal number of misclassified vertices scales as nexp(g(α1,a,b)f(n)(1+o(1)))n\exp(-g(\alpha_{1},a,b)f(n)(1+o(1))) when α11/2\alpha_{1}\leq 1/2 (w.l.o.g.) and where

It can be easily checked that g(α1,a,b)g(1/2,a,b)=12(ab)2g(\alpha_{1},a,b)\geq g(1/2,a,b)=\frac{1}{2}(\sqrt{a}-\sqrt{b})^{2} (letting λ=12\lambda=\frac{1}{2}). The worst case is hence obtained when the two clusters are of equal sizes. When f(n)=log(n)f(n)=\log(n), we also note that the condition for asymptotically exact recovery is g(α1,a,b)1g(\alpha_{1},a,b)\geq 1.

Recovering a single hidden community. As in , consider a random graph model with a hidden community consisting of αn\alpha n vertices, edges between vertices belonging the hidden community are present with probability af(n)n\frac{af(n)}{n}, and edges between other pairs are present with probability bf(n)n\frac{bf(n)}{n}. This is modeled by a LSBM with parameters K=2K=2, L=1L=1, α1=α\alpha_{1}=\alpha, p(1,1,1)=af(n)np(1,1,1)=\frac{af(n)}{n}, and p(1,2,1)=p(2,1,1)=p(2,2,1)=bf(n)np(1,2,1)=p(2,1,1)=p(2,2,1)=\frac{bf(n)}{n}. The minimal number of misclassified items when searching for the hidden community scales as nexp(h(α,a,b)f(n)(1+o(1)))n\exp(-h(\alpha,a,b)f(n)(1+o(1))) where

When f(n)=log(n)f(n)=\log(n), the condition for asymptotically exact recovery of the hidden community is h(α,a,b)1h(\alpha,a,b)\geq 1.

Optimal sampling for community detection under the SBM. Consider a dense binary symmetric SBM with intra- and inter-cluster edge probabilities aa and bb. In practice, to recover the clusters, one might not be able to observe the entire random graph, but sample its vertex (here item) pairs as considered in . Assume for instance that any pair of vertices is sampled with probability δf(n)n\frac{\delta f(n)}{n} for some fixed δ>0\delta>0, independently of other pairs. We can model such scenario using a LSBM with three labels, namely ×\times, 0 and 1, corresponding to the absence of observation (the vertex pair is not sampled), the observation of the absence of an edge and of the presence of an edge, respectively, and with parameters for all i,j{1,2}i,j\in\{1,2\}, p(i,j,×)=1δf(n)np(i,j,\times)=1-\frac{\delta f(n)}{n}, p(1,1,1)=p(2,2,1)=aδf(n)np(1,1,1)=p(2,2,1)=a\frac{\delta f(n)}{n}, and p(1,2,1)=p(2,1,1)=bδf(n)np(1,2,1)=p(2,1,1)=b\frac{\delta f(n)}{n}. The minimal number of misclassified vertices scales as nexp(l(δ,a,b)f(n)(1+o(1)))n\exp(-l(\delta,a,b)f(n)(1+o(1))) where l:=δ(1ab(1a)(1b)).l:=\delta(1-\sqrt{ab}-\sqrt{(1-a)(1-b)}). When f(n)=log(n)f(n)=\log(n), the condition for asymptotically exact recovery is l(α,a+,a,b+,b)1l(\alpha,a_{+},a_{-},b_{+},b_{-})\geq 1.

Signed networks. Signed networks are used in social sciences to model positive and negative interactions between individuals. These networks can be represented by a LSBM with three possible labels, namely 0, + and -, corresponding to the absence of interaction, positive and negative interaction, respectively. Consider such LSBM with parameters: K=2K=2, α1=α2\alpha_{1}=\alpha_{2}, p(1,1,+)=p(2,2,+)=a+f(n)np(1,1,+)=p(2,2,+)=\frac{a_{+}f(n)}{n}, p(1,1,)=p(2,2,)=af(n)np(1,1,-)=p(2,2,-)=\frac{a_{-}f(n)}{n}, p(1,2,+)=p(2,1,+)=b+f(n)np(1,2,+)=p(2,1,+)=\frac{b_{+}f(n)}{n}, and p(1,2,)=p(2,1,)=bf(n)np(1,2,-)=p(2,1,-)=\frac{b_{-}f(n)}{n}, for some fixed a+,a,b+,ba_{+},a_{-},b_{+},b_{-} such that a+>b+a_{+}>b_{+} and a<ba_{-}<b_{-}. The minimal number of misclassified individuals here scales as nexp(m(α,a+,a,b+,b)f(n)(1+o(1)))n\exp(-m(\alpha,a_{+},a_{-},b_{+},b_{-})f(n)(1+o(1))) where

When f(n)=log(n)f(n)=\log(n), the condition for asymptotically exact recovery is l(α,a+,a,b+,b)1l(\alpha,a_{+},a_{-},b_{+},b_{-})\geq 1.

Fundamental Limits: Change of Measures through Coupling

Construction of ψ\psi. Let (i,j)=argmini,j:i<jDL+(α,p(i),p(j))(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(\alpha,p(i),p(j)), and let vv^{\star} denote the smallest item index that belongs to cluster ii^{\star} or jj^{\star}. If both Vi\mathcal{V}_{i^{\star}} and Vj\mathcal{V}_{j^{\star}} are empty, we define v=nv^{\star}=n. Let qPK×(L+1)q\in{\cal P}^{K\times(L+1)} such that: D(α,p)=k=1KαkKL(q(k),p(i,k))=k=1KαkKL(q(k),p(j,k)).D(\alpha,p)=\sum_{k=1}^{K}\alpha_{k}KL(q(k),p(i^{\star},k))=\sum_{k=1}^{K}\alpha_{k}KL(q(k),p(j^{\star},k)). The existence of such qq is proved in Lemma 7 in the supplementary material. Now to define the stochastic model Ψ\Psi, we couple the generation of labels under Φ\Phi and Ψ\Psi as follows.

1. We first generate the random clusters V1,,VK\mathcal{V}_{1},\ldots,\mathcal{V}_{K} under Φ\Phi, and extract ii^{\star}, jj^{\star}, and vv^{\star}. The clusters generated under Ψ\Psi are the same as those generated under Φ\Phi. For any vVv\in\mathcal{V}, we denote by σ(v)\sigma(v) the cluster of item vv.

The Spectral Partition Algorithm and its Optimality

In this section, we sketch the proof of Theorem 2. To this aim, we present the Spectral Partition (SP) algorithm and analyze its performance. The SP algorithm consists in two parts, and its detailed pseudo-code is presented at the beginning of the supplementary document (see Algorithm 1).

Assume that (A1) and (A2) hold, and that npˉ=ω(1)n\bar{p}=\omega(1). After Step 4 (spectral decomposition) in the SP algorithm, with high probability, K^=K\hat{K}=K and there exists a permutation γ\gamma of {1,,K}\{1,\ldots,K\} such that: k=1KVkSγ(k)=O(log(npˉ)2pˉ).\left|\cup_{k=1}^{K}\mathcal{V}_{k}\setminus S_{\gamma(k)}\right|=O\left(\frac{\log(n\bar{p})^{2}}{\bar{p}}\right).

Finally, we establish that if the clusters provided after the first part of the SP algorithm are asymptotically accurate, then after log(n)\log(n) improvement iterations, there is no misclassified items in HH. To that aim, we denote by E(t)\mathcal{E}^{(t)} the set of misclassified items after the tt-th iteration, and show that with high probability, for all tt, E(t+1)HE(t)H1npˉ\frac{|\mathcal{E}^{(t+1)}\cap H|}{|\mathcal{E}^{(t)}\cap H|}\leq\frac{1}{\sqrt{n\bar{p}}}. This completes the proof of Theorem 2, since after log(n)\log(n) iterations, the only misclassified items are those in VH\mathcal{V}\setminus H.

References

Appendix A The SP Algorithm

Appendix B Properties of the divergence D​(α,p)𝐷𝛼𝑝D(\alpha,p) and related quantities

In this section, we prove the two claims of Section 2, as well as other results on the divergence D(α,p)D(\alpha,p) that will be instrumental in the proofs of Theorems.

DL+(p(i),p(j))D_{L+}(p(i),p(j)) is the minimum of the objective function of the following convex optimization problem:

When we put (9) onto (8) and use the approximation limx0log(1+x)=x\lim_{x\to 0}\log(1+x)=x (again using pˉ=o(1)\bar{p}=o(1)),

Therefore, the minimum value of (6) is equivalent to

B.2 Proof of Claim 5

Now, since 1+x=1+x2(1+o(1))\sqrt{1+x}=1+\frac{x}{2}(1+o(1)) and log(1+x)=x(1+o(1))\log(1+x)=x(1+o(1)) when x=o(1)x=o(1),

B.3 Other properties

Let (i,j)=argmini,jDL+(p(i),p(j))(i^{\star},j^{\star})=\arg\min_{i,j}D_{L+}(p(i),p(j)) and i<ji^{\star}<j^{\star}. Then, there exists qPK×(L+1)q\in\mathcal{P}^{K\times(L+1)} such that

Proof. We check by contradiction that such a qq exists. Indeed, assume that

Then there exists k0k_{0} such that KL(q(k0),p(i,k0))>KL(q(k0),p(j,k0))KL(q(k_{0}),p(i^{\star},k_{0}))>KL(q(k_{0}),p(j^{\star},k_{0})). Observe that by positivity of the KLKL divergence, q(k0)p(i,k0)q(k_{0})\neq p(i^{\star},k_{0}). Hence by continuity of the KLKL divergence, we can construct qq^{\prime} such that q(k)=q(k)q(k)=q^{\prime}(k) for all kk0k\neq k_{0}, and such that: KL(q(k0),p(i,k0))ϵ<KL(q(k0),p(i,k0))<KL(q(k0),p(i,k0))KL(q(k_{0}),p(i^{\star},k_{0}))-\epsilon<KL(q^{\prime}(k_{0}),p(i^{\star},k_{0}))<KL(q(k_{0}),p(i^{\star},k_{0})) and KL(q(k0),p(j,k0))<KL(q(k0),p(j,k0))+ϵKL(q^{\prime}(k_{0}),p(j^{\star},k_{0}))<KL(q(k_{0}),p(j^{\star},k_{0}))+\epsilon for some 0<ϵ<(KL(q(k0),p(i,k0))KL(q(k0),p(j,k0)))/20<\epsilon<(KL(q(k_{0}),p(i^{\star},k_{0}))-KL(q(k_{0}),p(j^{\star},k_{0})))/2. With this choice of qq^{\prime}, we get:

which contradicts the definition of D(α,p)D(\alpha,p). \blacksquare

Proof. Let (i,j)=argmini,jDL+(α,p(i),p(j))(i^{\star},j^{\star})=\arg\min_{i,j}D_{L+}(\alpha,p(i),p(j)) and i<ji^{\star}<j^{\star}. From Lemma 7, there exists qq satisfying that

Under condition (A1), when pˉ=o(1)\bar{p}=o(1), limsupnD(α,p)ηpˉL1.\lim\sup_{n\to\infty}\frac{D(\alpha,p)}{\eta\bar{p}L}\leq 1.

Proof. From the definition of D(α,p)D(\alpha,p), for any iji\neq j,

where we use log(1+x)=x(1+o(1))\log(1+x)=x(1+o(1)) when x=o(1)x=o(1). \blacksquare

Appendix C Proof of Theorem 1

Coupling and the perturbed stochastic model Ψ\Psi. Let (i,j)=argmini,j:i<jDL+(p(i),p(j))(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(p(i),p(j)), and let vv^{\star} denote the smallest node index that belongs to cluster ii^{\star} or jj^{\star}. If both Vi\mathcal{V}_{i^{\star}} and Vj\mathcal{V}_{j^{\star}} are empty, we define v=nv^{\star}=n. Let qK×(L+1)q\in^{K\times(L+1)} satisfy:

There exists such a qq from Lemma 7. Now to define the perturbed stochastic model Ψ\Psi, we couple the generation of labels under Φ\Phi and Ψ\Psi as follows.

We first generate construct the random clusters V1,,VK\mathcal{V}_{1},\ldots,\mathcal{V}_{K} under Φ\Phi, and extract ii^{\star}, jj^{\star}, and vv^{\star}. The clusters generated under Ψ\Psi are the same as those generated under Φ\Phi. For any vVv\in\mathcal{V}, we denote by σ(v)\sigma(v) the cluster of node vv.

Let π\pi denote a clustering algorithm with output (V^k)1kK(\hat{\mathcal{V}}_{k})_{1\leq k\leq K}, and let E=1kKV^kVk\mathcal{E}=\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{k}\setminus\mathcal{V}_{k} be the set of misclassified nodes under π\pi. Note that in general in our proofs, we always assume without loss of generality that 1kKV^kVk1kKV^γ(k)Vk|\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{k}\setminus\mathcal{V}_{k}|\leq|\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{\gamma(k)}\setminus\mathcal{V}_{k}| for any permutation γ\gamma, so that the set of misclassified nodes is really E\mathcal{E}. We denote by επ(n)=E\varepsilon^{\pi}(n)=|\mathcal{E}|. Since under Φ\Phi, nodes are interchangeable (remember that nodes are assigned to the various clusters in an i.i.d. manner), we have:

where the last inequality is obtained from the fact that we cannot distinguish between vv^{\star} and any other vVσ(v)v\in\mathcal{V}_{\sigma(v^{\star})}. Indeed,

Furthermore, since under the stochastic model Ψ\Psi, the observed labels do not depend on whether vv^{\star} belongs to cluster ii^{\star} or jj^{\star}, we have:

Combining (17), (22), and (27), we conclude that:

In addition, from Chebyshev’s inequality,

Hence from condition (A1), (32), and the definition of Q\mathcal{Q},

where the last inequlaity stems from the fact that 2KL(q(i),p(σ(v,i)))logηKL(q(j),p(σ(v,j)))2KL(q(i),p(\sigma(v^{\star},i)))\log\eta\geq KL(q(j),p(\sigma(v^{\star},j))) for all ii and jj from condition (A1).

To derive the above inequality, we have used:

Appendix D Performance of the SP Algorithm – Proof of Theorem 2

For every vVv\in\mathcal{V} and c1c\geq 1, we have

where we derive the last inequality choosing θ=2\theta=2. \blacksquare

The proof of Lemma 11 relies on arguments used in the spectral analysis of random graphs, see and .

For all vVkv\in{\mathcal{V}}_{k} and D0D\geq 0,

Proof. Let X\mathcal{X} be a set of K×(L+1)K\times(L+1) matrices such that

where (a)(a) stems from the following inequality:

D.2 Part 1 of the SP algorithm – Proof of Theorem 6

Recall that A^=U^V^=U^U^AΓ\hat{A}=\hat{U}\hat{V}=\hat{U}\hat{U}^{\top}A_{\Gamma} and A^uA^v=V^uV^v\|\hat{A}_{u}-\hat{A}_{v}\|=\|\hat{V}_{u}-\hat{V}_{v}\|. We can bound the number of misclassified items as follows:

with high probability, every item pair uu and vv satisfies that when σ(v)\sigma(v) represents the cluster of vv and Mv,ΓM_{v,\Gamma} denotes the column vector of MΓM_{\Gamma} on vv,

(40) suggests that if vv is misclassified by Algorithm 2, then we should have:

from (39) and (41), with high probability,

Proof of (39). First observe that from the definition of Γ\Gamma,

where the first inequality stems from Lemma 10 and Markov inequality. Therefore, with high probability,

Proof of (41). Define the following sets:

Γ(k=1KIk)A^MΓF2minvΓ(k=1KIk)A^vMΓk2=O(log(npˉ)3pˉ)|\Gamma\setminus(\cup_{k=1}^{K}\mathcal{I}_{k})|\leq\frac{\|\hat{A}-M_{\Gamma}\|_{F}^{2}}{\min_{v\in\Gamma\setminus(\cup_{k=1}^{K}I_{k})}\|\hat{A}_{v}-M_{\Gamma}^{k}\|^{2}}=O\left(\frac{\log(n\bar{p})^{3}}{\bar{p}}\right);

From the properties of Ik\mathcal{I}_{k} and O\mathcal{O}, we state the following results.

since every w(k=1KIk)w\in(\cup_{k=1}^{K}\mathcal{I}_{k}) is outside of QvQ_{v} (i.e., wΓ(k=1KIk)w\in\Gamma\setminus(\cup_{k=1}^{K}I_{k}) is necessary for wQvw\in Q_{v});

since αk\alpha_{k} is a constant for all kk and Γ(k=1KIk)Γ=o(1)\frac{|\Gamma\setminus(\cup_{k=1}^{K}\mathcal{I}_{k})|}{|\Gamma|}=o(1) from (ii), with high probability,

The properties (ii), (iii), and (iv) and (51) imply that

where mkm_{k} is the kk-th largest value among {I1,,IK}\{|\mathcal{I}_{1}|,\dots,|\mathcal{I}_{K}|\} ;

since IkVk(ΓO)αkn(1o(1))|\mathcal{I}_{k}|\geq|V_{k}\cap(\Gamma\setminus\mathcal{O})|\geq\alpha_{k}n(1-o(1)) from (ii) and (iii),

D.3 Proof of Theorem 2

From Chernoff bound, with high probability,

In what follows, we hence just prove the theorem assuming that (54) holds.

Let HH be the largest set of items vVv\in\mathcal{V} satisfying:

e(v,VH)2log(npˉ)2.e(v,\mathcal{V}\setminus H)\leq 2\log(n\bar{p})^{2}.

(H1) regularizes degrees, (H2) means that vHv\in H is correctly classified when using the log-likelihood estimate, and (H3) means that vv does not share too many labels with items outside HH.

The proof of the theorem follows from the following propositions. The first provides an upper bound of VH|\mathcal{V}\setminus H|, and the second provides the rate at which our estimated clusters improve in each iteration when we restrict our attention to items in HH.

When nD(α,p)npˉlog(npˉ)3log(n/s)+log(n/s)nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}, VHs|\mathcal{V}\setminus H|\leq s with high probability.

If k=1K(Sk(0)Vk)H+VH=O(1/pˉ){|\bigcup_{k=1}^{K}(S^{(0)}_{k}\setminus\mathcal{V}_{k})\cap H|+|\mathcal{V}\setminus H|}=O(1/\bar{p}), with high probability, the following statement holds

From Proposition 14, after log(n)\log(n) iterations (remember that npˉ=ω(1)n\bar{p}=\omega(1), so when nn is large enough 1/npˉe21/\sqrt{n\bar{p}}\leq e^{-2}), no item in HH can be misclassified with high probability. Hence the number of misclassified items cannot exceed VHs|\mathcal{V}\setminus H|\leq s, nD(α,p)npˉlog(npˉ)3log(n/s)+log(n/s)nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}. The proof is completed by remarking that if the previous condition on D(α,p)D(\alpha,p) holds, then

where we used D(α,p)=Ω(pˉ)D(\alpha,p)=\Omega(\bar{p}) from condition (A2) and Lemma 8. \blacksquare

We compute the number of items satisfying (H1), (H2), and (H3) in (55), (56), and Lemma 15, respectively.

Number of items satisfying (H1): From Lemma 10, we get:

Number of items satisfying (H2): We shall prove that when vv satisfies (H1), vv satisfies (H2) as well with probability at least

To this aim, we first establish that if vv satisfies

then vv satisfies (H2). Indeed, assume that (57) holds, then

i=1KαinKL(μ(v,Vi),p(k,i))(1+log(n)2n)i=1KViKL(μ(v,Vi),p(k,i))<nD(α,p)\sum_{i=1}^{K}\alpha_{i}nKL(\mu(v,\mathcal{V}_{i}),p(k,i))\leq\left(1+\frac{\log(n)^{2}}{\sqrt{n}}\right)\sum_{i=1}^{K}|\mathcal{V}_{i}|KL(\mu(v,\mathcal{V}_{i}),p(k,i))<nD(\alpha,p), since Viαinnlog(n)||\mathcal{V}_{i}|-\alpha_{i}n|\leq\sqrt{n}\log(n) and (57) holds;

i=1KαinKL(μ(v,Vi),p(j,i))nD(α,p)\sum_{i=1}^{K}\alpha_{i}nKL(\mu(v,\mathcal{V}_{i}),p(j,i))\geq nD(\alpha,p), since max{i=1KαiKL(μ(v,Vi),p(j,i)),i=1KαiKL(μ(v,Vi),p(k,i))}D(α,p)\max\left\{\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(j,i)),\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(k,i))\right\}\geq D(\alpha,p) and i=1KαiKL(μ(v,Vi),p(k,i))<D(α,p)\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(k,i))<D(\alpha,p);

i=1KViKL(μ(v,Vi),p(j,i))(1log(n)2n)nD(α,p)\sum_{i=1}^{K}|\mathcal{V}_{i}|KL(\mu(v,\mathcal{V}_{i}),p(j,i))\geq\left(1-\frac{\log(n)^{2}}{\sqrt{n}}\right)nD(\alpha,p), from ii) and the fact that Viαinnlog(n)||\mathcal{V}_{i}|-\alpha_{i}n|\leq\sqrt{n}\log(n);

Hence vv satisfies (H2). It remains to evaluate the probability of the event (57), which is done by applying Lemma 12 and proves (56).

Number of items satisfying (H3): From (55), (56), and the Markov inequality, we deduce that with probability at least 1exp(log(n/s))1-\exp\left(-\sqrt{\log(n/s)}\right), the number of items that do not satisfy either (H1) or (H2) is less than s/3s/3 when nD(α,p)npˉlog(npˉ)3log(n/s)+log(n/s)nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}, since

where we have used Lemma 9 for the last inequality. Lemma 15 allows us to complete the proof of Proposition. \blacksquare

When the number of items that do not satisfy either (H1) or (H2) is less than s/3s/3, VHs|\mathcal{V}\setminus H|\leq s, with high probability.

Proof. Let e(S,S)=vSe(S,S)e(S,S)=\sum_{v\in S}e(S,S). Next we prove the following intermediate claim: there is no subset SVS\subset\mathcal{V} such that e(S,S)slog(npˉ)2e(S,S)\geq s\log(n\bar{p})^{2} and S=s|S|=s with high probability. For any subset SVS\in\mathcal{V} such that S=s,|S|=s, by Markov inequality,

where, in the last two inequalities, we have set t=npˉlognpˉt=\frac{n\bar{p}}{\log n\bar{p}} and used the fact that: nsexp(npˉlognpˉ),\frac{n}{s}\geq\exp(\frac{n\bar{p}}{\log n\bar{p}}), which comes from the assumptions made in the theorem. Since the number of subsets SVS\subset\mathcal{V} with size ss is (ns)(ens)s,{{n}\choose{s}}\leq(\frac{en}{s})^{s}, from (65), we deduce:

Therefore, by Markov inequality, we can conclude that there is no SVS\subset\mathcal{V} such that e(S,S)slog(npˉ)2e(S,S)\geq s\log(n\bar{p})^{2} and S=s|S|=s with high probability.

To conclude the proof of the lemma, we build the following sequence of sets. Let Z1Z_{1} denote the set of items that do not satisfy at least one of (H1) and (H2). Let {Z(t)V}1tt\{Z(t)\subset\mathcal{V}\}_{1\leq t\leq t^{\star}} be generated as follows:

For t1t\geq 1, Z(t)=Z(t1){vt}Z(t)=Z(t-1)\cup\{v_{t}\} if there exists vtVv_{t}\in\mathcal{V} such that e(vt,Z(t1))>2log(npˉ)2e(v_{t},Z(t-1))>2\log(n\bar{p})^{2} and vtZ(t1)v_{t}\notin Z(t-1). If such an item does not exist, the sequence ends.

The sequence ends after the construction of Z(t)Z(t^{\star}). We show that if we assume that the cardinality of items that do not satisfy (H3) is strictly larger than s/2s/2, then one the set of the sequence {Z(t)V}1tt\{Z(t)\subset\mathcal{V}\}_{1\leq t\leq t^{\star}} contradicts the claim we just proved.

Assume that the number of items do not satisfy (H3) is strictly larger than s/2s/2, then these items will be at some point added to the sets Z(t)Z(t), and by definition, each of these node contributes with more than 2log(npˉ)22\log(n\bar{p})^{2} in e(Z(t),Z(t))e(Z(t),Z(t)). Hence if starting from Z1Z_{1}, we add s/2s/2 items not satisfying (H3), we get a set Z(t)Z(t) of cardinality less than s/3+s/2s/3+s/2 and such that e(Z(t),Z(t))>slog(npˉ)2e(Z(t),Z(t))>s\log(n\bar{p})^{2}. We can further add arbitrary items to Z(t)Z(t) so that it becomes of cardinality ss, and the obtained set contradicts the claim. \blacksquare

D.3.2 Proof of Proposition 14

Recall that {Sj(t)}1jK\{S^{(t)}_{j}\}_{1\leq j\leq K} is the partition after the tt-th improvement iteration. Also recall that with loss of generality, we assume that the set of misclassified items in HH after the tt-th step is E(t)=(k(Sk(t)Vk))H\mathcal{E}^{(t)}=\left(\cup_{k}(S_{k}^{(t)}\setminus\mathcal{V}_{k})\right)\cap H (it should be defined through an appropriate permutation γ\gamma of {1,,K}\{1,\ldots,K\} by E(t)=(k(Sk(t)Vγ(k)))H\mathcal{E}^{(t)}=(\cup_{k}(S_{k}^{(t)}\setminus\mathcal{V}_{\gamma(k)}))\cap H, but we omit γ\gamma). With this notational convention, we can define Ejk(t)=(Sj(t)Vk)H\mathcal{E}_{jk}^{(t)}=(S^{(t)}_{j}\cap\mathcal{V}_{k})\cap H and E(t)=j,k:jkEjk(t)\mathcal{E}^{(t)}=\bigcup_{j,k:j\neq k}\mathcal{E}_{jk}^{(t)}. At each improvement step, items move to the most likely cluster (according to the log-likelihood defined in the SP algorithm). Thus, for all ii,

Therefore, from the above inequalities, we conclude that

Next we prove all the steps of the previous analysis.

From (77) and (78), with high probability,

Since {Sk(0)}1kKS\{S^{(0)}_{k}\}_{1\leq k\leq K}\in\mathcal{S}, from the above inequality,

We now devote to the remaining part of (74). Since E(0)=O(log(npˉ)2pˉ)|\mathcal{E}^{(0)}|=O\left(\frac{\log(n\bar{p})^{2}}{\bar{p}}\right) from Theorem 6,

From (74), (79) and (80), with high probability,

Proof of (69): Since logp(j,i,0)p(k,i,0)=O(pˉ)\log\frac{p(j,i,0)}{p(k,i,0)}=O(\bar{p}) for all i,j,ki,j,k and E(t)=O(log(npˉ)2/pˉ)|\mathcal{E}^{(t)}|=O(\log(n\bar{p})^{2}/\bar{p}),

where the last inequality stems from (H3), i.e., from e(v,VH)2log(npˉ)2e(v,\mathcal{V}\setminus H)\leq 2\log(n\bar{p})^{2} when vHv\in H.

Proof of (70): Since E(t+1)H\mathcal{E}^{(t+1)}\subset H and every vHv\in H satisfies (H2), every vEjk(i+1)v\in\mathcal{E}_{jk}^{(i+1)} satisfies:

Appendix E Proof of Theorem 3

The positive result is obtained by applying Theorem 2 to s=12s=\frac{1}{2}. When liminfnnD(α,p)log(n)1\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}\geq 1, SP algorithm find clusters exactly with high probability. Thus, it suffices to show the negative result.

We prove the negative part by contradiction. Consider a maximum a posteriori (MAP) estimation with full parameter information. When we observe a labeld information AA, the MAP estimates the clusters as follows:

Let ε\mboxMAP\varepsilon^{\mbox{MAP}} denote the number of misclassified nodes by the MAP estimation. From the definition of the MAP estimation, for any clustring algorithm π\pi, we have

Thus, in what follows, we show that when liminfnnD(α,p)log(n)<1\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}<1, the MAP estimation is failed to find the exact clusters with high probability.

We start by Lemma 16 which finds a large deviation inequality for edge connections.

Assume that there exists a constant η>0\eta>0 such that nD(α,p)log(n)<1η\frac{nD(\alpha,p)}{\log(n)}<1-\eta. Let (i,j)=argmini,j:i<jDL+(p(i),p(j))(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(p(i),p(j)) (i.e., it is the hardest case to discriminate cluster ii^{\star} and cluster jj^{\star}). When nn\to\infty, one can easily check using the continuity of the KL divergence that there exists x{\bm{x}}^{\star} such that when e(v)=xe(v)=\bm{x}^{\star},

Let vVev^{\star}\in\mathcal{V}_{e} be a node in Ve\mathcal{V}_{e}. We denote by Φ\Phi the original partition and define a slightly modified partition Ψ\Psi as follows:

Then, Ψ\Psi is a more likely partition than Φ\Phi from (83), i.e.,

which means that the MAP estimator does not select the exact partition when Ve\mathcal{V}_{e} is not empty. Therefore, from (82), every clustering algorithm π\pi has the error probability that

when there exists a constant η>0\eta>0 such that nD(α,p)log(n)<1η\frac{nD(\alpha,p)}{\log(n)}<1-\eta. \blacksquare