Optimal Cluster Recovery in the Labeled Stochastic Block Model

Se-Young Yun, Alexandre Proutiere

Introduction

Community detection consists in extracting (a few) groups of similar items from a large global population, and has applications in a wide spectrum of disciplines including social sciences, biology, computer science, and statistical physics. The communities or clusters of items are inferred from the observed pair-wise similarities between items, which, most often, are represented by a graph whose vertices are items and edges are pairs of items known to share similar features.

Over the last few years, we have seen remarkable progresses for the problem of cluster recovery under the SBM (see for an exhaustive literature review), highlighting its scientific relevance and richness. Most recent work on the SBM aimed at characterizing the set of parameters (i.e., the probabilities $p(i,j)$ that there exists an edge between nodes in clusters $i$ and $j$ for $1\leq i,j\leq K$ ) such that some qualitative recovery objectives can or cannot be met. For sparse scenarios where the average degree of items in the graph is $O(1)$ , parameters under which it is possible to extract clusters positively correlated with the true clusters have been identified . When the average degree of the graph is $\omega(1)$ , one may predict the set of parameters allowing a cluster recovery with a vanishing (as $n$ grows large) proportion of misclassified items , but one may also characterize parameters for which an asymptotically exact cluster reconstruction can be achieved .

In this paper, we address the finer and more challenging question of determining, under the general LSBM, the minimal number of misclassified items given the parameters of the model. Specifically, for any given $s=o(n)$ , our goal is to identify the set of parameters such that it is possible to devise a clustering algorithm with at most $s$ misclassified items. Of course, if we achieve this goal, we shall recover all the aforementioned results on the SBM.

We first derive a tight lower bound on the average number of misclassified items when the latter is $o(n)$ . Note that such a bound was unknown even for the SBM .

These theorems indicate that under the LSBM with parameters satisfying (A1) and (A2), the number of misclassified items scales at least as $n\exp(-nD(\alpha,p)(1+o(1))$ under any clustering algorithm, irrespective of its complexity. They further establish that the Spectral Partition algorithm reaches this fundamental performance limit under the additional condition (A3). We note that the SP algorithm runs in polynomial time, i.e., it requires $O(n^{2}\bar{p}\log(n))$ floating-point operations.

We further establish a necessary and sufficient condition on the parameters of the LSBM for the existence of a clustering algorithm recovering the clusters exactly with high probability. Deriving such a condition was also open .

Assume that (A1) and (A2) hold. If there exists a clustering algorithm that does not misclassify any item with high probability, then the parameters $(\alpha,p)$ of the LSBM satisfy: $\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}\geq 1$ . If this condition holds, then under (A3), the SP algorithm recovers the clusters exactly with high probability.

The paper is organized as follows. Section 2 presents the related work and example of application of our results. In Section 3, we sketch the proof of Theorem 1, which leverages change-of-measure and coupling arguments. We present in Section 4 the Spectral Partition algorithm, and analyze its performance (we outline the proof of Theorem 2). All results are proved in details in the supplementary material.

Related Work and Applications

Cluster recovery in the SBM has attracted a lot of attention recently. We summarize below existing results, and compare them to ours. Results are categorized depending on the targeted level of performance. First, we consider the notion of detectability, the lowest level of performance requiring that the extracted clusters are just positively correlated with the true clusters. Second, we look at asymptotically accurate recovery, stating that the proportion of misclassified items vanishes as $n$ grows large. Third, we present existing results regarding exact cluster recovery, which means that no item is misclassified. Finally, we report recent work whose objective, like ours, is to characterize the optimal cluster recovery rate.

Detectability. Necessary and sufficient conditions for detectability have been studied for the binary symmetric SBM (i.e., $L=1$ , $K=2$ , $\alpha_{1}=\alpha_{2}$ , $p(1,1,1)=p(2,2,1)=\xi$ , and $p(1,2,1)=p(2,1,1)=\zeta$ ). In the sparse regime where $\xi,\zeta=o(1)$ , and for the binary symmetric SBM, the main focus has been on identifying the phase transition threshold (a condition on $\xi$ and $\zeta$ ) for detectability: It was conjectured in that if $n(\xi-\zeta)<\sqrt{2n(\xi+\zeta)}$ (i.e., under the threshold), no algorithm can perform better than a simple random assignment of items to clusters, and above the threshold, clusters can partially be recovered. The conjecture was recently proved in (necessary condition), and (sufficient condition). The problem of detectability has been also recently studied in for the asymmetric SBM with more than two clusters of possibly different sizes. Interestingly, it is shown that in most cases, the phase transition for detectability disappears.

The present paper is not concerned with conditions for detectability. Indeed detectability means that only a strictly positive proportion of items can be correctly classified, whereas here, we impose that the proportion of misclassified items vanishes as $n$ grows large.

Asymptotically accurate recovery. A necessary and sufficient condition for asymptotically accurate recovery in the SBM (with any number of clusters of different but linearly increasing sizes) has been derived in and . Using our notion of divergence specialized to the SBM, this condition is $nD(\alpha,p)=\omega(1)$ . Our results are more precise since the minimal achievable number of misclassified items is characterized, and apply to a broader setting since they are valid for the generic LSBM.

Asymptotically exact recovery. Conditions for exact cluster recovery in the SBM have been also recently studied. provide a necessary and sufficient condition for asymptotically exact recovery in the binary symmetric SBM. For example, it is shown that when $\xi={a\log(n)\over n}$ and $\zeta={b\log(n)\over n}$ for $a>b$ , clusters can be recovered exactly if and only if ${a+b\over 2}-\sqrt{ab}\geq 1$ . In , the authors consider a more general SBM corresponding to our LSBM with $L=1$ . They define CH-divergence as:

and show that $\min_{i\neq j}D_{+}(\alpha,p(i),p(j))>1$ is a necessary and sufficient condition for asymptotically exact reconstruction. The following claim, proven in the supplementary material, relates $D_{+}$ to $D_{L+}$ .

When $\bar{p}=o(1)$ , we have for all $i,j$ :

Thus, the results in are obtained by applying Theorem 3 and Claim 4.

Again from this claim, the results derived in are obtained by applying Theorem 3 and Claim 5.

Optimal recovery rate. In , the authors consider the binary SBM in the sparse regime where the average degree of items in the graph is $O(1)$ , and identify the minimal number of misclassified items for very specific intra- and inter-cluster edge probabilities $\xi$ and $\zeta$ . Again the sparse regime is out of the scope of the present paper. are concerned with the general SBM corresponding to our LSBM with $L=1$ , and with regimes where asympotically accurate recovery is possible. The authors first characterize the optimal recovery rate in a minimax framework. More precisely, they consider a (potentially large) set of possible parameters $(\alpha,p)$ , and provide a lower bound on the expected number of misclassified items for the worst parameters in this set. Our lower bound (Theorem 1) is more precise as it is model-specific, i.e., we provide the minimal expected number of misclassified items for a given parameter $(\alpha,p)$ (and for a more general class of models). Then the authors propose a clustering algorithm, with time complexity $O(n^{3}\log(n))$ , and achieving their minimax recovery rate. In comparison, our algorithm yields an optimal recovery rate $O(n^{2}\bar{p}\log(n))$ for any given parameter $(\alpha,p)$ , exhibits a lower running time, and applies to the generic LSBM.

2 Applications

We provide here a few examples of application of our results, illustrating their versatility. In all examples, $f(n)$ is a function such that $f(n)=\omega(1)$ , and $a,b$ are fixed real numbers such that $a>b$ .

The binary SBM. Consider the binary SBM where the average item degree is $\Theta(f(n))$ , and represented by a LSBM with parameters $L=1$ , $K=2$ , $\alpha=(\alpha_{1},1-\alpha_{1})$ , $p(1,1,1)=p(2,2,1)=\frac{af(n)}{n}$ , and $p(1,2,1)=p(2,1,1)=\frac{bf(n)}{n}$ . From Theorems 1 and 2, the optimal number of misclassified vertices scales as $n\exp(-g(\alpha_{1},a,b)f(n)(1+o(1)))$ when $\alpha_{1}\leq 1/2$ (w.l.o.g.) and where

It can be easily checked that $g(\alpha_{1},a,b)\geq g(1/2,a,b)=\frac{1}{2}(\sqrt{a}-\sqrt{b})^{2}$ (letting $\lambda=\frac{1}{2}$ ). The worst case is hence obtained when the two clusters are of equal sizes. When $f(n)=\log(n)$ , we also note that the condition for asymptotically exact recovery is $g(\alpha_{1},a,b)\geq 1$ .

Recovering a single hidden community. As in , consider a random graph model with a hidden community consisting of $\alpha n$ vertices, edges between vertices belonging the hidden community are present with probability $\frac{af(n)}{n}$ , and edges between other pairs are present with probability $\frac{bf(n)}{n}$ . This is modeled by a LSBM with parameters $K=2$ , $L=1$ , $\alpha_{1}=\alpha$ , $p(1,1,1)=\frac{af(n)}{n}$ , and $p(1,2,1)=p(2,1,1)=p(2,2,1)=\frac{bf(n)}{n}$ . The minimal number of misclassified items when searching for the hidden community scales as $n\exp(-h(\alpha,a,b)f(n)(1+o(1)))$ where

When $f(n)=\log(n)$ , the condition for asymptotically exact recovery of the hidden community is $h(\alpha,a,b)\geq 1$ .

Optimal sampling for community detection under the SBM. Consider a dense binary symmetric SBM with intra- and inter-cluster edge probabilities $a$ and $b$ . In practice, to recover the clusters, one might not be able to observe the entire random graph, but sample its vertex (here item) pairs as considered in . Assume for instance that any pair of vertices is sampled with probability $\frac{\delta f(n)}{n}$ for some fixed $\delta>0$ , independently of other pairs. We can model such scenario using a LSBM with three labels, namely $\times$ , 0 and 1, corresponding to the absence of observation (the vertex pair is not sampled), the observation of the absence of an edge and of the presence of an edge, respectively, and with parameters for all $i,j\in\{1,2\}$ , $p(i,j,\times)=1-\frac{\delta f(n)}{n}$ , $p(1,1,1)=p(2,2,1)=a\frac{\delta f(n)}{n}$ , and $p(1,2,1)=p(2,1,1)=b\frac{\delta f(n)}{n}$ . The minimal number of misclassified vertices scales as $n\exp(-l(\delta,a,b)f(n)(1+o(1)))$ where $l:=\delta(1-\sqrt{ab}-\sqrt{(1-a)(1-b)}).$ When $f(n)=\log(n)$ , the condition for asymptotically exact recovery is $l(\alpha,a_{+},a_{-},b_{+},b_{-})\geq 1$ .

Signed networks. Signed networks are used in social sciences to model positive and negative interactions between individuals. These networks can be represented by a LSBM with three possible labels, namely 0, + and -, corresponding to the absence of interaction, positive and negative interaction, respectively. Consider such LSBM with parameters: $K=2$ , $\alpha_{1}=\alpha_{2}$ , $p(1,1,+)=p(2,2,+)=\frac{a_{+}f(n)}{n}$ , $p(1,1,-)=p(2,2,-)=\frac{a_{-}f(n)}{n}$ , $p(1,2,+)=p(2,1,+)=\frac{b_{+}f(n)}{n}$ , and $p(1,2,-)=p(2,1,-)=\frac{b_{-}f(n)}{n}$ , for some fixed $a_{+},a_{-},b_{+},b_{-}$ such that $a_{+}>b_{+}$ and $a_{-}<b_{-}$ . The minimal number of misclassified individuals here scales as $n\exp(-m(\alpha,a_{+},a_{-},b_{+},b_{-})f(n)(1+o(1)))$ where

When $f(n)=\log(n)$ , the condition for asymptotically exact recovery is $l(\alpha,a_{+},a_{-},b_{+},b_{-})\geq 1$ .

Fundamental Limits: Change of Measures through Coupling

Construction of $\psi$ . Let $(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(\alpha,p(i),p(j))$ , and let $v^{\star}$ denote the smallest item index that belongs to cluster $i^{\star}$ or $j^{\star}$ . If both $\mathcal{V}_{i^{\star}}$ and $\mathcal{V}_{j^{\star}}$ are empty, we define $v^{\star}=n$ . Let $q\in{\cal P}^{K\times(L+1)}$ such that: $D(\alpha,p)=\sum_{k=1}^{K}\alpha_{k}KL(q(k),p(i^{\star},k))=\sum_{k=1}^{K}\alpha_{k}KL(q(k),p(j^{\star},k)).$ The existence of such $q$ is proved in Lemma 7 in the supplementary material. Now to define the stochastic model $\Psi$ , we couple the generation of labels under $\Phi$ and $\Psi$ as follows.

1. We first generate the random clusters $\mathcal{V}_{1},\ldots,\mathcal{V}_{K}$ under $\Phi$ , and extract $i^{\star}$ , $j^{\star}$ , and $v^{\star}$ . The clusters generated under $\Psi$ are the same as those generated under $\Phi$ . For any $v\in\mathcal{V}$ , we denote by $\sigma(v)$ the cluster of item $v$ .

The Spectral Partition Algorithm and its Optimality

In this section, we sketch the proof of Theorem 2. To this aim, we present the Spectral Partition (SP) algorithm and analyze its performance. The SP algorithm consists in two parts, and its detailed pseudo-code is presented at the beginning of the supplementary document (see Algorithm 1).

Assume that (A1) and (A2) hold, and that $n\bar{p}=\omega(1)$ . After Step 4 (spectral decomposition) in the SP algorithm, with high probability, $\hat{K}=K$ and there exists a permutation $\gamma$ of $\{1,\ldots,K\}$ such that: $\left|\cup_{k=1}^{K}\mathcal{V}_{k}\setminus S_{\gamma(k)}\right|=O\left(\frac{\log(n\bar{p})^{2}}{\bar{p}}\right).$

Finally, we establish that if the clusters provided after the first part of the SP algorithm are asymptotically accurate, then after $\log(n)$ improvement iterations, there is no misclassified items in $H$ . To that aim, we denote by $\mathcal{E}^{(t)}$ the set of misclassified items after the $t$ -th iteration, and show that with high probability, for all $t$ , $\frac{|\mathcal{E}^{(t+1)}\cap H|}{|\mathcal{E}^{(t)}\cap H|}\leq\frac{1}{\sqrt{n\bar{p}}}$ . This completes the proof of Theorem 2, since after $\log(n)$ iterations, the only misclassified items are those in $\mathcal{V}\setminus H$ .

References

Appendix A The SP Algorithm

Appendix B Properties of the divergence D(α,p)𝐷𝛼𝑝D(\alpha,p) and related quantities

In this section, we prove the two claims of Section 2, as well as other results on the divergence $D(\alpha,p)$ that will be instrumental in the proofs of Theorems.

$D_{L+}(p(i),p(j))$ is the minimum of the objective function of the following convex optimization problem:

When we put (9) onto (8) and use the approximation $\lim_{x\to 0}\log(1+x)=x$ (again using $\bar{p}=o(1)$ ),

Therefore, the minimum value of (6) is equivalent to

B.2 Proof of Claim 5

Now, since $\sqrt{1+x}=1+\frac{x}{2}(1+o(1))$ and $\log(1+x)=x(1+o(1))$ when $x=o(1)$ ,

B.3 Other properties

Let $(i^{\star},j^{\star})=\arg\min_{i,j}D_{L+}(p(i),p(j))$ and $i^{\star}<j^{\star}$ . Then, there exists $q\in\mathcal{P}^{K\times(L+1)}$ such that

Proof. We check by contradiction that such a $q$ exists. Indeed, assume that

Then there exists $k_{0}$ such that $KL(q(k_{0}),p(i^{\star},k_{0}))>KL(q(k_{0}),p(j^{\star},k_{0}))$ . Observe that by positivity of the $KL$ divergence, $q(k_{0})\neq p(i^{\star},k_{0})$ . Hence by continuity of the $KL$ divergence, we can construct $q^{\prime}$ such that $q(k)=q^{\prime}(k)$ for all $k\neq k_{0}$ , and such that: $KL(q(k_{0}),p(i^{\star},k_{0}))-\epsilon<KL(q^{\prime}(k_{0}),p(i^{\star},k_{0}))<KL(q(k_{0}),p(i^{\star},k_{0}))$ and $KL(q^{\prime}(k_{0}),p(j^{\star},k_{0}))<KL(q(k_{0}),p(j^{\star},k_{0}))+\epsilon$ for some $0<\epsilon<(KL(q(k_{0}),p(i^{\star},k_{0}))-KL(q(k_{0}),p(j^{\star},k_{0})))/2$ . With this choice of $q^{\prime}$ , we get:

which contradicts the definition of $D(\alpha,p)$ . $\blacksquare$

Proof. Let $(i^{\star},j^{\star})=\arg\min_{i,j}D_{L+}(\alpha,p(i),p(j))$ and $i^{\star}<j^{\star}$ . From Lemma 7, there exists $q$ satisfying that

Under condition (A1), when $\bar{p}=o(1)$ , $\lim\sup_{n\to\infty}\frac{D(\alpha,p)}{\eta\bar{p}L}\leq 1.$

Proof. From the definition of $D(\alpha,p)$ , for any $i\neq j$ ,

where we use $\log(1+x)=x(1+o(1))$ when $x=o(1)$ . $\blacksquare$

Appendix C Proof of Theorem 1

Coupling and the perturbed stochastic model $\Psi$ . Let $(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(p(i),p(j))$ , and let $v^{\star}$ denote the smallest node index that belongs to cluster $i^{\star}$ or $j^{\star}$ . If both $\mathcal{V}_{i^{\star}}$ and $\mathcal{V}_{j^{\star}}$ are empty, we define $v^{\star}=n$ . Let $q\in^{K\times(L+1)}$ satisfy:

There exists such a $q$ from Lemma 7. Now to define the perturbed stochastic model $\Psi$ , we couple the generation of labels under $\Phi$ and $\Psi$ as follows.

We first generate construct the random clusters $\mathcal{V}_{1},\ldots,\mathcal{V}_{K}$ under $\Phi$ , and extract $i^{\star}$ , $j^{\star}$ , and $v^{\star}$ . The clusters generated under $\Psi$ are the same as those generated under $\Phi$ . For any $v\in\mathcal{V}$ , we denote by $\sigma(v)$ the cluster of node $v$ .

Let $\pi$ denote a clustering algorithm with output $(\hat{\mathcal{V}}_{k})_{1\leq k\leq K}$ , and let $\mathcal{E}=\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{k}\setminus\mathcal{V}_{k}$ be the set of misclassified nodes under $\pi$ . Note that in general in our proofs, we always assume without loss of generality that $|\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{k}\setminus\mathcal{V}_{k}|\leq|\bigcup_{1\leq k\leq K}\hat{\mathcal{V}}_{\gamma(k)}\setminus\mathcal{V}_{k}|$ for any permutation $\gamma$ , so that the set of misclassified nodes is really $\mathcal{E}$ . We denote by $\varepsilon^{\pi}(n)=|\mathcal{E}|$ . Since under $\Phi$ , nodes are interchangeable (remember that nodes are assigned to the various clusters in an i.i.d. manner), we have:

where the last inequality is obtained from the fact that we cannot distinguish between $v^{\star}$ and any other $v\in\mathcal{V}_{\sigma(v^{\star})}$ . Indeed,

Furthermore, since under the stochastic model $\Psi$ , the observed labels do not depend on whether $v^{\star}$ belongs to cluster $i^{\star}$ or $j^{\star}$ , we have:

Combining (17), (22), and (27), we conclude that:

In addition, from Chebyshev’s inequality,

Hence from condition (A1), (32), and the definition of $\mathcal{Q}$ ,

where the last inequlaity stems from the fact that $2KL(q(i),p(\sigma(v^{\star},i)))\log\eta\geq KL(q(j),p(\sigma(v^{\star},j)))$ for all $i$ and $j$ from condition (A1).

To derive the above inequality, we have used:

Appendix D Performance of the SP Algorithm – Proof of Theorem 2

For every $v\in\mathcal{V}$ and $c\geq 1$ , we have

where we derive the last inequality choosing $\theta=2$ . $\blacksquare$

The proof of Lemma 11 relies on arguments used in the spectral analysis of random graphs, see and .

For all $v\in{\mathcal{V}}_{k}$ and $D\geq 0$ ,

Proof. Let $\mathcal{X}$ be a set of $K\times(L+1)$ matrices such that

where $(a)$ stems from the following inequality:

D.2 Part 1 of the SP algorithm – Proof of Theorem 6

Recall that $\hat{A}=\hat{U}\hat{V}=\hat{U}\hat{U}^{\top}A_{\Gamma}$ and $\|\hat{A}_{u}-\hat{A}_{v}\|=\|\hat{V}_{u}-\hat{V}_{v}\|$ . We can bound the number of misclassified items as follows:

with high probability, every item pair $u$ and $v$ satisfies that when $\sigma(v)$ represents the cluster of $v$ and $M_{v,\Gamma}$ denotes the column vector of $M_{\Gamma}$ on $v$ ,

(40) suggests that if $v$ is misclassified by Algorithm 2, then we should have:

from (39) and (41), with high probability,

Proof of (39). First observe that from the definition of $\Gamma$ ,

where the first inequality stems from Lemma 10 and Markov inequality. Therefore, with high probability,

Proof of (41). Define the following sets:

$|\Gamma\setminus(\cup_{k=1}^{K}\mathcal{I}_{k})|\leq\frac{\|\hat{A}-M_{\Gamma}\|_{F}^{2}}{\min_{v\in\Gamma\setminus(\cup_{k=1}^{K}I_{k})}\|\hat{A}_{v}-M_{\Gamma}^{k}\|^{2}}=O\left(\frac{\log(n\bar{p})^{3}}{\bar{p}}\right)$ ;

From the properties of $\mathcal{I}_{k}$ and $\mathcal{O}$ , we state the following results.

since every $w\in(\cup_{k=1}^{K}\mathcal{I}_{k})$ is outside of $Q_{v}$ (i.e., $w\in\Gamma\setminus(\cup_{k=1}^{K}I_{k})$ is necessary for $w\in Q_{v}$ );

since $\alpha_{k}$ is a constant for all $k$ and $\frac{|\Gamma\setminus(\cup_{k=1}^{K}\mathcal{I}_{k})|}{|\Gamma|}=o(1)$ from (ii), with high probability,

The properties (ii), (iii), and (iv) and (51) imply that

where $m_{k}$ is the $k$ -th largest value among $\{|\mathcal{I}_{1}|,\dots,|\mathcal{I}_{K}|\}$ ;

since $|\mathcal{I}_{k}|\geq|V_{k}\cap(\Gamma\setminus\mathcal{O})|\geq\alpha_{k}n(1-o(1))$ from (ii) and (iii),

D.3 Proof of Theorem 2

From Chernoff bound, with high probability,

In what follows, we hence just prove the theorem assuming that (54) holds.

Let $H$ be the largest set of items $v\in\mathcal{V}$ satisfying:

$e(v,\mathcal{V}\setminus H)\leq 2\log(n\bar{p})^{2}.$

(H1) regularizes degrees, (H2) means that $v\in H$ is correctly classified when using the log-likelihood estimate, and (H3) means that $v$ does not share too many labels with items outside $H$ .

The proof of the theorem follows from the following propositions. The first provides an upper bound of $|\mathcal{V}\setminus H|$ , and the second provides the rate at which our estimated clusters improve in each iteration when we restrict our attention to items in $H$ .

When $nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}$ , $|\mathcal{V}\setminus H|\leq s$ with high probability.

If ${|\bigcup_{k=1}^{K}(S^{(0)}_{k}\setminus\mathcal{V}_{k})\cap H|+|\mathcal{V}\setminus H|}=O(1/\bar{p})$ , with high probability, the following statement holds

From Proposition 14, after $\log(n)$ iterations (remember that $n\bar{p}=\omega(1)$ , so when $n$ is large enough $1/\sqrt{n\bar{p}}\leq e^{-2}$ ), no item in $H$ can be misclassified with high probability. Hence the number of misclassified items cannot exceed $|\mathcal{V}\setminus H|\leq s$ , $nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}$ . The proof is completed by remarking that if the previous condition on $D(\alpha,p)$ holds, then

where we used $D(\alpha,p)=\Omega(\bar{p})$ from condition (A2) and Lemma 8. $\blacksquare$

We compute the number of items satisfying (H1), (H2), and (H3) in (55), (56), and Lemma 15, respectively.

Number of items satisfying (H1): From Lemma 10, we get:

Number of items satisfying (H2): We shall prove that when $v$ satisfies (H1), $v$ satisfies (H2) as well with probability at least

To this aim, we first establish that if $v$ satisfies

then $v$ satisfies (H2). Indeed, assume that (57) holds, then

$\sum_{i=1}^{K}\alpha_{i}nKL(\mu(v,\mathcal{V}_{i}),p(k,i))\leq\left(1+\frac{\log(n)^{2}}{\sqrt{n}}\right)\sum_{i=1}^{K}|\mathcal{V}_{i}|KL(\mu(v,\mathcal{V}_{i}),p(k,i))<nD(\alpha,p)$ , since $||\mathcal{V}_{i}|-\alpha_{i}n|\leq\sqrt{n}\log(n)$ and (57) holds;

$\sum_{i=1}^{K}\alpha_{i}nKL(\mu(v,\mathcal{V}_{i}),p(j,i))\geq nD(\alpha,p)$ , since $\max\left\{\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(j,i)),\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(k,i))\right\}\geq D(\alpha,p)$ and $\sum_{i=1}^{K}\alpha_{i}KL(\mu(v,\mathcal{V}_{i}),p(k,i))<D(\alpha,p)$ ;

$\sum_{i=1}^{K}|\mathcal{V}_{i}|KL(\mu(v,\mathcal{V}_{i}),p(j,i))\geq\left(1-\frac{\log(n)^{2}}{\sqrt{n}}\right)nD(\alpha,p)$ , from ii) and the fact that $||\mathcal{V}_{i}|-\alpha_{i}n|\leq\sqrt{n}\log(n)$ ;

Hence $v$ satisfies (H2). It remains to evaluate the probability of the event (57), which is done by applying Lemma 12 and proves (56).

Number of items satisfying (H3): From (55), (56), and the Markov inequality, we deduce that with probability at least $1-\exp\left(-\sqrt{\log(n/s)}\right)$ , the number of items that do not satisfy either (H1) or (H2) is less than $s/3$ when $nD(\alpha,p)-\frac{n\bar{p}}{\log(n\bar{p})^{3}}\geq\log(n/s)+\sqrt{\log(n/s)}$ , since

where we have used Lemma 9 for the last inequality. Lemma 15 allows us to complete the proof of Proposition. $\blacksquare$

When the number of items that do not satisfy either (H1) or (H2) is less than $s/3$ , $|\mathcal{V}\setminus H|\leq s$ , with high probability.

Proof. Let $e(S,S)=\sum_{v\in S}e(S,S)$ . Next we prove the following intermediate claim: there is no subset $S\subset\mathcal{V}$ such that $e(S,S)\geq s\log(n\bar{p})^{2}$ and $|S|=s$ with high probability. For any subset $S\in\mathcal{V}$ such that $|S|=s,$ by Markov inequality,

where, in the last two inequalities, we have set $t=\frac{n\bar{p}}{\log n\bar{p}}$ and used the fact that: $\frac{n}{s}\geq\exp(\frac{n\bar{p}}{\log n\bar{p}}),$ which comes from the assumptions made in the theorem. Since the number of subsets $S\subset\mathcal{V}$ with size $s$ is ${{n}\choose{s}}\leq(\frac{en}{s})^{s},$ from (65), we deduce:

Therefore, by Markov inequality, we can conclude that there is no $S\subset\mathcal{V}$ such that $e(S,S)\geq s\log(n\bar{p})^{2}$ and $|S|=s$ with high probability.

To conclude the proof of the lemma, we build the following sequence of sets. Let $Z_{1}$ denote the set of items that do not satisfy at least one of (H1) and (H2). Let $\{Z(t)\subset\mathcal{V}\}_{1\leq t\leq t^{\star}}$ be generated as follows:

For $t\geq 1$ , $Z(t)=Z(t-1)\cup\{v_{t}\}$ if there exists $v_{t}\in\mathcal{V}$ such that $e(v_{t},Z(t-1))>2\log(n\bar{p})^{2}$ and $v_{t}\notin Z(t-1)$ . If such an item does not exist, the sequence ends.

The sequence ends after the construction of $Z(t^{\star})$ . We show that if we assume that the cardinality of items that do not satisfy (H3) is strictly larger than $s/2$ , then one the set of the sequence $\{Z(t)\subset\mathcal{V}\}_{1\leq t\leq t^{\star}}$ contradicts the claim we just proved.

Assume that the number of items do not satisfy (H3) is strictly larger than $s/2$ , then these items will be at some point added to the sets $Z(t)$ , and by definition, each of these node contributes with more than $2\log(n\bar{p})^{2}$ in $e(Z(t),Z(t))$ . Hence if starting from $Z_{1}$ , we add $s/2$ items not satisfying (H3), we get a set $Z(t)$ of cardinality less than $s/3+s/2$ and such that $e(Z(t),Z(t))>s\log(n\bar{p})^{2}$ . We can further add arbitrary items to $Z(t)$ so that it becomes of cardinality $s$ , and the obtained set contradicts the claim. $\blacksquare$

D.3.2 Proof of Proposition 14

Recall that $\{S^{(t)}_{j}\}_{1\leq j\leq K}$ is the partition after the $t$ -th improvement iteration. Also recall that with loss of generality, we assume that the set of misclassified items in $H$ after the $t$ -th step is $\mathcal{E}^{(t)}=\left(\cup_{k}(S_{k}^{(t)}\setminus\mathcal{V}_{k})\right)\cap H$ (it should be defined through an appropriate permutation $\gamma$ of $\{1,\ldots,K\}$ by $\mathcal{E}^{(t)}=(\cup_{k}(S_{k}^{(t)}\setminus\mathcal{V}_{\gamma(k)}))\cap H$ , but we omit $\gamma$ ). With this notational convention, we can define $\mathcal{E}_{jk}^{(t)}=(S^{(t)}_{j}\cap\mathcal{V}_{k})\cap H$ and $\mathcal{E}^{(t)}=\bigcup_{j,k:j\neq k}\mathcal{E}_{jk}^{(t)}$ . At each improvement step, items move to the most likely cluster (according to the log-likelihood defined in the SP algorithm). Thus, for all $i$ ,

Therefore, from the above inequalities, we conclude that

Next we prove all the steps of the previous analysis.

From (77) and (78), with high probability,

Since $\{S^{(0)}_{k}\}_{1\leq k\leq K}\in\mathcal{S}$ , from the above inequality,

We now devote to the remaining part of (74). Since $|\mathcal{E}^{(0)}|=O\left(\frac{\log(n\bar{p})^{2}}{\bar{p}}\right)$ from Theorem 6,

From (74), (79) and (80), with high probability,

Proof of (69): Since $\log\frac{p(j,i,0)}{p(k,i,0)}=O(\bar{p})$ for all $i,j,k$ and $|\mathcal{E}^{(t)}|=O(\log(n\bar{p})^{2}/\bar{p})$ ,

where the last inequality stems from (H3), i.e., from $e(v,\mathcal{V}\setminus H)\leq 2\log(n\bar{p})^{2}$ when $v\in H$ .

Proof of (70): Since $\mathcal{E}^{(t+1)}\subset H$ and every $v\in H$ satisfies (H2), every $v\in\mathcal{E}_{jk}^{(i+1)}$ satisfies:

Appendix E Proof of Theorem 3

The positive result is obtained by applying Theorem 2 to $s=\frac{1}{2}$ . When $\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}\geq 1$ , SP algorithm find clusters exactly with high probability. Thus, it suffices to show the negative result.

We prove the negative part by contradiction. Consider a maximum a posteriori (MAP) estimation with full parameter information. When we observe a labeld information $A$ , the MAP estimates the clusters as follows:

Let $\varepsilon^{\mbox{MAP}}$ denote the number of misclassified nodes by the MAP estimation. From the definition of the MAP estimation, for any clustring algorithm $\pi$ , we have

Thus, in what follows, we show that when $\lim\inf_{n\to\infty}\frac{nD(\alpha,p)}{\log(n)}<1$ , the MAP estimation is failed to find the exact clusters with high probability.

We start by Lemma 16 which finds a large deviation inequality for edge connections.

Assume that there exists a constant $\eta>0$ such that $\frac{nD(\alpha,p)}{\log(n)}<1-\eta$ . Let $(i^{\star},j^{\star})=\arg\min_{i,j:i<j}D_{L+}(p(i),p(j))$ (i.e., it is the hardest case to discriminate cluster $i^{\star}$ and cluster $j^{\star}$ ). When $n\to\infty$ , one can easily check using the continuity of the KL divergence that there exists ${\bm{x}}^{\star}$ such that when $e(v)=\bm{x}^{\star}$ ,

Let $v^{\star}\in\mathcal{V}_{e}$ be a node in $\mathcal{V}_{e}$ . We denote by $\Phi$ the original partition and define a slightly modified partition $\Psi$ as follows:

Then, $\Psi$ is a more likely partition than $\Phi$ from (83), i.e.,

which means that the MAP estimator does not select the exact partition when $\mathcal{V}_{e}$ is not empty. Therefore, from (82), every clustering algorithm $\pi$ has the error probability that

when there exists a constant $\eta>0$ such that $\frac{nD(\alpha,p)}{\log(n)}<1-\eta$ . $\blacksquare$