Community Detection in Degree-Corrected Block Models

Chao Gao, Zongming Ma, Anderson Y. Zhang, Harrison H. Zhou

Introduction

In many fields such as social science, neuroscience and computer science, it has become increasingly important to process and make inference on relational data. The analysis of network data, a prevalent form of relational data, becomes an important topic for statistics and machine learning. One central problem of network data analysis is community detection: to partition the nodes in a network into subsets. A meaningful partition of nodes can often uncover interesting information that is not apparent in a complicated network.

One way to accommodate degree heterogeneity is to introduce a set of degree-correction parameters {θi:i=1,,n}\left\{\theta_{i}:i=1,\dots,n\right\}, one for each node, which can be interpreted as the popularity or importance of a node in the network. Then one could revise the edge distributions to Aij=Ajiind.Bern(θiθjBz(i)z(j))A_{ij}=A_{ji}\stackrel{{\scriptstyle ind.}}{{\sim}}\text{Bern}(\theta_{i}\theta_{j}B_{z(i)z(j)}) for all i>ji>j, and this gives rise to the Degree-Corrected Block Models (DCBMs) . In a DCBM, within the same community, a node with a larger value of degree-correction parameter is expected to have more connections than that with a smaller value. On the other hand, SBMs are special cases of DCBMs in which the degree-correction parameters are all equal. Empirically, the larger class of DCBMs is able to provide possibly much better fits to many real world network datasets . Since the proposal of the model, there have been various methods proposed for community detection in DCBMs, including but not limited to spectral clustering and modularity based approaches . On the theoretical side, provides an information-theoretic characterization of the impossibility region of community detection for DCBMs with two clusters, and sufficient conditions have been given in for strongly and weakly consistent community detection. However, two fundamental statistical questions remain unanswered:

What are the fundamental limits of community detection in DCBMs?

Once we know these limits, can we achieve them adaptively via some polynomial time algorithm?

The answer to the first question can provide important benchmarks for comparing existing approaches and for developing new procedures. The answer to the second question can lead to new practical methodologies with theoretically justified optimality. The present paper is dedicated to provide answers to these two questions.

Our main contributions are two-folded. First, we carefully formulate community detection in DCBMs as a decision-theoretic problem and then work out its asymptotic minimax risks with sharp constant in the exponent under certain regularity conditions. For example, let kk be a fixed constant. Suppose there are kk communities all about the same size n/kn/k and the average within community and between community edge probabilities are pp and qq respectively with p>qp>q and p/q=O(1)p/q=O(1), then under mild regularity conditions, the minimax risk under the loss function that counts the proportion of misclassified nodes takes the form

as nn\to\infty whenever it converges to zero and the maximum expected node degree scales at a sublinear rate with nn. The general fundamental limits to be presented in Section 2 allow the community sizes to differ and the number of communities kk to grow to infinity with nn. To the best our knowledge, this is the first minimax risk result for community detection in DCBMs. The minimax risk (1) has an intuitive form. In particular, the ithi{{}^{\rm th}} term in the summation can be understood as the probability of the ithi{{}^{\rm th}} node being misclassified. When θi\theta_{i} is larger, the chance of the node being misclassified gets smaller as it has more edges and hence more information of its community membership is available in the network. The term n/kn/k is roughly the community size. Since the community detection problem can be reduced to a hypothesis testing problem with n/kn/k as its effective sample size, a larger kk implies a more difficult problem. Furthermore, (pq)2(\sqrt{p}-\sqrt{q})^{2} reflects the degree of separation among the kk clusters. Note that pp and qq are the average within and between community edge probabilities and so (pq)2(\sqrt{p}-\sqrt{q})^{2} measures the difference of edge densities within and between communities. If the clusters are more separated in the sense that the within and between community edge densities differ more, the chance of each node being misclassified becomes smaller. When the degree-correction parameters are all equal to one and p=o(1)p=o(1), the expression in (1) reduces to the minimax risk of community detection in SBMs in .

In addition, we investigate computationally feasible algorithms for adaptively achieving minimax optimal performance. In particular, we propose a polynomial time two-stage algorithm. In the first stage, we obtain a relatively crude community assignment via a weighted kk-medians procedure on a low-rank approximation to the adjacency matrix. Working with a low-rank approximation (as opposed to the leading eigenvectors of the adjacency matrix) enables us to avoid common eigen-gap conditions needed to establish weak consistency for spectral clustering methods. Based on result of the first stage, the second stage applies a local optimization to improve on the community assignment of each network node. Theoretically, we show that it can adaptively achieve asymptotic minimax optimal performance for a large collection of parameter spaces. The empirical effectiveness of the algorithm is illustrated by simulation.

Connection to previous work

The present paper is connected to a number of papers on community detection in DCBMs and SBMs.

It is connected to the authors’ previous work on minimax community detection in SBMs . However, the involvement of degree-correction parameters poses significant new challenges. For the study of fundamental limits, especially minimax lower bounds, the fundamental two-point testing problem in DCBMs compares two product probability distributions with different marginals, while in SBMs, the two product distributions can be divided to two equal sized blocks within which the marginals are the same. Consequently, a much more refined Cramér–Chernoff argument is needed to establish the desired bound. In addition, to establish matching minimax upper bounds, the analysis of the maximum likelihood estimators is technically more challenging than that in due to the presence of degree-correction parameters and the wide range in which they can take values. In particular, we use a new folding argument to obtain the desired bounds. For adaptive estimation, the degree-correction parameters further increase the number of nuisance parameters. As a result, although we still adopt a “global-to-local” two-stage strategy to construct the algorithm, neither stage of the proposed algorithm in the present paper can be borrowed from the algorithm proposed in . We will give more detailed comments on the first stage below. For the second stage, the penalized neighbor voting approach in requires estimation of degree-correction parameters with high accuracy and hence is infeasible. We propose a new normalized neighbor voting procedure to avoid estimating θi\theta_{i}’s.

The first stage of the proposed algorithm is connected to the literature on spectral clustering, especially . The novelty in our proposal is that we cluster the rows of a low-rank approximation to the adjacency matrix directly as opposed to the rows of the matrix containing the leading eigenvectors of the adjacency matrix. As a result, the new spectral clustering algorithm does not require any eigen-gap condition to achieve consistency.

Organization

After a brief introduction to common notation, the rest of the paper is organized as follows. Section 2 presents the decision-theoretic formulation of community detection in DCBMs and derives matching asymptotic minimax lower and upper bounds under appropriate conditions. Given the fundamental limits obtained, we propose in Section 3 a polynomial time two-stage algorithm and study when a version of it can adaptively achieve minimax optimal rates of convergence. The finite sample performance of the proposed algorithm is examined in Section 4 on simulated data examples. Some proofs of the main results are given in Section 5 with additional proofs deferred to the appendices.

Notation

Fundamental Limits

In this section, we present fundamental limits of community detection in DCBMs. We shall first define an appropriate parameter space and a loss function. A characterization of asymptotic minimax risks then follows.

Recall that a random graph of size nn generated by a DCBM has its adjacency matrix AA satisfying Aii=0A_{ii}=0 for all i[n]i\in[n] and

We are mostly interested in the behavior of minimax risks over a sequence of such parameter spaces as nn tends to infinity and the key model parameters θ,p,q,k\theta,p,q,k scale with nn in some appropriate way. On the other hand, we take β1\beta\geq 1 as an absolute constant and require the (slack) parameter δ\delta to be an o(1)o(1) sequence throughout the paper.

Therefore, BuuB_{uu} and BuvB_{uv} can be understood as the (approximate) average connectivity within the uthu{{}^{\rm th}} community and between the uthu{{}^{\rm th}} and the vthv{{}^{\rm th}} communities, respectively. Under this interpretation, pp can be seen as a lower bound on the within community connectivities and qq an upper bound on the between community connectivities. We require the assumption p>qp>q to ensure that the model is “assortative” in an average sense. Finally, we also require the individual community sizes to be contained in the interval [n/(βk)1,βn/k+1][n/(\beta k)-1,\beta n/k+1]. In other words, the community sizes are assumed to be of the same order.

As for the loss function, we use the following misclassification proportion that has been previously used in the investigation of community detection in stochastic block models :

where H(,)H(\cdot,\cdot) is the Hamming distance defined as H(z1,z2)=i[n]1{z1(i)z2(i)}H(z_{1},z_{2})=\sum_{i\in[n]}{\mathbf{1}_{\left\{{z_{1}(i)\neq z_{2}(i)}\right\}}} and Πk\Pi_{k} is the set of all permutations on [k][k]. Here, the minimization over all permutations is necessary since we are only interested in comparing the partitions resulting from zz and z^\hat{z} and so the actual labels used in defining the partitions should be inconsequential.

2 Minimax Risks

Now we study the minimax risk of the problem

In particular, we characterize the asymptotic behavior of (5) as a function of n,θ,p,q,kn,\theta,p,q,k and β\beta. The key information-theoretic quantity that governs the minimax risk of community detection is II, which is defined through

Note that II depends on nn not only directly but also through θ\theta, pp, qq and kk.

Given any parameter space Pn(θ,p,q,k,β;δ){\mathcal{P}}_{n}(\theta,p,q,k,\beta;\delta), we can define the following estimator:

If there is a tie, we break it arbitrarily. The estimator (7) is the maximum likelihood estimator for a special case of DCBM where Buu=pB_{uu}=p and Buv=qB_{uv}=q for all uv[k]u\neq v\in[k]. In other cases, the objective function in (7) is a misspecified likelihood function. For any sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=Ω(bn)a_{n}=\Omega(b_{n}) if anCbna_{n}\geq Cb_{n} for some absolute constant C>0C>0 for all n1n\geq 1. The following theorem characterizes the asymptotic behavior of the risk bounds for the estimator (7).

Consider any sequence {Pn(θ,p,q,k,β;δ)}n=1\{{\mathcal{P}}_{n}(\theta,p,q,k,\beta;\delta)\}_{n=1}^{\infty} such that as nn\to\infty, II\to\infty, p>qp>q, θ=o(n/k)\|\theta\|_{\infty}=o(n/k), mini[n]θi=Ω(1)\min_{i\in[n]}\theta_{i}=\Omega(1) and logk=o(min(I,logn))\log k=o(\min(I,\log{n})). When k3k\geq 3, further assume β[1,5/3)\beta\in[1,\sqrt{5/3}). Then the estimator in (7) satisfies

Before proceeding, we briefly discuss the conditions in Theorem 1. First, the condition mini[n]θi=Ω(1)\min_{i\in[n]}\theta_{i}=\Omega(1) requires that all θi\theta_{i}’s are at least of constant order. One should note that this condition does not rule out the possibility that maxiθiminiθi\max_{i}\theta_{i}\gg\min_{i}\theta_{i}, and so a great extent of degree variation, even within the same community, is allowed. Next, logk=o(logn)\log k=o(\log n) requires that the number of communities kk, if it diverges to infinity, grows at a sub-polynomial rate with the number of nodes nn. Furthermore, β[1,5/3)\beta\in[1,\sqrt{5/3}) is a technical condition that we need for a combinatorial argument in the proof to go through when k3k\geq 3. Informed readers might find the result in Theorem 1 in parallel to that in . However, due to the presence of degree-correction parameters, the proof of Theorem 1 is significantly different from that of the corresponding result in . For example, a new folding argument is employed to deal with degree heterogeneity.

Minimax lower bounds

When k=2k=2, there exists a disjoint partition C1,C2\mathcal{C}_{1},\mathcal{C}_{2} of [n][n], such that C1=n/2|\mathcal{C}_{1}|={\left\lfloor{n/2}\right\rfloor}, C2{n/2,n/2+1}|\mathcal{C}_{2}|\in\{{\left\lfloor{n/2}\right\rfloor},{\left\lfloor{n/2}\right\rfloor}+1\} and Cu1iCuθi(1δ/4,1+δ/4)|\mathcal{C}_{u}|^{-1}\sum_{i\in\mathcal{C}_{u}}\theta_{i}\in(1-\delta/4,1+\delta/4) for u=1,2u=1,2.

When k3k\geq 3, there exists a disjoint partition {Cu}u[k]\{\mathcal{C}_{u}\}_{u\in[k]} of [n][n], such that C1C2...Ck|\mathcal{C}_{1}|\leq|\mathcal{C}_{2}|\leq...\leq|\mathcal{C}_{k}|, C1=C2=n/(βk)|\mathcal{C}_{1}|=|\mathcal{C}_{2}|={\left\lfloor{n/(\beta k)}\right\rfloor} and Cu1iCuθi(1δ/4,1+δ/4)|\mathcal{C}_{u}|^{-1}\sum_{i\in\mathcal{C}_{u}}\theta_{i}\in(1-\delta/4,1+\delta/4) for all u[k]u\in[k].

We note that the condition is only on θ\theta (as opposed to the parameter space) and the actually communities in the data generating model need not coincide with the partition that occurs in the statement of the condition.

With the foregoing definition, we have the following result.

Consider any sequence {Pn(θ,p,q,k,β;δ)}n=1\{{\mathcal{P}}_{n}(\theta,p,q,k,\beta;\delta)\}_{n=1}^{\infty} such that as nn\to\infty, II\to\infty, 1<p/q=O(1)1<p/q=O(1), pθ2=o(1)p\|\theta\|_{\infty}^{2}=o(1), logk=o(I)\log k=o(I), log(1/δ)=o(I)\log(1/\delta)=o(I) and θ\theta satisfies Condition N. Then

Compared with the conditions in Theorem 1, the conditions of Theorem 2 are slightly different. The condition 1<p/q=O(1)1<p/q=O(1) ensures that the smallest average within community connectivity is of the same order as (albeit larger than) the largest average between community connectivity. Such an assumption is typical in the statistical literature on block models. The condition θ2p=o(1)\|{\theta}\|^{2}_{\infty}p=o(1) ensures that the maximum expected node degree scales at a sublinear rate with the network size nn. Furthermore, when k=O(1)k=O(1), the condition logk=o(I)\log k=o(I) can be dropped because it is equivalent to II\rightarrow\infty, which in turn is necessary for the minimax risk to converge to zero.

Combining both theorems, we have the minimax risk of the problem.

Under the conditions of Theorems 1 and 2, we have

where o(1)o(1) stands for a sequence whose absolute values tend to zero as nn tends to infinity.

Setting β=1\beta=1 in Corollary 1 leads to the minimax result (1) in the introduction. We refer to Section 1 for the meanings of the terms in II.

When θ=1n\theta=1_{n}, the foregoing minimax risk reduces to the corresponding result for stochastic block models in the sparse regime where q<p=o(1)q<p=o(1). In this case, (6) implies that the minimax risk is

Note that when q<p=o(1)q<p=o(1), the Rényi divergence of order 12\frac{1}{2} used in the minimax risk expression in is equal to (1+o(1))(pq)2(1+o(1))(\sqrt{p}-\sqrt{q})^{2}.

An Adaptive and Computationally Feasible Procedure

Theorem 1 shows that the minimax rate can be achieved by the estimator (7) obtained via combinatorial optimization which is not computationally feasible. Moreover, the procedure depends on the knowledge of the parameters θ\theta, pp and qq. These features make it not applicable in practical situations. In this section, we introduce a two-stage algorithm for community detection in DCBMs which is not only computationally feasible but also adaptive over a wide range of unknown parameter values. We show that the procedure achieves minimax optimal rates under certain regularity conditions.

The proposed algorithm consists of an initialization stage and a refinement stage.

To explain the rationale behind our proposal, with slight abuse of notation, let P=(Pij)n×nP=(P_{ij})\in^{n\times n}, where for all i,j[n]i,j\in[n], Pij=Pji=θiθjBz(i)z(j)P_{ij}=P_{ji}=\theta_{i}\theta_{j}B_{z(i)z(j)}. Except for the diagonal entries, PP is the same as in (3). For any i[n]i\in[n], let PiP_{i} denote the ithi{{}^{\rm th}} row of PP. Then for all ii such that z(i)=uz(i)=u, we observe that

are all equal. Thus, there are exactly kk different vectors that the normalized row vectors {θi1Pi}i=1n\{\theta_{i}^{-1}P_{i}\}_{i=1}^{n} can be. Moreover, which one of the kk vectors the ithi{{}^{\rm th}} normalized row vector equals is determined solely by its community label z(i)z(i). This observation suggests one can design a reasonable community detection procedure by clustering the sample counterparts of the vectors {θ11P1,θ21P2,...,θn1Pn}\{\theta_{1}^{-1}P_{1},\theta_{2}^{-1}P_{2},...,\theta_{n}^{-1}P_{n}\}, which leads us to the proposal of Algorithm 1.

In Algorithm 1, Steps 1 and 2 aim to find an estimator P^\hat{P} of PP by solving a low rank approximation problem. Then, in Step 3, we can use P^i11P^i\|{\hat{P}_{i}}\|_{1}^{-1}\hat{P}_{i} as a surrogate for θi1Pi\theta_{i}^{-1}P_{i}. Finally, Step 4 performs a weighted kk-median clustering procedure applied on the row vectors of the n×kn\times k matrix [P^111P^1P^n11P^n]\begin{bmatrix}\|{\hat{P}_{1}}\|_{1}^{-1}\hat{P}_{1}\\ \cdots\\ \|{\hat{P}_{n}}\|_{1}^{-1}\hat{P}_{n}\end{bmatrix}.

Refinement: normalized network neighbor counts

As we shall show later, the error rate of Algorithm 1 decays polynomially with respect to the key quantity II defined in (6). To achieve the desired exponential decay rate with respect to II as in the minimax rate, we need to further refine the community assignments obtained from Algorithm 1.

To this end, we propose a prototypical refinement procedure in Algorithm 2. The algorithm determines a possibly new community label for the ithi{{}^{\rm th}} node by counting the number of neighbors that the ithi{{}^{\rm th}} node has in each community normalized by the corresponding community size, and then picking the label of the community that maximizes the normalized counts. If there is a tie, we break it in an arbitrary way.

To see the rationale behind Algorithm 2, let us consider a simplified version of the problem. Suppose k=2k=2, n=2m+1n=2m+1 for some integer m1m\geq 1, B11=B22=pB_{11}=B_{22}=p and B12=B21=qB_{12}=B_{21}=q. Moreover, let us assume that the community labels of the first 2m2m nodes are such that z(i)=1z(i)=1 for i=1,,mi=1,\dots,m and z(i)=2z(i)=2 for i=m+1,,2mi=m+1,\dots,2m. The label of the last node z(n)z(n) remains to be determined from the data. When {z(i):i=1,,2m}\left\{z(i):i=1,\dots,2m\right\} are the truth, the determination of the label for the nthn{{}^{\rm th}} node reduces to the following testing problem:

The hypotheses H0H_{0} and H1H_{1} are joint distributions of {An,i}i[n1]\{A_{n,i}\}_{i\in[n-1]} in the two cases z(n)=1z(n)=1 and z(n)=2z(n)=2, respectively. For this simple vs. simple testing problem, the Neyman–Pearson lemma dictates that the likelihood ratio test is optimal. However, it is not a satisfying answer for our goal, since the likelihood ratio test needs to use the values of the unknown parameters pp, qq and θ\theta. While it is possible to obtain sufficiently accurate estimators for pp and qq, it is hard to do so for θ\theta, especially when the network is sparse. In summary, the dependence of the likelihood ratio test on nuisance parameters makes it impossible to apply in practice. To overcome this difficulty, we propose to consider a simple test which

As we shall show later in Lemma 2 and Lemma 3, this simple procedure achieves the optimal testing error exponent. It is worthwhile to point out that it does not require any knowledge of pp, qq or θ\theta, and hence the procedure is adaptive. A detailed study of the testing problem (10) is given in Section 5.1.

Inspired by the foregoing discussion, when k=2k=2 and the two community sizes are different, we propose to normalize the counts in (11) by the community sizes. Moreover, when there are more than two communities, we propose to perform pairwise comparison based on the foregoing (normalized) test statistic for each pair of community labels, which becomes the procedure in (9) as long as we replace the unknown truth zz with an initial estimator z^0\hat{z}^{0}. For a good initial estimator such as the one output by Algorithm 1, the refinement can lead to minimax optimal errors in misclassification proportion for a large collection of parameter spaces.

2 Performance Guarantees

The parameter α\alpha is assumed to be a constant no smaller than 11 that does not change with nn. By studying the proofs of Theorem 2 and Theorem 1, the minimax lower and upper bounds do not change for the slightly smaller parameter space Pn(θ,p,q,k,β;δ,α)\mathcal{P}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha). Therefore, the rate exp((1+o(1))I)\exp(-(1+o(1))I) still serves as a benchmark for us to develop theoretically justifiable algorithms for the parameter space Pn(θ,p,q,k,β;δ,α)\mathcal{P}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha).

As a first step, we provide the following high probability error bound for Algorithm 1.

Assume δ=o(1)\delta=o(1), 1<p/q=O(1)1<p/q=O(1) and θ=o(n/k)\|{\theta}\|_{\infty}=o(n/k). Let τ=C1(npθ2+1)\tau=C_{1}(np\|{\theta}\|_{\infty}^{2}+1) for some sufficiently large constant C1>0C_{1}>0 in Algorithm 1. Then, there exist some constants C,C>0C^{\prime},C>0, such that for any generative model in Pn(θ,p,q,k,β;δ,α)\mathcal{P}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha), we have with probability at least 1n(1+C)1-n^{-(1+C^{\prime})},

Lemma 1 provides a uniform high probability bound for the sum of θi\theta_{i}’s of the nodes which are assigned wrong labels. Before discussing the implication of this result, we give two remarks.

Algorithm 1 applies a weighted kk-medians procedure on the matrix P^\hat{P} instead of its leading eigenvectors. This is the main difference between Algorithm 1 and many traditional spectral clustering algorithms. As a result, we avoid any eigengap assumption that is imposed to prove consistency results for spectral clustering algorithms .

The extra (1+ϵ)(1+\epsilon) slack that we allow in Algorithm 1 is also reflected in the error bound.

The following corollary exemplifies how the result of Lemma 1 can be specialized into a high probability bound for the loss function (4) with a rate depending on II under some stronger conditions. These conditions, especially miniθi=Ω(1)\min_{i}\theta_{i}=\Omega(1), can be relaxed in Theorem 4 stated in the next paragraph.

Error rate for the refinement stage

As exemplified in Corollary 2, the convergence rate for the initialization step is typically only polynomial in II as opposed to the exponential rate in the minimax rate. Thus, there is room for improvement. In what follows, we show that a specific way of applying Algorithm 2 on the output of Algorithm 1 leads to significant performance enhancement in terms of misclassification proportion. To this end, let us first state in Algorithm 3 the combined algorithm for which we are able to establish the improved error bounds. Here and after, for any i[n]i\in[n], let Ai{0,1}(n1)×(n1)A_{-i}\in\{0,1\}^{(n-1)\times(n-1)} be the submatrix of AA obtained from removing the ithi{{}^{\rm th}} row and column of AA.

The last step (12) of Algorithm 3 constructs a final label estimator z^\hat{z} from z^10,z^20,...,z^n0\hat{z}_{-1}^{0},\hat{z}_{-2}^{0},...,\hat{z}_{-n}^{0}. Since the labels given by z^10,z^20,...,z^n0\hat{z}_{-1}^{0},\hat{z}_{-2}^{0},...,\hat{z}_{-n}^{0} are only comparable after certain permutations in Πk\Pi_{k}, we need this extra step to resolve this issue.

Algorithm 3 is a theoretically justifiable version for combining Algorithm 1 and Algorithm 2. In order to obtain a rate-optimal label assignment for the ithi{{}^{\rm th}} node, we first apply the initial clustering procedure in Algorithm 1 on the sub-network consisting of the remaining n1n-1 nodes and the edges among them. Then, one applies the refinement procedure in Algorithm 2 to assign a label for the ithi{{}^{\rm th}} node. The independence between the initialization and the refinement stages facilitates the technical arguments in the proof. However, in practice, one can simply apply Algorithm 1 followed by Algorithm 2. The numerical difference from Algorithm 3 is negligible in all the data examples we have examined.

In the special case where the community sizes are almost equal, we can show that z^\hat{z} output by Algorithm 3 achieves the minimax rate.

Under the conditions of Lemma 1, we further assume β=1\beta=1, k=O(1)k=O(1), miniθi=Ω(1)\min_{i}\theta_{i}=\Omega(1), δ=o(pqp)\delta=o(\frac{p-q}{p}), θ2pn1\|{\theta}\|_{\infty}^{2}p\geq n^{-1}, and (1+ϵ)θp3/2n(pq)2=o(1)\frac{(1+\epsilon)\|{\theta}\|_{\infty}p^{3/2}}{\sqrt{n}(p-q)^{2}}=o(1). Then there is a sequence η=o(1)\eta=o(1) such that the output z^\hat{z} of Algorithm 3 satisfies

Theorem 3 shows that when the community sizes are almost equal, the minimax rate exp((1+o(1))I)\exp(-(1+o(1))I) can be achieved within polynomial time. We note that the conditions that we need here are stronger than those of Theorem 1. When k=O(1)k=O(1), ϵ=O(1)\epsilon=O(1) and Ω(1)=miniθiθ=O(1)\Omega(1)=\min_{i}\theta_{i}\leq\|{\theta}\|_{\infty}=O(1), Theorem 1 only requires II\rightarrow\infty, which is equivalent to p1/2n(pq)=o(1)\frac{p^{1/2}}{\sqrt{n}(p-q)}=o(1), while Theorem 3 requires p3/2n(pq)2=o(1)\frac{p^{3/2}}{\sqrt{n}(p-q)^{2}}=o(1). Whether the extra factor ppq\frac{p}{p-q} can be removed from the assumptions is an interesting problem to investigate in the future.

We now state a general high probability error bound for Algorithm 3. To introduce this result, we define another information-theoretic quantity. For any t(0,1)t\in(0,1), define

By Jensen’s inequality, it is straightforward to verify that Jt(p,q)0J_{t}(p,q)\geq 0 and Jt(p,q)=0J_{t}(p,q)=0 if and only if p=qp=q. As a special case, when t=12t=\frac{1}{2}, we have

For a given z[k]nz\in[k]^{n}, let n(1)...n(k)n_{(1)}\leq...\leq n_{(k)} be the order statistics of community sizes {nu(z):u=1,,k}\{n_{u}(z):u=1,\dots,k\}. Then, we define the quantity JJ by through

with t=n(1)n(1)+n(2)t^{*}=\frac{n_{(1)}}{n_{(1)}+n_{(2)}}. With the foregoing definitions, the following theorem gives a general error bound for Algorithm 3.

Under the conditions of Lemma 1, we further assume that δ=o(pqp)\delta=o(\frac{p-q}{p}), θ2pn1\|{\theta}\|_{\infty}^{2}p\geq n^{-1},

Then there is a sequence η=o(1)\eta=o(1) such that the output z^\hat{z} of Algorithm 3 satisfies

Theorem 4 gives a general error bound for the performance of Algorithm 3. It is easy to check that the conditions (16) and (17) are satisfied under the settings of Theorem 3. Therefore, Theorem 3 is a special case of Theorem 4. Theorem 4 shows that Algorithm 3 converges at the rate exp((1+o(1))J)\exp(-(1+o(1))J). According to the properties of Jt(p,q)J_{t}(p,q) stated in Appendix B, one can show that when n(1)=(1+o(1))n(2)n_{(1)}=(1+o(1))n_{(2)}, J=(1+o(1))IJ=(1+o(1))I, and that in general

Using this relation, we can state the convergence rate in Theorem 4 using the quantity II.

Under the conditions of Theorem 4, there is a sequence η=o(1)\eta=o(1) such that the output z^\hat{z} of Algorithm 3 satisfies

Therefore, when k3k\geq 3, the minimax rate exp((1+o(1))I)\exp(-(1+o(1))I) is achieved by Algorithm 3. The only situation where the minimax rate is not achieved by Algorithm 3 is when k=2k=2 and β>1\beta>1. For this case, there is an extra β1\beta^{-1} factor on the exponent of the convergence rate.

If we further assume that minijBij=Ω(q)\min_{i\neq j}B_{ij}=\Omega(q), a careful examination of the proofs shows that we can improve the term k5/2k^{5/2} in the conclusion of Lemma 1 and in the conditions (16) and (17) to k3/2k^{3/2}. Since it is unclear what the optimal power exponent for kk is in these circumstances, we do not pursue it explicitly in this paper.

Numerical Results

In this section, we present numerical experiments on simulated datasets generated from DCBMs. In particular, we compare the performance of two versions of our algorithm with two state-of-the-art methods: SCORE and CMM in two different scenarios. On simulated data examples, both versions of our algorithm outperformed SCORE in terms of misclassification proportion. The performance of CMM was comparable to our algorithm. However, our algorithm not only demonstrated slightly better accuracy on the simulated datasets when compared with CMM, but it also enjoys the advantage of easy implementation, fast computation and scalability to networks of large sizes since it does not involve convex programming.

We compare misclassification proportions of the following five algorithmsThe numerical performance of Algorithm 3 was indistinguishable from that of the second algorithm in the list in all the experiments conducted.:

The weighted kk-medians procedure in Algorithm 1;

Refinement of the output of Algorithm 1 by Algorithm 2;

Iterate Algorithm 2 1010 times after initialization by Algorithm 1.

We conducted the experiments with 100100 independent repetitions and summarize the performances of the five algorithms through boxplots of misclassification proportions. Fig. 1 shows that our refinement step (Algorithm 2) significantly improved the performance of the initialization step (Algorithm 1). Moreover, it helped to further reduce the error if we apply the refinement step for a few more iterations. Among the five algorithms, our proposed procedures give the best performance. The CMM algorithm performed slightly worse than our procedures with refinement, but was better than Algorithm 1 and SCORE.

Scenario 2

As in Scenario 1, we compare the performance of the five algorithms over 100100 independent repetitions. Fig. 2 shows the boxplots of the misclassification proportions. The overall message is similar to Scenario 1, except that CMM performed almost as good as our procedures with refinement, but all of them outperformed Algorithm 1 and SCORE. Despite its comparable performance on this task, CMM required noticeably longer running time than our procedures on the simulated datasets due to the involvement of convex programming. Therefore, its scalability to large networks is more limited.

Proofs

This section presents proofs for some main results of the paper. In Section 5.1, we first study a fundamental testing problem for community detection in DCBMs. The theoretical results for this testing problem are critical tools to prove minimax lower and upper bounds for community detection. We then state the proof of Theorem 1. Proofs of the other main results are deferred to the appendices.

As a prelude to all proofs, we consider the hypothesis testing problem (10) that not only is fundamental to the study of minimax risk but also motivates the proposal of Algorithm 2. To paraphrase the problem, suppose X=(X1,,Xm,Xm+1,,X2m)X=(X_{1},\dots,X_{m},X_{m+1},\dots,X_{2m}) have independent Bernoulli entries. Given 1p>q01\geq p>q\geq 0 and θ0,θ1,,θ2m>0\theta_{0},\theta_{1},\dots,\theta_{2m}>0 such that

We are interested in understanding the minimum possible Type I+II error of testing

In particular, we are interested in the asymptotic behavior of the error for a sequence of such testing problems in which pp, qq and the θi\theta_{i}’s scale with mm as mm\to\infty. First, we have the following lower bound result.

Suppose that as mm\to\infty, 1<p/q=O(1)1<p/q=O(1) and pmax0i2mθi2=o(1)p\max_{0\leq i\leq 2m}\theta_{i}^{2}=o(1). If θ0m(pq)2\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}\to\infty,

Otherwise if θ0m(pq)2=O(1)\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}=O(1), there exists a constant c(0,1)c\in(0,1) such that

According to Neyman-Pearson lemma, the optimal testing procedure is the likelihood ratio test. However, such a test depends on the values of {θi}i=12m,p,q\{\theta_{i}\}_{i=1}^{2m},p,q, and is not appropriate in practice. We consider an alternative test

This test is simple, but achieves the optimal error bound.

For the testing function defined above, we have

Combining Lemma 2 and Lemma 3, we find that the minimax testing error for the problem (19) is e(1+o(1))θ0m(pq)2e^{-(1+o(1))\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}}. This explains why the minimax rate for community detection in DCBM takes the form of e(1+o(1))Ie^{-(1+o(1))I} in Corollary 1. Moreover, the simple testing function (20) serves as a critical component in Algorithm 2. The fact that (20) can achieve the optimal testing error exponent in Lemma 3 explains why our algorithm for community detection can achieve the minimax rate when the community sizes are equal (Theorem 3).

In order for Lemma 2 to be applied to lower bounding the performance of community detection in DCBM, we need a version of Lemma 2 that can handle approximately equal sizes. To be specific, suppose X=(X1,,Xm,Xm+1,,Xm+m1)X=(X_{1},\dots,X_{m},\\ X_{m+1},\dots,X_{m+m_{1}}) have independent Bernoulli entries. Given 1p>q01\geq p>q\geq 0 and θ0,θ1,,θm+m1>0\theta_{0},\theta_{1},\dots,\theta_{m+m_{1}}>0 such that

When mm and m1m_{1} are approximately equal, we are interested in understanding the minimum possible Type I+II error of testing

The setting of Lemma 2 is a special case where m=m1m=m_{1} and δ=0\delta=0.

Suppose that as mm\to\infty, 1<p/q=O(1)1<p/q=O(1), pmax0i2mθi2=o(1)p\max_{0\leq i\leq 2m}\theta_{i}^{2}=o(1), δ=o(1)\delta=o(1) and mm11=o(1)\left|\frac{m}{m_{1}}-1\right|=o(1). If θ0m(pq)2\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}\to\infty,

Otherwise, there exists a constant c(0,1)c\in(0,1) such that

2 Proof of Theorem 1

In the last step, we supply the bounds obtained in the second step to (23) to establish the theorem. Indeed, the bounds we derive in the second step decay geometrically once mm is larger than some critical value which depends on the rate of the error bounds. Thus, we divide the final arguments here according to three different regimes of error rates.

After a brief introduction to some additional notation, we carry out these three steps in order. We denote nmin=minu[k]{i:z(i)=u}n_{\min}=\min_{u\in[k]}|\{i:z(i)=u\}|, nmax=maxu[k]{i:z(i)=u}n_{\max}=\max_{u\in[k]}|\{i:z(i)=u\}| and θmin=mini[n]θi\theta_{\min}=\min_{i\in[n]}\theta_{i}. Note that nminn/(βk)n_{\min}\geq n/(\beta k), nmaxβn/kn_{\max}\leq\beta n/k and θmin=Ω(1)\theta_{\min}=\Omega(1). For any t(0,1)t\in(0,1), we define

where the second inequality is by Jensen inequality.

Similarly, when z(i)z(j)z(i)\neq z(j) we also have

Let M2M\geq 2 be a large enough constant integer to be determined later. For each Cu\mathcal{C}_{u} we decompose it as Cu=Cu+Cu\mathcal{C}_{u}=\mathcal{C}_{u}^{+}\cup\mathcal{C}_{u}^{-} such that

Due to the approximate normalization of degree-correction parameters, for sufficiently large values of nn,

Since Cu+(M1)Cu|\mathcal{C}_{u}^{+}|\leq(M-1)|\mathcal{C}_{u}^{-}|, we can define a mapping τu:CuCu\tau_{u}:\mathcal{C}_{u}\rightarrow\mathcal{C}_{u}^{-} such that its restriction on Cu\mathcal{C}_{u}^{-} is identity. Moreover, we could require that for any iCui\in\mathcal{C}_{u}^{-}, τu1(i)M|\tau_{u}^{-1}(i)|\leq M. Let τ\tau be the mapping from [n][n] to u=1kCu\cup_{u=1}^{k}\mathcal{C}_{u}^{-} such that the restriction of τ\tau on Cu{\mathcal{C}_{u}} is τu\tau_{u}. The main reason for introducing τ\tau is to deal with the range of values the degree-correction parameters can take. The right side of (25) shows that the desired bounds depend crucially on quantities of the form iSθi\sum_{i\in S}\theta_{i} for some set SS. For any set SS, the sum iSθi\sum_{i\in S}\theta_{i} is not necessarily upper bounded by a constant multiple of the size of the set S|S|. However, by (27), we can always upper bound iSθτ(i)\sum_{i\in S}\theta_{\tau(i)} by a constant multiple of S|S|. This gives us a way to relate the probability bounds and the number of misclassified nodes. Such a point can be seen more clearly as we go to explicit calculation below.

Here, the first inequality comes from direct application of (25) and the union bound. Since θ=o(n/k)=o(nmin)\|\theta\|_{\infty}=o(n/k)=o(n_{\min}) and MM is a constant, we have iΓθi=o(nmin)\sum_{i\in\Gamma}\theta_{i}=o(n_{\min}). This, together with the monotonicity of the function x(1x)x(1-x) when xx is in the right neighborhood of zero, implies the second inequality. The third inequality is due to (27). Since MM is a constant and M/nminM/n_{\min} can be upper bounded by η\eta for large values of nn, we further have

In this case, we cannot directly apply the argument in case 1 since we can no longer guarantee that iΓθi=o(nmin)\sum_{i\in\Gamma}\theta_{i}=o(n_{\min}) and so the second inequality of the last display no longer holds. To proceed, we can further bound the rightmost side of (25) by B1×B2B_{1}\times B_{2}, where

In what follows, we focus on upper bounding B1B_{1} and the same upper bound holds for B2B_{2} by essentially repeating the arguments. For B1B_{1}, we have

Fig. 3 illustrates why the inequality in (30) holds. Furthermore, we notice that

To further bound the right side of (31), recall that θmin=miniθi=Ω(1)\theta_{\min}=\min_{i}\theta_{i}=\Omega(1). Then

Here the last inequality holds since ΓuΓm=o(nmin)12Cu|\Gamma_{u}|\leq|\Gamma|\leq m^{\prime}=o(n_{\min})\leq\frac{1}{2}|\mathcal{C}_{u}|. Together with the property of the function x(1x),xx(1-x),x\in, when M5θminM\geq\frac{5}{\theta_{\min}}, we have

Here, the first inequality holds since iτu(Γu)θiτu(Γu)maxiCuθi2(M1Cu+1)12Cuθmin\sum_{i\in\tau_{u}(\Gamma_{u})}\theta_{i}\leq|\tau_{u}(\Gamma_{u})|\max_{i\in\mathcal{C}_{u}^{-}}\theta_{i}\leq 2(M^{-1}|\mathcal{C}_{u}|+1)\leq\frac{1}{2}|\mathcal{C}_{u}|\theta_{\min}. The second inequality is due to (27), nminnβk1n_{\min}\geq\frac{n}{\beta k}-1 and the fact τu(Γu)Γuηn/k|\tau_{u}(\Gamma_{u})|\leq|\Gamma_{u}|\leq\eta n/k in the current case. Thus

Since for any iCui\in\mathcal{C}_{u}^{-}, τu1(i)M|\tau_{u}^{-1}(i)|\leq M, we obtain that τ(Γ)m/M|\tau(\Gamma)|\geq m/M, and so we have

Here, the second inequality is based on counting and the details are as follows. Note that each term in iτ(Γ)\prod_{i\in\tau(\Gamma)} is a product of at least m/Mm/M terms. First, there are at most (mm/M){m\choose m/M} of sets τ(Γ)\tau(\Gamma) that map to the same m/Mm/M-product. Then, there are at most (mMm){mM\choose m} of sets Γ\Gamma that map to the same τ(Γ)\tau(\Gamma) (recall that for any iCui\in\mathcal{C}_{u}^{-}, τu1(i)<M|\tau_{u}^{-1}(i)|<M). For each m/Mm/M-product, it appear at most (m/M)!(m/M)! times from the expansion of nRδ+2ηnR_{\delta+2\eta}, and that explains the existence of 1/(m/M)!1/(m/M)!. The last inequality holds since \binom{n}{m}\leq\big{(}\frac{en}{m}\big{)}^{m} and n!2πnn+1/2enn!\geq\sqrt{2\pi}n^{n+1/2}e^{-n}.

In each Γu,u,1u,uk\Gamma_{u,u^{\prime}},\forall 1\leq u,u^{\prime}\leq k we can find an arbitrary subset Γu,uΓu,u\Gamma^{\prime}_{u,u^{\prime}}\subset\Gamma_{u,u^{\prime}} such that Γu,u=ηΓu,u|\Gamma^{\prime}_{u,u^{\prime}}|=\eta|\Gamma_{u,u^{\prime}}|. In this way Γu[k]uuΓu,u\Gamma^{\prime}\triangleq\cup_{u\in[k]}\cup_{u^{\prime}\neq u}\Gamma^{\prime}_{u,u^{\prime}} satisfies Γ=ηΓηm|\Gamma^{\prime}|=\eta|\Gamma|\leq\eta m and iΓu,uθi2Γu,u2ηCu\sum_{i\in\Gamma^{\prime}_{u,u^{\prime}}}\theta_{i}\leq 2|\Gamma^{\prime}_{u,u^{\prime}}|\leq 2\eta|\mathcal{C}_{u}|.

Together with the property of the function x(1x),xx(1-x),x\in, we have

Thus with the same bound on B2B_{2} we get

Note we have an extra 1/21/2 factor inside the exponent compared with Case 2. Since for each Γ\Gamma we can find a subset Γ\Gamma^{\prime} with Γ=ητ(Γ)ηm/M|\Gamma^{\prime}|=\eta|\tau(\Gamma)|\geq\eta m/M satisfying the above inequality, we have

where the last equality is due to Cauchy-Schwarz.

As we have pointed out in the proof outline, we shall combine (23) with (33) in this step to finish the proof. To this end, we divide the argument into three cases according to different possible growth rates of Rδ+2ηR_{\delta+2\eta}.

for some constant C1>1C_{1}>1 where the last inequality is due to the properties of power series. For m>mm>m^{\prime} we have

We are to show that the above ratio is upper bounded by eme^{-m}. This is because e2Mn/(η2m/M)η3e2M2ke^{2}Mn/(\eta^{2}m/M)\leq\eta^{-3}e^{2}M^{2}k since mηn/km\geq\eta n/k and e2MnRδ+2η/(η2m/M)1/(2(kM)Mη3n)e^{2}MnR_{\delta+2\eta}/(\eta^{2}m/M)\leq 1/(2(kM)^{M}\eta^{3}n) since nRδ+2η1/(2(eM)M)nR_{\delta+2\eta}\leq 1/(2(eM)^{M}). Then for some constant C2>0C_{2}>0 we have

where the last inequality is due to the fact that knηk\leq n^{\eta}. By the property of power series we have

for some constant C3>0C_{3}>0. Finally by Jensen’s inequality and the assumption logk=o(I)\log k=o(I),

To obtain the last display, we divide both sides of (23) by nn, replace all the mm’s in front of the probabilities in the summation by nn and then upper bound the first m0m_{0} probabilities by one. To further bound the right side of the last display, we have

Since m02Mlognm_{0}\geq 2M\log n, we have 2m/M22lognm0/n2^{-m/M}\leq 2^{-2\log n}\leq m_{0}/n. Thus

Since Rδ+2ηexp((1δ2η)I)R_{\delta+2\eta}\leq\exp(-(1-\delta-2\eta)I) by Jensen’s inequality, logk=o(I)\log k=o(I) and η1=o(I)\eta^{-1}=o(I), we have η3(ekM)Mη+3Rδ+2η1/21/2\eta^{-3}(ekM)^{\frac{M}{\eta}+3}R^{1/2}_{\delta+2\eta}\leq 1/2. Then

Appendix A Additional Proofs of Main Results

Denote m=ηnm^{\prime}=\eta n for some η=o(1)\eta=o(1) satisfying η1=o(I)\eta^{-1}=o(I). We define τ\tau exactly the same way as in Section 5.2. We use the notation

Recall the constant MM used in Section 5.2. Case 1: 1mM1\leq m\leq M. By (34), we have

Using the argument in Section 5.2, we have

Case 3: m>mm>m^{\prime}. Under this scenario, we can take an arbitrary subset Γτ(Γ)\Gamma^{\prime}\subset\tau(\Gamma) such that Γ=ηm/M|\Gamma^{\prime}|=\eta m/M, which leads to iΓθi2ηn\sum_{i\in\Gamma^{\prime}}\theta_{i}\leq 2\eta n. Recall niΓθi=iΓcθiΓcθminnθmin/2n-\sum_{i\in\Gamma}\theta_{i}=\sum_{i\in\Gamma^{c}}\theta_{i}\geq|\Gamma^{c}|\theta_{\min}\geq n\theta_{\min}/2. Using (34), together with the property of x(1x)x(1-x) for xx\in, we have

By the argument used in Section 5.2, we have

A.2 Proof of Theorem 2

We only state the proof for the case k3k\geq 3. The proof for the case k=2k=2 can be derived using essentially the same argument. For a label vector, recall the notation nu(z)={i[n]:z(i)=u}n_{u}(z)=|\{i\in[n]:z(i)=u\}|. Under Condition N, there exists a z[k]nz^{*}\in[k]^{n} such that n1(z)n2(z)n3(z)nk(z)n_{1}(z^{*})\leq n_{2}(z^{*})\leq n_{3}(z^{*})\leq\cdots\leq n_{k}(z^{*}) with n1(z)=n2(z)=n/(βk)n_{1}(z^{*})=n_{2}(z^{*})={\left\lfloor{n/(\beta k)}\right\rfloor}, and that (nu(z))1i:z(i)=uθi(1δ4,1+δ4)(n_{u}(z^{*}))^{-1}{\sum_{i:z^{*}(i)=u}\theta_{i}}\in(1-\frac{\delta}{4},1+\frac{\delta}{4}) for all u[k]u\in[k].

11^{\circ} As a first step, we define a community detection problem on a subset of the parameter space such that we can avoid the complication of label permutation. To this end, given zz^{*}, for each u[k]u\in[k], let Tu{i:z(i)=u}T_{u}\subset\{i:z^{*}(i)=u\} with cardinality nu(z)δn4k2β{\left\lceil{n_{u}(z^{*})-\frac{\delta n}{4k^{2}\beta}}\right\rceil} collect the indices of the largest θi\theta_{i}’s in {θi:z(i)=u}\{\theta_{i}:{z(i)=u}\}. Let T=u=1kTuT=\cup_{u=1}^{k}T_{u}. Define

Since zZz^{*}\in Z^{*}, the latter is not empty. By the definition of TT and Condition N, maxiTcθi\max_{i\in T^{c}}\theta_{i} is bounded by a constant. Thus, for any zz such that z(i)=z(i)z(i)=z^{*}(i) for all iTi\in T, we have

Therefore, we can define a smaller parameter space Pn0=Pn0(θ,p,q,k,β;δ)Pn(θ,p,q,k,β;δ){\mathcal{P}}_{n}^{0}={\mathcal{P}}_{n}^{0}(\theta,p,q,k,\beta;\delta)\subset{\mathcal{P}}_{n}(\theta,p,q,k,\beta;\delta) where

22^{\circ} We now turn to lower bounding the rightmost side of (36), which relies crucially on our previous discussion in Section 5.1. To this end, observe that

for some constant c>0c>0. Here, ave\operatorname*{\operatorname{ave}} stands for arithmetic average. The first inequality holds since minimax risk is lower bounded by Bayes risk. The second inequality is due to the fact that for any zZz\in Z^{*}, z(i)=z(i)z(i)=z^{*}(i) for all iTi\in T, and so infimum can be taken over all z^\hat{z} with z^(i)=z(i)\hat{z}(i)=z^{*}(i) for iTi\in T. The last inequality holds because Tccδnk|T^{c}|\geq c\frac{\delta n}{k} for some constant cc by its definition.

Note that for any uvu\neq v, there is a 1-to-1 correspondence between the elements in ZuZ^{*}_{u} and ZvZ^{*}_{v}. In particular, for each zZuz\in Z^{*}_{u}, there exists a unique zZvz^{\prime}\in Z^{*}_{v} such that z(i)=z(i)z(i)=z^{\prime}(i) for all i1i\neq 1. Thus, we can simultaneously index all {Zu}u=1k\{Z^{*}_{u}\}_{u=1}^{k} by the second to the last coordinates of the zz vectors contained in them. We use z1z_{-1} to indicate the subvector in [k]n1[k]^{n-1} excluding the first coordinate and collect all the different z1z_{-1}’s into a set Z1Z_{-1}. Then we have

Note that by the definition of zz^{*} and ZZ^{*}, it is guaranteed that for either (1,z1)(1,z_{-1}) or (2,z1)(2,z_{-1}), n1n21=o(1)|\frac{n_{1}}{n_{2}}-1|=o(1). Therefore, we can apply Lemma 4 to bound from below each term in the summation of the rightmost side of the last display by exp((1+η)θ1nβk(pq)2)\exp\left(-(1+\eta)\theta_{1}\frac{n}{\beta k}(\sqrt{p}-\sqrt{q})^{2}\right) for some η=o(1)\eta=o(1). Together with (36) – (38), this implies that

Here, the first inequality is simple algebra. The second inequality holds since TcT^{c} only contains within each community defined by zz^{*} the nodes with the smallest θi\theta_{i}’s. The third inequality is a direct application of Jensen’s inequality, and the last equality holds since logk=o(I)\log k=o(I) and log1δ=o(I)\log\frac{1}{\delta}=o(I). This completes the proof.

A.3 Proofs of Lemma 1 and Corollary 2

We now prove Lemma 1 and Corollary 2, which characterize the performance of Algorithm 1. To prove Lemma 1, we need two auxiliary lemmas, whose proofs will be given in Appendix C. In the rest of this part, we let P=(Pij)=(θiθjBz(i)z(j))P=(P_{ij})=(\theta_{i}\theta_{j}B_{z(i)z(j)}) for notational convenience.

The following lemma characterizes the connection between measure on misclassification and geometry of the point cloud. The result is not tied to any specific clustering algorithm or choice of norm.

Under the settings of Lemma 1, let τ=C1(npθ2+1)\tau=C_{1}\left(np\|{\theta}\|_{\infty}^{2}+1\right) for some sufficiently large C1>0C_{1}>0 in Algorithm 1. Then, for any constant C>0C^{\prime}>0, there exists some C>0C>0 only depending on C1,CC_{1},C^{\prime} and α\alpha such that

with probability at least 1n(1+C)1-n^{-(1+C^{\prime})} uniformly over Pn(θ,p,q,k,β;δ,α){\mathcal{P}}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha).

under the conditions δ=o(1)\delta=o(1) and θ=o(n/k)\|{\theta}\|_{\infty}=o(n/k).

Note that Pˉi=Pˉj\bar{P}_{i}=\bar{P}_{j} when z(i)=z(j)z(i)=z(j). Our first task is to lower bound PˉiPˉj1\|{\bar{P}_{i}-\bar{P}_{j}}\|_{1} when z(i)z(j)z(i)\neq z(j), which serves as the separation condition among different clusters. For any ii and jj such that z(i)=uv=z(j)z(i)=u\neq v=z(j), we assume Pi1/θiPj1/θj\|{P_{i}}\|_{1}/\theta_{i}\leq\|{P_{j}}\|_{1}/\theta_{j} without loss of generality. Then,

Here, (41) holds since Pj1/θjPi1/θiBuup\frac{\|{P_{j}}\|_{1}/\theta_{j}}{\|{P_{i}}\|_{1}/\theta_{i}}B_{uu}\geq p and BuvqB_{uv}\leq q, and (42) is due to (40). By switching ii and jj, the foregoing argument also works for the case where Pi1/θi>Pj1/θj\|{P_{i}}\|_{1}/\theta_{i}>\|{P_{j}}\|_{1}/\theta_{j}. Hence,

In order to bound iSθi\sum_{i\in S}\theta_{i}, we first derive a bound for iSP^i1\sum_{i\in S}\|{\hat{P}_{i}}\|_{1}. That is,

where (A.3) uses the definition of SS, (47) is by the inequality (45), and (48) is by the inequality x11xy11y12xy1x1y1\|{\|{x}\|_{1}^{-1}x-\|{y}\|_{1}^{-1}y}\|_{1}\leq\frac{2\|{x-y}\|_{1}}{\|{x}\|_{1}\vee\|{y}\|_{1}} which in turn is due to the triangle inequality.

Now we are ready to bound iSθi\sum_{i\in S}\theta_{i} as

where (50) is by the inequality (40), (51) is due to the triangle inequality, (52) uses (49) and Cauchy-Schwarz, and (53) holds since α,β,k1\alpha,\beta,k\geq 1.

We now turn to bounding iS0θi\sum_{i\in S_{0}}\theta_{i}. To this end, simple algebra leads to

where (54) is by the inequality (40), (55) uses the definition of S0S_{0} and (56) is due to the Cauchy-Schwarz inequality.

Combining the bounds in (53), (56) and (44), we have

where we have absorbed α\alpha and β\beta into the constant CC. By Lemma 6, we have

with probability at least 1n(1+C)1-n^{-(1+C^{\prime})}. This completes the proof. ∎

A.4 Proofs of Theorem 3, Theorem 4 and Corollary 3

Now we are going to give proofs of Theorem 3, Theorem 4 and Corollary 3. Note that both Theorem 3 and Corollary 3 are direct consequences of Theorem 4. The main argument in the proof of Theorem 4 is the following lemma.

Suppose 1<p/q=O(1)1<p/q=O(1) and δ=o(pqp)\delta=o\left(\frac{p-q}{p}\right). If there exist two sequences γ1=o(pqkp)\gamma_{1}=o\left(\frac{p-q}{kp}\right) and γ2=o(pqk2p)\gamma_{2}=o\left(\frac{p-q}{k^{2}p}\right), a constant C1>0C_{1}>0 and permutations {πi}i[n]Πk\{\pi_{i}\}_{i\in[n]}\subset\Pi_{k} such that

uniformly for all probability distributions in Pn(θ,p,q,k,β;δ,α)\mathcal{P}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha). Then, we have for all i[n]i\in[n],

uniformly for all probability distributions in Pn(θ,p,q,k,β;δ,α)\mathcal{P}_{n}^{\prime}(\theta,p,q,k,\beta;\delta,\alpha), where η=o(1)\eta=o(1).

Define independent random variables XiBernoulli(θ1θiq)X_{i}\sim\text{Bernoulli}(\theta_{1}\theta_{i}q), YiBernoulli(θ1θip)Y_{i}\sim\text{Bernoulli}(\theta_{1}\theta_{i}p), and ZiBernoulli(θ1θip)Z_{i}\sim\text{Bernoulli}(\theta_{1}\theta_{i}p) for all i[n]i\in[n]. Then, a stochastic order argument bounds the right hand side of (59) by

Using Chernoff bound, for any λ>0\lambda>0, we upper bound (60) by

We are going to give bounds for the four terms (64), (64), (64) and (64), respectively. On the event E1E_{1},

which is a bound for (64). A similar argument leads to a bound (64), which is

Finally, we need a bound for (64). With the current choice of λ\lambda,

where τ=n1n1+n2\tau=\frac{n_{1}}{n_{1}+n_{2}} and τ^=m1m1+m2\hat{\tau}=\frac{m_{1}}{m_{1}+m_{2}}. We will give a bound for Jτ(p,q)Jτ^(p,q)\left|J_{\tau}(p,q)-J_{\hat{\tau}}(p,q)\right|. Since τJτ(p,q)=12(pq)ptq1tlogpq12pq+12plogplogqpq\left|\frac{\partial}{\partial\tau}J_{\tau}(p,q)\right|=\frac{1}{2}\left|(p-q)-p^{t}q^{1-t}\log\frac{p}{q}\right|\leq\frac{1}{2}|p-q|+\frac{1}{2}p|\log p-\log q|\leq|p-q|, we have Jτ(p,q)Jτ^(p,q)pqττ^βkγ2(pq)\left|J_{\tau}(p,q)-J_{\hat{\tau}}(p,q)\right|\leq|p-q||\tau-\hat{\tau}|\leq\beta k\gamma_{2}(p-q). Hence, we have a bound for (64), which is

Combining the above bounds for (64), (64), (64) and (64), we have

By the property of Jt(p,q)J_{t}(p,q) stated in Lemma 11, Jn1n1+n2(p,q)(4β2)1(pq)2pJ_{\frac{n_{1}}{n_{1}+n_{2}}}(p,q)\geq(4\beta^{2})^{-1}\frac{(p-q)^{2}}{p}. Then, under the assumptions γ1=o(pqp)\gamma_{1}=o\left(\frac{p-q}{p}\right), γ2=o(pqpk)\gamma_{2}=o\left(\frac{p-q}{pk}\right) and δ=o(k(pq)p)\delta=o\left(\frac{k(p-q)}{p}\right), we have

Now let us use the original notation and apply the above argument for each node, which leads to the bound

for all [n]\in[n]. By the property of Jt(p,q)J_{t}(p,q) stated in Lemma 8, minuv[(nu+nv)Jnunu+nv(p,q)]=(n(1)+n(2))Jt(p,q)\min_{u\neq v}\left[(n_{u}+n_{v})J_{\frac{n_{u}}{n_{u}+n_{v}}}(p,q)\right]=(n_{(1)}+n_{(2)})J_{t^{*}}(p,q), with tt^{*} specified by (15). Thus, the proof is complete. ∎

It is sufficient to check that the initial clustering step satisfies (58) with γ1=o(pqkp)\gamma_{1}=o\left(\frac{p-q}{kp}\right) and γ2=o(pqk2p)\gamma_{2}=o\left(\frac{p-q}{k^{2}p}\right). This can be done using the bound in Lemma 1 under the assumptions (16) and (17). Note that the nn initial clustering results {z^i0}\{\hat{z}_{-i}^{0}\} may not correspond to the same permutation. This problem can be taken care of by the consensus step (12). Details of the argument are referred to the proof of Theorem 2 in . ∎

Theorem 3 is a direct implication of Theorem 4 by observing I=JI=J when β=1\beta=1. The fact that (n(1)+n(2))Jt(p,q)2n(1)J1/2(p,q)2nβk(pq)2(n_{(1)}+n_{(2)})J_{t^{*}}(p,q)\geq 2n_{(1)}J_{1/2}(p,q)\geq\frac{2n}{\beta k}(\sqrt{p}-\sqrt{q})^{2} by Lemma 10 implies the result for the case k3k\geq 3 in Corollary 3. For k=2k=2, observe that

This implies the result for k=2k=2 in Corollary 3. ∎

In this section, we study the quantity Jt(p,q)J_{t}(p,q) defined in (13). We will state some lemmas about some useful properties of Jt(p,q)J_{t}(p,q) that we have used in the paper. Recall that for p,q,t(0,1)p,q,t\in(0,1),

Given p,q(0,1)p,q\in(0,1), let f(x1,x2)=x1p+x2q(x1+x2)px1x1+x2qx2x1+x2f(x_{1},x_{2})=x_{1}p+x_{2}q-(x_{1}+x_{2})p^{\frac{x_{1}}{x_{1}+x_{2}}}q^{\frac{x_{2}}{x_{1}+x_{2}}} where x1,x2>0x_{1},x_{2}>0. Then the function ff is increasing in terms of x1x_{1} and x2x_{2}, respectively.

By differentiating ff against x1x_{1} we get

Thus limx1f(x1,x2)x1=0\lim_{x_{1}\rightarrow\infty}\frac{\partial f(x_{1},x_{2})}{\partial x_{1}}=0. Moreover,

Therefore, f(x1,x2)x10\frac{\partial f(x_{1},x_{2})}{\partial x_{1}}\geq 0 for all x1,x2>0x_{1},x_{2}>0. This shows f(x1,x2)f(x_{1},x_{2}) is increasing with respect to x1x_{1}. Similarly we can prove that f(x1,x2)f(x_{1},x_{2}) is also an increasing function in terms of x2x_{2}. ∎

For any 0<q<p<10<q<p<1 and 0<t120<t\leq\frac{1}{2}, we have

Define S(t)=12(Jt(p,q)J1t(p,q))=(2t1)(pq)(q(pq)tp(qqp)t)S(t)=\frac{1}{2}\left(J_{t}(p,q)-J_{1-t}(p,q)\right)=(2t-1)(p-q)-\left(q(\frac{p}{q})^{t}-p(\frac{q}{qp})^{t}\right). Then, we have

Since S(0)=S(1/2)=0S(0)=S(1/2)=0, we have S(t)0S(t)\leq 0 for all t(0,1/2]t\in(0,1/2]. ∎

For any 0<q<p<10<q<p<1 and 0<x1x20<x_{1}\leq x_{2}, we have

The first inequality 2x1J1/2(p,q)(x1+x2)Jx1x1+x2(p,q)2x_{1}J_{1/2}(p,q)\leq(x_{1}+x_{2})J_{\frac{x_{1}}{x_{1}+x_{2}}}(p,q) is a consequence of Lemma 8 and x1x2x_{1}\leq x_{2}. Since (t)2Jt(p,q)=2ptq1tlog2(pq)0\left(\frac{\partial}{\partial t}\right)^{2}J_{t}(p,q)=-2p^{t}q^{1-t}\log^{2}\left(\frac{p}{q}\right)\leq 0, Jt(p,q)J_{t}(p,q) is concave in tt. Thus,

When t(0,1/2]t\in(0,1/2], Jt(p,q)12(Jt(p,q)+J1t(p,q))J_{t}(p,q)\leq\frac{1}{2}\left(J_{t}(p,q)+J_{1-t}(p,q)\right) according to Lemma 9. Thus, Jt(p,q)J1/2(p,q)J_{t}(p,q)\leq J_{1/2}(p,q), which leads to the second inequality (x1+x2)Jx1x1+x2(p,q)(x1+x2)J1/2(p,q)(x_{1}+x_{2})J_{\frac{x_{1}}{x_{1}+x_{2}}}(p,q)\leq(x_{1}+x_{2})J_{1/2}(p,q) by the assumption x1x2x_{1}\leq x_{2}. ∎

Moreover, if max(p/q,q/p)M\max(p/q,q/p)\leq M, then we have

Without loss of generality, let p>qp>q. We first consider the case 0<t1/20<t\leq 1/2. By Lemma 10, we have

Let x1x1+x2=t\frac{x_{1}}{x_{1}+x_{2}}=t, and we have

Now we consider the case 1/2t<11/2\leq t<1. Let s=1ts=1-t. By Lemma 9 and (68), we have

Combine (68) and (69), and we can derive (66) for p>qp>q. A symmetric argument leads to the same result for p<qp<q. When, p=qp=q, the result trivially holds. Thus, the proof for (66) is complete.

for some ξ\xi between xx and 11. Thus, using the condition that max(p/q,q/p)M\max(p/q,q/p)\leq M, we have

Then, we can derive (67) by the fact that t(1t)min(t,1t)t(1-t)\leq\min(t,1-t). ∎

Appendix C Proofs of Auxiliary Results

Note that (19) is a simple vs. simple hypothesis testing problem. By the Neyman–Pearson lemma, the optimal test is the likelihood ratio test ϕ\phi which rejects H0H_{0} if

To establish the desired bound for this quantity, we employ below a refined version of the Cramer–Chernoff argument [26, Proposition 14.23]. To this end, for any fixed t>0t>0, define independent random variables {Wi}i=12m\{W_{i}\}_{i=1}^{2m} by

To obtain the desired lower bound, we set tt to be the minimizer of i=12mBi\prod_{i=1}^{2m}B_{i}. Since the minimizer is a stationary point, it satisfies

For any t,a,b(0,1)t,a,b\in(0,1), recall the definition of Jt(a,b)J_{t}(a,b) in (13). By Lemma 11, we have

where CC only depends on the ratio a/ba/b. Therefore, under the condition ab=o(1)a\asymp b=o(1), (71) implies

for some η=o(1)\eta=o(1) independent of tt. Using (72), under the assumption that 1<p/q=O(1)1<p/q=O(1), we have

Similar bounds hold for WiW_{i}, i=m+1,...,2mi=m+1,...,2m. Thus, we obtain that

for sufficiently large values of mm. This completes the proof when θ0m(pq)2\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}\rightarrow\infty.

When θ0m(pq)2=O(1)\theta_{0}m(\sqrt{p}-\sqrt{q})^{2}=O(1), then we have

where we have set et=p/qe^{t}=\sqrt{p/q}. The same bound can be established for PH1(1ϕ)P_{H_{1}}(1-\phi). ∎

The proof is very similar to that of Lemma 2. Therefore, we only sketch the difference. Without loss of generality, let θ1θ2...θm\theta_{1}\geq\theta_{2}\geq...\geq\theta_{m}, θm+1θm+2...θm+m1\theta_{m+1}\geq\theta_{m+2}\geq...\geq\theta_{m+m_{1}}, and mm1m\leq m_{1}. Then, we have

where Hˉ0\bar{H}_{0} and Hˉ1\bar{H}_{1} correspond to the following two hypotheses.

Bounding infϕ(PHˉ0ϕ+PHˉ1(1ϕ))\inf_{\phi}\left(P_{\bar{H}_{0}}\phi+P_{\bar{H}_{1}}(1-\phi)\right) is handled by the proof of Lemma 2 except that we do not have the relation (18) exactly. This slightly change the derivation of (73), as we will illustrate below. By the definition of Jt(,)J_{t}(\cdot,\cdot), we have

for some η=o(1)\eta=o(1). The last inequality uses Lemma 11 and the fact that δ=o(1)\delta=o(1). Since the term Cηθ0m(pq)2C\eta\theta_{0}m(\sqrt{p}-\sqrt{q})^{2} is of smaller order compared with the targeted exponent, the desired result can be derived following the remaining proof of Lemma 2. ∎

Following , we divide the sets {Cu}u[k]\{\mathcal{C}_{u}\}_{u\in[k]} into three groups. Define

Apply singular value decomposition to KK and we get K=l=12kλlululTK=\sum_{l=1}^{2k}\lambda_{l}u_{l}u_{l}^{T}. Then,

By Lemma 5 of , Tτ(A)PopCnαpθ2+1\|T_{\tau}(A)-P\|_{\rm op}\leq C\sqrt{n\alpha p\|{\theta}\|_{\infty}^{2}+1} with probability at least 1nC1-n^{-C^{\prime}}, where the constant CC^{\prime} can be made arbitrarily large. Hence,

with probability at least 1nC1-n^{-C^{\prime}} Moreover,

Using (74), the proof is complete by absorbing α\alpha into the constant. ∎

References