Achieving Optimal Misclassification Proportion in Stochastic Block Model

Chao Gao, Zongming Ma, Anderson Y. Zhang, Harrison H. Zhou

Introduction

Network data analysis has become one of the leading topics in statistics. In fields such as physics, computer science, social science and biology, one observes a network among a large number of subjects of interest such as particles, computers, people, etc. The observed network can be modeled as an instance of a random graph and the goal is to infer structures of the underlying generating process. A structure of particular interest is community: there is a partition of the graph nodes in some suitable sense so that each node belongs to a community. Starting with the proposal of a series of methodologies , we have seen a tremendous literature devoted to algorithmic solutions to uncovering community structure and great advances have also been made in recent years on the theoretical understanding of the problem in terms of statistical consistency and thresholds for detection and exact recoveries. See, for instance, , among others. In spite of the great efforts exerted on this “community detection” problem, its state-of-the-art solution has not yet reached the comparable level of maturity as what statisticians have achieved in other high dimensional problems such as nonparametric estimation , high dimensional regression and covariance matrix estimation , etc. In these more well-established problems, not only do we know the fundamental statistical limits, we also have computationally feasible algorithms to achieve them. The major goal of the present paper is to serve as a step towards such maturity in network data analysis by proposing a computationally feasible algorithm for community detection in stochastic block model with provable statistical optimality.

To describe network data with community structure, we focus on the stochastic block model (SBM) proposed by . Let A{0,1}n×nA\in\left\{0,1\right\}^{n\times n} be the symmetric adjacency matrix of an undirected random graph generated according to an SBM with kk communities. Then the diagonal entries of AA are all zeros and each Auv=AvuA_{uv}=A_{vu} for u>vu>v is an independent Bernoulli random variable with mean Puv=Bσ(u)σ(v)P_{uv}=B_{\sigma(u)\sigma(v)} for some symmetric connectivity matrix Bk×kB\in^{k\times k} and some label function σ:[n][k]\sigma:[n]\rightarrow[k], where for any positive integer mm, [m]={1,,m}[m]=\left\{1,\dots,m\right\}. In other word, if the uthu{{}^{\rm th}} node and the vthv{{}^{\rm th}} node belong to the ithi{{}^{\rm th}} and the jthj{{}^{\rm th}} community respectively, then σ(u)=i\sigma(u)=i, σ(v)=j\sigma(v)=j and there is an edge connecting uu and vv with probability BijB_{ij}. Community detection then refers to the problem of estimating the label function σ\sigma subject to a permutation of the community labels {1,,k}\left\{1,\dots,k\right\}. A natural loss function for such an estimation problem is the proportion of wrong labels (subject to a permutation of the label set [k][k]), which we shall refer to as misclassification proportion from here on.

In ground breaking works by Mossel et al. and Massoulié , the authors established sharp threshold for the regimes in which it is possible and impossible to achieve a misclassification proportion strictly less than 12\frac{1}{2} when k=2k=2 and both communities are of the same size (so that it is better than random guess), which solved the conjecture in that was only justified in physics rigor. For some recent progress on the general case of fixed kk and possibly unequal sized communities, see . On the other hand, Abbe et al. , Mossel et al. and Hajek et al. established the necessary and sufficient condition for ensuring zero misclassification proportion (usually referred to as “strong consistency” in the literature) with high probability when k=2k=2 and community sizes are equal, and was later generalized to a larger set of fixed kk by . Arguably, what is of more interest to statisticians is the intermediate regime between the above two cases, namely when the misclassification proportion is vanishing as the number of nodes grows but not exactly zero. This is usually called the regime of “weak consistency” in the network literature.

To achieve weak (and strong) consistency, statisticians have proposed various methods. One popular approach is spectral clustering which is motivated by the observation that the rank of the n×nn\times n matrix P=(Puv)=(Bσ(u)σ(v))P=(P_{uv})=(B_{\sigma(u)\sigma(v)}) is at most kk and its leading eigenvectors contain information of the community structure. The application of spectral clustering on network data goes back to , and its performance under the stochastic block model has been investigated by , among others. To further improve the performance, various ways for refining spectral clustering have been proposed, such as those in which lead to strong consistency or convergence rates that are exponential in signal-to-noise ratio, while studied the problem of minimizing a non-vanishing misclassification proportion. However, in the regime of weak consistency, these refinement methods are not guaranteed to attain the optimal misclassification proportion to be introduced below. Another important line of research is devoted to the investigation of likelihood-based methods, which was initiated by and later extended to more general settings by . To tackle the intractability of optimizing the likelihood function, an EM algorithm using pseudo-likelihood was proposed by . Another way to overcome the intractability of the maximum likelihood estimator (MLE) is by convex relaxation. Various semi-definite relaxations were studied by , and the aforementioned sharp threshold for strong consistency in was indeed achieved by semi-definite programming. Recently, Zhang and Zhou established the minimax risk for misclassification proportion in SBM under weak conditions, which is of the form

if all kk communities are of equal sizes, where II^{*} is the minimum Rényi divergence of order 12\frac{1}{2} of the within and the between community edge distributions. See Theorem 1 below for a more general and precise statement of the minimax risk. Unfortunately, Zhang and Zhou used MLE for achieving the risk in (1) which was hence computationally intractable. Moreover, none of the spectral clustering based method or tractable variants of the likelihood based method has a known error bound that matches (1) with the sharp constant 1+o(1)1+o(1) on the exponent.

The main contribution of the current paper lies in the proposal of a computationally feasible algorithm that provably achieves the optimal misclassification proportion established in adaptively under weak regularity conditions. It covers the cases of both finite and diverging number of communities and both equal and unequal community sizes and achieves both weak and strong consistency in the respective regimes. In addition, the algorithm is guaranteed to compute in polynomial time even when the number of communities diverges with the number of nodes. Since the error bound of the algorithm matches the optimal misclassification proportion in under weak conditions, it achieves various existing detection boundaries in the literature. For instance, for any fixed number of communities, the procedure is weakly consistent under the necessary and sufficient condition of , and strongly consistent under the necessary and sufficient condition of . Moreover, it could match the optimal misclassification proportion in even when kk diverges. To the best of our limited knowledge, this is the first polynomial-time algorithm that achieves minimax optimal performance. In other words, the proposed procedure enjoys both statistical and computational efficiency.

The core of the algorithm is a refinement scheme for community detection motivated by penalized maximum likelihood estimation. As long as there exists an initial estimator that satisfies a certain weak consistency criterion, the refinement scheme is able to obtain an improved estimator that achieves the optimal misclassification proportion in (1) with high probability. The key to achieve the bound in (1) is to optimize the local penalized likelihood function for each node separately. This local optimization step is completely data-driven and has a closed form solution, and hence can be computed very efficiently. The additional penalty term is indispensable as it plays a key rule in ensuring the optimal performance when the community sizes are unequal and when the within community and/or between community edge probabilities are unequal.

To obtain a qualified initial estimator, we show that both spectral clustering and its normalized variant could satisfy the desired condition needed for subsequent refinement, though the refinement scheme works for any other method satisfying a certain weak consistency condition. Note that spectral clustering can be considered as a global method, and hence our two-stage algorithm runs in a “from global to local” fashion. In essence, with high probability, the global stage pinpoints a local neighborhood in which we shall search for solution to each local penalized maximum likelihood problem, and the subsequent local stage finds the desired solution. From this viewpoint, one can also regard our approach as an “optimization after localization” procedure. Historically, this idea played a key role in the development of the renowned one-step efficient estimator . It has also led to recent progress in non-convex optimization and localized gradient descent techniques for finding optimal solutions to high dimensional statistical problems. Examples include but are not limited to high-dimensional linear regression , sparse PCA , sparse CCA , phase retrieval and high dimensional EM algorithm . A closely related idea has also found success in the development of confidence intervals for regression coefficients in high dimensional linear regression. See, for instance, and the references therein. Last but not least, even when viewed as a “spectral clustering plus refinement” procedure, our method distinguishes itself from other such methods in the literature by provably achieving the minimax optimal performance over a wide range of parameter configurations.

The rest of the paper is organized as follows. Section 2 formally sets up the community detection problem and presents the two-stage algorithm. The theoretical guarantees for the proposed method are given in Section 3, followed by numerical results demonstrating its competitive performance on both simulated and real datasets in Sections 4 and 5. A discussion on the results in the current paper and possible directions for future investigation is included in Section 6. Section 7 presents the proofs of main results with some technical details deferred to the appendix.

Problem formulation and methodology

In this section, we give a precise formulation of the community detection problem and present a new method for it. The method consists of two stages: initialization and refinement. We shall first introduce the second stage, which is the main algorithm of the paper. It clusters the network data by performing a node-wise penalized neighbor voting based on some initial community assignment. Then, we will discuss several candidates for the initialization step including a new greedy algorithm for clustering the leading eigenvectors of the adjacency matrix or of the graph Laplacian that is tailored specifically for stochastic block model. Theoretical guarantees for the algorithms introduced in the current section will be presented in Section 3.

Recall that a stochastic block model is completely characterized by a symmetric connectivity matrix Bk×kB\in^{k\times k} and a label vector σ[k]n\sigma\in[k]^{n}. One widely studied parameter space of SBM is

where β1\beta\geq 1 is an absolute constant. This parameter space Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta) contains all SBMs in which the within community connection probabilities are all equal to ana\over n and the between community connection probabilities are all equal to bnb\over n. In the special case of β=1\beta=1, all communities are of nearly equal sizes.

Assuming equal within and equal between connection probabilities can be restrictive. Thus, we also introduce the following larger parameter space

Throughout the paper, we treat β1\beta\geq 1 and α1\alpha\geq 1 as absolute constants, while k,a,bk,a,b and λ\lambda should be viewed as functions of the number of nodes nn which can vary as nn grows. Moreover, we assume 0<bn<an1ϵ0<\frac{b}{n}<\frac{a}{n}\leq 1-\epsilon throughout the paper for some numeric constant ϵ(0,1)\epsilon\in(0,1). Thus, the parameter space Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) requires that the within community connection probabilities are bounded from below by ana\over n and the connection probabilities between any two communities are bounded from above by bnb\over n. In addition, it requires that the sizes of different communities are comparable. In order to guarantee that Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) is a larger parameter space than Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta), we always require λ\lambda to be positive and sufficiently small such that

According to Proposition 1 in the appendix, a sufficient condition for (4) is λab2βk\lambda\leq\frac{a-b}{2\beta k}. We assume (4) throughout the rest of the paper.

The labels on the nn nodes induce a community structure [n]=i=1kCi[n]=\cup_{i=1}^{k}\mathcal{C}_{i}, where Ci={u[n]:σ(u)=i}\mathcal{C}_{i}=\left\{u\in[n]:\sigma(u)=i\right\} is the ithi{{}^{\rm th}} community with size ni=Cin_{i}=|\mathcal{C}_{i}|. Our goal is to reconstruct this partition, or equivalently, to estimate the label of each node modulo any permutation of label symbols. Therefore, a natural error measure is the misclassification proportion defined as

where SkS_{k} stands for the symmetric group on [k][k] consisting of all permutations of [k][k].

2 Main algorithm

We are now ready to present the main method of the paper – a refinement algorithm for community detection in stochastic block model motivated by penalized local maximum likelihood estimation.

Indeed, for any SBM in the parameter space Θ0(n,k,a,b,1)\Theta_{0}(n,k,a,b,1) with equal community size, the MLE for σ\sigma is

which is a combinatorial optimization problem and hence is computationally intractable. However, node-wise optimization of (6) has a simple closed form solution. Suppose the values of {σ(u)}u=2n\{\sigma(u)\}_{u=2}^{n} are known and we want to estimate σ(1)\sigma(1). Then, (6) reduces to

For each i[k]i\in[k], the quantity {v1:σ(v)=i}A1v\sum_{\{v\neq 1:\sigma(v)=i\}}A_{1v} is exactly the number of neighbors that the first node has in the ithi{{}^{\rm th}} community. Therefore, the most likely label for the first node is the one it has the most connections with when all communities are of equal sizes. In practice, we do not know any label in advance. However, we may estimate the labels of all but the first node by first applying a community detection algorithm σ0\sigma^{0} on the subnetwork excluding the first node and its associated edges, the adjacency matrix of which is denoted by A1A_{-1} since it is the (n1)×(n1)(n-1)\times(n-1) submatrix of AA with its first row and first column removed. Once we estimate the remaining labels, we can apply (7) to estimate σ(1)\sigma(1) but with {σ(v)}v=2n\{\sigma(v)\}_{v=2}^{n} replaced with the estimated labels.

For any u[n]u\in[n], let AuA_{-u} denote the (n1)×(n1)(n-1)\times(n-1) submatrix of AA with its uthu{{}^{\rm th}} row and uthu{{}^{\rm th}} column removed. Given any community detection algorithm σ0\sigma^{0} which is able to cluster any graph on n1n-1 nodes into kk categories, we present the precise description of our refinement scheme in Algorithm 1.

The algorithm works in two consecutive steps. The first step carries out the foregoing heuristics on a node by node basis. For each fixed node uu, we first leave the node out and apply the available community detection algorithm σ0\sigma^{0} on the remaining n1n-1 nodes and the edges among them (as summarized in the matrix Au{0,1}(n1)×(n1)A_{-u}\in\left\{0,1\right\}^{(n-1)\times(n-1)}) to obtain an initial community assignment vector σu0\sigma^{0}_{u}. For convenience, we make σu0\sigma^{0}_{u} an nn-vector by fixing σu0(u)=0\sigma^{0}_{u}(u)=0, though applying σ0\sigma^{0} on AuA_{-u} does not give any community assignment for uu. We then assign the label of the uthu{{}^{\rm th}} node according to (10), which is essentially (7) with σ\sigma replaced with σu0\sigma^{0}_{u} except for the additional penalty term. The additional penalty term is added to ensure the optimal performance even when both the diagonal and the off-diagonal entries of the connectivity matrix BB are allowed to take different values and the community sizes are not necessarily equal. To determine the penalty parameter ρu\rho_{u} in an adaptive way as spelled out in (11) – (12), we first estimate the connectivity matrix BB based on AuA_{-u} in (8) – (9). After we obtain the community assignment for uu, we organize the assignment for all nn vertices into an nn-vector σ^u\widehat{\sigma}_{u}. We call this step “penalized neighbor voting” since the first term on the RHS of (10) counts the number of neighbors of uu in each (estimated) community while the second term is a penalty term proportional to the size of each (estimated) community.

Once we complete the above procedure for each of the nn nodes, we obtain nn vectors σ^u[k]n\widehat{\sigma}_{u}\in[k]^{n}, u=1,,nu=1,\dots,n, and turn to the second step of the algorithm. The basic idea behind the second step is to obtain a unified community assignment by assembling {σ^u(u):u[n]}\{\widehat{\sigma}_{u}(u):u\in[n]\} and the immediate hurdle is that each σ^u\widehat{\sigma}_{u} is only determined up to a permutation of the community labels. Thus, the second step aims to find the right permutations by (13) before we assemble the σ^u(u)\widehat{\sigma}_{u}(u)’s. We call this step “consensus” since we are essentially looking for a consensus on the community labels for nn possibly different community assignments, under the assumption that all of them are close to the ground truth up to some permutation.

3 Initialization via spectral methods

Another important issue in spectral clustering lies in the subsequent clustering method used to cluster the eigenvectors. A popular choice is kk-means clustering. However, finding the global solution to the kk-means problem is NP-hard . Kumar et al. proposed a polynomial time algorithm for achieving (1+ϵ)(1+\epsilon) approximation to the kk-means problem for any fixed kk, which was utilized in to establish consistency for spectral clustering under stochastic block model with fixed number of communities. However, a closer look at the complexity bound suggests that the smallest possible ϵ\epsilon is proportional to kk. Thus, applying the algorithm and the associated bound in directly in our settings can lead to inferior error bounds when kk\to\infty as nn\to\infty. To address this issue under stochastic block model, we propose a greedy clustering algorithm in Algorithm 2 inspired by the fact that the clustering centers under stochastic block model are well separated from each other on the population level. It is straightforward to check that the complexity of Algorithm 2 is polynomial in nn.

Last but not least, we would like to emphasize that one needs not limit the initialization algorithm to the spectral methods introduced in this section. As Theorem 2 below shows, Algorithm 1 works for any initialization method that satisfies a weak consistency condition.

Theoretical properties

Before stating the theoretical properties of the proposed method, we first review the minimax rate in , which will be used as the optimality benchmark. The minimax risk is governed by the following critical quantity,

which is the Rényi divergence of order 12\frac{1}{2} between \mboxBern(an)\mbox{Bern}\left(\frac{a}{n}\right) and \mboxBern(bn)\mbox{Bern}\left(\frac{b}{n}\right), i.e., Bernoulli distributions with success probabilities an\frac{a}{n} and bn\frac{b}{n} respectively. Recall that 0<bn<an1ϵ0<\frac{b}{n}<\frac{a}{n}\leq 1-\epsilon is assumed throughout the paper. It can be shown that I(ab)2naI^{*}\asymp\frac{(a-b)^{2}}{na}. Moreover, when an=o(1)\frac{a}{n}=o(1),

where H2(P,Q)=12(dPdQ)2H^{2}(P,Q)=\frac{1}{2}\int(\sqrt{{\rm d}P}-\sqrt{{\rm d}Q})^{2} is the squared Hellinger distance between two distributions PP and QQ. The minimax rate for the parameter spaces (2) and (3) under the loss function (5) is given in the following theorem.

When (ab)2aklogk\frac{(a-b)^{2}}{ak\log k}\rightarrow\infty, we have

for both Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta) and Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha) with any λab2βk\lambda\leq\frac{a-b}{2\beta k} and any β[1,5/3)\beta\in[1,\sqrt{5/3}), where η=ηn0\eta=\eta_{n}\rightarrow 0 is some sequence tending to as nn\rightarrow\infty.

The assumption β[1,5/3)\beta\in[1,\sqrt{5/3}) is needed in for some technical reason. Here, the parameter β\beta enters the minimax rates when k3k\geq 3 since the worst case is essentially when one has two communities of size nβk\frac{n}{\beta k}, while for k=2k=2, the worst case is essentially two communities of size n2\frac{n}{2}. For all other results in this paper, we allow β\beta to be an arbitrary constant no less than 11.

To this end, let us show that the two-stage algorithm proposed in Section 2 achieves the optimal misclassification proportion. The essence of the two-stage algorithm lies in the refinement scheme described in Algorithm 1. As long as any initialization step satisfies a certain weak consistency criterion, the refinement step directly leads to a solution with optimal misclassification proportion. To be specific, the initialization step needs to satisfy the following condition.

There exist constants C0,δ>0C_{0},\delta>0 and a positive sequence γ=γn\gamma=\gamma_{n} such that

Under Condition 1, we have the following upper bounds regarding the performance of the proposed refinement scheme.

Suppose as nn\to\infty, (ab)2aklogk\frac{(a-b)^{2}}{ak\log k}\to\infty, aba\asymp b and Condition 1 is satisfied for

and Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta). Then there is a sequence η0\eta\to 0 such that

If in addition Condition 1 is satisfied for γ\gamma satisfying both (16) and

and Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha), then the conclusion in (17) continues to hold for Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha).

Theorem 2 assumes aba\asymp b. The case when aba\asymp b may not hold is considered in Section 6. Compared with Theorem 1, the upper bounds (17) achieved by Algorithm 1 is minimax optimal. The condition (16) for the parameter space Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta) is very mild. When k=O(1)k=O(1), it reduces to γ=o(1)\gamma=o(1) and simply means that the initialization should be weakly consistent at any rate. For kk\rightarrow\infty, it implies that the misclassification proportion within each community converges to zero. Note that if the initialization step gives wrong labels to all nodes in one particular community, then the misclassification proportion is at least 1/k1/k. The condition (16) rules out this situation. For the parameter space Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha), an extra condition (18) is required. This is because estimating the connectivity matrix BB in Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) is harder than in Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta). In other words, if we do not pursue adaptive estimation, (18) is not needed.

Theorem 2 is an adaptive result without assuming the knowledge of aa and bb. When these two parameters are known, we can directly use aa and bb in (11) of Algorithm 1. By scrutinizing the proof of Theorem 2, the conditions (16) and (18) can be weakened as γ=o(k1)\gamma=o(k^{-1}) in this case.

Given the results of Theorem 2, it remains to check the initialization step via spectral clustering satisfies Condition 1. For matrix P=(Puv)=(Bσ(u)σ(v))P=(P_{uv})=(B_{\sigma(u)\sigma(v)}) with (B,σ)(B,\sigma) belonging to either Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta) or Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha), we use λk\lambda_{k} to denote λk(P)\lambda_{k}(P). Define the average degree by

Assume eaC1be\leq a\leq C_{1}b for some constant C1>0C_{1}>0 and

for some sufficiently small c(0,1)c\in(0,1). Consider USC(τ)(\tau) with a sufficiently small constant μ>0\mu>0 in Algorithm 2 and τ=C2dˉ\tau=C_{2}\bar{d} for some sufficiently large constant C2>0C_{2}>0. For any constant C>0C^{\prime}>0, there exists some C>0C>0 only depending on C,C1,C2C^{\prime},C_{1},C_{2} and μ\mu such that

with probability at least 1nC1-n^{-C^{\prime}}. If kk is fixed, the same conclusion holds without assuming aC1ba\leq C_{1}b.

Theorem 3 improves the error bound for spectral clustering in . While requires the assumption a>Clogna>C\log n, our result also holds for a=o(logn)a=o(\log n). A result close to ours is that by , but their clustering step is different from Algorithm 2. Moreover, the conclusion of Theorem 3 holds with probability 1nC1-n^{-C^{\prime}} for an arbitrary large CC^{\prime}, which is critical because the initialization step needs to satisfy Condition 1 for the subsequent refinement step to work. On the other hand, the bound in is stated with probability 1o(1)1-o(1).

When k=O(1)k=O(1), Theorem 2 and Theorem 3 jointly imply the following result.

Consider Algorithm 1 initialized by σ0\sigma^{0} with USC(τ)(\tau) for τ=Cdˉ\tau=C\bar{d}, where CC is a sufficiently large constant. Suppose as nn\rightarrow\infty, k=O(1)k=O(1), (ab)2a\frac{(a-b)^{2}}{a}\rightarrow\infty and aba\asymp b. Then, there exists a sequence η0\eta\rightarrow 0 such that

where the parameter space is Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta).

Compared with Theorem 1, the proposed procedure achieves the minimax rate under the condition (ab)2a\frac{(a-b)^{2}}{a}\rightarrow\infty and aba\asymp b. When k=O(1)k=O(1), the condition (ab)2a\frac{(a-b)^{2}}{a}\rightarrow\infty is necessary and sufficient for weak consistency in view of Theorem 1. More general results including the case of kk\rightarrow\infty are stated and discussed in Section 6.

The following theorem characterizes the misclassification rate of normalized spectral clustering.

Assume eaC1be\leq a\leq C_{1}b for some constant C1>0C_{1}>0 and

for some sufficiently small c(0,1)c\in(0,1). Consider NSC(τ)(\tau) with a sufficiently small constant μ>0\mu>0 in Algorithm 2 and τ=C2dˉ\tau=C_{2}\bar{d} for some sufficiently large constant C2>0C_{2}>0. Then, for any constant C>0C^{\prime}>0, there exists some C>0C>0 only depending on C,C1,C2C^{\prime},C_{1},C_{2} and μ\mu such that

with probability at least 1nC1-n^{-C^{\prime}}. If kk is fixed, the same conclusion holds without assuming aC1ba\leq C_{1}b.

A slightly different regularization of normalized spectral clustering is studied by only for the dense regime, while Theorem 4 holds under both dense and sparse regimes. Moreover, our result also improves that of due to our tighter bound on L(Aτ)L(Pτ)op\|L(A_{\tau})-L(P_{\tau})\|_{\rm op} in Lemma 7 below. We conjecture that the loga\log a factor in both the assumption and the bound of Theorem 4 can be removed.

Note that Theorem 3 and Theorem 4 are stated in terms of the quantity λk\lambda_{k}. We may specialize the results into the parameter spaces defined in (2) and (3). By Proposition 1, λkab2βk\lambda_{k}\geq\frac{a-b}{2\beta k} for Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta) and λkλ\lambda_{k}\geq\lambda for Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha). The implications of Theorem 3 and Theorem 4 and their uses as the initialization step for Algorithm 1 are discussed in full details in Section 6.

Numerical results

In this section we present the performance of the proposed algorithm on simulated datasets. The experiments cover three different scenarios: (1) dense network with communities of equal sizes; (2) dense network with communities of unequal sizes; and (3) sparse network. Recall the definition of dˉ\bar{d} in (19). For each setting, we report results of Algorithm 1 initialized with four different approaches: USC()(\infty), USC(2dˉ)(2\bar{d}), NSC(0)(0) and NSC(dˉ)(\bar{d}), the description of which can all be found in Section 2.3. For all these spectral clustering methods, Algorithm 2 was used to cluster the leading eigenvectors. The constant μ\mu in the critical radius definition was set to be 0.50.5 in all the results reported here. For each setting, the results are based on 100 independent draws from the underlying stochastic block model.

To achieve faster running time, we also ran a simplified version of Algorithm 1. Instead of obtaining nn different initializers {σu}u[n]\{\sigma_{u}\}_{u\in[n]} to refine each node separately, the simplified algorithm refines all the nodes with a single initialization on the whole network. Thus, the running time can be reduced roughly by a factor of nn. Simulation results below suggest that the simplified version achieves similar performances to that of Algorithm 1 in all the settings we have considered. For the precise description of the simplified algorithm, we refer readers to Algorithm 3 in the appendix.

In this setting, we generate networks with 2500 nodes and 10 communities, each of which consists of 250 nodes, and we set Bii=0.48B_{ii}=0.48 for all ii and Bij=0.32B_{ij}=0.32 for all iji\neq j. Figure 1 shows the boxplots of the number of misclassified nodes. The first four boxplots correspond to the four different spectral clustering methods, in the order of USC()(\infty), USC(2dˉ)(2\bar{d}), NSC(0)(0) and NSC(dˉ)(\bar{d}). The middle four correspond to the results achieved by applying the simplified refinement scheme with these four initialization methods, and the last four show the results of Algorithm 1 with these four initialization methods. Regardless of the initialization method, Algorithm 1 or its simplified version reduces the number of misclassified nodes from around 30 to around 5.

Imbalanced case

In this setting, we generate networks with 2000 nodes and 4 communities, the sizes of which are 200,400,600200,400,600 and 800800, respectively. The connectivity matrix is

Hence, the within-community edge probability is no smaller than 0.45 while the between-community edge probability is no greater than 0.35, and the underlying SBM is inhomogeneous. Figure 2 shows the boxplots of the number of misclassified nodes obtained by different initialization methods and their refinements, and the boxplots are presented in the same order as those in Figure 1. Similarly, we can see refinement significantly reduces the error.

Sparse case

In this setting we consider a much sparser stochastic block model than the previous two cases. In particular, each simulated network has 4000 nodes, divided into 10 communities all of size 400. We set all Bii=0.032B_{ii}=0.032 and all Bij=0.005B_{ij}=0.005 when iji\neq j. The average degree of each node in the network is around 30. Figure 3 shows the boxplots of the number of misclassified nodes obtained by different initialization methods and their refinements, and the boxplots are presented in the same order as those in Figure 1. Compared with either USC or NSC initialization, refinement reduces the number of misclassified nodes by 50%.

Summary

In all three simulation settings, for all four initialization approaches considered, the refinement scheme in Algorithm 1 (and its simplified version) was able to significantly reduce the number of misclassified nodes, which is in agreement with the theoretical properties presented in Section 3.

Real data example

We now compare the results of our algorithm and some existing methods on a political blog dataset . Each node in this network represents a blog about US politics and a pair of nodes is connected if one blog contains a link to the other. There were 1490 nodes to start with, each labeled liberal or conservative. In what follows, we consider only the 1222 nodes located in the largest connected component of the network. This pre-processing step is the same as what was done in . After pre-processing, the network has 586 liberal blogs and 636 conservative ones which naturally form two communities. As shown in the right panel of Figure 4, nodes are more likely to be connected if they have the same political ideology.

Table 1 summarizes the results of Algorithm 1 and its simplified version on this network with four different initialization methods, as well as the performances of directly applying the four methods on the dataset. The average degree of the network dˉ\bar{d} is 27, which is used as the tuning parameter for regularized NSC. For regularized USC, we set τ\tau equals to twice the average degree, leading to the removal of 196 most connected nodes. The result of directly applying any of the four spectral clustering based initializations was unsatisfactory with at least 30% nodes misclassified. Despite the unsatisfactory performance of the initializers, Algorithm 1 and its simplified version are able to significantly reduce the number of misclassified nodes except for the case of NSC(0)(0), and the performance of the two are close to each other regardless of the initialization method.

An interesting observation is that if we apply the refinement scheme multiple times, the number of misclassified nodes keeps decreasing until convergence and the further reduction of misclassification proportion compared to a single refinement can be sizable. Figure 5 plots the numbers of misclassified nodes for multiple iterations of refinement via the simplified version of Algorithm 1. We are able to achieve 61, 58 or 63 misclassified nodes out of 1222 depending on which initialization method is used. For the three initialization methods included in the figure, the number of misclassified nodes converges within several iterations. NSC with τ=0\tau=0 is not included in Figure 5 due to the relatively inferior initialization, but its error also converges to around 60/1222 after 20 iterations. For comparison, state-of-the-art method such as SCORE was reported to achieve a comparable error of 58/1222. It is worth noting that SCORE was designed under the setting of degree-corrected stochastic block model, which fits the current dataset better than SBM due to the presence of hubs and low-degree nodes. The regularized spectral clustering implemented by , which was also designed under the degree-corrected stochastic block model, was reported to have an error of (80±2)/1222(80\pm 2)/1222. The semi-definite programming method by achieved 63/1222.

To summarize, our algorithm leads to significant performance improvement over several popular spectral clustering based methods on the political blog dataset. With repeated refinements, it demonstrates competitive performance even when compared with methods designed for models that better fit the current dataset.

Discussion

In this section, we discuss a few important issues related to the methodology and theory we have presented in the previous sections.

In Section 3, we established upper bounds on misclassification proportion under the assumption of aba\asymp b. The following theorem shows that slightly weaker upper bounds can be obtained even when aba\asymp b does not hold. To state the result, recall that we assume throughout the paper an1ϵ\frac{a}{n}\leq 1-\epsilon for some numeric constant ϵ(0,1)\epsilon\in(0,1).

Suppose as nn\to\infty, (ab)2aklogk\frac{(a-b)^{2}}{ak\log k}\to\infty and Condition 1 is satisfied for γ\gamma satisfying (16) and Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta). Then for some positive constants cϵc_{\epsilon} and CϵC_{\epsilon} that depend only on ϵ\epsilon, for any sufficiently small constant ϵ0(0,cϵ)\epsilon_{0}\in(0,c_{\epsilon}), if we replace the definition of tut_{u}’s in (11) with

where II^{*} is defined as in (14). In particular, we can set Cϵ=1032ϵϵ2log2ϵC_{\epsilon}=\frac{10}{3}\frac{2-\epsilon}{\frac{\epsilon}{2}\log\frac{2}{\epsilon}} and cϵ=min(110Cϵ,ϵ2ϵ)c_{\epsilon}=\min(\frac{1}{10C_{\epsilon}},\frac{\epsilon}{2-\epsilon}).

If in addition Condition 1 is satisfied for γ\gamma satisfying both (16) and (18) and Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha), then the same conclusion holds for Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha).

Compared with the conclusion (17) in Theorem 2, the vanish sequence η\eta in the exponent of the upper bound is replaced by Cϵϵ0C_{\epsilon}\epsilon_{0}, which is guaranteed to be smaller than min(0.1,2log(2/ϵ))\min(0.1,\frac{2}{\log(2/\epsilon)}) and can be driven to be arbitrarily small by decreasing ϵ0\epsilon_{0}. To achieve this, the tut_{u}’s used in defining the penalty parameters in the penalized neighbor voting step need to be truncated at the value log1ϵ0/2\log\frac{1}{\epsilon_{0}/2}.

2 Implications of the results

We now discuss some implications of the results in Theorems 2 – 5.

When using USC as initialization for Algorithm 1, we obtain the following results by combining Theorem 2, Theorem 3 and Theorem 5. Recall that dˉ\bar{d} is the average degree of nodes in AA defined in (19).

Consider Algorithm 1 initialized by σ0\sigma^{0} with USC(τ)(\tau) with τ=Cdˉ\tau=C\bar{d} for some sufficiently large constant C>0C>0. If as nn\rightarrow\infty, aba\asymp b and

then there is a sequence η0\eta\rightarrow 0 such that (17) holds with Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta). If as nn\rightarrow\infty, aba\asymp b and

then (17) holds for Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha). If for either parameter space, aba\asymp b may not hold but kk is fixed and (24) or (25) holds respectively, then (23) holds as long as tut_{u} is replaced by (22) in Algorithm 1.

regardless of the behavior of ab\frac{a}{b}. On the other hand, Theorem 1 shows that it is impossible to achieve strong consistency if

When an=o(1)\frac{a}{n}=o(1), nI=(1+o(1))(ab)2nI^{*}=(1+o(1))(\sqrt{a}-\sqrt{b})^{2} and so one can replace nI{nI^{*}} in (26) – (27) with (ab)2{(\sqrt{a}-\sqrt{b})^{2}}. In the literature, Abbe et al. , Mossel et al. and Hajek et al. obtained comparable strong consistency results via efficient algorithms for the special case of two communities of equal sizes, i.e., k=2k=2 and β=1\beta=1. Under the additional assumption of ablogna\asymp b\asymp{\log n}, Hajek et al. later achieved the result via efficient algorithm for the case of fixed kk and β=1\beta=1, and Abbe and Sandon investigated the case of fixed kk and β1\beta\geq 1. In comparison, our result holds for any fixed kk and any β1\beta\geq 1 without assuming ablogna\asymp b\asymp\log n. In the weak consistency regime, in terms of misclassification proportion, for the special case of k=2k=2 and β=1\beta=1, Yun and Proutiere achieved the optimal rate for Θ0(n,2,a,b,1)\Theta_{0}(n,2,a,b,1) when ababa\asymp b\asymp a-b, while the error bounds in other papers are typically off by a constant multiplier on the exponent. In comparison, Theorem 6 provides optimal results (17) and near optimal results (23) for a much broader class of models under much weaker conditions. Last but not least, our algorithm can provably achieve strong consistency and minimax optimal performance even for growing kk, which to our limited knowledge, is the first in the literature.

The performance of Algorithm 1 initialized by NSC can be summarized as the following theorem by combining Theorem 2, Theorem 4 and Theorem 5. In this case, the sufficient condition for achieving minimax optimal performance is slightly stronger than when USC is used for initialization.

Consider Algorithm 1 initialized by σ0\sigma^{0} with NSC(τ)(\tau) with τ=Cdˉ\tau=C\bar{d} for some sufficiently large constant C>0C>0. If as nn\rightarrow\infty, aba\asymp b and

then there is a sequence η0\eta\rightarrow 0 such that (17) holds with Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta). If as nn\rightarrow\infty, aba\asymp b and

then (17) holds for Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha). If for either parameter space, aba\asymp b may not hold but kk is fixed and (28) or (29) holds respectively, then (23) holds as long as tut_{u} is replaced by (22) in Algorithm 1.

Last but not least, we would like to point out that when the key parameters aa and bb are known, we can obtain the desired performance guarantee under weaker conditions as summarized in the following theorem.

Suppose a,ba,b are known. Consider Algorithm 1 initialized by σ0\sigma^{0} with USC(τ\tau) with τ=Ca\tau=Ca for some sufficiently large constant C>0C>0 and a^u=a\widehat{a}_{u}=a, b^u=b\widehat{b}_{u}=b in (9) for all u[n]u\in[n]. If as nn\to\infty, aba\asymp b and

then there is a sequence η0\eta\to 0 such that (17) holds with Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta). If as nn\to\infty, aba\asymp b and

then (17) holds with Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha). If for either parameter space without assuming aba\asymp b, (30) or (31) holds respectively, then (23) holds if in addition tut_{u} is replaced by (22).

If instead NSC(τ\tau) is used for initialization with τ=Ca\tau=Ca for some sufficiently large constant C>0C>0, then the above conclusions hold if we replace (30) with (ab)2ak3loga\frac{(a-b)^{2}}{ak^{3}\log a}\to\infty and (31) with λ2akloga\frac{\lambda^{2}}{ak\log a}\to\infty, respectively.

3 Potential future research problems

In simulation studies, we experimented a simplified version of Algorithm 1 (with precise description as Algorithm 3 in appendix) and showed that it provided similar performance to Algorithm 1 on simulated datasets. Moreover, for the political blog data, we showed that iterative application of this simplified refinement scheme kept driving down the number of misclassified nodes till convergence. It is of great interest to see if comparable theoretical results to Theorem 2 could be established for the simplified and/or the iterative version, and if the iterative version converges to a local optimum of certain objective function for community detection. Though answering these intriguing questions is beyond the scope of the current paper, we think it can serve as an interesting future research problem.

Data-driven choice of k𝑘k

The knowledge of kk is assumed and is used in both methodology and theory of the present paper. Date-driven choice of kk is of both practical importance and contemporary research interest, and researchers have proposed various ways to achieve this goal for stochastic block model, including cross-validation , Tracy–Widom test , information criterion , likelihood ratio test , etc. Whether these methods are optimal and whether it is possible to select kk in a statistically optimal way remains an important open problem.

More general models

The results in this paper cover a large range of parameter spaces for stochastic block models and we show the competitive performance of the proposed algorithm both in theory and on numerical examples. Despite its popularity, stochastic block model has its own limits for modeling network data. Therefore, an important future research direction is to design computationally feasible algorithms that can achieve statistically optimal performance for more general network models, such as degree-corrected stochastic block models.

Proofs of main results

The main result of the paper, Theorem 2, is proved in Section 7.1. Theorem 3 and Theorem 4 are proved in Section 7.2 and Section 7.3 respectively. The proofs of the remaining results, together with some auxiliary lemmas, are given in the appendix.

We first state a lemma that guarantees the accuracy of parameter estimation in Algorithm 1.

Let Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha). Suppose as nn\to\infty, (ab)2ak\frac{(a-b)^{2}}{ak}\to\infty and Condition 1 holds with γ\gamma satisfying (16) and (18). Then there is a sequence η0\eta\to 0 as nn\to\infty and a constant C>0C>0 such that

For Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta), the conclusion (32) continues to hold even when the assumption (18) is dropped.

11^{\circ} Let Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha). For any community assignments σ1\sigma_{1} and σ2\sigma_{2}, define

Fix any (B,σ)Θ(B,\sigma)\in\Theta and u[n]u\in[n]. Define event

Let Ci{\mathcal{C}}_{i}^{\prime} be any deterministic subset of [n][n] such that (35) holds with C~iu\widetilde{\mathcal{C}}_{i}^{u} replaced by Ci{\mathcal{C}}_{i}^{\prime}. By definition, there are at most

different subsets with this property where C1>0C_{1}>0 is an absolute constant. Let Ei{\mathcal{E}}_{i}^{\prime} be the edges within Ci{\mathcal{C}}_{i}^{\prime}. Then Ei|{\mathcal{E}}_{i}^{\prime}| consists of independent Bernoulli random variables, where at least (1βγk)2(1-\beta\gamma k)^{2} proportion of them follow the Bern(Bii)(B_{ii}) distribution, at most (βγk)2(\beta\gamma k)^{2} proportion that are stochastically smaller than Bern(αan)(\frac{\alpha a}{n}) and stochastically larger than Bern(an)(\frac{a}{n}), and at most 2βγk2\beta\gamma k proportion are stochastically smaller than Bern(bn)(\frac{b}{n}). Therefore, we obtain that

Note that the LHS is (1(2+o(1))βγk)Bii(1-(2+o(1))\beta\gamma k)B_{ii}. On the other hand, under condition (18), the RHS is attained at t=0t=0 and equals BiiB_{ii} exactly. Thus, we conclude that

for some η0\eta^{\prime}\to 0 that depends only on a,k,α,βa,k,\alpha,\beta and γ\gamma, where the last inequality is due to (18).

On the other hand, by Bernstein’s inequality, for any t>0t>0,

where we the second inequality holds since logxx\frac{\log x}{x} is monotone decreasing as xx increases and so γlogγ11nlogn\gamma\log\gamma^{-1}\geq\frac{1}{n}\log n for any γ1n\gamma\geq\frac{1}{n}, which is the case of most interest since γ<1n\gamma<\frac{1}{n} leads to γ=0\gamma=0 and so the initialization is already perfect. Even when γ=0\gamma=0, we can still continue to the following arguments by replacing every γ\gamma with 1n\frac{1}{n} and all the steps continue to hold. Thus, we obtain that for positive constant Cα,β,δC_{\alpha,\beta,\delta} that depends only on α,β\alpha,\beta and δ\delta,

Thus, with probability at least 1exp{C1γnlogγ1}n(3+δ)1-\exp\left\{-C_{1}\gamma n\log\gamma^{-1}\right\}n^{-(3+\delta)},

where η0\eta^{\prime}\to 0 depends only on a,k,α,β,γa,k,\alpha,\beta,\gamma and δ\delta. Here, the last inequality holds since

where akab\sqrt{ak}\ll a-b since (ab)2ak\frac{(a-b)^{2}}{ak}\to\infty and kγlogγ1=O(1)k\gamma\log\gamma^{-1}=O(1), and

We combine (37) and (39) and apply the union bound to obtain that for a sequence η0\eta\to 0 that depends only on a,k,α,β,γa,k,\alpha,\beta,\gamma and δ\delta, with probability at least 1n(3+δ)1-n^{-(3+\delta)}

The proof for BijB_{ij} estimation is analogous and hence is omitted. A final union bound on i,j[k]i,j\in[k] leads to the desired claim since all the constants and vanishing sequences in the above analysis depend only on a,b,k,α,β,γa,b,k,\alpha,\beta,\gamma and δ\delta, but not on uu, BB or σ\sigma.

22^{\circ} If Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta), then condition (18) on γ\gamma is no longer needed. This is because (36) can be replaced by

where the LHS equals an(1βγk(1+o(1)))abn=an+o(abn)\frac{a}{n}-(1-\beta\gamma k(1+o(1)))\frac{a-b}{n}=\frac{a}{n}+o(\frac{a-b}{n}) and the RHS equals an\frac{a}{n}. Thus, no additional condition is needed to guarantee (37) in the foregoing arguments. This completes the proof. ∎

The next two lemmas establish the desired error bound for the node-wise refinement.

Let Θ0\Theta_{0} be defined as in (2) and k2k\geq 2. Suppose as nn\to\infty, (ab)2ak\frac{(a-b)^{2}}{ak}\to\infty and aba\asymp b. If there exists two sequences γ=o(1/k)\gamma=o(1/k) and η=o(1)\eta=o(1), constants C,δ>0C,\delta>0 and permutations {πu}u=1nSk\left\{\pi_{u}\right\}_{u=1}^{n}\subset S_{k} such that

Then for σ^u(u)\widehat{\sigma}_{u}(u) defined as in (10) with ρ=ρu\rho=\rho_{u} in (12), there is a sequence η=o(1)\eta^{\prime}=o(1) such that for k=2k=2,

In what follows, let EuE_{u} denote the event in (41). For the sake of brevity, we let p=a/np=a/n, q=b/nq=b/n, p^u=a^u/n\widehat{p}_{u}=\widehat{a}_{u}/n and q^u=b^u/n\widehat{q}_{u}=\widehat{b}_{u}/n. Moreover, let σu=πu(σ)\sigma_{u}=\pi_{u}(\sigma), ni={v:σu(v)=i}n_{i}=|\{v:\sigma_{u}(v)=i\}|, mi={v:σu0(v)=i}m_{i}=|\{v:\sigma^{0}_{u}(v)=i\}| and mi={v:σu0(v)=σu(v)=i}m_{i}^{\prime}=|\{v:\sigma^{0}_{u}(v)=\sigma_{u}(v)=i\}|. Without loss of generality, let σu(u)=1\sigma_{u}(u)=1.

Now we bound each plp_{l}. By the independence structure and Chernoff bound, we have

We are going to give bounds for the terms in (45) and (45) respectively. Before doing that, we need some preparatory inequalities. Define tt^{*} through the equation

for some constant C2>0C_{2}>0. Therefore, for the term in (45), on the event EuE_{u},

By (46), the term in (49) is upper bounded by

By (47), the term in (49) is upper bounded by

Now we provide an upper bound for (45). By (47), on EuE_{u},

When k=2k=2, minl1(n1+nl2)=n2\min_{l\neq 1}\left(\frac{n_{1}+n_{l}}{2}\right)=\frac{n}{2}, and when k3k\geq 3, minl1(n1+nl2)nβk\min_{l\neq 1}\left(\frac{n_{1}+n_{l}}{2}\right)\geq\frac{n}{\beta k}. Thus, the proof is complete. ∎

Let Θ\Theta be defined as in (3) and k2k\geq 2. Suppose as nn\to\infty, (ab)2ak\frac{(a-b)^{2}}{ak}\to\infty and aba\asymp b. If there exists two sequences γ=o(abak)\gamma=o\left(\frac{a-b}{ak}\right) and η=o(1)\eta=o(1), constants C,δ>0C,\delta>0 and permutations {πu}u=1nSk\left\{\pi_{u}\right\}_{u=1}^{n}\subset S_{k} such that (41) holds. Then for σ^u(u)\widehat{\sigma}_{u}(u) defined as in (10) with ρ=ρu\rho=\rho_{u} in (12), the conclusions of Lemma 2 continue to hold.

The proof is similar to that of Lemma 2 and we use the same notation as there. First, we give a bound for plp_{l} defined in (42). Let Xj\mboxBern(q)X_{j}\sim\mbox{Bern}(q), Yj\mboxBern(p)Y_{j}\sim\mbox{Bern}(p) and Zj\mboxBern(αp)Z_{j}\sim\mbox{Bern}(\alpha p), j1j\geq 1, be mutually independent. Then, a stochastic order argument gives

Note that the term in (54) is the same as that in (45), and thus it can be upper bounded by (50) as before. To bound for (54), observe that by (47),

Thus, under the assumption γ=o(pqkp)\gamma=o\left(\frac{p-q}{kp}\right), the term (54) is bounded by exp(o(1)n1+nl2I)\exp\left(o(1)\frac{n_{1}+n_{l}}{2}I^{*}\right). The remaining proof is the same as that of Lemma 2. ∎

Finally, we need a lemma to justify the consensus step in Algorithm 1.

For any community assignments σ\sigma and σ\sigma^{\prime}: [n][k][n]\to[k], such that for some constant C1C\geq 1

Thus, what remains to be shown is that ξSk\xi\in S_{k}, i.e., ξ(l1)ξ(l2)\xi(l_{1})\neq\xi(l_{2}) for any l1l2l_{1}\neq l_{2}. To this end, note that if for some l1l2l_{1}\neq l_{2}, ξ(l1)=ξ(l2)\xi(l_{1})=\xi(l_{2}), then there would exist some l0[k]l_{0}\in[k] such that for any l[k]l\in[k], ξ(l)l0\xi(l)\neq l_{0}, and so

This is in contradiction to the second last display, and hence ξSk\xi\in S_{k}. This completes the proof. ∎

Let Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha), and fix any (B,σ)Θ(B,\sigma)\in\Theta. For any u[n]u\in[n], by Condition 1 and the fact that σu0\sigma^{0}_{u} and σ^u\widehat{\sigma}_{u} differ only at the community assignment of uu, for γ=γ+1/n\gamma^{\prime}=\gamma+1/n, there exists some πuSk\pi_{u}\in S_{k} such that

In addition, (56) implies with probability at least 1Cn(1+δ)1-Cn^{-(1+\delta)}, we have

When k3k\geq 3, Lemma 1, (16) and (18) imply that the condition of Lemma 3 is satisfied, which in turn implies that for a sequence η=o(1)\eta^{\prime}=o(1),

where the last inequality holds since nIk(ab)2ak\frac{\,nI^{*}}{k}\asymp\frac{(a-b)^{2}}{ak}\to\infty. Thus, Markov’s inequality leads to

If (k1)exp{(1η)nIβk}n(1+δ/2)(k-1)\exp\left\{-(1-\eta)\frac{nI^{*}}{\beta k}\right\}\geq n^{-(1+\delta/2)}, then

If (k1)exp{(1η)nIβk}<n(1+δ/2)(k-1)\exp\left\{-(1-\eta)\frac{nI^{*}}{\beta k}\right\}<n^{-(1+\delta/2)}, then

Here, the second last inequality holds since η>η\eta>\eta^{\prime} and so (k1)exp{(1η)nI/(βk)}<(k1)exp{(1η)nI/(βk)}<n(1+δ/2)(k-1)\exp\left\{-(1-\eta^{\prime})nI^{*}/(\beta k)\right\}<(k-1)\exp\left\{-(1-\eta)nI^{*}/(\beta k)\right\}<n^{-(1+\delta/2)}. We complete the proof for the case of Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) and k3k\geq 3 by noting that (k1)exp{(1η)nIβk}=exp{(1η)nIβk}(k-1)\exp\left\{-(1-\eta)\frac{nI^{*}}{\beta k}\right\}=\exp\left\{-(1-\eta^{\prime\prime})\frac{nI^{*}}{\beta k}\right\} for another sequence η=o(1)\eta^{\prime\prime}=o(1) under the assumption (ab)2aklogk\frac{(a-b)^{2}}{ak\log k}\rightarrow\infty and no constant or sequence in the foregoing arguments involves B,σB,\sigma or uu. When Θ=Θ(n,k,a,b,λ,β;α)\Theta=\Theta(n,k,a,b,\lambda,\beta;\alpha) and k=2k=2, the foregoing arguments continue to hold with β\beta and kk replaced with 11 and 22 respectively.

When Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta), we can run the foregoing arguments with Lemma 3 replaced by Lemma 2 to reach the conclusion in (17), which does not require condition (18). This completes the proof. ∎

2 Proof of Theorem 3

Consider a symmetric adjacency matrix A{0,1}n×nA\in\{0,1\}^{n\times n} and a symmetric matrix Pn×nP\in^{n\times n} satisfying Auu=0A_{uu}=0 for all u[n]u\in[n] and AuvBernoulli(Puv)A_{uv}\sim\text{Bernoulli}(P_{uv}) independently for all u>vu>v. For any C>0C^{\prime}>0, there exists some C>0C>0 such that

with probability at least 1nC1-n^{-C^{\prime}} uniformly over τ[C1(npmax+1),C2(npmax+1)]\tau\in[C_{1}(np_{\max}+1),C_{2}(np_{\max}+1)] for some sufficiently large constants C1,C2C_{1},C_{2}, where pmax=maxuvPuvp_{\max}=\max_{u\geq v}P_{uv}.

For P=(Puv)=(Bσ(u)σ(v))P=(P_{uv})=(B_{\sigma(u)\sigma(v)}), we have SVD P=UΛUTP=U\Lambda U^{T}, where

with Δ=diag(n1,...,nk)\Delta=\text{diag}(\sqrt{n_{1}},...,\sqrt{n_{k}}), Z{0,1}n×kZ\in\{0,1\}^{n\times k} is a matrix with exactly one nonzero entry in each row at (i,σ(i))(i,\sigma(i)) taking value 11 and WO(k,k)W\in O(k,k).

and observe that ZΔ1O(n,k)Z\Delta^{-1}\in O(n,k). Apply SVD to the matrix ΔBΔT=WΛWT\Delta B\Delta^{T}=W\Lambda W^{T} for some WO(k,k)W\in O(k,k), and then we have P=UΛUTP=U\Lambda U^{T} with U=ZΔ1WO(k,k)U=Z\Delta^{-1}W\in O(k,k). ∎

where V=ZΔ1W2O(n,k)V=Z\Delta^{-1}W_{2}\in O(n,k) for some W2O(k,k)W_{2}\in O(k,k). Combining (60), Lemma 5 and the conclusion τ[C1a,C2a]\tau\in[C_{1}a,C_{2}a], we have

with probability at least 1nC1-n^{-C^{\prime}}. The definition of VV implies that

By definition, TiTj=T_{i}\cap T_{j}=\varnothing when iji\neq j, and we also have

where the last inequality is by (61). After rearrangement, we have

In other words, most nodes are close to the centers and are in the set (63). Note that the sets {Ti}i[k]\{T_{i}\}_{i\in[k]} are disjoint. Suppose there is some i[k]i\in[k] such that Ti<σ1(i)(i[k]Ti)c|T_{i}|<|\sigma^{-1}(i)|-\left|\left(\cup_{i\in[k]}T_{i}\right)^{c}\right|, we have i[k]Ti=i[k]Ti<n(i[k]Ti)c=i[k]Ti\left|\cup_{i\in[k]}T_{i}\right|=\sum_{i\in[k]}|T_{i}|<n-\left|\left(\cup_{i\in[k]}T_{i}\right)^{c}\right|=\left|\cup_{i\in[k]}T_{i}\right|, which is impossible. Thus, the cardinality of TiT_{i} for each i[k]i\in[k] is lower bounded as

where the last inequality above is by the assumption (20). Intuitively speaking, except for a negligible proportion, most data points in {U^u}u[n]\{\widehat{U}_{u*}\}_{u\in[n]} are very close to the population centers {Qi}i[k]\{Q_{i*}\}_{i\in[k]}. Since the centers are at least 2kβn\sqrt{\frac{2k}{\beta n}} away from each other and {Ti}i[k]\{T_{i}\}_{i\in[k]} and {C^i}i[k]\{\widehat{C}_{i}\}_{i\in[k]} are both defined through the critical radius r=μknr=\mu\sqrt{\frac{k}{n}} for a small μ\mu, each C^i\widehat{C}_{i} should intersect with only one TiT_{i} (see Figure 6). We claim that there exists some permutation π\pi of the set [k][k], such that for C^i\widehat{C}_{i} defined in Algorithm 2,

In what follows, we first establish the result of Theorem 3 by assuming (66). The proof of (66) will be given in the end. Note that for any iji\neq j, Tπ(i)C^j=T_{\pi(i)}\cap\widehat{\mathcal{C}}_{j}=\varnothing, which is deduced from the fact that C^jTπ(j)\widehat{\mathcal{C}}_{j}\cap T_{\pi(j)}\neq\varnothing and the definition of C^j\widehat{\mathcal{C}}_{j}. Therefore, Tπ(i)C^jcT_{\pi(i)}\subset\widehat{\mathcal{C}}_{j}^{c} for all jij\neq i. Combining with the fact that Tπ(i)C^icC^icT_{\pi(i)}\cap\widehat{\mathcal{C}}_{i}^{c}\subset\widehat{\mathcal{C}}_{i}^{c}, we get Tπ(i)C^ic(i[k]C^i)cT_{\pi(i)}\cap\widehat{\mathcal{C}}_{i}^{c}\subset(\cup_{i\in[k]}\widehat{\mathcal{C}}_{i})^{c}. Therefore,

Since TiTj=T_{i}\cap T_{j}=\varnothing for iji\neq j, we deduce from (67) that

By definition, C^iC^j=\widehat{\mathcal{C}}_{i}\cap\widehat{\mathcal{C}}_{j}=\varnothing for iji\neq j, we deduce from (66) that

Since for any ui[k](C^iTπ(i))u\in\cup_{i\in[k]}(\widehat{\mathcal{C}}_{i}\cap T_{\pi(i)}), we have σ^(u)=i\widehat{\sigma}(u)=i when σ(u)=π(i)\sigma(u)=\pi(i), the mis-classification rate is bounded as

where the last inequality is from (70) and (64). This proves the desired conclusion.

Finally, we are going to establish the claim (66) to close the proof. We use mathematical induction. For i=1i=1, it is clear that C^1maxi[k]Ti|\widehat{\mathcal{C}}_{1}|\geq\max_{i\in[k]}|T_{i}| holds by the definition of C^1\widehat{\mathcal{C}}_{1}. Suppose C^1Ti=\widehat{\mathcal{C}}_{1}\cap T_{i}=\varnothing for all i[k]i\in[k], and then we must have

where the last inequality is by (65). This contradicts (64) under the assumption (20). Therefore, there must be a π(1)\pi(1) such that C^1Tπ(1)\widehat{\mathcal{C}}_{1}\cap T_{\pi(1)}\neq\varnothing and C^1Tπ(1)|\widehat{\mathcal{C}}_{1}|\geq|T_{\pi(1)}|. Moreover,

where the last inequality is because Tπ(1)T_{\pi(1)} is the only set in {Ti}i[k]\{T_{i}\}_{i\in[k]} that intersects C^1\widehat{\mathcal{C}}_{1} by the definitions. By (64), we get

Now suppose (66) and (71) are true for i=1,...,l1i=1,...,l-1. Because of the sizes of {C^i}i[l1]\{\widehat{\mathcal{C}}_{i}\}_{i\in[l-1]} and the fact that {Ti}i[k]\{T_{i}\}_{i\in[k]} are mutually exclusive, we have

Therefore, for the set SS in the current step, i[k]\i=1l1{π(i)}TiS\cup_{i\in[k]\backslash\cup_{i=1}^{l-1}\{\pi(i)\}}T_{i}\subset S. By the definition of C^l\widehat{\mathcal{C}}_{l}, we have C^lmaxi[k]\i=1l1{π(i)}Tin2βk|\widehat{\mathcal{C}}_{l}|\geq\max_{i\in[k]\backslash\cup_{i=1}^{l-1}\{\pi(i)\}}|T_{i}|\geq\frac{n}{2\beta k}. Suppose C^lTπ(i)\widehat{\mathcal{C}}_{l}\cap T_{\pi(i)}\neq\varnothing for some i=1,...,l1i=1,...,l-1. Then, this Tπ(i)T_{\pi(i)} is the only set in {Ti}i[k]\{T_{i}\}_{i\in[k]} that intersects C^l\widehat{\mathcal{C}}_{l} by their definitions. This implies that

Since C^lC^π(i)=\widehat{\mathcal{C}}_{l}\cap\widehat{\mathcal{C}}_{\pi(i)}=\varnothing, C^lTπ(i)C^icTπ(i)|\widehat{\mathcal{C}}_{l}\cap T_{\pi(i)}|\leq|\widehat{\mathcal{C}}_{i}^{c}\cap T_{\pi(i)}| is bounded by (71). Together with (64), we have

which contradicts C^ln2βk|\widehat{\mathcal{C}}_{l}|\geq\frac{n}{2\beta k} under the assumption (20). Therefore, we must have C^lTπ(i)=\widehat{\mathcal{C}}_{l}\cap T_{\pi(i)}=\varnothing for all i=1,...,l1i=1,...,l-1. Now suppose C^lTπ(i)=\widehat{\mathcal{C}}_{l}\cap T_{\pi(i)}=\varnothing for all i[k]i\in[k], we must have

which contradicts (64). Hence, C^lTπ(l)\widehat{C}_{l}\cap T_{\pi(l)}\neq\varnothing for some π(l)[k]\i=1l1{π(i)}\pi(l)\in[k]\backslash\cup_{i=1}^{l-1}\{\pi(i)\}, and (66) is established for i=li=l. Moreover, (71) can also be established for i=li=l by the same argument that is used to prove (71) for i=1i=1. The proof is complete. ∎

3 Proof of Theorem 4

Define Pτ=P+τn11TP_{\tau}=P+\frac{\tau}{n}\mathbf{1}\mathbf{1}^{T}. The proof of the following lemma is given in the appendix.

Consider a symmetric adjacency matrix A{0,1}n×nA\in\{0,1\}^{n\times n} and a symmetric matrix Pn×nP\in^{n\times n} satisfying Auu=0A_{uu}=0 for all u[n]u\in[n] and AuvBernoulli(Puv)A_{uv}\sim\text{Bernoulli}(P_{uv}) independently for all u>vu>v. For any C>0C^{\prime}>0, there exists some C>0C>0 such that

with probability at least 1nC1-n^{-C^{\prime}} uniformly over τ[C1(npmax+1),C2(npmax+1)]\tau\in[C_{1}(np_{\max}+1),C_{2}(np_{\max}+1)] for some sufficiently large constants C1,C2C_{1},C_{2}, where pmax=maxuvPuvp_{\max}=\max_{u\geq v}P_{uv}.

Consider P=(Puv)=(Bσ(u)σ(v))P=(P_{uv})=(B_{\sigma(u)\sigma(v)}). Let the SVD of the matrix L(Pτ)L(P_{\tau}) be L(Pτ)=UΣUTL(P_{\tau})=U\Sigma U^{T}, with UO(n,k)U\in O(n,k) and Σ=diag(σ1,...,σk)\Sigma=\text{diag}(\sigma_{1},...,\sigma_{k}). For V=UWV=UW with any WO(r,r)W\in O(r,r), we have VuVv=1nu+1nv\left\|{V_{u*}-V_{v*}}\right\|=\sqrt{\frac{1}{n_{u}}+\frac{1}{n_{v}}} when σ(u)σ(v)\sigma(u)\neq\sigma(v) and Vu=VvV_{u*}=V_{v*} when σ(u)=σ(v)\sigma(u)=\sigma(v). Moreover, σkλk2τ\sigma_{k}\geq\frac{\lambda_{k}}{2\tau} as long as τnpmax\tau\geq np_{\max}.

The first part is Lemma 1 in . Define dˉv=u[n]Puv\bar{d}_{v}=\sum_{u\in[n]}P_{uv} and Dˉτ=diag(dˉ1+τ,...,dˉn+τ)\bar{D}_{\tau}=\text{diag}(\bar{d}_{1}+\tau,...,\bar{d}_{n}+\tau). Then, we have L(Pτ)=Dˉτ1/2PτDˉτ1/2L(P_{\tau})=\bar{D}_{\tau}^{-1/2}P_{\tau}\bar{D}_{\tau}^{-1/2}. Note that PτP_{\tau} has an SBM structure so that it has rank at most kk, and the kthk{{}^{\rm th}} eigenvalue of PτP_{\tau} is lower bounded by λk\lambda_{k}. Thus, we have

Observe that maxu[n]dˉunpmaxτ\max_{u\in[n]}\bar{d}_{u}\leq np_{\max}\leq\tau, and the proof is complete. ∎

As is shown in the proof of Theorem 3, τ[C1a,C2a]\tau\in[C_{1}a,C_{2}a] for some large C1,C2C_{1},C_{2} with probability at least 1eCn1-e^{-C^{\prime}n}. By Davis–Kahan’s sin-theta theorem , we have U^UWFC1kσkL(Aτ)L(Pτ)op\|\widehat{U}-UW\|_{\rm F}\leq C_{1}\frac{\sqrt{k}}{\sigma_{k}}\|L(A_{\tau})-L(P_{\tau})\|_{\rm op} for some WO(r,r)W\in O(r,r) and some constant C1>0C_{1}>0. Let V=UWV=UW and apply Lemma 7 and Lemma 8, we have

with probability at least 1nC1-n^{-C^{\prime}}. Note that by Lemma 8, VV satisfies (62). Replace (61) by (72), and follow the remaining proof of Theorem 3, the proof is complete. ∎

References

Appendix A A simplified version of Algorithm 1

Appendix B Proofs of Theorem 5

Let us consider Θ=Θ0(n,k,a,b,β)\Theta=\Theta_{0}(n,k,a,b,\beta) and the case of Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) is similar except that the condition (18) is needed to establish the counterpart of Lemma 3. The proof essentially follows the same steps as those in the proof of Theorem 2. First, we note that Lemma 1 continues to hold since it does not need the assumption of a/ba/b being bounded. Thus, the first job is to establish the counterpart of Lemma 2 with η\eta^{\prime} replaced with Cϵ2ϵ03C_{\epsilon}\frac{2\epsilon_{0}}{3}. As before, let p=a/np=a/n and q=b/nq=b/n.

To this end, we first proceed in the same way to obtain (42) – (52). Without loss of generality, let us consider the case where t>log2ϵ0t^{*}>\log\frac{2}{\epsilon_{0}} and tu=log2ϵ0t_{u}=\log\frac{2}{\epsilon_{0}} since otherwise we can essentially repeat the proof of Theorem 2. Note that this implies ab>(2ϵ0)2\frac{a}{b}>(\frac{2}{\epsilon_{0}})^{2}. In this case, with the new tut_{u} in (22), we have on the event EuE_{u},

To see this, we first note that for any x,y(0,1)x,y\in(0,1) and sufficient small constant c0>0c_{0}>0, if yx(1c0)yy\geq x\geq(1-c_{0})y and yx1y1\frac{y-x}{1-y}\leq 1, then

where Cy=2y(1y)log(1y)C^{\prime}_{y}=\frac{2y}{-(1-y)\log(1-y)}. When ab>(2ϵ0)2\frac{a}{b}>(\frac{2}{\epsilon_{0}})^{2} and tu=log2ϵ0t_{u}=\log\frac{2}{\epsilon_{0}}, we have I=log(1x)I^{\prime}=-\log(1-x) for

Thus, for any ϵ0(0,cϵ)\epsilon_{0}\in(0,c_{\epsilon}), 1ϵ2yx(12ϵ0)y1-\frac{\epsilon}{2}\geq y\geq x\geq(1-2\epsilon_{0})y and yx1y1\frac{y-x}{1-y}\leq 1, and we apply the inequality in the third last display to obtain (73).

Thus, the term in (49) is upper bounded by

On the other hand, since etu11|e^{-t_{u}}-1|\leq 1, etu1|e^{t_{u}}-1| is bounded and pqp1\frac{p-q}{p}\asymp 1, the term in (49) continues to be bounded by

Moreover, by the same argument as in Lemma 2, (51) continues to hold. Thus, we can replace (52) as

and when k=2k=2, we can replace β\beta by 11 in the last display.

When k3k\geq 3, given the last display and (58), we have

Thus, the assumption that (ab)2aklogk\frac{(a-b)^{2}}{ak\log k}\rightarrow\infty and Markov’s inequality leads to

If (k1)exp{(1Cϵ5ϵ06)nIβk}n(1+δ/2)(k-1)\exp\left\{-(1-C_{\epsilon}\frac{5\epsilon_{0}}{6})\frac{nI^{*}}{\beta k}\right\}\geq n^{-(1+\delta/2)}, then

If (k1)exp{(1Cϵ5ϵ06)nIβk}<n(1+δ/2)(k-1)\exp\left\{-(1-C_{\epsilon}\frac{5\epsilon_{0}}{6})\frac{nI^{*}}{\beta k}\right\}<n^{-(1+\delta/2)}, then

Here, the second last inequality holds since (k1)exp{(1Cϵ2ϵ03)nIβk}<exp{(1Cϵ5ϵ06)nIβk}<n(1+δ/2)(k-1)\exp\left\{-(1-C_{\epsilon}\frac{2\epsilon_{0}}{3})\frac{nI^{*}}{\beta k}\right\}<\exp\left\{-(1-C_{\epsilon}\frac{5\epsilon_{0}}{6})\frac{nI^{*}}{\beta k}\right\}<n^{-(1+\delta/2)}. We complete the proof for the case of k3k\geq 3 by noting that no constant or sequence in the foregoing arguments involves B,σB,\sigma or uu. When k=2k=2, we run the foregoing arguments with β\beta replaced by 11 to obtain the desired claim. ∎

Appendix C Proofs of Theorems 6, 7 and 8

For SBM in the space Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta) satisfying n2βkn\geq 2\beta k, we have λkabβk\lambda_{k}\geq\frac{a-b}{\beta k}.

where v1=(1n1T,0n2T,...,0nkT)Tv_{1}=(\mathbf{1}_{n_{1}}^{T},\mathbf{0}_{n_{2}}^{T},...,\mathbf{0}_{n_{k}}^{T})^{T}, v1=(0n1T,1n2T,0n3T,...,0nkT)Tv_{1}=(\mathbf{0}_{n_{1}}^{T},\mathbf{1}_{n_{2}}^{T},\mathbf{0}_{n_{3}}^{T},...,\mathbf{0}_{n_{k}}^{T})^{T},…, vk=(0n1T,...,0nk1T,1nkT)Tv_{k}=(\mathbf{0}_{n_{1}}^{T},...,\mathbf{0}_{n_{k-1}}^{T},\mathbf{1}_{n_{k}}^{T})^{T}. Note that {vi}i=1k\{v_{i}\}_{i=1}^{k} are orthogonal to each other, and therefore

By Weyl’s inequality (Theorem 4.3.1 of ),

Let us first consider Θ0(n,k,a,b,β)\Theta_{0}(n,k,a,b,\beta). By Theorem 3 and Proposition 1, the misclassification proportion is bounded by Ck2a(ab)2C\frac{k^{2}a}{(a-b)^{2}} under the condition k3a(ab)2c\frac{k^{3}a}{(a-b)^{2}}\leq c for some small cc. Thus, Condition 1 holds when k3a(ab)2=o(1)\frac{k^{3}a}{(a-b)^{2}}=o(1), which leads to the desired conclusion in view of Theorem 2 and Theorem 5. The proof of the space Θ(n,k,a,b,λ,β;α)\Theta(n,k,a,b,\lambda,\beta;\alpha) follows the same argument. ∎

The proof is the same as that of Theorem 6. ∎

When the parameters aa and bb are known, we can use τ=Ca\tau=Ca for some sufficiently large C>0C>0 for both USC(τ)(\tau) and NSC(τ)(\tau). Then, the results of Theorem 3 and Theorem 4 hold without assuming aC1ba\leq C_{1}b or fixed kk. Moreover, a^u\widehat{a}_{u} and b^u\widehat{b}_{u} in (11) and (22) can be replaced by aa and bb. Then, the conditions (16) and (18) in Theorem 2 and Theorem 5 can be weakened as γ=o(k1)\gamma=o(k^{-1}) because the we do not need to establish Lemma 1 anymore. Combining Theorem 2, Theorem 3, Theorem 4 and Theorem 5, we obtain the desired results. ∎

Appendix D Proofs of Lemma 5 and Lemma 7

The following lemma is Corollary A.1.10 in .

For independent Bernoulli random variables XuBern(pu)X_{u}\sim\text{Bern}(p_{u}) and p=1nu[n]pup=\frac{1}{n}\sum_{u\in[n]}p_{u}, we have

Consider any adjacency matrix A{0,1}n×nA\in\{0,1\}^{n\times n} for an undirected graph. Suppose maxu[n]v[n]Auvγ\max_{u\in[n]}\sum_{v\in[n]}A_{uv}\leq\gamma and for any S,T[n]S,T\subset[n], one of the following statements holds with some constant C>0C>0:

e(S,T)STγnC\frac{e(S,T)}{|S||T|\frac{\gamma}{n}}\leq C,

e(S,T)log(e(S,T)STγn)CTlognTe(S,T)\log\left(\frac{e(S,T)}{|S||T|\frac{\gamma}{n}}\right)\leq C|T|\log\frac{n}{|T|},

where e(S,T)e(S,T) is the number of edges connecting SS and TT. Then, (u,v)HxuAuvyvCγ\sum_{(u,v)\in H}x_{u}A_{uv}y_{v}\leq C^{\prime}\sqrt{\gamma} uniformly over all unit vectors x,yx,y, where H={(u,v):xuyvγ/n}H=\{(u,v):|x_{u}y_{v}|\geq\sqrt{\gamma}/n\} and C>0C^{\prime}>0 is some constant.

The following lemma is critical for proving both theorems.

For any τ>C(1+npmax)\tau>C(1+np_{\max}) with some sufficiently large C>0C>0, we have

with probability at least 1eCn1-e^{-C^{\prime}n} for some constant C>0C^{\prime}>0.

Applying union bound, the probability that the number of nodes with degree at least τ\tau is greater than ξn\xi n is

where the last inequality is by choosing ξ=τ1\xi=\tau^{-1}. Therefore, with probability at least 1eCn1-e^{-C^{\prime}n}, the number of nodes with degree at least τ\tau is bounded by τ1n\tau^{-1}n. ∎

Given τ>0\tau>0, define the subset J={u[n]:duτ}J=\{u\in[n]:d_{u}\leq\tau\}. Then for any C>0C^{\prime}>0, there is some C>0C>0 such that

with probability at least 1nC1-n^{-C^{\prime}}.

The idea of the proof follows the argument in . By definition,

Define L={(u,v):xuyv(τ+pmaxn)/n}L=\{(u,v):|x_{u}y_{v}|\leq(\sqrt{\tau}+\sqrt{p_{\max}n})/n\} and H={(u,v):xuyv(τ+pmaxn)/n}H=\{(u,v):|x_{u}y_{v}|\geq(\sqrt{\tau}+\sqrt{p_{\max}n})/n\}, then we have

A discretization argument in implies that

To bound the second part supx,ySn1(u,v)HJ×Jxu(AuvPuv)yv\sup_{x,y\in S^{n-1}}\sum_{(u,v)\in H\cap J\times J}x_{u}(A_{uv}-P_{uv})y_{v}, we are going to bound supx,ySn1(u,v)HJ×JxuAuvyv\sup_{x,y\in S^{n-1}}\sum_{(u,v)\in H\cap J\times J}x_{u}A_{uv}y_{v} and supx,ySn1(u,v)HJ×JxuPuvyv\sup_{x,y\in S^{n-1}}\sum_{(u,v)\in H\cap J\times J}x_{u}P_{uv}y_{v} separately. By the definition of HH,

To bound supx,ySn1(u,v)HJ×JxuAuvyv\sup_{x,y\in S^{n-1}}\sum_{(u,v)\in H\cap J\times J}x_{u}A_{uv}y_{v}, it is sufficient to check the conditions of Lemma 10 for the graph AJJA_{JJ}. By definition, its degree is bounded by τ\tau. Following the argument of , the two conditions of Lemma 10 hold with γ=τ+npmax\gamma=\tau+np_{\max} with probability at least 1nC1-n^{-C^{\prime}}. Thus, supx,ySn1(u,v)HJ×JxuAuvyvC(τ+npmax)\sup_{x,y\in S^{n-1}}\sum_{(u,v)\in H\cap J\times J}x_{u}A_{uv}y_{v}\leq C(\sqrt{\tau}+\sqrt{np_{\max}}) with probability at least 1nC1-n^{-C^{\prime}}. Hence, the proof is complete. ∎

where Tτ(P)T_{\tau}(P) is the matrix obtained by zeroing out the uthu{{}^{\rm th}} row and column of PP with duτd_{u}\geq\tau. Let J={u[n]:duτ}J=\{u\in[n]:d_{u}\leq\tau\}, and then Tτ(A)Tτ(P)op=AJJPJJop\|T_{\tau}(A)-T_{\tau}(P)\|_{\rm op}=\|A_{JJ}-P_{JJ}\|_{\rm op}, whose bound has been established in Lemma 12. By Lemma 11, Jcn/τ|J^{c}|\leq n/\tau with high probability. This implies Tτ(P)PopTτ(P)PF2nJcpmax22npmaxτ\|T_{\tau}(P)-P\|_{\rm op}\leq\|T_{\tau}(P)-P\|_{\rm F}\leq\sqrt{2n|J^{c}|p_{\max}^{2}}\leq\frac{\sqrt{2}np_{\max}}{\sqrt{\tau}}. Taking τ[C1(1+npmax),C2(1+npmax)]\tau\in[C_{1}(1+np_{\max}),C_{2}(1+np_{\max})], the proof is complete. ∎

Now let us prove Lemma 7. The following lemma, which controls the degree, is Lemma 7.1 in .

For any C>0C^{\prime}>0, there exists some C>0C>0 such that with probability at least 1nC1-n^{-C^{\prime}}, there exists a subset J[n]J\subset[n] satisfying nJn2e(npmax+1)n-|J|\leq\frac{n}{2e(np_{\max}+1)} and

Using this lemma, together with Lemma 11 and Lemma 12, we are able to prove the following result, which improves the bound in Theorem 7.2 of .

For any C>0C^{\prime}>0, there exists some C>0C>0 such that with probability at least 1nC1-n^{-C^{\prime}}, there exists a subset J[n]J\subset[n] satisfying nJn/dn-|J|\leq n/d and

Let us use the notation dv=u[n]Auvd_{v}=\sum_{u\in[n]}A_{uv} in the proof. Define the set J1={v[n]:dvC1d}J_{1}=\left\{v\in[n]:d_{v}\leq C_{1}d\right\} for some sufficiently large constant C1>0C_{1}>0. Using Lemma 11 and Lemma 12, with probability at least 1nC1-n^{-C^{\prime}}, we have

Let J2J_{2} be the subset in Lemma 13, and then with probability at least 1nC1-n^{-C^{\prime}}, J2J_{2} satisfies

Define J=J1J2J=J_{1}\cap J_{2}. By (79) and (81), we have

Define dˉv=u[n]Puv\bar{d}_{v}=\sum_{u\in[n]}P_{uv}. Then,

Define Dτ=diag(d1+τ,...,dn+τ)D_{\tau}=\text{diag}(d_{1}+\tau,...,d_{n}+\tau) and Dˉτ=diag(dˉ1+τ,...,dˉn+τ)\bar{D}_{\tau}=\text{diag}(\bar{d}_{1}+\tau,...,\bar{d}_{n}+\tau). We introduce the notation

Recall that d=npmax+1d=np_{\max}+1. Following the proof of Theorem 8.4 in , it can be shown that with probability at least 1nC1-n^{-C^{\prime}}, for any J[n]J\subset[n] such that nJn/dn-|J|\leq n/d,

where the first term on the right side of the inequality above is bounded in Lemma 14 by choosing an appropriate JJ. Hence, with probability at least 12nC1-2n^{-C^{\prime}},

Choosing τ[C1(1+npmax),C2(1+npmax)]\tau\in[C_{1}(1+np_{\max}),C_{2}(1+np_{\max})], the proof is complete. ∎