Recovering communities in the general stochastic block model without knowing the parameters

Emmanuel Abbe, Colin Sandon

Introduction

This paper studies the problem of recovering communities in the general stochastic block model with linear size communities, for constant and slowly diverging degree regimes. In contrast to [AS15], this paper does not require knowledge of the SBM parameters. In particular, the problem of learning the model parameters is solved when average degrees are diverging. We next provide some motivations on the problem and further background on the model.

Detecting communities (or clusters) in graphs is a fundamental problem in networks, computer science and machine learning. This applies to a large variety of complex networks (e.g., social and biological networks) as well as to data sets engineered as networks via similarly graphs, where one often attempts to get a first impression on the data by trying to identify groups with similar behavior. In particular, finding communities allows one to find like-minded people in social networks [GN02, NWS], to improve recommendation systems [LSY03, XWZ+14], to segment or classify images [SM97, SHB07], to detect protein complexes [CY06, MPN+99], to find genetically related sub-populations [PSD00, JTZ04], or discover new tumor subclasses [SPT+01].

While a large variety of community detection algorithms have been deployed in the past decades, the understanding of the fundamental limits of community detection has only appeared more recently, in particular for the SBM [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb, YC14, AS15]. The SBM is a canonical model for community detection [HLL83, WBB76, FMW85, WW87, BC09, KN11, BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12], where $n$ vertices are partitioned into $k$ communities of relative size $p_{i}$ , $i\in[k]$ , and pairs of nodes in communities $i$ and $j$ connect independently with probability $W_{i,j}$ .

Recently the SBM came back to the center of the attention at both the practical level, due to extensions allowing overlapping communities [ABFX08] that have proved to fit well real data sets in massive networks [GB13], and at the theoretical level due to new phase transition phenomena [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb]. The latter works focus exclusively on the SBM with two symmetric communities, i.e., each community is of the same size and the connectivity in each community is identical. Denoting by $p$ the intra- and $q$ the extra-cluster probabilities, most of the results are concerned with two figure of merits: (i) recovery (also called exact recovery or strong consistency), which investigates the regimes of $p$ and $q$ for which there exists an algorithm that recovers with high probability the two communities completely [BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12, Vu14, YC14], (ii) detection, which investigates the regimes for which there exists an algorithm that recovers with high probability a positively correlated partition [Co10, DKMZ11, MNS12, Mas14, MNS14].

The sharp threshold for exact recovery was obtained in [ABH14, MNSb], showing[MNSb] generalizes this to $a,b=\Theta(1)$ . that for $p=a\log(n)/n$ , $q=b\log(n)/n$ , $a,b>0$ , exact recovery is solvable if and only if $\sqrt{a}-\sqrt{b}\geq 2$ , with efficient algorithms achieving the threshold provided in [ABH14, MNSb]. In addition, [ABH14] introduces an SDP proved to achieve the threshold in [BH14, Ban15], while [YP14] shows that a spectral algorithm also achieves the threshold. Prior to these, the sharp threshold for detection was obtained in [Mas14, MNS14], showing that detection is solvable (and so efficiently) if and only if $(a-b)^{2}>2(a+b)$ , when $p=a/n$ , $q=b/n$ , settling a conjecture made in [DKMZ11] and improving on [Co10].

Besides the detection and the recovery properties, one may ask about the partial recovery of the communities, studied in [MNSa, GV14, Vu14, CRV15, AS15]. Of particular interest to this paper is the case of almost exact recovery (also called weak consistency), where only a vanishing fraction of the nodes is allowed to be misclassified. For two-symmetric communities, [MNSb] shows that almost exact recovery is possible if and only if $n(p-q)^{2}/(p+q)$ diverges, generalized in [AS15] for general SBMs.

In the next section, we discuss the results for the general SBM of interest in this paper and the problem of learning the model parameters. We conclude this section by providing motivations on the problem of achieving the threshold with an efficient and universal algorithm.

Threshold phenomena have long been studied in fields such as information theory (e.g., Shannon’s capacity) and constraint satisfaction problems (e.g., the SAT threshold). In particular, the quest of achieving the threshold has generated major algorithmic developments in these fields (e.g., LDPC codes, polar codes, survey propagation to name a few). Likewise, identifying thresholds in community detection models is key to benchmark and guide the development of clustering algorithms. Most reasonable algorithms may succeed in some regimes, while in others they may be doomed to fail due to computational barriers. However, it is particularly crucial to develop benchmarks that do not depend sensitively on the knowledge of the model parameters. A natural question is hence whether one can solve the various recovery problems in the SBM without having access to the parameters. This paper answers this question by the affirmative for the exact and almost exact recovery of the communities.

Most of the previous works are concerned with the SBM having symmetric communities (mainly 2 or sometimes $k$ ), with the exception of [Vu14] which provides some achievability results for the general SBM.[GV14] also study variations of the $k$ -symmetric model. Recently, [AS15] studied the fundamental limits for the general SBM, with results as follows (where SBM $(n,p,W)$ is the SBM with community prior $p$ and connectivity matrix $W$ ).

The algorithm Sphere-comparison is proposed that solves partial recovery with exponential accuracy and quasi-linear complexity when the SNR diverges, solving in particular almost exact recovery.

Note that for $k$ symmetric clusters, SNR reduces to $\frac{(a-b)^{2}}{k(a+(k-1)b)}$ , which is the quantity of interest for detection [DKMZ11, MNS12]. Moreover, the SNR must diverge to ensure almost exact recovery in the symmetric case [AS15]. The following is an important consequence of the previous theorem, as it shows that Sphere-comparison achieves almost exact recovery when the entries of $Q$ are scaled.

II. Exact recovery in the general SBM. The second result in [AS15] is for the regime where the connectivity matrix scales as $\log(n)Q/n$ , $Q$ fixed, where it is shown that exact recovery has a sharp threshold characterized by the divergence function

named the CH-divergence in [AS15]. Specifically, if all pairs of columns in $\operatorname{diag}(p)Q$ are at $D_{+}$ -distance at least 1 from each other, then exact recovery is solvable in the general SBM. This provides in particular an operational meaning to a new divergence function analog to the KL-divergence in the channel coding theorem (see Section 2.3 in [AS15]). Moreover, an algorithm (Degree-profiling) is developed that solves exact recovery down to the $D_{+}$ limit in quasi-linear time, showing that exact recovery has no informational to computational gap (as opposed to the conjectures made for detection with more than 4 communities [DKMZ11]). The following gives a more general statement characterizing which subset of communities can be extracted — see Definition 3 for formal definitions.

In particular, exact recovery is information-theoretically solvable in SBM $(n,p,Q\log(n)/n)$ if and only if $\min_{i,j\in[k],i\neq j}D_{+}((PQ)_{i}||(PQ)_{j})\geq 1$ . (ii) The Degree-profiling algorithm (see [AS15]) recovers the finest partition that can be recovered with probability $1-o_{n}(1)$ and runs in $o(n^{1+\epsilon})$ time for all $\epsilon>0$ . In particular, exact recovery is efficiently solvable whenever it is information-theoretically solvable.

In summary, exact or almost exact recovery is closed for the general SBM (and detection is closed for 2 symmetric communities). However this is for the case where the parameters of the SBM are assumed to be known, and with linear-size communities.

2 Estimating the parameters

For the estimation of the parameter, some results are known for two-symmetric communities. In the logarithmic degree regime, since the SDP is agnostic to the parameters (it is a relaxation of the min-bisection), and the parameters can be estimated by recovering the communities [ABH14, BH14, Ban15]. For the constant-degree regime, [MNS12] shows that the parameters can be estimated above the threshold by counting cycles (which is efficiently approximated by counting non-backtracking walks). These are however for a fixed number of communities, namely 2. We also became recently aware of a parallel work [BCS15], which considers private graphon estimation (including SBMs). In particular, for the logarithmic degree regime, [BCS15] obtains a procedure to estimate parameters of graphons in an appropriate version of the $L_{2}$ norm. This procedure is however not efficient.

For the general SBM, the results of [AS15] allow to find communities efficiently, however these rely on the knowledge of the parameters. Hence, a major open problem is to understand if these results can be extended without such a knowledge.

Results

Agnostic algorithms are developed for the constant and diverging node degrees. These afford optimal accuracy scaling for large node degrees and achieve the CH-divergence limit for logarithmic node degrees in quasi-linear time. In particular, these solve the parameter estimation problems for SBM $(n,p,\omega(1)Q)$ without knowing the number of communities. An example with real data is provided in Section 4.

The general stochastic block model $\text{SBM}(n,p,W)$ is a random graph ensemble defined on the vertex-set $V=[n]$ , where each vertex $v\in V$ is assigned independently a hidden (or planted) label $\sigma_{v}$ in $[k]$ under a probability distribution $p=(p_{1},\dots,p_{k})$ on $[k]$ , and each (unordered) pair of nodes $(u,v)\in V\times V$ is connected independently with probability $W_{\sigma_{u},\sigma_{v}}$ , where $W_{\sigma_{u},\sigma_{v}}$ is specified by a symmetric $k\times k$ matrix $W$ with entries in $ $. Note that$ G\sim\text{SBM}(n,p,W) $denotes a random graph drawn under this model, without the hidden (or planted) clusters (i.e., the labels$ \sigma_{v}$ ) revealed. The goal is to recover these labels by observing only the graph.

This paper focuses on $p$ independent of $n$ (the communities have linear size), $W$ dependent on $n$ such that the average node degrees are either constant or logarithmically growing, and $k$ fixed. These assumptions on $p$ and $k$ could be relaxed, for example to slowly growing $k$ , but we leave this for future work. As discussed in the introduction, the above regimes for $W$ are both motivated by applications, as networks are typically sparse [LLDM08, Str01] though the average degrees may not be small, and by the fact that interesting mathematical phenomena take place in these regimes. For convenience, we attribute specific notations for the model in these regimes:

(Partial recovery.) An algorithm recovers or detects communities in $\text{SBM}(n,p,W)$ with an accuracy of $\alpha\in$ , if it outputs a labelling of the nodes $\{\sigma^{\prime}(v),v\in V\}$ , which agrees with the true labelling $\sigma$ on a fraction $\alpha$ of the nodes with probability $1-o_{n}(1)$ . The agreement is maximized over relabellings of the communities.

(Exact recovery.) Exact recovery is solvable in $\text{SBM}(n,p,W)$ for a community partition $[k]=\sqcup_{s=1}^{t}A_{s}$ , where $A_{s}$ is a subset of $[k]$ , if there exists an algorithm that takes $G\sim SBM(n,p,W)$ and assigns to each node in $G$ an element of $\{A_{1},\dots,A_{t}\}$ that contains its true communityThis is again up to relabellings of the communities. with probability $1-o_{n}(1)$ . Exact recovery is solvable in $\text{SBM}(n,p,W)$ if it is solvable for the partition of $[k]$ into $k$ singletons, i.e., all communities can be recovered.

The problem is solvable information-theoretically if there exists an algorithm that solves it, and efficiently if the algorithm runs in polynomial-time in $n$ . Note that exact recovery for the partition $[k]=\{i\}\sqcup([k]\setminus\{i\})$ is equivalent to extracting community $i$ . In general, recovering a partition $[k]=\sqcup_{s=1}^{t}A_{s}$ is equivalent to merging the communities that are in a common subset $A_{s}$ and recovering the merged communities. Note also that exact recovery in $\text{SBM}(n,p,W)$ requires the graph not to have vertices of degree in multiple communities with high probability (i.e., connectivity in the symmetric case). Therefore, for exact recovery, we focus below on $W=\frac{\ln(n)}{n}Q$ where $Q$ is fixed.

2 Partial recovery

Our main result in the Appendix (Theorme 6) applies to SBM $(n,p,Q/n)$ with arbitrary $Q$ . We provided here a specific instance which is easier to parse.

Note that a vertex in community $i$ has degree with probability exponential in $c$ , and there is no way to differentiate between vertices of degree from different communities. So, an error rate that decreases exponentially with $c$ is optimal. The above gives in particular the parameter estimation in the case $c=\omega(1)$ (see also Lemma 17 in the Appendix).

The general result in the Appendix yields the following refined results in the $k$ -block symmetric case.

Consider the $k$ -block symmetric case. In other words, $p_{i}=\frac{1}{k}$ for all $i$ , and $Q_{i,j}$ is $\alpha$ if $i=j$ and $\beta$ otherwise. The vector whose entries are all $1$ s is an eigenvector of $PQ$ with eigenvalue $\frac{\alpha+(k-1)\beta}{k}$ , and every vector whose entries add up to is an eigenvector of $PQ$ with eigenvalue $\frac{\alpha-\beta}{k}$ . So, $\lambda=\frac{\alpha+(k-1)\beta}{k}$ and $\lambda^{\prime}=\frac{\alpha-\beta}{k}$ and $\frac{(\lambda^{\prime})^{2}}{\lambda}=\frac{(a-b)^{2}}{k(a+(k-1)\beta)}.$ Then, as long as $\frac{160}{9}k(\alpha+(k-1)\beta)^{7}<(\alpha-\beta)^{8}$ and $4k(\alpha+(k-1)\beta)^{3}<(\alpha-\beta)^{4}$ , there exist a constant $c>0$ such that Agnostic-sphere-comparison( $G,1/k$ ) detects communities with an accuracy of $1-O(e^{-c(\alpha-\beta)^{2}/(\alpha+(k-1)\beta)})$ for sufficiently large $(\alpha-\beta)^{2}/(\alpha+(k-1)\beta)$ .

We refer to Section 4 for an example of implementation with real data.

3 Exact recovery

We next show that this can be achieved without knowing the parameters. Recall that the finest partition is the largest partition of $[k]$ that ensure (19).

The proof assumes that the entries of $Q$ are non-zero, see Remark 1 for zero entries. To achieve this result we rely on a two step procedure. First an algorithm is developed to recover all but a vanishing fraction of nodes — this is the main focus of our partial recovery result — and then a procedure is used to “clean up” the leftover graphs using the node degrees of the preliminary classification. This turns out to be much more efficient than aiming for an algorithm that directly achieves exact recovery. We already used this technique in [AS15], but here we also deal with the difficulties resulting from not knowing the SBM’s parameters.

Proof Techniques and Algorithms

For any vertex $v$ , let $N_{r}(v)$ be the set of all vertices with shortest path to $v$ of length $r$ . We also refer to the vector with $i$ -th entry equal to the number of vertices in $N_{r}(v)$ that are in community $i$ as $N_{r}(v)$ . If there are multiple graphs that $v$ could be considered a vertex in, let $N_{r[G]}(v)$ be the set of all vertices with shortest paths in $G$ to $v$ of length $r$ .

One could probably determine $PQ$ and $e_{\sigma}$ given the values of $(PQ)^{r}e_{\sigma_{v}}$ for a few different $r$ , but using $N_{r}(v)$ to approximate that would require knowing how many of the vertices in $N_{r}(v)$ are in each community. So, we attempt to get information relating to how many vertices in $N_{r}(v)$ are in each community by checking how it connects to $N_{r^{\prime}}(v^{\prime})$ for some vertex $v^{\prime}$ and integer $r^{\prime}$ . The obvious way to do this would be to compute the cardinality of their intersection. Unfortunately, whether a given vertex in community $i$ is in $N_{r}(v)$ is not independent of whether it is in $N_{r^{\prime}}(v^{\prime})$ , which causes the cardinality of $|N_{r}(v)\cap N_{r^{\prime}}(v^{\prime})|$ to differ from what one would expect badly enough to disrupt plans to use it for approximations.

In order to get around this, we randomly assign every edge in $G$ to a set $E$ with probability $c$ . We hence define the following.

Note that $E$ and $G\backslash E$ are disjoint; however, $G$ is sparse enough that even if they were generated independently a given pair of vertices would have an edge between them in both with probability $O(\frac{1}{n^{2}})$ . So, $E$ is approximately independent of $G\backslash E$ . Thus, for any $v_{1}\in N_{r[G/E]}(v)$ and $v_{2}\in N_{r^{\prime}[G/E]}(v^{\prime})$ , $(v_{1},v_{2})\in E$ with a probability of approximately $cQ_{\sigma_{v_{1}},\sigma_{v_{2}}}/n$ . As a result,

Let $\lambda_{1},...,\lambda_{h}$ be the distinct eigenvalues of $PQ$ , ordered so that $|\lambda_{1}|\geq|\lambda_{2}|\geq...\geq|\lambda_{h}|\geq 0$ . Also define $h^{\prime}$ so that $h^{\prime}=h$ if $\lambda_{h}\neq 0$ and $h^{\prime}=h-1$ if $\lambda_{h}=0$ . If $W_{i}$ is the eigenspace of $PQ$ corresponding to the eigenvalue $\lambda_{i}$ , and $P_{W_{i}}$ is the projection operator on to $W_{i}$ , then

where the final equality holds because for all $i\neq j$ ,

and since $\lambda_{i}\neq\lambda_{j}$ , this implies that $P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{j}}(e_{\sigma_{v^{\prime}}})=0$ . In order to simplify the terminology,

Let $\zeta_{i}(v\cdot v^{\prime})=P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{i}}(e_{\sigma_{v^{\prime}}})$ for all $i$ , $v$ , and $v^{\prime}$ .

Equation (14) is dominated by the $\lambda_{1}^{r+r^{\prime}+1}$ term, so getting good estimate of the $\lambda_{2}^{r+r^{\prime}+1}$ through $\lambda_{h^{\prime}}^{r+r^{\prime}+1}$ terms requires cancelling it out somehow. As a start, if $\lambda_{1}>\lambda_{2}>\lambda_{3}$ then

Note that the left hand side of this expression is equal to $\det{}\begin{vmatrix}N_{r,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})\\ N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+2,r^{\prime}[E]}(v\cdot v^{\prime})\end{vmatrix}$ . More generally, in order to get an expression that can be used to estimate the $\lambda_{i}$ and $\zeta_{i}(v\cdot v^{\prime})$ , we consider the determinant of the following.

Let $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})$ be the $m\times m$ matrix such that $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})_{i,j}=N_{r+i+j,r^{\prime}[E]}(v\cdot v^{\prime})$ for each $i$ and $j$ .

To the degree that approximation 8 holds and $c$ is small, each column of $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})$ is a linear combination of the vectors

with coefficients that depend only on $\{\lambda_{1},...,\lambda_{h}\}$ . So, by linearity of the determinant in one column, $\det(M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}))$ is a linear combination of these vector’s wedge products with coefficients that are independent of $r$ and $r^{\prime}$ . By antisymmetry of wedge products, only the products that use $m$ different such vectors contribute to the determinant, and the products involving the eigenvalues of highest magnitude will dominate. As a result, there exist constants $\gamma(\lambda_{1},...,\lambda_{m})$ and $\gamma^{\prime}(\lambda_{1},...,\lambda_{m})$ such that

if $|\lambda_{m}|=|\lambda_{m+1}|$ . These facts suggest the following plan for estimating the eigenvalues corresponding to a graph. First, pick several vertices at random. Then, use the fact that $|N_{r[G\backslash E]}(v)|\approx((1-c)\lambda_{1})^{r}$ for any good vertex $v$ to estimate $\lambda_{1}$ . Next, use the formulas above about $\det(M_{m,r,r^{\prime}[E]}(v\cdot v))$ to get an approximation of $h^{\prime}$ and all of $PQ$ ’s eigenvalues for each selected vertex. Finally, take the median of these estimates.

Now, note that whether or not $|\lambda_{m}|=|\lambda_{m+1}|$ , we have

This fact can be used in combination with estimates of $PQ$ ’s eigenvalues to approximate $\zeta_{i}(v\cdot v^{\prime})$ for arbitrary $v$ , $v^{\prime}$ , and $i$ .

Of course, this requires $r$ and $r^{\prime}$ to be large enough that

is large relative to the error terms for all $i\leq h^{\prime}$ . At a minimum, that requires that $|(1-c)\lambda_{i}|^{r+r^{\prime}+1}=\omega(n)$ .

On a different note, for any $v$ and $v^{\prime}$ ,

with equality for all $i$ if and only if $\sigma_{v}=\sigma_{v^{\prime}}$ , so sufficiently good approximations of $\zeta_{i}(v\cdot v),\zeta_{i}(v\cdot v^{\prime})$ and $\zeta_{i}(v^{\prime}\cdot v^{\prime})$ can be used to determine which pairs of vertices are in the same community.

One could generate a reasonable classification based solely on this method of comparing vertices. However, that would require computing $N_{r,r^{\prime}[E]}(v\cdot v)$ for every vertex in the graph with fairly large $r+r^{\prime}$ , which would be slow. Instead, we use the fact that for any vertices $v$ , $v^{\prime}$ , and $v^{\prime\prime}$ with $\sigma_{v}=\sigma_{v^{\prime}}\neq\sigma_{v^{\prime\prime}}$ ,

for all $i$ , and the inequality is strict for at least one $i$ . So, subtracting $\zeta_{i}(v\cdot v)$ from both sides gives us that

for all $i$ , and the inequality is still strict for at least one $i$ .

So, given a representative vertex in each community, we can determine which of them a given vertex, $v$ , is in the same community as without needing to know the value of $\zeta_{i}(v\cdot v)$ .

This runs fairly quickly if $\zeta_{i}(v\cdot v^{\prime})$ is approximated using $N_{r,r^{\prime}[E]}(v^{\prime}\cdot v)$ such that $r$ is large and $r^{\prime}$ is small because the algorithm only requires focusing on $|N_{r^{\prime}}(v)|$ vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one in each community. After that, use the algorithm referred to above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.

The risk that this randomly gives a bad classification due to a bad set of initial vertices can be mitigated as follows. First, repeat the previous classification procedure several times. Assuming that the procedure gives a good classification more often than not, the good classifications should comprise a set that contains more than half the classifications and which has fairly little difference between any two elements of the set. Furthermore, any such set would have to contain at least one good classification, so none of its elements could be too bad. So, find such a set and average its classifications together. This completes the Agnostic-Sphere-comparison-algorithm. We refer to Section 6 for a detailed version.

2 Exact recovery and the Agnostic-degree-profiling algorithm

This is proved in [AS15]. In the case of unknown parameters, the algorithmic approach is largely unchanged, adding a step where the best known classification is used to estimate $P$ and $Q$ prior to any step in which vertices are classified based on their neighbors. The analysis of the algorithm requires however some careful handling.

First, it is necessary to prove that given a labelling of the graph’s vertices with an error rate of $x$ , one can compute approximations of $P$ and $Q$ that are within $O(x+\log(n)/\sqrt{n})$ of their true values with probability $1-o(1)$ . Secondly, one needs to modify the robust degree profiling lemma to show that attempting to determine vertices’ communities based on estimates of $p$ and $Q$ that are off by at most $\delta$ , $p^{\prime}$ and $Q^{\prime}$ , and a classification of its neighbors that has an error rate of $\delta$ classifies the vertices with an error rate only $e^{O(\delta\log n)}$ times higher than it would be given accurate values of $p$ and $Q$ and accurate classifications of the vertices’ neighbors. Combining these yields the conclusion that any errors in the estimates of the SBM’s parameters do not disrupt vertex classification any worse than the errors in the preliminary classifications already were.

The Agnostic-degree-profiling algorithm. The inputs are $(G,\gamma)$ , where $G$ is a graph, and $\gamma\in$ (see Theorem 7 for how to set $\gamma$ ).

The algorithm outputs an assignment of each vertex to one of the groups of communities $\{A_{1},\dots,A_{t}\}$ , where $A_{1},\dots,A_{t}$ is the partition of $[k]$ in to the largest number of subsets such that $D_{+}((pQ)_{i},(pQ)_{j})\geq 1$ for all $i,j$ in $[k]$ that are in different subsets (this is called the “finest partition”). It does the following:

(1) Define the graph $g^{\prime}$ on the vertex set $[n]$ by selecting each edge in $g$ independently with probability $\gamma$ , and define the graph $g^{\prime\prime}$ that contains the edges in $g$ that are not in $g^{\prime}$ .

(2) Run Agnostic-sphere-comparison on $g^{\prime}$ to obtain the preliminary classification $\sigma^{\prime}\in[k]^{n}$ (see Section 7.1.)

(3) Determine the size of each alleged community, and the edge density between each pair of alleged communities.

(4) For each node $v\in[n]$ , determine in which community node $v$ is most likely to belong to based on its degree profile computed from the preliminary classification $\sigma^{\prime}$ (see Section 7.2.2), and call it $\sigma^{\prime\prime}_{v}$

(5) Use $\sigma^{\prime\prime}_{v}$ to get new estimates of $p$ and $Q$ .

(6) For each node $v\in[n]$ , determine in which group $A_{1},\dots,A_{t}$ node $v$ is most likely to belong to based on its degree profile computed from the preliminary classification $\sigma^{\prime\prime}$ (see Section 7.2.2).

An example with real data

We have tested a simplified version of our algorithm on the data from “The political blogosphere and the 2004 US Election” [AG05], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs.

The algorithm we used has a few major modifications relative to our standard algorithm. First of all, instead of using $N_{r,r^{\prime}}(v\cdot v^{\prime})$ as its basic tool for comparing vertices, it uses a different measure, $N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime})$ which is defined as the fraction of pairs of an edge leaving the ball of radius $r$ centered on $v$ and an edge leaving the ball of radius $r^{\prime}$ centered on $v^{\prime}$ which hit the same vertex but are not the same edge. Making the measure a fraction of the pairs rather than a count of pairs was necessary to prevent $N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime})$ from being massively dependent on the degrees of $v$ and $v^{\prime}$ , which would have resulted in the increased variance in vertex degree obscuring the effects of $\sigma_{v}$ and $\sigma_{v^{\prime}}$ on $N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime})$ . The other changes to the definition make the measure somewhat less reliable, but it is still useable as long as the average degree is fairly high and $v\neq v^{\prime}$ .

Secondly, the version of Vertex-comparison-algorithm we used simply concludes that two vertices, $v$ and $v^{\prime}$ , are in different communities if $N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime})$ is below average and the same community otherwise. This is reasonable because of the following facts. For one thing, because the normalization converts the dominant term to a constant, $N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime})$ is approximately affine in $(\lambda_{2}/\lambda_{1})^{r+r^{\prime}}\zeta_{2}(v\cdot v^{\prime})/n$ . Also, as a result of the symmetry between communities, $\zeta_{2}(v\cdot v)$ is the same for all $v$ . So, $\zeta_{2}(v\cdot v)-2\zeta_{2}(v\cdot v^{\prime})+\zeta_{2}(v^{\prime}\cdot v^{\prime})$ is also affine in $\zeta_{2}(v\cdot v^{\prime})$ . Furthermore, there are only two possible values of $\zeta_{2}(v\cdot v^{\prime})$ by symmetry, and $\lambda_{2}>0$ , so $\zeta_{2}(v\cdot v)-2\zeta_{2}(v\cdot v^{\prime})+\zeta_{2}(v^{\prime}\cdot v^{\prime})>0$ iff $(\lambda_{2}/\lambda_{1})^{r+r^{\prime}}\zeta_{2}(v\cdot v^{\prime})/n$ is below average. Finally, because both communities have the same average degree, $\zeta_{1}(v,v^{\prime})$ is independent of $v$ and $v^{\prime}$ so $\zeta_{1}(v\cdot v)-2\zeta_{1}(v\cdot v^{\prime})+\zeta_{1}(v^{\prime}\cdot v^{\prime})$ is always . The version of Vertex-classification-algorithm we used is comparably modified.

Finally, our algorithm generates reference vertices by repeatedly picking two vertices at random and comparing them. If it concludes that they are in different communities and they both have above-average degree, it accepts them as reference vertices; otherwise it tries again. Requiring above-average degree is useful because a higher degree vertex is less likely to have its neighborhood distorted by a couple of atypical neighbors.

Out of $40$ trials, the resulting algorithm gave a reasonably good classification $37$ times. Each of these classified all but $56$ to $67$ of the $1222$ vertices in the graph’s main component correctly. The state-of-the-art described in [CG15] gives a lowest value at 58, with the best algorithms around $60$ , while algorithms regularized spectral methods such as the one in [QR13] obtain about 80 errors.

Open problems

The current result should also extend directly to a slowly growing number of communities (e.g., up to logarithmic). It would be interesting to extend the current approach to smaller sized communities or larger numbers of communities (watching the complexity scaling with the number of communities), as well as more general models with corrected-degrees, labeled-edges, or overlapping communities (though linear-sized overlapping communities can be treated with the approach of [AS15]).

References

The Agnostic-sphere-comparison algorithm in details

Recall the following motivation and definitions.

In order to get around this, we randomly assign every edge in $G$ to a set $E$ with probability $c$ . We hence define the following.

where the final equality holds because for all $i\neq j$ ,

and since $\lambda_{i}\neq\lambda_{j}$ , this implies that $P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{j}}(e_{\sigma_{v^{\prime}}})=0$ . In order to simplify the terminology,

Let $\zeta_{i}(v\cdot v^{\prime})=P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{i}}(e_{\sigma_{v^{\prime}}})$ for all $i$ , $v$ , and $v^{\prime}$ .

Let $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})$ be the $m\times m$ matrix such that $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})_{i,j}=N_{r+i+j,r^{\prime}[E]}(v\cdot v^{\prime})$ for each $i$ and $j$ .

To the degree that approximation 8 holds and $c$ is small, each column of $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})$ is a linear combination of the vectors

if $|\lambda_{m}|=|\lambda_{m+1}|$ . In the later case, $\lambda_{m}=-\lambda_{m+1}$ , so either $\gamma(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+1}$ and $\gamma^{\prime}(\lambda_{1},...,\lambda_{m})\lambda_{m+1}^{r+r^{\prime}+1}$ have the same sign or $\gamma(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+2}$ and $\gamma^{\prime}(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+2}$ have the same sign. In any of these cases,

This suggests the following algorithm for finding $PQ$ ’s eigenvalues.

The Basic-Eigenvalue-approximation-algorithm The inputs are $(E,c,v)$ , where $v$ is a vertex of the graph, $c\in(0,1)$ , and $E$ is a subset of $G$ ’s edges.

The algorithm ouputs a claim about how many distinct nonzero eigenvalues $PQ$ has and a list of approximations of them.

(1) Compute $N_{r[G\backslash E]}(v)$ for each $r$ until $|N_{r[G\backslash E]}(v)|>\sqrt{n}$ , and then set $\lambda_{1}^{\prime\prime}=\sqrt[2r]{n}/(1-c)$ .

(2) Set $r=r^{\prime}=\frac{2}{3}\log n/\log((1-c)\lambda^{\prime\prime}_{1})-\sqrt{\ln n}$ . Then, compute

until an $m$ is found for which this expression is less than $((1-c)\lambda^{\prime\prime}_{1})^{3/4}+\frac{1}{\sqrt{\ln n}}$ . Then, set $h^{\prime\prime}=m-1$ .

unless $|\det(M_{i,r+1,r^{\prime}[E]}(v\cdot v^{\prime}))|<\sqrt{|\det(M_{i,r,r^{\prime}[E]}(v\cdot v^{\prime}))|\cdot|\det(M_{i,r+2,r^{\prime}[E]}(v\cdot v^{\prime}))|}$ , in which case set

Repeat this for each $i\leq h^{\prime\prime}$

(4) Next, for each $i<h^{\prime\prime}$ , if $||\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}||<\frac{1}{\ln n}$ then set $\lambda^{\prime}_{i}=|\lambda^{\prime}_{i}|$ and $\lambda^{\prime}_{i+1}=-|\lambda^{\prime}_{i+1}|$ . For each $i\leq h^{\prime\prime}$ such that $||\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}||\geq\frac{1}{\ln n}$ and $||\lambda^{\prime}_{i-1}|-|\lambda^{\prime}_{i}||\geq\frac{1}{\ln n}$ set

(5) Return $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$

The risk when using this algorithm is that if the set of edges in $v$ ’s immediate neighborhood is sufficiently atypical it may not work correctly. This can be solved by repeating it for several vertices and taking the median estimates.

The Improved-Eigenvalue-approximation-algorithm The input is $c\in(0,1)$

The algorithm ouputs a claim about how many distinct nonzero eigenvalues $PQ$ has and a list of approximations of them.

(1) Create a set of edges $E$ , that each of $G$ ’s edges is independently assigned to with probability $c$ .

(2) Randomly select $\sqrt{\ln n}$ of $G$ ’s vertices, $v$ , $v$ ,…, $v[\sqrt{\ln n}]$ .

(3) Run Basic-Eigenvalue-approximation-algorithm(E,c,v[i]) for each $i\leq\sqrt{\ln n}$ , stopping the algorithm prematurely if it takes more than $O(n\sqrt{\ln n})$ time.

(4) Return $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ where $h^{\prime\prime}$ and $\lambda_{i}$ are the median outputs of the executions of Basic-Eigenvalue-approximation-algorithm for each $i$ .

Now, note that whether or not $|\lambda_{m}|=|\lambda_{m+1}|$ , we have

This fact can be used in combination with estimates of $PQ$ ’s eigenvalues to approximate $\zeta_{i}(v\cdot v^{\prime})$ for arbitrary $v$ , $v^{\prime}$ , and $i$ as follows.

The algorithm outputs $(z_{1}(v\cdot v^{\prime}),...,z_{h^{\prime\prime}}(v\cdot v^{\prime}))$ such that $z_{i}(v\cdot v^{\prime})\approx\zeta_{i}(v\cdot v^{\prime})$ for all $i$ .

(1) For each $i\leq h^{\prime\prime}$ , set

(2) Return $(z_{1}(v\cdot v^{\prime}),...,z_{h^{\prime\prime}}(v\cdot v^{\prime}))$ .

Of course, this requires $r$ and $r^{\prime}$ to be large enough that

is large relative to the error terms for all $i\leq h^{\prime}$ . At a minimum, that requires that $|(1-c)\lambda_{i}|^{r+r^{\prime}+1}=\omega(n)$ , so

because otherwise the graph will start running out of vertices before one gets $r$ steps away from $v$ or $r^{\prime}$ steps away from $v^{\prime}$ .

Furthermore, for any $v$ and $v^{\prime}$ ,

The Vertex-comparison-algorithm. The inputs are $(v,v^{\prime},r,r^{\prime},E,x,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ , where $v,v^{\prime}$ are two vertices, $r,r^{\prime}$ are positive integers, $E$ is a subset of $G$ ’s edges, $x$ is a positive real number, $c$ is a real number between and $1$ , and $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ are real numbers.

The algorithm outputs a decision on whether $v$ and $v^{\prime}$ are in the same community or not. It proceeds as follows.

(1) Run Vertex-product-approximation-algorithm(v,v’,r,r’,E,c, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ),Vertex-product-approximation-algorithm(v,v,r,r’,E,c, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ), and Vertex-product-approximation- algorithm(v’,v’,r,r’,E,c, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ).

(2) If $\exists i:z_{i}(v\cdot v)-2z_{i}(v\cdot v^{\prime})+z_{i}(v^{\prime}\cdot v^{\prime})>5(2x(\min p_{j})^{-1/2}+x^{2})$ then conclude that $v$ and $v^{\prime}$ are in different communities. Otherwise, conclude that $v$ and $v^{\prime}$ are in the same community.

One could generate a reasonable classification based solely on this method of comparing vertices (with an appropriate choice of the parameters, as later detailed). However, that would require computing $N_{r,r^{\prime}[E]}(v\cdot v)$ for every vertex in the graph with fairly large $r+r^{\prime}$ , which would be slow. Instead, we use the fact that for any vertices $v$ , $v^{\prime}$ , and $v^{\prime\prime}$ with $\sigma_{v}=\sigma_{v^{\prime}}\neq\sigma_{v^{\prime\prime}}$ ,

for all $i$ , and the inequality is strict for at least one $i$ . So, subtracting $\zeta_{i}(v\cdot v)$ from both sides gives us that

for all $i$ , and the inequality is still strict for at least one $i$ .

So, given a representative vertex in each community, we can determine which of them a given vertex, $v$ , is in the same community as without needing to know the value of $\zeta_{i}(v\cdot v)$ as follows.

The Vertex-classification-algorithm. The inputs are $(v[],v^{\prime},r,r^{\prime},E,c,\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ , where $v[]$ is a list of vertices, $v^{\prime}$ is a vertex, $r,r^{\prime}$ are positive integers, $E$ is a subset of $G$ ’s edges, $c$ is a real number between and $1$ , and $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ are real numbers. It is assumed that $z_{i}(v[\sigma]\cdot v[\sigma])$ has already been computed for each $i$ and $\sigma$ .

The algorithm is supposed to output $\sigma$ such that $v^{\prime}$ is in the same community as $v[\sigma]$ . It works as follows.

(1) Run Vertex-product-approximation-algorithm( $v[\sigma]$ ,v’,r,r’,E,c, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ) for each $\sigma$ .

(2) Find a $\sigma$ that minimizes the value of

and conclude that $v^{\prime}$ is in the same community as $v[\sigma]$ .

This runs fairly quickly if $r$ is large and $r^{\prime}$ is small because the algorithm only requires focusing on $N_{r^{\prime}}(v^{\prime})$ vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one anchor vertex in each community. After that, use the algorithm above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.

The Unreliable-graph-classification-algorithm. The inputs are $(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ , where $G$ is a graph, $c$ is a real number between and $1$ , $m$ is a positive integer, $\epsilon$ is a real number between and $1$ , $x$ is a positive real number, and $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ are real numbers.

The algorithm outputs an alleged list of communities for $G$ . It works as follows.

(1) Randomly assign each edge in $G$ to $E$ independently with probability $c$ .

(2) Randomly select $m$ vertices in $G$ , $v,...,v[m-1]$ .

(3) Set $r=(1-\frac{\epsilon}{3})\log n/\log((1-c)\lambda^{\prime}_{1})-\sqrt{\ln n}$ and $r^{\prime}=\frac{2\epsilon}{3}\cdot\log n/\log((1-c)\lambda^{\prime}_{1})$

(4) Compute $N_{r^{\prime\prime}[G\backslash E]}(v[i])$ for each $r^{\prime\prime}<r+2h^{\prime\prime}+3$ and $0\leq i<m$ .

(5) Run vertex-comparison-algorithm(v[i],v[j],r,r’,E,x, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ) for every $i$ and $j$

(6) If these give consistent results, randomly select one alleged member of each community $v^{\prime}[\sigma]$ . Otherwise, fail.

(7) For every $v^{\prime\prime}$ in the graph, compute $N_{r^{\prime\prime}[G\backslash E]}(v^{\prime\prime})$ for each $r^{\prime\prime}\leq r^{\prime}$ . Then, run Vertex-classification-algorithm(v’[],v”, r,r’,E, $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ ) in order to get a hypothesized classification of $v^{\prime\prime}$

So, the overall Agnostic-sphere-comparison-algorithm starts by estimating $PQ$ ’s eigenvalues. Then, it uses those estimates to pick appropriate values of $x$ and $\epsilon$ for the Unreliable-graph-classification-algorithm. Finally, it runs it several times and combines the resulting classifications as explained above. The only inputs it requires are the graph itself and some $\delta>0$ such that $p_{i}\geq\delta$ for all $i$ .

The Reliable-graph-classification-algorithm (i.e., Agnostic Sphere comparison). The inputs are $(G,m,\delta,T(n))$ , where $G$ is a graph, $m$ is a positive integer, $\delta$ is a positive real number, and $T$ is a function from the positive integers to itself.

The algorithm outputs an alleged list of communities for $G$ . It works as follows.

(1) Run Improved-Eigenvalue-approximation-algorithm(.1) in order to compute $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$

(2) Let $\lambda^{\prime\prime}_{1}=\lambda^{\prime}_{1}+2\ln^{-3/2}(n)$ , $\lambda^{\prime\prime}_{h^{\prime\prime}}=\lambda^{\prime}_{h^{\prime\prime}}-2\ln^{-3/2}(n)$ , and $k^{\prime}=\lfloor 1/\delta\rfloor$

(3) Let $x$ be the smallest rational number of minimal numerator such that

(4) Let $\epsilon$ be the smallest rational number of the form $\frac{1}{z}$ or $1-\frac{1}{z}$ such that $(2(\lambda^{\prime\prime}_{1})^{3}/(\lambda^{\prime\prime}_{h^{\prime\prime}})^{2})^{1-\epsilon/3}<\lambda^{\prime\prime}_{1}$ and $(1+\epsilon/3)>\log(\lambda^{\prime\prime}_{1})/\log((\lambda^{\prime\prime}_{h^{\prime\prime}})^{2}/2\lambda^{\prime\prime}_{1})$

(5) Let $c$ be the largest unit reciprocal less than $1/9$ such that all of the following hold:

(6) Run Unreliable-graph-classification-algorithm $(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ $T(n)$ times and record the resulting classifications.

(7) Find the smallest $y^{\prime\prime}$ such that there exists a set of more than half of the classifications no two of which have more than $y^{\prime\prime}$ disagreement, and discard all classifications not in the set. In this step, define the disagreement between two classifications as the minimum disagreement over all bijections between their communities.

(8) For every vertex in $G$ , randomly pick one of the remaining classifications and assert that it is in the community claimed by that classification, where a community from one classification is assumed to correspond to the community it has the greatest overlap with in each other classification.

(9) Return the resulting combined classification.

If the conditions of theorem $2$ are satisfied, then there exists $\delta$ such that

of $G$ ’s vertices correctly with probability $1-o(1)$ and it runs in $O(n^{1+\epsilon})$ time, for appropriate $C$ and $\rho=\sqrt{\lambda_{h^{\prime}}^{2}/4\lambda_{1}}$ .

Appendix

Considering the way $\delta$ , $\epsilon$ , $x$ , and $x^{\prime}$ scale when $Q$ is multiplied by a scalar yields the following corollary.

If instead of having constant average degree, one has an average degree which increases as $n$ increases, one can slowly reduce $b$ , $\delta$ , and $\epsilon$ as $n$ increases, leading to the following corollary.

These corollaries are important as they show that if the entries of the connectivity matrix $Q$ are amplified by a coefficient growing with $n$ , almost exact recovery is achieved by (Agnostic-sphere-comparison) without parameter knowledge.

1.2 Proof of Theorem 6

Proving Theorem 6 will require establishing some terminology. First, let $\lambda_{1},...,\lambda_{h}$ be the distinct eigenvalues of $PQ$ , ordered so that $|\lambda_{1}|\geq|\lambda_{2}|\geq...\geq|\lambda_{h}|\geq 0$ and if $|\lambda_{i}|=|\lambda_{i+1}|$ then $\lambda_{i}>0>\lambda_{i+1}$ . Also define $h^{\prime}$ so that $h^{\prime}=h$ if $\lambda_{h}\neq 0$ and $h^{\prime}=h-1$ if $\lambda_{h}=0$ . In addition to this, let $d$ be the largest sum of a column of $PQ$ .

For any vertex $v$ , let $N_{r}(v)$ be the set of all vertices with shortest path to $v$ of length $r$ . If there are multiple graphs that $v$ could be considered a vertex in, let $N_{r[G^{\prime}]}(v)$ be the set of all vertices with shortest paths in $G^{\prime}$ to $v$ of length $r$ .

We also typically refer to $\overrightarrow{N_{r[G^{\prime}]}(v)}$ as simply $N_{r[G^{\prime}]}(v)$ , as the context will make it clear whether the expression refers to a set or vector.

Note that since any such $w$ can be written as a linear combination of the $e_{i}$ , $v$ is $(R,x)$ -good if $|e_{i}\cdot N_{r+1}(v)-e_{i}\cdot PQN_{r}(v)|\leq\frac{x\lambda_{h^{\prime}}}{2}\left(\frac{\lambda_{h^{\prime}}^{2}}{2\lambda_{1}}\right)^{r}\sqrt{p_{i}/k}$ for all $1\leq i\leq k$ and $0\leq r<R$ .

First, note that for any eigenvector of $PQ$ , $w$ , and $r<R$ ,

Thus, for any $r\leq R$ , it must be the case that

Now, define $w_{1}$ ,…, $w_{h}$ such that $PQw_{i}=\lambda_{i}w_{i}$ for each $i$ and $p=\sum_{i=1}^{h}w_{i}$ . For any $i,j$ ,

If $i\neq j$ , then $\lambda_{i}\neq\lambda_{j}$ , so this implies that $w_{i}\cdot P^{-1}w_{j}=0$ . It follows from this that

Therefore, for any $r\leq R$ , we have that

The following two lemmas are proved in [AS15].

Note that if $N_{r[G\backslash E]}(v)$ and $N_{r^{\prime}[G\backslash E]}(v^{\prime})$ have already been computed, $N_{r,r^{\prime}[E]}(v\cdot v^{\prime})$ can be computed by means of the following algorithm, where $E[v]=\{v^{\prime}:(v,v^{\prime})\in E\}$

Roughly speaking, for each $v_{1}\in N_{r[G\backslash E]}(v)$ and $v_{2}\in N_{r^{\prime}[G\backslash E]}(v^{\prime})$ , $(v_{1},v_{2})\in E$ with probability $cQ_{\sigma_{v_{1}},\sigma_{v_{2}}}/n$ . This is complicated by the facts that $(v_{1},v_{1})$ is never in $E$ and no edge is in $G\backslash E$ and $E$ . However, this changes the expected value of $N_{r,r^{\prime}[E]}(v\cdot v^{\prime})$ given $G\backslash E$ by at most a constant unless $G$ has more than double its expected number of edges, something that happens with probability $o(1)$ . Furthermore, whether $(v_{1},v_{2})$ is in $E$ is independent of whether $(v_{1}^{\prime},v_{2}^{\prime})$ is in $E$ unless $(v_{1}^{\prime},v_{2}^{\prime})=(v_{1},v_{2})$ or $(v_{1}^{\prime},v_{2}^{\prime})=(v_{2},v_{1})$ . So, the variance of $N_{r,r^{\prime}[E]}(v\cdot v^{\prime})$ is proportional to its expected value, which is

$N_{r,r^{\prime}[E]}(v\cdot v^{\prime})$ is within $\log n$ standard deviations of its expected value with probability $1-o(1)$ , which completes the proof. ∎

Note that if $\overrightarrow{v}$ is an eigenvector of $(1-c)PQ$ , $\sqrt{P}Q\overrightarrow{v}$ is an eigenvector of the symmetric matrix $(1-c)\sqrt{P}Q\sqrt{P}$ . So, since eigenvectors of a symmetric matrix with different eigenvalues are orthogonal, we have

where we temporarily adopt the convention that if $i>h^{\prime}$ , $\lambda_{i}=\lambda_{h^{\prime}}/\sqrt{2}$ and $w_{i}(S)=0$ for all $S$ .

Alternately, if $m=h^{\prime}+1$ then with probability $1-o(1)$ , either $v$ is $(r+2m+1,x)$ -bad, $v^{\prime}$ is $(r^{\prime}+1,x)$ -bad, or

For each $1\leq l\leq 2m$ and $1\leq i\leq h$ , let

Next, for each $1\leq i\leq h$ and $0\leq l\leq m$ , let $u_{l}(i)$ be the column vector thats $j$ th entry is $x_{l+j}(i)$ for $1\leq j\leq m$ . Also, for each $1\leq l\leq m$ , let $u_{l}(h+1)$ be the length $m$ column vector thats $j$ th entry is $N_{r+l+j,r^{\prime}[E]}(v\cdot v^{\prime})-\sum_{i=1}^{h}x_{l+j}(i)$ for $1\leq j\leq m$ . Note that for each $1\leq l\leq m$ , the $l$ th column of $M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})$ is $\sum_{i=1}^{h+1}u_{l}(i)$ . So,

If $m\leq h$ and $i$ is some permutation of the integers from $1$ to $m$ , then

The $j$ th column of this matrix is proportional to $\frac{c}{n}w_{j}(N_{r[G\backslash E]}(v))\cdot Qw_{j}(N_{r^{\prime}[G\backslash E]}(v^{\prime}))$ , so there exists some $\gamma=\gamma(\{(1-c)\lambda_{j}\},m)$ such that the sum of all such terms is

Alternately, the sum of all such terms is equal to

That can only hold for all such $i$ if $\sum_{l=1}^{m}u^{\prime}_{l}(1-c)^{l}\lambda_{j}^{l}=0$ for all $1\leq j\leq m$ , and that can only be the case if $u^{\prime}=0$ . Therefore, the determinant is nonzero unless $x_{l}(0)=0$ for some $l$ , which implies that $\gamma\neq 0$ . If $m<h$ then by similar logic, there exists $\gamma^{\prime}=\gamma^{\prime}(\{(1-c)\lambda_{i}\},m)$ such that the sum of all terms for which $i$ is a permutation of the integers from $1$ to $m-1$ and $m+1$ is

for all $i$ . Similarly, if $v^{\prime}$ is $(r^{\prime}+1,x)$ -good then

for all $i$ . If both hold, then $|x_{l}(i)|\leq\frac{c}{n}(1-c)^{r+r^{\prime}+l}|\lambda_{i}|^{r+r^{\prime}+l+1}((\min p_{j})^{-1/2}+x)^{2}$ for all $i$ . Furthermore, for any $l$ and $j$ ,

In other words, under these circumstances $|x_{l}(i)|$ is upper bounded by a constant multiple of $\frac{c}{n}((1-c)|\lambda_{i}|)^{r+r^{\prime}}$ if $v$ and $v^{\prime}$ are both good, and every entry of $u_{l}(h+1)$ has a magnitude that is upper bounded by a constant multiple of $((1-c)\lambda_{1})^{(r+r^{\prime})/2}\log n/\sqrt{n}+\frac{c}{n}((1-c)^{2}\lambda_{h^{\prime}}^{2}/2)^{(r+r^{\prime})/2}$ . Either way, every entry of $u_{l}(i)$ is upper bounded by a constant multiple of $\frac{c}{n}((1-c)|\lambda_{i}|)^{r+r^{\prime}}\log n$ .

Alternately, recall that if $v$ is $(r+2m+1,x)$ -good then for any $r^{\prime\prime}<r+2m+1$ and $i\leq h$ ,

Also, for fixed values of $N_{r^{\prime\prime\prime}}(v)$ for all $r^{\prime\prime\prime}\leq r^{\prime\prime}\leq r+2m+1$ , each element of $N_{r^{\prime\prime}+1[G\backslash E]}(v)$ has a variance of $O(|N_{r^{\prime\prime}[G\backslash E]}(v)|)=O((1-c)\lambda_{1}^{r^{\prime\prime}})$ . So,

with probability $1-o(1)$ for all $l,j\leq m$ and $i\leq h$ . This implies that if $v$ is $(r+2m+1,x)$ -good and $v^{\prime}$ is $(r^{\prime}+1,x)$ -good then

with probability $1-o(1)$ . There are only $m^{m}$ possible choices of $i$ , so with probability $1-o(1)$ , either $v$ is $(r+2m+1,x)$ -bad, $v^{\prime}$ is $(r^{\prime}+1,x)$ -bad, or

While this is a helpful result, it turns out to be more useful to have an expression that links the determinant to $w_{i}(N_{r[G\backslash E]}(v))\cdot Qw_{i}(N_{r^{\prime}[G\backslash E]}(v^{\prime}))$ for fixed values of $r$ and $r^{\prime}$ . So, we have the following:

Combining these inequalities with the determinant lemma yields the desired result. ∎

However, in order to know this in a useful sense it is necessary to prove that these terms are large relative to the error terms. In order to do that, we need the following:

Every entry in $(PQ)^{k}$ is positive, so its eigenvector of largest eigenvalue is unique up to multiplication, and its entries all have the same sign. That means that it has a unique multiple, $w$ , such that $w\cdot P^{-1}w=1$ and $w_{1}>0$ . In this case all of $w$ ’s entries must be positive, and since $(PQ)^{k}w=\lambda_{1}^{k}w$ , it must be the case that $PQw=\lambda_{1}w$ .

Given any $(\sqrt{\ln n},\min(P^{-1}w)_{j}/2)$ -good vertex $v$ , and any $r<\sqrt{\ln n}$ ,

For any eigenvector $w^{\prime\prime}$ with eigenvalue $\lambda_{i^{\prime}}$ and $w^{\prime\prime}\cdot P^{-1}w^{\prime\prime}=1$ ,

$\lambda_{1}>\lambda_{2}$ , so there exists a constant $r_{0}$ such that for any $r>r_{0}$ ,

for all $j$ . For $r_{0}<r\leq\sqrt{\ln n}$ , $1\leq j\leq k$ , and a fixed value of $N_{r}(v)$ , the probability distribution of $N_{r+1}(v)_{j}$ is within $o(1)$ of a poisson distribution with expected value $(PQN_{r}(v))_{j}$ . Furthermore, $N_{r+1}(v)_{j^{\prime}}$ has negligible dependence on $N_{r+1}(v)_{j}$ for all $j^{\prime}\neq j$ . So, $w^{\prime}\cdot P^{-1}N_{r+1}(v)$ has an expected value of $\lambda_{i}w^{\prime}\cdot P^{-1}N_{r}(v)+o(1)$ and a standard deviation of

In particular, this means that if there exists $r_{0}<r<2\ln\ln n/\ln(\lambda_{i}^{2}/\lambda_{1})$ such that $|w^{\prime}\cdot P^{-1}N_{r}(v)|>\lambda_{1}^{r/2}\ln\ln\ln\ln n$ then with probability $1-o(1)$ ,

Furthermore, since the probability distribution of $w^{\prime}\cdot P^{-1}N_{r}(v)$ for a fixed value of $N_{r-1}(v)$ is a sum of constant multiples of poisson distributions with expected values of $\Omega(\lambda_{1}^{r})$ , it must be the case that $|w^{\prime}\cdot P^{-1}N_{r}(v)|>\lambda_{1}^{r/2}\ln\ln\ln\ln n$ with probability $e^{-O(\ln^{2}\ln\ln\ln n)}=\omega(1/\ln\ln n)$ . Therefore, there exists $r_{0}<r<2\ln\ln n/\ln(\lambda_{i}^{2}/\lambda_{1})$ such that this holds with probability $1-o(1)$ , and the lemma follows. ∎

This also implies that for any $v$ , $v^{\prime}$ , and $i$ , either $v$ is $(\sqrt{\ln n},\min(P^{-1}w)_{i}/2)$ -bad, $v^{\prime}$ is $(\sqrt{\ln n},\min(P^{-1}w)_{i}/2)$ -bad, or $|w_{i}(N_{\sqrt{\ln n}}(v))\cdot P^{-1}w_{i}(N_{\sqrt{\ln n}}(v^{\prime}))|\geq\lambda_{i}^{2\sqrt{\ln n}}/\ln^{2}n$ with probability $1-o(1)$ , since the degree of dependence between $N_{\sqrt{\ln n}}(v)$ and $N_{\sqrt{\ln n}}(v^{\prime})$ is negligible. This allows me to attempt to approximate $PQ$ ’s eigenvalues as follows.

With probability $1-o(1)$ , either $v$ is $(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)$ -bad or Basic-Eigenvalue-approximation-algorithm(E,c,v) runs in $O(n)$ time and returns $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ such that $h^{\prime}=h^{\prime\prime}$ and $|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n)$ for all $i$ .

If $v$ is $(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1},x)$ -good, there exists a constant $r$ such that for any $r<r^{\prime\prime}<\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1})$ ,

So, the minimum $r^{\prime\prime}$ such that $|N_{r^{\prime\prime}[G\backslash E]}(v)|>\sqrt{n}$ is within a constant of $\log n/2\log((1-c)\lambda_{1})$ . That means that $|\lambda_{1}-\lambda^{\prime\prime}_{1}|$ is upper bounded by a constant multiple of $1/\log n$ , and that $r=r^{\prime}$ is within a constant of the value it would have if $\lambda^{\prime\prime}_{1}$ were $\lambda_{1}$ . This also implies that if $n$ is sufficiently large it is less than $\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1})-2h^{\prime}-3$ .

For $m\leq h^{\prime}$ , $r^{\prime\prime}\in\{r,r+1\}$ , and $v\in G$ if $|\lambda_{i}|\neq|\lambda_{i+1}|$ then

with probability $1-o(1)$ . If $|\lambda_{i}|=|\lambda_{i+1}|$ then $\lambda_{i}=-\lambda_{i+1}$ , so either $\gamma((1-c)\lambda_{m})^{2r-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))$ has the same sign as $\gamma^{\prime}((1-c)\lambda_{m+1})^{2r-2\sqrt{\ln n}}w_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))$ or $\gamma((1-c)\lambda_{m})^{2r+1-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))$ has the same sign as $\gamma^{\prime}((1-c)\lambda_{m+1})^{2r+1-2\sqrt{\ln n}}w_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v)))$ . Either way, we have that

with probability $1-o(1)$ . Furthermore, by lemma $8$ , these bounds are within a factor of $O(\ln^{2}n)$ of each other with probability $1-o(1)$ .

is within a factor of $\ln^{3}(n)$ of $|((1-c)\lambda_{m})^{2r_{1}-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))|$ with probability $1-o(1)$ , and thus that

with probability $1-o(1)$ . $|(1-c)\lambda_{m}|\geq\sqrt{(4(1-c)^{3}\lambda_{1}^{3}}$ , so with probability $1-o(1)$ , this expression is not less than $((1-c)\lambda^{\prime\prime}_{1})^{3/4}+\frac{1}{\sqrt{\ln n}}$ .

If $m=h^{\prime}+1$ , $r^{\prime\prime}\in\{r,r+1\}$ , and $v\in G$ , then with probability $1-o(1)$ either $v$ is $(r^{\prime\prime}+2m+1,x)$ -bad or

Therefore, with probability $1-o(1)$ , either $v$ is $(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)$ -bad or $h^{\prime\prime}=h^{\prime}$ .

By the previous two lemmas, with probability $1-o(1)$ either $v$ is $(r+2m+4,x)$ -bad, or

for all $m\leq h^{\prime}+1$ and $r\leq r^{\prime\prime}<r+4$ .

Assume that the later holds. If $|\lambda_{m+1}|<|\lambda_{m}|$ then if $n$ is sufficiently large the $\gamma^{\prime}$ term will be negligable relative to the $\gamma$ term, while if $|\lambda_{m+1}|=|\lambda_{m}|$ increasing $r^{\prime\prime}$ by two would multiply both of these terms by $(1-c)^{2}\lambda_{m}^{2}$ . Either way, we have that for any $k\leq h^{\prime}$ and $r\leq r^{\prime\prime}\leq r+1$ ,

as long as $n$ is sufficiently large, and

We assumed that $n$ is large, so that can only fail if $|\lambda_{m}|=|\lambda_{m+1}|$ . In this case, $\lambda_{m}=-\lambda_{m+1}$ , so increasing $m$ by one would result in both terms having the same sign, at which point the condition is satisfied. So,

for any $i\leq h^{\prime}$ . For any $i<h^{\prime}$ , if $|\lambda_{i}|>|\lambda_{i+1}|$ , then for sufficiently large $n$ , $|\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}|$ will be greater than $\frac{1}{\ln n}$ . On the other hand, if $|\lambda_{i}|=|\lambda_{i+1}|$ and $n$ is sufficiently large, $|\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}|$ will be less than $\frac{1}{\ln n}$ . So, the algorithm will suceed at determining which eigenvalues have the same absolute values and assign $\lambda^{\prime}_{i}$ the same sign as $\lambda_{i}$ for each $i$ . So, $|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n)$ as desired. Assuming this all works, the slowest part of the algorithm is computing expressions of the form $N_{r^{\prime\prime},r^{\prime\prime\prime}[E]}(v\cdot v)$ , each such expression can be computed in $O(n)$ time, and only a constant number of them need to be computed. So, the algorithm runs in $O(n)$ time. ∎

So, this algorithm sometimes works, but it fails if $v$ is bad. This risk can be mitigated by using multiple vertices as follows.

then Improved-Eigenvalue-approximation-algorithm(c) returns $(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})$ such that $h^{\prime}=h^{\prime\prime}$ and $|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n)$ for all $i$ with probability $1-o(1)$ .

Generating $E$ takes $O(n)$ time, picking $v[i]$ takes $o(n)$ time, each execution of Basic-Eigenvalue-approximation-algorithm(c) runs in $O(n\sqrt{\log n})$ time, and combining their outputs takes $o(n)$ time. So, this algorithm runs in $O(n\log n)$ time.

Assuming the conditions are satisfied, there exists $y<\frac{1}{2}$ such that $1-y$ of $G$ ’s vertices are $(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1},x)$ -good with probability $1-o(1)$ . So, with probability $1-o(1)$ , the majority of the selected vertices of $G$ are $(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)$ -good and the majority of the executions of Basic-Eigenvalue-approximation-algorithm give good output. If this happens, then the median value of $h^{\prime\prime}$ is $h^{\prime}$ , and for each $1\leq i\leq h^{\prime}$ , the median value of $\lambda^{\prime}_{i}$ is within $\ln^{-3/2}(n)$ of $\lambda_{i}$ for each $i$ , as desired. ∎

Given approximations of $PQ$ ’s eigenvalues, one can attempt to approximate $N_{i}(\{v\})\cdot P^{-1}N_{i}(\{v^{\prime}\})$ as follows.

The slowest part of the algorithm is computing the expressions of the form $N_{r^{\prime\prime},r^{\prime}[E]}(v\cdot v^{\prime})$ . Each of these can be computed in $O(E[|N_{r^{\prime}[G\backslash E]}(v^{\prime})|])=O(((1-c)\lambda_{1})^{r^{\prime}})$ average time, and the number of these that need to be computed is constant in $n$ . So, this algorithm runs in $O(((1-c)\lambda_{1})^{r^{\prime}})$ average time.

With probability $1-o(1)$ either $v$ is $(r+2m+4,x)$ -bad, $v^{\prime}$ is $(r^{\prime}+1,x)$ -bad, or

for all $m\leq h^{\prime}+1$ and $r\leq r^{\prime\prime}<r+4$ .

for any $i\leq h$ with probability $1-o(1)$ . So,

with probability $1-o(1)$ . That implies that

By goodness of $v$ and $v^{\prime}$ , this is less than or equal to

For any two vertices in different communities, $v$ and $v^{\prime}$ , the fact that $Q$ ’s rows are distinct implies that $Q(\overrightarrow{\{v\}}-\overrightarrow{\{v^{\prime}\}})\neq 0$ . So, $w_{i}(\{v\})\neq w_{i}(\{v^{\prime}\})$ for some $1\leq i\leq h^{\prime}$ . That means that for any two vertices $v$ and $v^{\prime}$ ,

for all $1\leq i\leq h^{\prime}$ , with equality for all $i$ if and only if $v$ and $v^{\prime}$ are in the same community. This also implies that given a vertex $v$ , another vertex in the same community $v^{\prime}$ , and a vertex in a different community $v^{\prime\prime}$ ,

for all $1\leq i\leq h^{\prime}$ and the inequality is strict for at least one $i$ . This suggests the following algorithms for classifying vertices.

Assuming that each of $G$ ’s edges was independently assigned to $E$ with probability $c$ , this algorithm runs in $O(((1-c)\lambda_{1})^{\max(r,r^{\prime})})$ average time. Furthermore, if each execution of Vertex-product-approximation-algorithm succeeds and $13(2x(\min p_{j})^{-1/2}+x^{2})$ is less than the minimum nonzero value of $(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))\cdot P^{-1}(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))$ then the algorithm returns the correct result with probability $1-o(1)$ .

The slowest step of the algorithm is using Vertex-product-approximation-algorithm. This runs in an average time of $O(((1-c)\lambda_{1})^{\max(r,r^{\prime})})$ and must be done $3$ times. If each execution of Vertex-product-approximation-algorithm succeeds then with probability $1-o(1)$ the $z_{i}$ are all within $\frac{6}{5}(2x(\min p_{j})^{-1/2}+x^{2})$ of the products they seek to approximate, in which case

which is true for some $i$ if and only if $v$ and $v^{\prime}$ are in different communities. ∎

Assuming that $E$ was generated properly, this algorithm runs in $O(((1-c)\lambda_{1})^{r^{\prime}})$ average time. Let $x>0$ and assume that each execution of Vertex-product-approximation-algorithm succeeds (including the previous ones to compute $z_{i}(v[\sigma]\cdot v[\sigma])$ ), and that $v[]$ contains exactly one vertex from each community. Also, assume that $13(2x(\min p_{j})^{-1/2}+x^{2})$ is less than the minimum nonzero value of $(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))\cdot P^{-1}(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))$ . Then this algorithm classifies $v^{\prime}$ correctly with probability $1-o(1)$ .

Again, the slowest step of the algorithm is running Vertex-product-approximation-algorithm. This runs in an average time of $O(((1-c)\lambda_{1})^{r^{\prime}})$ and must be done $k$ times. If the conditions given above are satisifed, then each $z_{i}$ is within $\frac{21}{20}(2x(\min p_{j})^{-1/2}+x^{2})$ of the product it seeks to approximate with probability $1-o(1)$ . If this is the case, then

for all $i$ and $\sigma^{\prime}$ if $v^{\prime}$ is in the same community as $v[\sigma]$ , and

for all $i$ and $\sigma$ iff $v^{\prime}$ is in the same community as $v[\sigma]$ . So,

iff $v^{\prime}$ is in the same community as $v[\sigma]$ . Therefore, the algorithm returns the correct result with probability $1-o(1)$ . ∎

At this point, we can finally start giving algorithms for classifying a graph’s vertices.

Let $x^{\prime}>0$ , and assume that all of the following hold:

Let $y=2ke^{-\frac{x^{2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x)}}/\left(1-e^{-\frac{x^{2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x)}\cdot((\frac{(1-c)\lambda_{h^{\prime}}^{4}}{4\lambda_{1}^{3}})-1)}\right)$

and $y^{\prime}=2ke^{-\frac{x^{\prime 2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x^{\prime})}}/\left(1-e^{-\frac{x^{\prime 2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x^{\prime})}\cdot((\frac{(1-c)\lambda_{h^{\prime}}^{4}}{4\lambda_{1}^{3}})-1)}\right)$

This algorithm runs in $O(m^{2}n^{1-\frac{\epsilon}{3}}+n^{1+\frac{2}{3}\epsilon})$ time. Furthermore, with probability $1-o(1)$ , $G$ is such that Unreliable-graph-classification-algorithm $(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ has at least a

chance of classifying at least $1-y^{\prime}$ of $G$ ’s vertices correctly.

Let $r_{0}=(1-\frac{\epsilon}{3})\log n/\log((1-c)\lambda_{1})$ . There exists $y^{\prime\prime}<y$ such that if these conditions hold, then with probability $1-o(1)$ , at least $1-y^{\prime\prime}$ of $G$ ’s vertices are $(r_{0},x)$ -good and the number of vertices in $G$ in community $\sigma$ is within $\sqrt{n}\log n$ of $p_{\sigma}n$ for all $\sigma$ . If this is the case, then for sufficiently large $n$ , it is at least $1-k(1-\min p_{i})^{m}-my$ likely that every one of the $m$ randomly selected vertices is $(r_{0},x)$ -good and at least one is selected from each community.

If $v[i]$ is $(r_{0},x)$ -good for all $i$ , then with probability $1-o(1)$ , vertex-comparison-algorithm $(v[i],v[j],r,r^{\prime},E,x,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ determines whether or not $v[i]$ and $v[j]$ are in the same community correctly for every $i$ and $j$ , allowing the algorithm to pick one member of each community. If that happens, then the algorithm will classify each $(r^{\prime}+h^{\prime},x^{\prime})$ -good vertex correctly with probability $1-o(1)$ . So, as long as the initial selection of $v[]$ is good, the algorithm classifies at least $1-y^{\prime}$ of the graph’s vertices correctly with probability $1-o(1)$ .

Generating $E$ and $v[]$ takes $O(n)$ time. Computing $N_{r^{\prime\prime}[G\backslash E]}(v[i])$ for all $r^{\prime\prime}\leq r+2h^{\prime}+3$ takes $O(m|\cup_{r^{\prime\prime}}N_{r^{\prime\prime}[G\backslash E]}(v[i]))=O(mn)$ time, and computing $N_{r^{\prime\prime}[G\backslash E]}(v^{\prime})$ for all $r^{\prime\prime}\leq r^{\prime}$ and $v^{\prime}\in G$ takes

time. Once these have been computed, running Vertex-comparison- algorithm $(v[i],v[j],r,r^{\prime},E,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ for every $i$ and $j$ takes $O(m^{2}\cdot((1-c)\lambda_{1})^{r})=O(m^{2}n^{1-\frac{\epsilon}{3}})$ time, at which point an alleged member of each community can be found in $O(m^{2})$ time. Running Vertex-classification-algorithm $(v^{\prime}[],v^{\prime\prime},r,r^{\prime},E,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ for every $v^{\prime\prime}\in G$ takes $O(n\cdot((1-c)\lambda_{1})^{r^{\prime}})=O(n^{1+\frac{2}{3}\epsilon})$ time. So, the overall algorithm runs in $O(m^{2}n^{1-\frac{\epsilon}{3}}+n^{1+\frac{2}{3}\epsilon})$ average time. ∎

So, this algorithm can sometimes give a vertex classification that is nontrivially better than that obtained by guessing. However, it has an assymptotically nonzero failure rate and requires too much information about the graph’s parameters. In order to get around that, we combine the results of multiple executions of the algorithm and add in a parameter analysis procedure as follows.

Assume that there exist $x$ , $x^{\prime}$ , and $\epsilon$ such that $x$ is either a unit reciprocal or an integer, $\epsilon$ is a rational number of the form $\frac{1}{z}$ or $1-\frac{1}{z}$ , and all of the following hold:

With probability $1-o(1)$ , Reliable-graph-classification-algorithm(G,m, $\delta$ , T(n)) runs in $O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n))$ time and classifies at least $1-3y^{\prime}$ of $G$ ’s vertices correctly, where

With probability $1-o(1)$ , Improved-Eigenvalue-approximation-algorithm gives output such that $h^{\prime\prime}=h^{\prime}$ and $|\lambda_{i}-\lambda^{\prime}_{i}|\leq\ln^{-3/2}(n)$ for each $i$ . Assuming that this holds, the algorithm finds the largest $x$ and $\epsilon$ that satisfy the conditions above and the largest unit reciprocal less than $1/9$ , $c$ , that satisifes the conditions for Unreliable-graph-classification-algorithm $(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}))$ to have a greater than $1/2$ success rate for any $k\leq\lfloor 1/\delta\rfloor$ with probability $1-o(1)$ . Since its success rate is greater than $1/2$ and $T(n)=\omega(1)$ , more than half of the executions of Unreliable-graph-classification-algorithm give classifications with error $y^{\prime}$ or less with probability $1-o(1)$ . That means that $y^{\prime\prime}\leq 2y^{\prime}$ and at least one of the classifications in the selected set must have error $y^{\prime}$ or less. Thus, all of the selected classifications have error $3y^{\prime}$ or less. The requirement that $\min p_{i}>4y^{\prime}$ ensures that the bijection between any two of these classifications’ communities that minimizes disagreeement is the identity. Therefore, this algorithm classifies at least $1-3y^{\prime}$ of the vertices correctly, as desired.

Assuming this all works correctly, Improved-Eigenvalue-approximation-algorithm runs in $O(n\ln(n))$ time. Finding $x$ , $\epsilon$ , and $c$ takes constant time, and running Unreliable-graph-classification-algorithm $T(n)$ times takes $O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n))$ . Computing the degree of agreement between each pair of classifications takes $O(n\ln^{2}(n))$ time. $T(n)<\ln n$ , so the brute force algorithm finds $y^{\prime\prime}$ and the corresponding set in $O(n)$ time. Combining the classifications takes $O(n)$ time. Therefore, this whole algorithm runs in $O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n))$ time with probability $1-o(1)$ , as desired. ∎

If the conditions hold, Reliable-graph-classification- algorithm $(G,\ln(4\lfloor 1/\delta\rfloor)/\delta,\delta,\ln(n))$ has the desired properties by the previous lemma. ∎

2 Exact recovery

Recall also that exact recovery is solvable for a community partition $[k]=\sqcup_{s=1}^{t}A_{s}$ , if there exists an algorithm that assigns to each node in $G$ an element of $\{A_{1},\dots,A_{t}\}$ that contains its true communityUp to a relabelling of the communities. with probability $1-o_{n}(1)$ . Exact recovery is solvable in $SBM(n,p,W)$ if it is solvable for the partition of $[k]$ into $k$ singletons, i.e., all communities can be recovered.

is an $f$ -divergence. For $t=1/2$ , i.e., the gap between the arithmetic and geometric means, we have

which is the Hellinger divergence (or distance), and the maximization over $t$ of the part $\sum_{x}\mu(x)^{t}\nu(x)^{1-t}$ is the exponential of the Chernoff divergence. We refer to Section 8.3 for further discussions on $D_{+}$ . Note also that we will often evaluate $D_{+}$ as $D_{+}(x,y)$ where $x,y$ are vectors instead of measures.

We next present our main theorem for exact recovery. We first provide necessary and sufficient conditions for exact recovery of partitions, and then provide an agnostic algorithm that solves exact recovery efficiently, more precisely, in quasi-linear time.

Note that the second item in the theorem implies that Agnostic-degree-profiling solves exact recovery efficiently whenever the parameters $p$ and $Q$ allow for exact recovery to be solvable.

If $Q_{ij}=0$ for some $i$ and $j$ then the results above still hold, except that if for all $i$ and $j$ in different subsets of the partition,

but there exist $i$ and $j$ in different subsets of the partition such that $D_{+}((PQ)_{i},(PQ)_{j})=1$ and $((PQ)_{i,k}\cdot(PQ)_{j,k}\cdot((PQ)_{i,k}-(PQ)_{j,k})=0$ for all $k$ , then the optimal algorithm will have an assymptotically constant failure rate. The recovery algorithm also needs to be modified to accomodate ’s in $Q$ .

Agnostic-degree-profiling algorithm. Inputs: a graph $G=([n],E)$ , and a splitting parameter $\gamma\in$ (see Theorem 7 for the choice of $\gamma$ ). Output: Each node $v\in[n]$ is assigned a community-list $A(v)\in\{A_{1},\dots,A_{t}\}$ , where $A_{1},\dots,A_{t}$ is intended to be the partition of $[k]$ in to the largest number of subsets such that $D_{+}((pQ)_{i},(pQ)_{j})\geq 1$ for all $i,j$ in $[k]$ that are in different subsets. Algorithm: (1) Define the graph $G^{\prime}$ on the vertex set $[n]$ by selecting each edge in $G$ independently with probability $\gamma$ , and define the graph $G^{\prime\prime}$ that contains the edges in $G$ that are not in $G^{\prime}$ . (2) Run Agnostic-sphere-comparison on $G^{\prime}$ to obtain the preliminary classification $\sigma^{\prime}\in[k]^{n}$ (see Section 7.1 and Corollary 2.) (3) Estimate $p$ and $Q$ based on the alleged communities’ sizes and the edge densities between them. (4) For each node $v\in[n]$ , determine in which community node $v$ is most likely to belong to based on its degree profile computed from the preliminary classification $\sigma^{\prime}$ and the estimates of $p$ and $Q$ (see Section 7.2.2), and call it $\sigma^{\prime\prime}_{v}$ (5) Re-estimate $p$ and $Q$ based on the sizes of the communities claimed by $\sigma^{\prime\prime}$ and the edge densities between them. (6) For each node $v\in[n]$ , determine in which group $A_{1},\dots,A_{t}$ node $v$ is most likely to belong to based on its degree profile computed from the preliminary classification $\sigma^{\prime\prime}$ and the new estimates of $P$ and $Q$ (see Section 7.2.2).

2.2 Testing degree profiles

In this section, we consider the problem of deciding which community a node in the SBM belongs to based on its degree profile. The notions are the same as in [AS15], we repeat them to ease the reading and explain how the Agnostic-degree-profiling algorithm works.

hence, by the additivity of Poisson distribution and the triangular inequality,

We will rely on a simple one-sided bound (see (44)) to approximate our events under the Poisson measure.

In other words, $D$ has independent Poisson entries with different means. We use the following notation to summarize the above setting:

Our goal is to infer the value of $H$ by observing a realization of $D$ . To minimize the error probability given a realization of $D$ , we must pick the most likely hypothesis conditioned on this realization, i.e.,

which is the Maximum A Posteriori (MAP) decoding rule.Ties can be broken arbitrarily. To resolve this maximization, we can proceed to a tournament of $k-1$ pairwise comparisons of the hypotheses. Each comparison allows us to eliminate one candidate for the maxima, i.e.,

The error probability $P_{e}$ of this decoding rule is then given by,

where $D_{+}(c_{1},c_{2})$ is the CH-divergence as defined in (16).

In other words, the CH-divergence provides the error exponent for deciding among multivariate Poisson distributions. We did not find this result in the literature, but found a similar result obtained by Verdú [Ver86], who shows that the Hellinger distance (the special case with $t=1/2$ instead of the maximization over $t$ ) appears in the error exponent for testing Poisson point-processes, although [Ver86] does not investigate the exact error exponent. This lemma was proven in [AS15].

This lemma together with previous bounds on $P_{e}$ imply that if $D_{+}(c_{i},c_{j})>1$ for all $i\neq j$ , the true hypothesis is correctly recovered with probability $o(1/n)$ . However, it may be that $D_{+}(c_{i},c_{j})>1$ only for a subset of $(i,j)$ -pairs. What can we then infer? While we may not recover the true value of $H$ with probability $o(1/n)$ , we may narrow down the search within a subset of possible hypotheses with that probability of error.

Testing composite multivariate Poisson distributions. We now consider the previous setting, but we are no longer interested in determining the true hypothesis, but in deciding between two (or more) disjoint subsets of hypotheses. Under hypothesis 1, the distribution of $D$ belongs to a set of possible distributions, namely $\mathcal{P}(\lambda_{i})$ where $i\in A$ , and under hypothesis 2, the distribution of $D$ belongs to another set of distributions, namely $\mathcal{P}(\lambda_{i})$ where $i\in B$ . Note that $A$ and $B$ are disjoint subsets such that $A\cup B=[k]$ . In short,

Moreover, applying bounds on the minima of two sums,

The same reasoning can be applied to the problem of deciding whether a given node belongs to a group of communities, with more than two groups. Also, for any $p$ and $p^{\prime}$ such that $|p_{j}-p^{\prime}_{j}|<\ln n/\sqrt{n}$ for each $j$ , $Q$ , $\gamma(n)$ , and $i$ ,

So, the error rate for any algorithm that classifies vertices based on their degree profile in a graph drawn from a sparse SBM is at most $O(1/n^{2})$ more than twice what it would be if the probability distribution of degree profiles really was the poisson distribution.

In summary, we have proved the following.

So, this hypothesis testing procedure has an error probability of at most

With probability $1-o(1)$ , the size of every community is within $\sqrt{\ln n/n}$ of its expected value, and the edge density between each pair of communities is within $\sqrt{n}\ln n$ of its expected value. Also, there exists a constant $c^{\prime}$ such that with probability $1-o(1)$ , no vertex has degree more than $c^{\prime}\ln n$ . Assuming this is the case, the misclassification of vertices can not change the apparent number of edges between any two communities by more than $c^{\prime}\delta n\ln n$ , and it can not change the apparent size of any community by more than $\delta n$ . Thus, $|p^{\prime}_{i}-p_{i}|\leq\sqrt{\ln n/n}+\delta$ and

This lets us prove the following “robust” version of this lemma to prove Theorem 7.

First, note that by the previous lemma, with probability $1-o(1)$ , $G$ is such that every element of the estimates of $p$ and $Q$ is within $c\delta+\sqrt{\ln n/n}$ of its true value. Choose $c_{3}$ such that every vertex in the graph has degree less than $c_{3}\ln n$ with probability $1-o(1)$ . If this holds, then the probability of misclassifying a vertex based on its true degree profile and the estimates of its parameters is at most

based on the previous bounds on the probability that classifying a vertex based on its degree profile fails.

The key observation is that $v$ ’s $m$ th neighbor had at least a $\min_{i,j}(p_{i}q_{i,j})/\sum p_{i^{\prime}}q_{i^{\prime},j}$ chance of actually being in community $\sigma$ for each $\sigma$ , so its probability of being reported as being in community $\sigma$ is at most $1+c^{\prime}_{1}\delta$ times the probability that it actually is. So, the probability that its reported degree profile is bad is at most $(1+c^{\prime}_{1}\delta)^{|N_{1}(v)|}$ times the probability that its actual degree profile is bad. The conclusion follows. ∎

2.3 Proof of Theorem 7

We prove the possibility result here. The converse was proven in [AS15] for known parameters, hence also applies here.

The idea behind Claim 1 is contained in Lemma 16. However, there are several technical steps that need to be handled:

The graphs $G^{\prime}$ and $G^{\prime\prime}$ obtained in step 1 of the algorithm are correlated, since an edge cannot be both in $G^{\prime}$ and $G^{\prime\prime}$ . However, this effect can be discarded since two independent versions would share edges with low enough probability.

The classification in step 2 using Agnostic-sphere-comparison has a vanishing fraction of vertices which are wrongly labeled, and the SBM parameters are unknown. This requires using the robust version of Lemma 16, namely Lemma 18.

In the case where $D_{+}((PQ)_{i},(PQ)_{j})=1$ a more careful classification is needed as carried in steps 3 and 4 of the algorithm.

With probability $1-O(1/n)$ , no vertex in the graph has degree greater than $c_{3}\ln n$ . Assuming that this holds, no vertex’s set of neighbors in $G^{\prime\prime}$ is more than

times as likely to occur as it would be if $G^{\prime\prime}$ were independent of $G^{\prime}$ . So, the fact that they are not has negligible impact on the algorithm’s error rate. Now, let

By Lemma 18, if the classification in step $2$ has an error rate of at most $\delta$ , then the classification in step $3$ has an error rate of

observing that if $\sigma^{\prime}_{v}\neq\sigma_{v}$ the error rate of $\sigma^{\prime\prime}_{v^{\prime}}$ for $v^{\prime}$ adjacent to $v$ is at worst multiplied by a constant. That in turn ensures that the final classification has an error rate of at most