Recovering communities in the general stochastic block model without knowing the parameters

Emmanuel Abbe, Colin Sandon

Introduction

This paper studies the problem of recovering communities in the general stochastic block model with linear size communities, for constant and slowly diverging degree regimes. In contrast to [AS15], this paper does not require knowledge of the SBM parameters. In particular, the problem of learning the model parameters is solved when average degrees are diverging. We next provide some motivations on the problem and further background on the model.

Detecting communities (or clusters) in graphs is a fundamental problem in networks, computer science and machine learning. This applies to a large variety of complex networks (e.g., social and biological networks) as well as to data sets engineered as networks via similarly graphs, where one often attempts to get a first impression on the data by trying to identify groups with similar behavior. In particular, finding communities allows one to find like-minded people in social networks [GN02, NWS], to improve recommendation systems [LSY03, XWZ+14], to segment or classify images [SM97, SHB07], to detect protein complexes [CY06, MPN+99], to find genetically related sub-populations [PSD00, JTZ04], or discover new tumor subclasses [SPT+01].

While a large variety of community detection algorithms have been deployed in the past decades, the understanding of the fundamental limits of community detection has only appeared more recently, in particular for the SBM [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb, YC14, AS15]. The SBM is a canonical model for community detection [HLL83, WBB76, FMW85, WW87, BC09, KN11, BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12], where nn vertices are partitioned into kk communities of relative size pip_{i}, i[k]i\in[k], and pairs of nodes in communities ii and jj connect independently with probability Wi,jW_{i,j}.

Recently the SBM came back to the center of the attention at both the practical level, due to extensions allowing overlapping communities [ABFX08] that have proved to fit well real data sets in massive networks [GB13], and at the theoretical level due to new phase transition phenomena [Co10, DKMZ11, Mas14, MNS14, ABH14, MNSb]. The latter works focus exclusively on the SBM with two symmetric communities, i.e., each community is of the same size and the connectivity in each community is identical. Denoting by pp the intra- and qq the extra-cluster probabilities, most of the results are concerned with two figure of merits: (i) recovery (also called exact recovery or strong consistency), which investigates the regimes of pp and qq for which there exists an algorithm that recovers with high probability the two communities completely [BCLS87, DF89, Bop87, JS98, CK99, CI01, SN97, McS01, BC09, RCY11, CWA12, CSX12, Vu14, YC14], (ii) detection, which investigates the regimes for which there exists an algorithm that recovers with high probability a positively correlated partition [Co10, DKMZ11, MNS12, Mas14, MNS14].

The sharp threshold for exact recovery was obtained in [ABH14, MNSb], showing[MNSb] generalizes this to a,b=Θ(1)a,b=\Theta(1). that for p=alog(n)/np=a\log(n)/n, q=blog(n)/nq=b\log(n)/n, a,b>0a,b>0, exact recovery is solvable if and only if ab2\sqrt{a}-\sqrt{b}\geq 2, with efficient algorithms achieving the threshold provided in [ABH14, MNSb]. In addition, [ABH14] introduces an SDP proved to achieve the threshold in [BH14, Ban15], while [YP14] shows that a spectral algorithm also achieves the threshold. Prior to these, the sharp threshold for detection was obtained in [Mas14, MNS14], showing that detection is solvable (and so efficiently) if and only if (ab)2>2(a+b)(a-b)^{2}>2(a+b), when p=a/np=a/n, q=b/nq=b/n, settling a conjecture made in [DKMZ11] and improving on [Co10].

Besides the detection and the recovery properties, one may ask about the partial recovery of the communities, studied in [MNSa, GV14, Vu14, CRV15, AS15]. Of particular interest to this paper is the case of almost exact recovery (also called weak consistency), where only a vanishing fraction of the nodes is allowed to be misclassified. For two-symmetric communities, [MNSb] shows that almost exact recovery is possible if and only if n(pq)2/(p+q)n(p-q)^{2}/(p+q) diverges, generalized in [AS15] for general SBMs.

In the next section, we discuss the results for the general SBM of interest in this paper and the problem of learning the model parameters. We conclude this section by providing motivations on the problem of achieving the threshold with an efficient and universal algorithm.

Threshold phenomena have long been studied in fields such as information theory (e.g., Shannon’s capacity) and constraint satisfaction problems (e.g., the SAT threshold). In particular, the quest of achieving the threshold has generated major algorithmic developments in these fields (e.g., LDPC codes, polar codes, survey propagation to name a few). Likewise, identifying thresholds in community detection models is key to benchmark and guide the development of clustering algorithms. Most reasonable algorithms may succeed in some regimes, while in others they may be doomed to fail due to computational barriers. However, it is particularly crucial to develop benchmarks that do not depend sensitively on the knowledge of the model parameters. A natural question is hence whether one can solve the various recovery problems in the SBM without having access to the parameters. This paper answers this question by the affirmative for the exact and almost exact recovery of the communities.

Most of the previous works are concerned with the SBM having symmetric communities (mainly 2 or sometimes kk), with the exception of [Vu14] which provides some achievability results for the general SBM.[GV14] also study variations of the kk-symmetric model. Recently, [AS15] studied the fundamental limits for the general SBM, with results as follows (where SBM(n,p,W)(n,p,W) is the SBM with community prior pp and connectivity matrix WW).

The algorithm Sphere-comparison is proposed that solves partial recovery with exponential accuracy and quasi-linear complexity when the SNR diverges, solving in particular almost exact recovery.

Note that for kk symmetric clusters, SNR reduces to (ab)2k(a+(k1)b)\frac{(a-b)^{2}}{k(a+(k-1)b)}, which is the quantity of interest for detection [DKMZ11, MNS12]. Moreover, the SNR must diverge to ensure almost exact recovery in the symmetric case [AS15]. The following is an important consequence of the previous theorem, as it shows that Sphere-comparison achieves almost exact recovery when the entries of QQ are scaled.

II. Exact recovery in the general SBM. The second result in [AS15] is for the regime where the connectivity matrix scales as log(n)Q/n\log(n)Q/n, QQ fixed, where it is shown that exact recovery has a sharp threshold characterized by the divergence function

named the CH-divergence in [AS15]. Specifically, if all pairs of columns in diag(p)Q\operatorname{diag}(p)Q are at D+D_{+}-distance at least 1 from each other, then exact recovery is solvable in the general SBM. This provides in particular an operational meaning to a new divergence function analog to the KL-divergence in the channel coding theorem (see Section 2.3 in [AS15]). Moreover, an algorithm (Degree-profiling) is developed that solves exact recovery down to the D+D_{+} limit in quasi-linear time, showing that exact recovery has no informational to computational gap (as opposed to the conjectures made for detection with more than 4 communities [DKMZ11]). The following gives a more general statement characterizing which subset of communities can be extracted — see Definition 3 for formal definitions.

In particular, exact recovery is information-theoretically solvable in SBM(n,p,Qlog(n)/n)(n,p,Q\log(n)/n) if and only if mini,j[k],ijD+((PQ)i(PQ)j)1\min_{i,j\in[k],i\neq j}D_{+}((PQ)_{i}||(PQ)_{j})\geq 1. (ii) The Degree-profiling algorithm (see [AS15]) recovers the finest partition that can be recovered with probability 1on(1)1-o_{n}(1) and runs in o(n1+ϵ)o(n^{1+\epsilon}) time for all ϵ>0\epsilon>0. In particular, exact recovery is efficiently solvable whenever it is information-theoretically solvable.

In summary, exact or almost exact recovery is closed for the general SBM (and detection is closed for 2 symmetric communities). However this is for the case where the parameters of the SBM are assumed to be known, and with linear-size communities.

2 Estimating the parameters

For the estimation of the parameter, some results are known for two-symmetric communities. In the logarithmic degree regime, since the SDP is agnostic to the parameters (it is a relaxation of the min-bisection), and the parameters can be estimated by recovering the communities [ABH14, BH14, Ban15]. For the constant-degree regime, [MNS12] shows that the parameters can be estimated above the threshold by counting cycles (which is efficiently approximated by counting non-backtracking walks). These are however for a fixed number of communities, namely 2. We also became recently aware of a parallel work [BCS15], which considers private graphon estimation (including SBMs). In particular, for the logarithmic degree regime, [BCS15] obtains a procedure to estimate parameters of graphons in an appropriate version of the L2L_{2} norm. This procedure is however not efficient.

For the general SBM, the results of [AS15] allow to find communities efficiently, however these rely on the knowledge of the parameters. Hence, a major open problem is to understand if these results can be extended without such a knowledge.

Results

Agnostic algorithms are developed for the constant and diverging node degrees. These afford optimal accuracy scaling for large node degrees and achieve the CH-divergence limit for logarithmic node degrees in quasi-linear time. In particular, these solve the parameter estimation problems for SBM(n,p,ω(1)Q)(n,p,\omega(1)Q) without knowing the number of communities. An example with real data is provided in Section 4.

The general stochastic block model SBM(n,p,W)\text{SBM}(n,p,W) is a random graph ensemble defined on the vertex-set V=[n]V=[n], where each vertex vVv\in V is assigned independently a hidden (or planted) label σv\sigma_{v} in [k][k] under a probability distribution p=(p1,,pk)p=(p_{1},\dots,p_{k}) on [k][k], and each (unordered) pair of nodes (u,v)V×V(u,v)\in V\times V is connected independently with probability Wσu,σvW_{\sigma_{u},\sigma_{v}}, where Wσu,σvW_{\sigma_{u},\sigma_{v}} is specified by a symmetric k×kk\times k matrix WW with entries in $.Notethat. Note thatG\sim\text{SBM}(n,p,W)denotesarandomgraphdrawnunderthismodel,withoutthehidden(orplanted)clusters(i.e.,thelabelsdenotes a random graph drawn under this model, without the hidden (or planted) clusters (i.e., the labels\sigma_{v}$ ) revealed. The goal is to recover these labels by observing only the graph.

This paper focuses on pp independent of nn (the communities have linear size), WW dependent on nn such that the average node degrees are either constant or logarithmically growing, and kk fixed. These assumptions on pp and kk could be relaxed, for example to slowly growing kk, but we leave this for future work. As discussed in the introduction, the above regimes for WW are both motivated by applications, as networks are typically sparse [LLDM08, Str01] though the average degrees may not be small, and by the fact that interesting mathematical phenomena take place in these regimes. For convenience, we attribute specific notations for the model in these regimes:

(Partial recovery.) An algorithm recovers or detects communities in SBM(n,p,W)\text{SBM}(n,p,W) with an accuracy of α\alpha\in, if it outputs a labelling of the nodes {σ(v),vV}\{\sigma^{\prime}(v),v\in V\}, which agrees with the true labelling σ\sigma on a fraction α\alpha of the nodes with probability 1on(1)1-o_{n}(1). The agreement is maximized over relabellings of the communities.

(Exact recovery.) Exact recovery is solvable in SBM(n,p,W)\text{SBM}(n,p,W) for a community partition [k]=s=1tAs[k]=\sqcup_{s=1}^{t}A_{s}, where AsA_{s} is a subset of [k][k], if there exists an algorithm that takes GSBM(n,p,W)G\sim SBM(n,p,W) and assigns to each node in GG an element of {A1,,At}\{A_{1},\dots,A_{t}\} that contains its true communityThis is again up to relabellings of the communities. with probability 1on(1)1-o_{n}(1). Exact recovery is solvable in SBM(n,p,W)\text{SBM}(n,p,W) if it is solvable for the partition of [k][k] into kk singletons, i.e., all communities can be recovered.

The problem is solvable information-theoretically if there exists an algorithm that solves it, and efficiently if the algorithm runs in polynomial-time in nn. Note that exact recovery for the partition [k]={i}([k]{i})[k]=\{i\}\sqcup([k]\setminus\{i\}) is equivalent to extracting community ii. In general, recovering a partition [k]=s=1tAs[k]=\sqcup_{s=1}^{t}A_{s} is equivalent to merging the communities that are in a common subset AsA_{s} and recovering the merged communities. Note also that exact recovery in SBM(n,p,W)\text{SBM}(n,p,W) requires the graph not to have vertices of degree in multiple communities with high probability (i.e., connectivity in the symmetric case). Therefore, for exact recovery, we focus below on W=ln(n)nQW=\frac{\ln(n)}{n}Q where QQ is fixed.

2 Partial recovery

Our main result in the Appendix (Theorme 6) applies to SBM(n,p,Q/n)(n,p,Q/n) with arbitrary QQ. We provided here a specific instance which is easier to parse.

Note that a vertex in community ii has degree with probability exponential in cc, and there is no way to differentiate between vertices of degree from different communities. So, an error rate that decreases exponentially with cc is optimal. The above gives in particular the parameter estimation in the case c=ω(1)c=\omega(1) (see also Lemma 17 in the Appendix).

The general result in the Appendix yields the following refined results in the kk-block symmetric case.

Consider the kk-block symmetric case. In other words, pi=1kp_{i}=\frac{1}{k} for all ii, and Qi,jQ_{i,j} is α\alpha if i=ji=j and β\beta otherwise. The vector whose entries are all 11s is an eigenvector of PQPQ with eigenvalue α+(k1)βk\frac{\alpha+(k-1)\beta}{k}, and every vector whose entries add up to is an eigenvector of PQPQ with eigenvalue αβk\frac{\alpha-\beta}{k}. So, λ=α+(k1)βk\lambda=\frac{\alpha+(k-1)\beta}{k} and λ=αβk\lambda^{\prime}=\frac{\alpha-\beta}{k} and (λ)2λ=(ab)2k(a+(k1)β).\frac{(\lambda^{\prime})^{2}}{\lambda}=\frac{(a-b)^{2}}{k(a+(k-1)\beta)}. Then, as long as 1609k(α+(k1)β)7<(αβ)8\frac{160}{9}k(\alpha+(k-1)\beta)^{7}<(\alpha-\beta)^{8} and 4k(α+(k1)β)3<(αβ)44k(\alpha+(k-1)\beta)^{3}<(\alpha-\beta)^{4}, there exist a constant c>0c>0 such that Agnostic-sphere-comparison(G,1/kG,1/k) detects communities with an accuracy of 1O(ec(αβ)2/(α+(k1)β))1-O(e^{-c(\alpha-\beta)^{2}/(\alpha+(k-1)\beta)}) for sufficiently large (αβ)2/(α+(k1)β)(\alpha-\beta)^{2}/(\alpha+(k-1)\beta).

We refer to Section 4 for an example of implementation with real data.

3 Exact recovery

We next show that this can be achieved without knowing the parameters. Recall that the finest partition is the largest partition of [k][k] that ensure (19).

The proof assumes that the entries of QQ are non-zero, see Remark 1 for zero entries. To achieve this result we rely on a two step procedure. First an algorithm is developed to recover all but a vanishing fraction of nodes — this is the main focus of our partial recovery result — and then a procedure is used to “clean up” the leftover graphs using the node degrees of the preliminary classification. This turns out to be much more efficient than aiming for an algorithm that directly achieves exact recovery. We already used this technique in [AS15], but here we also deal with the difficulties resulting from not knowing the SBM’s parameters.

Proof Techniques and Algorithms

For any vertex vv, let Nr(v)N_{r}(v) be the set of all vertices with shortest path to vv of length rr. We also refer to the vector with ii-th entry equal to the number of vertices in Nr(v)N_{r}(v) that are in community ii as Nr(v)N_{r}(v). If there are multiple graphs that vv could be considered a vertex in, let Nr[G](v)N_{r[G]}(v) be the set of all vertices with shortest paths in GG to vv of length rr.

One could probably determine PQPQ and eσe_{\sigma} given the values of (PQ)reσv(PQ)^{r}e_{\sigma_{v}} for a few different rr, but using Nr(v)N_{r}(v) to approximate that would require knowing how many of the vertices in Nr(v)N_{r}(v) are in each community. So, we attempt to get information relating to how many vertices in Nr(v)N_{r}(v) are in each community by checking how it connects to Nr(v)N_{r^{\prime}}(v^{\prime}) for some vertex vv^{\prime} and integer rr^{\prime}. The obvious way to do this would be to compute the cardinality of their intersection. Unfortunately, whether a given vertex in community ii is in Nr(v)N_{r}(v) is not independent of whether it is in Nr(v)N_{r^{\prime}}(v^{\prime}), which causes the cardinality of Nr(v)Nr(v)|N_{r}(v)\cap N_{r^{\prime}}(v^{\prime})| to differ from what one would expect badly enough to disrupt plans to use it for approximations.

In order to get around this, we randomly assign every edge in GG to a set EE with probability cc. We hence define the following.

Note that EE and G\EG\backslash E are disjoint; however, GG is sparse enough that even if they were generated independently a given pair of vertices would have an edge between them in both with probability O(1n2)O(\frac{1}{n^{2}}). So, EE is approximately independent of G\EG\backslash E. Thus, for any v1Nr[G/E](v)v_{1}\in N_{r[G/E]}(v) and v2Nr[G/E](v)v_{2}\in N_{r^{\prime}[G/E]}(v^{\prime}), (v1,v2)E(v_{1},v_{2})\in E with a probability of approximately cQσv1,σv2/ncQ_{\sigma_{v_{1}},\sigma_{v_{2}}}/n. As a result,

Let λ1,...,λh\lambda_{1},...,\lambda_{h} be the distinct eigenvalues of PQPQ, ordered so that λ1λ2...λh0|\lambda_{1}|\geq|\lambda_{2}|\geq...\geq|\lambda_{h}|\geq 0. Also define hh^{\prime} so that h=hh^{\prime}=h if λh0\lambda_{h}\neq 0 and h=h1h^{\prime}=h-1 if λh=0\lambda_{h}=0. If WiW_{i} is the eigenspace of PQPQ corresponding to the eigenvalue λi\lambda_{i}, and PWiP_{W_{i}} is the projection operator on to WiW_{i}, then

where the final equality holds because for all iji\neq j,

and since λiλj\lambda_{i}\neq\lambda_{j}, this implies that PWi(eσv)P1PWj(eσv)=0P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{j}}(e_{\sigma_{v^{\prime}}})=0. In order to simplify the terminology,

Let ζi(vv)=PWi(eσv)P1PWi(eσv)\zeta_{i}(v\cdot v^{\prime})=P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{i}}(e_{\sigma_{v^{\prime}}}) for all ii, vv, and vv^{\prime}.

Equation (14) is dominated by the λ1r+r+1\lambda_{1}^{r+r^{\prime}+1} term, so getting good estimate of the λ2r+r+1\lambda_{2}^{r+r^{\prime}+1} through λhr+r+1\lambda_{h^{\prime}}^{r+r^{\prime}+1} terms requires cancelling it out somehow. As a start, if λ1>λ2>λ3\lambda_{1}>\lambda_{2}>\lambda_{3} then

Note that the left hand side of this expression is equal to detNr,r[E](vv)Nr+1,r[E](vv)Nr+1,r[E](vv)Nr+2,r[E](vv)\det{}\begin{vmatrix}N_{r,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})\\ N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+2,r^{\prime}[E]}(v\cdot v^{\prime})\end{vmatrix}. More generally, in order to get an expression that can be used to estimate the λi\lambda_{i} and ζi(vv)\zeta_{i}(v\cdot v^{\prime}), we consider the determinant of the following.

Let Mm,r,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}) be the m×mm\times m matrix such that Mm,r,r[E](vv)i,j=Nr+i+j,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})_{i,j}=N_{r+i+j,r^{\prime}[E]}(v\cdot v^{\prime}) for each ii and jj.

To the degree that approximation 8 holds and cc is small, each column of Mm,r,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}) is a linear combination of the vectors

with coefficients that depend only on {λ1,...,λh}\{\lambda_{1},...,\lambda_{h}\}. So, by linearity of the determinant in one column, det(Mm,r,r[E](vv))\det(M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})) is a linear combination of these vector’s wedge products with coefficients that are independent of rr and rr^{\prime}. By antisymmetry of wedge products, only the products that use mm different such vectors contribute to the determinant, and the products involving the eigenvalues of highest magnitude will dominate. As a result, there exist constants γ(λ1,...,λm)\gamma(\lambda_{1},...,\lambda_{m}) and γ(λ1,...,λm)\gamma^{\prime}(\lambda_{1},...,\lambda_{m}) such that

if λm=λm+1|\lambda_{m}|=|\lambda_{m+1}|. These facts suggest the following plan for estimating the eigenvalues corresponding to a graph. First, pick several vertices at random. Then, use the fact that Nr[G\E](v)((1c)λ1)r|N_{r[G\backslash E]}(v)|\approx((1-c)\lambda_{1})^{r} for any good vertex vv to estimate λ1\lambda_{1}. Next, use the formulas above about det(Mm,r,r[E](vv))\det(M_{m,r,r^{\prime}[E]}(v\cdot v)) to get an approximation of hh^{\prime} and all of PQPQ’s eigenvalues for each selected vertex. Finally, take the median of these estimates.

Now, note that whether or not λm=λm+1|\lambda_{m}|=|\lambda_{m+1}|, we have

This fact can be used in combination with estimates of PQPQ’s eigenvalues to approximate ζi(vv)\zeta_{i}(v\cdot v^{\prime}) for arbitrary vv, vv^{\prime}, and ii.

Of course, this requires rr and rr^{\prime} to be large enough that

is large relative to the error terms for all ihi\leq h^{\prime}. At a minimum, that requires that (1c)λir+r+1=ω(n)|(1-c)\lambda_{i}|^{r+r^{\prime}+1}=\omega(n).

On a different note, for any vv and vv^{\prime},

with equality for all ii if and only if σv=σv\sigma_{v}=\sigma_{v^{\prime}}, so sufficiently good approximations of ζi(vv),ζi(vv)\zeta_{i}(v\cdot v),\zeta_{i}(v\cdot v^{\prime}) and ζi(vv)\zeta_{i}(v^{\prime}\cdot v^{\prime}) can be used to determine which pairs of vertices are in the same community.

One could generate a reasonable classification based solely on this method of comparing vertices. However, that would require computing Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v) for every vertex in the graph with fairly large r+rr+r^{\prime}, which would be slow. Instead, we use the fact that for any vertices vv, vv^{\prime}, and vv^{\prime\prime} with σv=σvσv\sigma_{v}=\sigma_{v^{\prime}}\neq\sigma_{v^{\prime\prime}},

for all ii, and the inequality is strict for at least one ii. So, subtracting ζi(vv)\zeta_{i}(v\cdot v) from both sides gives us that

for all ii, and the inequality is still strict for at least one ii.

So, given a representative vertex in each community, we can determine which of them a given vertex, vv, is in the same community as without needing to know the value of ζi(vv)\zeta_{i}(v\cdot v).

This runs fairly quickly if ζi(vv)\zeta_{i}(v\cdot v^{\prime}) is approximated using Nr,r[E](vv)N_{r,r^{\prime}[E]}(v^{\prime}\cdot v) such that rr is large and rr^{\prime} is small because the algorithm only requires focusing on Nr(v)|N_{r^{\prime}}(v)| vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one in each community. After that, use the algorithm referred to above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.

The risk that this randomly gives a bad classification due to a bad set of initial vertices can be mitigated as follows. First, repeat the previous classification procedure several times. Assuming that the procedure gives a good classification more often than not, the good classifications should comprise a set that contains more than half the classifications and which has fairly little difference between any two elements of the set. Furthermore, any such set would have to contain at least one good classification, so none of its elements could be too bad. So, find such a set and average its classifications together. This completes the Agnostic-Sphere-comparison-algorithm. We refer to Section 6 for a detailed version.

2 Exact recovery and the Agnostic-degree-profiling algorithm

This is proved in [AS15]. In the case of unknown parameters, the algorithmic approach is largely unchanged, adding a step where the best known classification is used to estimate PP and QQ prior to any step in which vertices are classified based on their neighbors. The analysis of the algorithm requires however some careful handling.

First, it is necessary to prove that given a labelling of the graph’s vertices with an error rate of xx, one can compute approximations of PP and QQ that are within O(x+log(n)/n)O(x+\log(n)/\sqrt{n}) of their true values with probability 1o(1)1-o(1). Secondly, one needs to modify the robust degree profiling lemma to show that attempting to determine vertices’ communities based on estimates of pp and QQ that are off by at most δ\delta, pp^{\prime} and QQ^{\prime}, and a classification of its neighbors that has an error rate of δ\delta classifies the vertices with an error rate only eO(δlogn)e^{O(\delta\log n)} times higher than it would be given accurate values of pp and QQ and accurate classifications of the vertices’ neighbors. Combining these yields the conclusion that any errors in the estimates of the SBM’s parameters do not disrupt vertex classification any worse than the errors in the preliminary classifications already were.

The Agnostic-degree-profiling algorithm. The inputs are (G,γ)(G,\gamma), where GG is a graph, and γ\gamma\in (see Theorem 7 for how to set γ\gamma).

The algorithm outputs an assignment of each vertex to one of the groups of communities {A1,,At}\{A_{1},\dots,A_{t}\}, where A1,,AtA_{1},\dots,A_{t} is the partition of [k][k] in to the largest number of subsets such that D+((pQ)i,(pQ)j)1D_{+}((pQ)_{i},(pQ)_{j})\geq 1 for all i,ji,j in [k][k] that are in different subsets (this is called the “finest partition”). It does the following:

(1) Define the graph gg^{\prime} on the vertex set [n][n] by selecting each edge in gg independently with probability γ\gamma, and define the graph gg^{\prime\prime} that contains the edges in gg that are not in gg^{\prime}.

(2) Run Agnostic-sphere-comparison on gg^{\prime} to obtain the preliminary classification σ[k]n\sigma^{\prime}\in[k]^{n} (see Section 7.1.)

(3) Determine the size of each alleged community, and the edge density between each pair of alleged communities.

(4) For each node v[n]v\in[n], determine in which community node vv is most likely to belong to based on its degree profile computed from the preliminary classification σ\sigma^{\prime} (see Section 7.2.2), and call it σv\sigma^{\prime\prime}_{v}

(5) Use σv\sigma^{\prime\prime}_{v} to get new estimates of pp and QQ.

(6) For each node v[n]v\in[n], determine in which group A1,,AtA_{1},\dots,A_{t} node vv is most likely to belong to based on its degree profile computed from the preliminary classification σ\sigma^{\prime\prime} (see Section 7.2.2).

An example with real data

We have tested a simplified version of our algorithm on the data from “The political blogosphere and the 2004 US Election” [AG05], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs.

The algorithm we used has a few major modifications relative to our standard algorithm. First of all, instead of using Nr,r(vv)N_{r,r^{\prime}}(v\cdot v^{\prime}) as its basic tool for comparing vertices, it uses a different measure, Nr,r(vv)N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime}) which is defined as the fraction of pairs of an edge leaving the ball of radius rr centered on vv and an edge leaving the ball of radius rr^{\prime} centered on vv^{\prime} which hit the same vertex but are not the same edge. Making the measure a fraction of the pairs rather than a count of pairs was necessary to prevent Nr,r(vv)N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime}) from being massively dependent on the degrees of vv and vv^{\prime}, which would have resulted in the increased variance in vertex degree obscuring the effects of σv\sigma_{v} and σv\sigma_{v^{\prime}} on Nr,r(vv)N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime}). The other changes to the definition make the measure somewhat less reliable, but it is still useable as long as the average degree is fairly high and vvv\neq v^{\prime}.

Secondly, the version of Vertex-comparison-algorithm we used simply concludes that two vertices, vv and vv^{\prime}, are in different communities if Nr,r(vv)N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime}) is below average and the same community otherwise. This is reasonable because of the following facts. For one thing, because the normalization converts the dominant term to a constant, Nr,r(vv)N^{\prime}_{r,r^{\prime}}(v\cdot v^{\prime}) is approximately affine in (λ2/λ1)r+rζ2(vv)/n(\lambda_{2}/\lambda_{1})^{r+r^{\prime}}\zeta_{2}(v\cdot v^{\prime})/n. Also, as a result of the symmetry between communities, ζ2(vv)\zeta_{2}(v\cdot v) is the same for all vv. So, ζ2(vv)2ζ2(vv)+ζ2(vv)\zeta_{2}(v\cdot v)-2\zeta_{2}(v\cdot v^{\prime})+\zeta_{2}(v^{\prime}\cdot v^{\prime}) is also affine in ζ2(vv)\zeta_{2}(v\cdot v^{\prime}). Furthermore, there are only two possible values of ζ2(vv)\zeta_{2}(v\cdot v^{\prime}) by symmetry, and λ2>0\lambda_{2}>0, so ζ2(vv)2ζ2(vv)+ζ2(vv)>0\zeta_{2}(v\cdot v)-2\zeta_{2}(v\cdot v^{\prime})+\zeta_{2}(v^{\prime}\cdot v^{\prime})>0 iff (λ2/λ1)r+rζ2(vv)/n(\lambda_{2}/\lambda_{1})^{r+r^{\prime}}\zeta_{2}(v\cdot v^{\prime})/n is below average. Finally, because both communities have the same average degree, ζ1(v,v)\zeta_{1}(v,v^{\prime}) is independent of vv and vv^{\prime} so ζ1(vv)2ζ1(vv)+ζ1(vv)\zeta_{1}(v\cdot v)-2\zeta_{1}(v\cdot v^{\prime})+\zeta_{1}(v^{\prime}\cdot v^{\prime}) is always . The version of Vertex-classification-algorithm we used is comparably modified.

Finally, our algorithm generates reference vertices by repeatedly picking two vertices at random and comparing them. If it concludes that they are in different communities and they both have above-average degree, it accepts them as reference vertices; otherwise it tries again. Requiring above-average degree is useful because a higher degree vertex is less likely to have its neighborhood distorted by a couple of atypical neighbors.

Out of 4040 trials, the resulting algorithm gave a reasonably good classification 3737 times. Each of these classified all but 5656 to 6767 of the 12221222 vertices in the graph’s main component correctly. The state-of-the-art described in [CG15] gives a lowest value at 58, with the best algorithms around 6060, while algorithms regularized spectral methods such as the one in [QR13] obtain about 80 errors.

Open problems

The current result should also extend directly to a slowly growing number of communities (e.g., up to logarithmic). It would be interesting to extend the current approach to smaller sized communities or larger numbers of communities (watching the complexity scaling with the number of communities), as well as more general models with corrected-degrees, labeled-edges, or overlapping communities (though linear-sized overlapping communities can be treated with the approach of [AS15]).

References

The Agnostic-sphere-comparison algorithm in details

Recall the following motivation and definitions.

For any vertex vv, let Nr(v)N_{r}(v) be the set of all vertices with shortest path to vv of length rr. We also refer to the vector with ii-th entry equal to the number of vertices in Nr(v)N_{r}(v) that are in community ii as Nr(v)N_{r}(v). If there are multiple graphs that vv could be considered a vertex in, let Nr[G](v)N_{r[G]}(v) be the set of all vertices with shortest paths in GG to vv of length rr.

One could probably determine PQPQ and eσe_{\sigma} given the values of (PQ)reσv(PQ)^{r}e_{\sigma_{v}} for a few different rr, but using Nr(v)N_{r}(v) to approximate that would require knowing how many of the vertices in Nr(v)N_{r}(v) are in each community. So, we attempt to get information relating to how many vertices in Nr(v)N_{r}(v) are in each community by checking how it connects to Nr(v)N_{r^{\prime}}(v^{\prime}) for some vertex vv^{\prime} and integer rr^{\prime}. The obvious way to do this would be to compute the cardinality of their intersection. Unfortunately, whether a given vertex in community ii is in Nr(v)N_{r}(v) is not independent of whether it is in Nr(v)N_{r^{\prime}}(v^{\prime}), which causes the cardinality of Nr(v)Nr(v)|N_{r}(v)\cap N_{r^{\prime}}(v^{\prime})| to differ from what one would expect badly enough to disrupt plans to use it for approximations.

In order to get around this, we randomly assign every edge in GG to a set EE with probability cc. We hence define the following.

Note that EE and G\EG\backslash E are disjoint; however, GG is sparse enough that even if they were generated independently a given pair of vertices would have an edge between them in both with probability O(1n2)O(\frac{1}{n^{2}}). So, EE is approximately independent of G\EG\backslash E. Thus, for any v1Nr[G/E](v)v_{1}\in N_{r[G/E]}(v) and v2Nr[G/E](v)v_{2}\in N_{r^{\prime}[G/E]}(v^{\prime}), (v1,v2)E(v_{1},v_{2})\in E with a probability of approximately cQσv1,σv2/ncQ_{\sigma_{v_{1}},\sigma_{v_{2}}}/n. As a result,

Let λ1,...,λh\lambda_{1},...,\lambda_{h} be the distinct eigenvalues of PQPQ, ordered so that λ1λ2...λh0|\lambda_{1}|\geq|\lambda_{2}|\geq...\geq|\lambda_{h}|\geq 0. Also define hh^{\prime} so that h=hh^{\prime}=h if λh0\lambda_{h}\neq 0 and h=h1h^{\prime}=h-1 if λh=0\lambda_{h}=0. If WiW_{i} is the eigenspace of PQPQ corresponding to the eigenvalue λi\lambda_{i}, and PWiP_{W_{i}} is the projection operator on to WiW_{i}, then

where the final equality holds because for all iji\neq j,

and since λiλj\lambda_{i}\neq\lambda_{j}, this implies that PWi(eσv)P1PWj(eσv)=0P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{j}}(e_{\sigma_{v^{\prime}}})=0. In order to simplify the terminology,

Let ζi(vv)=PWi(eσv)P1PWi(eσv)\zeta_{i}(v\cdot v^{\prime})=P_{W_{i}}(e_{\sigma_{v}})\cdot P^{-1}P_{W_{i}}(e_{\sigma_{v^{\prime}}}) for all ii, vv, and vv^{\prime}.

Equation (14) is dominated by the λ1r+r+1\lambda_{1}^{r+r^{\prime}+1} term, so getting good estimate of the λ2r+r+1\lambda_{2}^{r+r^{\prime}+1} through λhr+r+1\lambda_{h^{\prime}}^{r+r^{\prime}+1} terms requires cancelling it out somehow. As a start, if λ1>λ2>λ3\lambda_{1}>\lambda_{2}>\lambda_{3} then

Note that the left hand side of this expression is equal to detNr,r[E](vv)Nr+1,r[E](vv)Nr+1,r[E](vv)Nr+2,r[E](vv)\det{}\begin{vmatrix}N_{r,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})\\ N_{r+1,r^{\prime}[E]}(v\cdot v^{\prime})&N_{r+2,r^{\prime}[E]}(v\cdot v^{\prime})\end{vmatrix}. More generally, in order to get an expression that can be used to estimate the λi\lambda_{i} and ζi(vv)\zeta_{i}(v\cdot v^{\prime}), we consider the determinant of the following.

Let Mm,r,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}) be the m×mm\times m matrix such that Mm,r,r[E](vv)i,j=Nr+i+j,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})_{i,j}=N_{r+i+j,r^{\prime}[E]}(v\cdot v^{\prime}) for each ii and jj.

To the degree that approximation 8 holds and cc is small, each column of Mm,r,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}) is a linear combination of the vectors

with coefficients that depend only on {λ1,...,λh}\{\lambda_{1},...,\lambda_{h}\}. So, by linearity of the determinant in one column, det(Mm,r,r[E](vv))\det(M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime})) is a linear combination of these vector’s wedge products with coefficients that are independent of rr and rr^{\prime}. By antisymmetry of wedge products, only the products that use mm different such vectors contribute to the determinant, and the products involving the eigenvalues of highest magnitude will dominate. As a result, there exist constants γ(λ1,...,λm)\gamma(\lambda_{1},...,\lambda_{m}) and γ(λ1,...,λm)\gamma^{\prime}(\lambda_{1},...,\lambda_{m}) such that

if λm=λm+1|\lambda_{m}|=|\lambda_{m+1}|. In the later case, λm=λm+1\lambda_{m}=-\lambda_{m+1}, so either γ(λ1,...,λm)λmr+r+1\gamma(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+1} and γ(λ1,...,λm)λm+1r+r+1\gamma^{\prime}(\lambda_{1},...,\lambda_{m})\lambda_{m+1}^{r+r^{\prime}+1} have the same sign or γ(λ1,...,λm)λmr+r+2\gamma(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+2} and γ(λ1,...,λm)λmr+r+2\gamma^{\prime}(\lambda_{1},...,\lambda_{m})\lambda_{m}^{r+r^{\prime}+2} have the same sign. In any of these cases,

This suggests the following algorithm for finding PQPQ’s eigenvalues.

The Basic-Eigenvalue-approximation-algorithm The inputs are (E,c,v)(E,c,v), where vv is a vertex of the graph, c(0,1)c\in(0,1), and EE is a subset of GG’s edges.

The algorithm ouputs a claim about how many distinct nonzero eigenvalues PQPQ has and a list of approximations of them.

(1) Compute Nr[G\E](v)N_{r[G\backslash E]}(v) for each rr until Nr[G\E](v)>n|N_{r[G\backslash E]}(v)|>\sqrt{n}, and then set λ1=n2r/(1c)\lambda_{1}^{\prime\prime}=\sqrt[2r]{n}/(1-c).

(2) Set r=r=23logn/log((1c)λ1)lnnr=r^{\prime}=\frac{2}{3}\log n/\log((1-c)\lambda^{\prime\prime}_{1})-\sqrt{\ln n}. Then, compute

until an mm is found for which this expression is less than ((1c)λ1)3/4+1lnn((1-c)\lambda^{\prime\prime}_{1})^{3/4}+\frac{1}{\sqrt{\ln n}}. Then, set h=m1h^{\prime\prime}=m-1.

unless det(Mi,r+1,r[E](vv))<det(Mi,r,r[E](vv))det(Mi,r+2,r[E](vv))|\det(M_{i,r+1,r^{\prime}[E]}(v\cdot v^{\prime}))|<\sqrt{|\det(M_{i,r,r^{\prime}[E]}(v\cdot v^{\prime}))|\cdot|\det(M_{i,r+2,r^{\prime}[E]}(v\cdot v^{\prime}))|}, in which case set

Repeat this for each ihi\leq h^{\prime\prime}

(4) Next, for each i<hi<h^{\prime\prime}, if λiλi+1<1lnn||\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}||<\frac{1}{\ln n} then set λi=λi\lambda^{\prime}_{i}=|\lambda^{\prime}_{i}| and λi+1=λi+1\lambda^{\prime}_{i+1}=-|\lambda^{\prime}_{i+1}|. For each ihi\leq h^{\prime\prime} such that λiλi+11lnn||\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}||\geq\frac{1}{\ln n} and λi1λi1lnn||\lambda^{\prime}_{i-1}|-|\lambda^{\prime}_{i}||\geq\frac{1}{\ln n} set

(5) Return (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})

The risk when using this algorithm is that if the set of edges in vv’s immediate neighborhood is sufficiently atypical it may not work correctly. This can be solved by repeating it for several vertices and taking the median estimates.

The Improved-Eigenvalue-approximation-algorithm The input is c(0,1)c\in(0,1)

The algorithm ouputs a claim about how many distinct nonzero eigenvalues PQPQ has and a list of approximations of them.

(1) Create a set of edges EE, that each of GG’s edges is independently assigned to with probability cc.

(2) Randomly select lnn\sqrt{\ln n} of GG’s vertices, vv, vv,…, v[lnn]v[\sqrt{\ln n}].

(3) Run Basic-Eigenvalue-approximation-algorithm(E,c,v[i]) for each ilnni\leq\sqrt{\ln n}, stopping the algorithm prematurely if it takes more than O(nlnn)O(n\sqrt{\ln n}) time.

(4) Return (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) where hh^{\prime\prime} and λi\lambda_{i} are the median outputs of the executions of Basic-Eigenvalue-approximation-algorithm for each ii.

Now, note that whether or not λm=λm+1|\lambda_{m}|=|\lambda_{m+1}|, we have

This fact can be used in combination with estimates of PQPQ’s eigenvalues to approximate ζi(vv)\zeta_{i}(v\cdot v^{\prime}) for arbitrary vv, vv^{\prime}, and ii as follows.

The algorithm outputs (z1(vv),...,zh(vv))(z_{1}(v\cdot v^{\prime}),...,z_{h^{\prime\prime}}(v\cdot v^{\prime})) such that zi(vv)ζi(vv)z_{i}(v\cdot v^{\prime})\approx\zeta_{i}(v\cdot v^{\prime}) for all ii.

(1) For each ihi\leq h^{\prime\prime}, set

(2) Return (z1(vv),...,zh(vv))(z_{1}(v\cdot v^{\prime}),...,z_{h^{\prime\prime}}(v\cdot v^{\prime})).

Of course, this requires rr and rr^{\prime} to be large enough that

is large relative to the error terms for all ihi\leq h^{\prime}. At a minimum, that requires that (1c)λir+r+1=ω(n)|(1-c)\lambda_{i}|^{r+r^{\prime}+1}=\omega(n), so

because otherwise the graph will start running out of vertices before one gets rr steps away from vv or rr^{\prime} steps away from vv^{\prime}.

Furthermore, for any vv and vv^{\prime},

with equality for all ii if and only if σv=σv\sigma_{v}=\sigma_{v^{\prime}}, so sufficiently good approximations of ζi(vv),ζi(vv)\zeta_{i}(v\cdot v),\zeta_{i}(v\cdot v^{\prime}) and ζi(vv)\zeta_{i}(v^{\prime}\cdot v^{\prime}) can be used to determine which pairs of vertices are in the same community as follows.

The Vertex-comparison-algorithm. The inputs are (v,v,r,r,E,x,c,(λ1,...,λh))(v,v^{\prime},r,r^{\prime},E,x,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})), where v,vv,v^{\prime} are two vertices, r,rr,r^{\prime} are positive integers, EE is a subset of GG’s edges, xx is a positive real number, cc is a real number between and 11, and (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) are real numbers.

The algorithm outputs a decision on whether vv and vv^{\prime} are in the same community or not. It proceeds as follows.

(1) Run Vertex-product-approximation-algorithm(v,v’,r,r’,E,c,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})),Vertex-product-approximation-algorithm(v,v,r,r’,E,c,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})), and Vertex-product-approximation- algorithm(v’,v’,r,r’,E,c,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})).

(2) If i:zi(vv)2zi(vv)+zi(vv)>5(2x(minpj)1/2+x2)\exists i:z_{i}(v\cdot v)-2z_{i}(v\cdot v^{\prime})+z_{i}(v^{\prime}\cdot v^{\prime})>5(2x(\min p_{j})^{-1/2}+x^{2}) then conclude that vv and vv^{\prime} are in different communities. Otherwise, conclude that vv and vv^{\prime} are in the same community.

One could generate a reasonable classification based solely on this method of comparing vertices (with an appropriate choice of the parameters, as later detailed). However, that would require computing Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v) for every vertex in the graph with fairly large r+rr+r^{\prime}, which would be slow. Instead, we use the fact that for any vertices vv, vv^{\prime}, and vv^{\prime\prime} with σv=σvσv\sigma_{v}=\sigma_{v^{\prime}}\neq\sigma_{v^{\prime\prime}},

for all ii, and the inequality is strict for at least one ii. So, subtracting ζi(vv)\zeta_{i}(v\cdot v) from both sides gives us that

for all ii, and the inequality is still strict for at least one ii.

So, given a representative vertex in each community, we can determine which of them a given vertex, vv, is in the same community as without needing to know the value of ζi(vv)\zeta_{i}(v\cdot v) as follows.

The Vertex-classification-algorithm. The inputs are (v[],v,r,r,E,c,λ1,...,λh))(v[],v^{\prime},r,r^{\prime},E,c,\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})), where v[]v[] is a list of vertices, vv^{\prime} is a vertex, r,rr,r^{\prime} are positive integers, EE is a subset of GG’s edges, cc is a real number between and 11, and (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) are real numbers. It is assumed that zi(v[σ]v[σ])z_{i}(v[\sigma]\cdot v[\sigma]) has already been computed for each ii and σ\sigma.

The algorithm is supposed to output σ\sigma such that vv^{\prime} is in the same community as v[σ]v[\sigma]. It works as follows.

(1) Run Vertex-product-approximation-algorithm(v[σ]v[\sigma],v’,r,r’,E,c,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) for each σ\sigma.

(2) Find a σ\sigma that minimizes the value of

and conclude that vv^{\prime} is in the same community as v[σ]v[\sigma].

This runs fairly quickly if rr is large and rr^{\prime} is small because the algorithm only requires focusing on Nr(v)N_{r^{\prime}}(v^{\prime}) vertices. This leads to the following plan for partial recovery. First, randomly select a set of vertices that is large enough to contain at least one vertex from each community with high probability. Next, compare all of the selected vertices in an attempt to determine which of them are in the same communities. Then, pick one anchor vertex in each community. After that, use the algorithm above to attempt to determine which community each of the remaining vertices is in. As long as there actually was at least one vertex from each community in the initial set and none of the approximations were particularly bad, this should give a reasonably accurate classification.

The Unreliable-graph-classification-algorithm. The inputs are (G,c,m,ϵ,x,(λ1,...,λh))(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})), where GG is a graph, cc is a real number between and 11, mm is a positive integer, ϵ\epsilon is a real number between and 11, xx is a positive real number, and (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) are real numbers.

The algorithm outputs an alleged list of communities for GG. It works as follows.

(1) Randomly assign each edge in GG to EE independently with probability cc.

(2) Randomly select mm vertices in GG, v,...,v[m1]v,...,v[m-1].

(3) Set r=(1ϵ3)logn/log((1c)λ1)lnnr=(1-\frac{\epsilon}{3})\log n/\log((1-c)\lambda^{\prime}_{1})-\sqrt{\ln n} and r=2ϵ3logn/log((1c)λ1)r^{\prime}=\frac{2\epsilon}{3}\cdot\log n/\log((1-c)\lambda^{\prime}_{1})

(4) Compute Nr[G\E](v[i])N_{r^{\prime\prime}[G\backslash E]}(v[i]) for each r<r+2h+3r^{\prime\prime}<r+2h^{\prime\prime}+3 and 0i<m0\leq i<m.

(5) Run vertex-comparison-algorithm(v[i],v[j],r,r’,E,x,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) for every ii and jj

(6) If these give consistent results, randomly select one alleged member of each community v[σ]v^{\prime}[\sigma]. Otherwise, fail.

(7) For every vv^{\prime\prime} in the graph, compute Nr[G\E](v)N_{r^{\prime\prime}[G\backslash E]}(v^{\prime\prime}) for each rrr^{\prime\prime}\leq r^{\prime}. Then, run Vertex-classification-algorithm(v’[],v”, r,r’,E,(λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) in order to get a hypothesized classification of vv^{\prime\prime}

The risk that this randomly gives a bad classification due to a bad set of initial vertices can be mitigated as follows. First, repeat the previous classification procedure several times. Assuming that the procedure gives a good classification more often than not, the good classifications should comprise a set that contains more than half the classifications and which has fairly little difference between any two elements of the set. Furthermore, any such set would have to contain at least one good classification, so none of its elements could be too bad. So, find such a set and average its classifications together.

So, the overall Agnostic-sphere-comparison-algorithm starts by estimating PQPQ’s eigenvalues. Then, it uses those estimates to pick appropriate values of xx and ϵ\epsilon for the Unreliable-graph-classification-algorithm. Finally, it runs it several times and combines the resulting classifications as explained above. The only inputs it requires are the graph itself and some δ>0\delta>0 such that piδp_{i}\geq\delta for all ii.

The Reliable-graph-classification-algorithm (i.e., Agnostic Sphere comparison). The inputs are (G,m,δ,T(n))(G,m,\delta,T(n)), where GG is a graph, mm is a positive integer, δ\delta is a positive real number, and TT is a function from the positive integers to itself.

The algorithm outputs an alleged list of communities for GG. It works as follows.

(1) Run Improved-Eigenvalue-approximation-algorithm(.1) in order to compute (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})

(2) Let λ1=λ1+2ln3/2(n)\lambda^{\prime\prime}_{1}=\lambda^{\prime}_{1}+2\ln^{-3/2}(n), λh=λh2ln3/2(n)\lambda^{\prime\prime}_{h^{\prime\prime}}=\lambda^{\prime}_{h^{\prime\prime}}-2\ln^{-3/2}(n), and k=1/δk^{\prime}=\lfloor 1/\delta\rfloor

(3) Let xx be the smallest rational number of minimal numerator such that

(4) Let ϵ\epsilon be the smallest rational number of the form 1z\frac{1}{z} or 11z1-\frac{1}{z} such that (2(λ1)3/(λh)2)1ϵ/3<λ1(2(\lambda^{\prime\prime}_{1})^{3}/(\lambda^{\prime\prime}_{h^{\prime\prime}})^{2})^{1-\epsilon/3}<\lambda^{\prime\prime}_{1} and (1+ϵ/3)>log(λ1)/log((λh)2/2λ1)(1+\epsilon/3)>\log(\lambda^{\prime\prime}_{1})/\log((\lambda^{\prime\prime}_{h^{\prime\prime}})^{2}/2\lambda^{\prime\prime}_{1})

(5) Let cc be the largest unit reciprocal less than 1/91/9 such that all of the following hold:

(6) Run Unreliable-graph-classification-algorithm(G,c,m,ϵ,x,(λ1,...,λh))(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) T(n)T(n) times and record the resulting classifications.

(7) Find the smallest yy^{\prime\prime} such that there exists a set of more than half of the classifications no two of which have more than yy^{\prime\prime} disagreement, and discard all classifications not in the set. In this step, define the disagreement between two classifications as the minimum disagreement over all bijections between their communities.

(8) For every vertex in GG, randomly pick one of the remaining classifications and assert that it is in the community claimed by that classification, where a community from one classification is assumed to correspond to the community it has the greatest overlap with in each other classification.

(9) Return the resulting combined classification.

If the conditions of theorem 22 are satisfied, then there exists δ\delta such that

of GG’s vertices correctly with probability 1o(1)1-o(1) and it runs in O(n1+ϵ)O(n^{1+\epsilon}) time, for appropriate CC and ρ=λh2/4λ1\rho=\sqrt{\lambda_{h^{\prime}}^{2}/4\lambda_{1}}.

Appendix

Considering the way δ\delta, ϵ\epsilon, xx, and xx^{\prime} scale when QQ is multiplied by a scalar yields the following corollary.

If instead of having constant average degree, one has an average degree which increases as nn increases, one can slowly reduce bb, δ\delta, and ϵ\epsilon as nn increases, leading to the following corollary.

These corollaries are important as they show that if the entries of the connectivity matrix QQ are amplified by a coefficient growing with nn, almost exact recovery is achieved by (Agnostic-sphere-comparison) without parameter knowledge.

1.2 Proof of Theorem 6

Proving Theorem 6 will require establishing some terminology. First, let λ1,...,λh\lambda_{1},...,\lambda_{h} be the distinct eigenvalues of PQPQ, ordered so that λ1λ2...λh0|\lambda_{1}|\geq|\lambda_{2}|\geq...\geq|\lambda_{h}|\geq 0 and if λi=λi+1|\lambda_{i}|=|\lambda_{i+1}| then λi>0>λi+1\lambda_{i}>0>\lambda_{i+1}. Also define hh^{\prime} so that h=hh^{\prime}=h if λh0\lambda_{h}\neq 0 and h=h1h^{\prime}=h-1 if λh=0\lambda_{h}=0. In addition to this, let dd be the largest sum of a column of PQPQ.

For any vertex vv, let Nr(v)N_{r}(v) be the set of all vertices with shortest path to vv of length rr. If there are multiple graphs that vv could be considered a vertex in, let Nr[G](v)N_{r[G^{\prime}]}(v) be the set of all vertices with shortest paths in GG^{\prime} to vv of length rr.

We also typically refer to Nr[G](v)\overrightarrow{N_{r[G^{\prime}]}(v)} as simply Nr[G](v)N_{r[G^{\prime}]}(v), as the context will make it clear whether the expression refers to a set or vector.

Note that since any such ww can be written as a linear combination of the eie_{i}, vv is (R,x)(R,x)-good if eiNr+1(v)eiPQNr(v)xλh2(λh22λ1)rpi/k|e_{i}\cdot N_{r+1}(v)-e_{i}\cdot PQN_{r}(v)|\leq\frac{x\lambda_{h^{\prime}}}{2}\left(\frac{\lambda_{h^{\prime}}^{2}}{2\lambda_{1}}\right)^{r}\sqrt{p_{i}/k} for all 1ik1\leq i\leq k and 0r<R0\leq r<R.

First, note that for any eigenvector of PQPQ, ww, and r<Rr<R,

Thus, for any rRr\leq R, it must be the case that

Now, define w1w_{1},…, whw_{h} such that PQwi=λiwiPQw_{i}=\lambda_{i}w_{i} for each ii and p=i=1hwip=\sum_{i=1}^{h}w_{i}. For any i,ji,j,

If iji\neq j, then λiλj\lambda_{i}\neq\lambda_{j}, so this implies that wiP1wj=0w_{i}\cdot P^{-1}w_{j}=0. It follows from this that

Therefore, for any rRr\leq R, we have that

The following two lemmas are proved in [AS15].

Note that if Nr[G\E](v)N_{r[G\backslash E]}(v) and Nr[G\E](v)N_{r^{\prime}[G\backslash E]}(v^{\prime}) have already been computed, Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v^{\prime}) can be computed by means of the following algorithm, where E[v]={v:(v,v)E}E[v]=\{v^{\prime}:(v,v^{\prime})\in E\}

Roughly speaking, for each v1Nr[G\E](v)v_{1}\in N_{r[G\backslash E]}(v) and v2Nr[G\E](v)v_{2}\in N_{r^{\prime}[G\backslash E]}(v^{\prime}), (v1,v2)E(v_{1},v_{2})\in E with probability cQσv1,σv2/ncQ_{\sigma_{v_{1}},\sigma_{v_{2}}}/n. This is complicated by the facts that (v1,v1)(v_{1},v_{1}) is never in EE and no edge is in G\EG\backslash E and EE. However, this changes the expected value of Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v^{\prime}) given G\EG\backslash E by at most a constant unless GG has more than double its expected number of edges, something that happens with probability o(1)o(1). Furthermore, whether (v1,v2)(v_{1},v_{2}) is in EE is independent of whether (v1,v2)(v_{1}^{\prime},v_{2}^{\prime}) is in EE unless (v1,v2)=(v1,v2)(v_{1}^{\prime},v_{2}^{\prime})=(v_{1},v_{2}) or (v1,v2)=(v2,v1)(v_{1}^{\prime},v_{2}^{\prime})=(v_{2},v_{1}). So, the variance of Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v^{\prime}) is proportional to its expected value, which is

Nr,r[E](vv)N_{r,r^{\prime}[E]}(v\cdot v^{\prime}) is within logn\log n standard deviations of its expected value with probability 1o(1)1-o(1), which completes the proof. ∎

Note that if v\overrightarrow{v} is an eigenvector of (1c)PQ(1-c)PQ, PQv\sqrt{P}Q\overrightarrow{v} is an eigenvector of the symmetric matrix (1c)PQP(1-c)\sqrt{P}Q\sqrt{P}. So, since eigenvectors of a symmetric matrix with different eigenvalues are orthogonal, we have

where we temporarily adopt the convention that if i>hi>h^{\prime}, λi=λh/2\lambda_{i}=\lambda_{h^{\prime}}/\sqrt{2} and wi(S)=0w_{i}(S)=0 for all SS.

Alternately, if m=h+1m=h^{\prime}+1 then with probability 1o(1)1-o(1), either vv is (r+2m+1,x)(r+2m+1,x)-bad, vv^{\prime} is (r+1,x)(r^{\prime}+1,x)-bad, or

For each 1l2m1\leq l\leq 2m and 1ih1\leq i\leq h, let

Next, for each 1ih1\leq i\leq h and 0lm0\leq l\leq m, let ul(i)u_{l}(i) be the column vector thats jjth entry is xl+j(i)x_{l+j}(i) for 1jm1\leq j\leq m. Also, for each 1lm1\leq l\leq m, let ul(h+1)u_{l}(h+1) be the length mm column vector thats jjth entry is Nr+l+j,r[E](vv)i=1hxl+j(i)N_{r+l+j,r^{\prime}[E]}(v\cdot v^{\prime})-\sum_{i=1}^{h}x_{l+j}(i) for 1jm1\leq j\leq m. Note that for each 1lm1\leq l\leq m, the llth column of Mm,r,r[E](vv)M_{m,r,r^{\prime}[E]}(v\cdot v^{\prime}) is i=1h+1ul(i)\sum_{i=1}^{h+1}u_{l}(i). So,

If mhm\leq h and ii is some permutation of the integers from 11 to mm, then

The jjth column of this matrix is proportional to cnwj(Nr[G\E](v))Qwj(Nr[G\E](v))\frac{c}{n}w_{j}(N_{r[G\backslash E]}(v))\cdot Qw_{j}(N_{r^{\prime}[G\backslash E]}(v^{\prime})), so there exists some γ=γ({(1c)λj},m)\gamma=\gamma(\{(1-c)\lambda_{j}\},m) such that the sum of all such terms is

Alternately, the sum of all such terms is equal to

That can only hold for all such ii if l=1mul(1c)lλjl=0\sum_{l=1}^{m}u^{\prime}_{l}(1-c)^{l}\lambda_{j}^{l}=0 for all 1jm1\leq j\leq m, and that can only be the case if u=0u^{\prime}=0. Therefore, the determinant is nonzero unless xl(0)=0x_{l}(0)=0 for some ll, which implies that γ0\gamma\neq 0. If m<hm<h then by similar logic, there exists γ=γ({(1c)λi},m)\gamma^{\prime}=\gamma^{\prime}(\{(1-c)\lambda_{i}\},m) such that the sum of all terms for which ii is a permutation of the integers from 11 to m1m-1 and m+1m+1 is

for all ii. Similarly, if vv^{\prime} is (r+1,x)(r^{\prime}+1,x)-good then

for all ii. If both hold, then xl(i)cn(1c)r+r+lλir+r+l+1((minpj)1/2+x)2|x_{l}(i)|\leq\frac{c}{n}(1-c)^{r+r^{\prime}+l}|\lambda_{i}|^{r+r^{\prime}+l+1}((\min p_{j})^{-1/2}+x)^{2} for all ii. Furthermore, for any ll and jj,

In other words, under these circumstances xl(i)|x_{l}(i)| is upper bounded by a constant multiple of cn((1c)λi)r+r\frac{c}{n}((1-c)|\lambda_{i}|)^{r+r^{\prime}} if vv and vv^{\prime} are both good, and every entry of ul(h+1)u_{l}(h+1) has a magnitude that is upper bounded by a constant multiple of ((1c)λ1)(r+r)/2logn/n+cn((1c)2λh2/2)(r+r)/2((1-c)\lambda_{1})^{(r+r^{\prime})/2}\log n/\sqrt{n}+\frac{c}{n}((1-c)^{2}\lambda_{h^{\prime}}^{2}/2)^{(r+r^{\prime})/2}. Either way, every entry of ul(i)u_{l}(i) is upper bounded by a constant multiple of cn((1c)λi)r+rlogn\frac{c}{n}((1-c)|\lambda_{i}|)^{r+r^{\prime}}\log n.

Alternately, recall that if vv is (r+2m+1,x)(r+2m+1,x)-good then for any r<r+2m+1r^{\prime\prime}<r+2m+1 and ihi\leq h,

Also, for fixed values of Nr(v)N_{r^{\prime\prime\prime}}(v) for all rrr+2m+1r^{\prime\prime\prime}\leq r^{\prime\prime}\leq r+2m+1, each element of Nr+1[G\E](v)N_{r^{\prime\prime}+1[G\backslash E]}(v) has a variance of O(Nr[G\E](v))=O((1c)λ1r)O(|N_{r^{\prime\prime}[G\backslash E]}(v)|)=O((1-c)\lambda_{1}^{r^{\prime\prime}}). So,

with probability 1o(1)1-o(1) for all l,jml,j\leq m and ihi\leq h. This implies that if vv is (r+2m+1,x)(r+2m+1,x)-good and vv^{\prime} is (r+1,x)(r^{\prime}+1,x)-good then

with probability 1o(1)1-o(1). There are only mmm^{m} possible choices of ii, so with probability 1o(1)1-o(1), either vv is (r+2m+1,x)(r+2m+1,x)-bad, vv^{\prime} is (r+1,x)(r^{\prime}+1,x)-bad, or

While this is a helpful result, it turns out to be more useful to have an expression that links the determinant to wi(Nr[G\E](v))Qwi(Nr[G\E](v))w_{i}(N_{r[G\backslash E]}(v))\cdot Qw_{i}(N_{r^{\prime}[G\backslash E]}(v^{\prime})) for fixed values of rr and rr^{\prime}. So, we have the following:

Combining these inequalities with the determinant lemma yields the desired result. ∎

However, in order to know this in a useful sense it is necessary to prove that these terms are large relative to the error terms. In order to do that, we need the following:

Every entry in (PQ)k(PQ)^{k} is positive, so its eigenvector of largest eigenvalue is unique up to multiplication, and its entries all have the same sign. That means that it has a unique multiple, ww, such that wP1w=1w\cdot P^{-1}w=1 and w1>0w_{1}>0. In this case all of ww’s entries must be positive, and since (PQ)kw=λ1kw(PQ)^{k}w=\lambda_{1}^{k}w, it must be the case that PQw=λ1wPQw=\lambda_{1}w.

Given any (lnn,min(P1w)j/2)(\sqrt{\ln n},\min(P^{-1}w)_{j}/2)-good vertex vv, and any r<lnnr<\sqrt{\ln n},

For any eigenvector ww^{\prime\prime} with eigenvalue λi\lambda_{i^{\prime}} and wP1w=1w^{\prime\prime}\cdot P^{-1}w^{\prime\prime}=1,

λ1>λ2\lambda_{1}>\lambda_{2}, so there exists a constant r0r_{0} such that for any r>r0r>r_{0},

for all jj. For r0<rlnnr_{0}<r\leq\sqrt{\ln n}, 1jk1\leq j\leq k, and a fixed value of Nr(v)N_{r}(v), the probability distribution of Nr+1(v)jN_{r+1}(v)_{j} is within o(1)o(1) of a poisson distribution with expected value (PQNr(v))j(PQN_{r}(v))_{j}. Furthermore, Nr+1(v)jN_{r+1}(v)_{j^{\prime}} has negligible dependence on Nr+1(v)jN_{r+1}(v)_{j} for all jjj^{\prime}\neq j. So, wP1Nr+1(v)w^{\prime}\cdot P^{-1}N_{r+1}(v) has an expected value of λiwP1Nr(v)+o(1)\lambda_{i}w^{\prime}\cdot P^{-1}N_{r}(v)+o(1) and a standard deviation of

In particular, this means that if there exists r0<r<2lnlnn/ln(λi2/λ1)r_{0}<r<2\ln\ln n/\ln(\lambda_{i}^{2}/\lambda_{1}) such that wP1Nr(v)>λ1r/2lnlnlnlnn|w^{\prime}\cdot P^{-1}N_{r}(v)|>\lambda_{1}^{r/2}\ln\ln\ln\ln n then with probability 1o(1)1-o(1),

Furthermore, since the probability distribution of wP1Nr(v)w^{\prime}\cdot P^{-1}N_{r}(v) for a fixed value of Nr1(v)N_{r-1}(v) is a sum of constant multiples of poisson distributions with expected values of Ω(λ1r)\Omega(\lambda_{1}^{r}), it must be the case that wP1Nr(v)>λ1r/2lnlnlnlnn|w^{\prime}\cdot P^{-1}N_{r}(v)|>\lambda_{1}^{r/2}\ln\ln\ln\ln n with probability eO(ln2lnlnlnn)=ω(1/lnlnn)e^{-O(\ln^{2}\ln\ln\ln n)}=\omega(1/\ln\ln n). Therefore, there exists r0<r<2lnlnn/ln(λi2/λ1)r_{0}<r<2\ln\ln n/\ln(\lambda_{i}^{2}/\lambda_{1}) such that this holds with probability 1o(1)1-o(1), and the lemma follows. ∎

This also implies that for any vv, vv^{\prime}, and ii, either vv is (lnn,min(P1w)i/2)(\sqrt{\ln n},\min(P^{-1}w)_{i}/2)-bad, vv^{\prime} is (lnn,min(P1w)i/2)(\sqrt{\ln n},\min(P^{-1}w)_{i}/2)-bad, or wi(Nlnn(v))P1wi(Nlnn(v))λi2lnn/ln2n|w_{i}(N_{\sqrt{\ln n}}(v))\cdot P^{-1}w_{i}(N_{\sqrt{\ln n}}(v^{\prime}))|\geq\lambda_{i}^{2\sqrt{\ln n}}/\ln^{2}n with probability 1o(1)1-o(1), since the degree of dependence between Nlnn(v)N_{\sqrt{\ln n}}(v) and Nlnn(v)N_{\sqrt{\ln n}}(v^{\prime}) is negligible. This allows me to attempt to approximate PQPQ’s eigenvalues as follows.

With probability 1o(1)1-o(1), either vv is (23logn/log((1c)λ1),x)(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)-bad or Basic-Eigenvalue-approximation-algorithm(E,c,v) runs in O(n)O(n) time and returns (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) such that h=hh^{\prime}=h^{\prime\prime} and λiλi<ln3/2(n)|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n) for all ii.

If vv is (23logn/log((1c)λ1,x)(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1},x)-good, there exists a constant rr such that for any r<r<23logn/log((1c)λ1)r<r^{\prime\prime}<\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),

So, the minimum rr^{\prime\prime} such that Nr[G\E](v)>n|N_{r^{\prime\prime}[G\backslash E]}(v)|>\sqrt{n} is within a constant of logn/2log((1c)λ1)\log n/2\log((1-c)\lambda_{1}). That means that λ1λ1|\lambda_{1}-\lambda^{\prime\prime}_{1}| is upper bounded by a constant multiple of 1/logn1/\log n, and that r=rr=r^{\prime} is within a constant of the value it would have if λ1\lambda^{\prime\prime}_{1} were λ1\lambda_{1}. This also implies that if nn is sufficiently large it is less than 23logn/log((1c)λ1)2h3\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1})-2h^{\prime}-3.

For mhm\leq h^{\prime}, r{r,r+1}r^{\prime\prime}\in\{r,r+1\}, and vGv\in G if λiλi+1|\lambda_{i}|\neq|\lambda_{i+1}| then

with probability 1o(1)1-o(1). If λi=λi+1|\lambda_{i}|=|\lambda_{i+1}| then λi=λi+1\lambda_{i}=-\lambda_{i+1}, so either γ((1c)λm)2r2lnnwm(Nlnn[G\E](v))Qwm(Nlnn[G\E](v))\gamma((1-c)\lambda_{m})^{2r-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v)) has the same sign as γ((1c)λm+1)2r2lnnwm+1(Nlnn[G\E](v))Qwm+1(Nlnn[G\E](v))\gamma^{\prime}((1-c)\lambda_{m+1})^{2r-2\sqrt{\ln n}}w_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v)) or γ((1c)λm)2r+12lnnwm(Nlnn[G\E](v))Qwm(Nlnn[G\E](v))\gamma((1-c)\lambda_{m})^{2r+1-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v)) has the same sign as γ((1c)λm+1)2r+12lnnwm+1(Nlnn[G\E](v))Qwm+1(Nlnn[G\E](v)))\gamma^{\prime}((1-c)\lambda_{m+1})^{2r+1-2\sqrt{\ln n}}w_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m+1}(N_{\sqrt{\ln n}[G\backslash E]}(v))). Either way, we have that

with probability 1o(1)1-o(1). Furthermore, by lemma 88, these bounds are within a factor of O(ln2n)O(\ln^{2}n) of each other with probability 1o(1)1-o(1).

is within a factor of ln3(n)\ln^{3}(n) of ((1c)λm)2r12lnnwm(Nlnn[G\E](v))Qwm(Nlnn[G\E](v))|((1-c)\lambda_{m})^{2r_{1}-2\sqrt{\ln n}}w_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))\cdot Qw_{m}(N_{\sqrt{\ln n}[G\backslash E]}(v))| with probability 1o(1)1-o(1), and thus that

with probability 1o(1)1-o(1). (1c)λm(4(1c)3λ13|(1-c)\lambda_{m}|\geq\sqrt{(4(1-c)^{3}\lambda_{1}^{3}}, so with probability 1o(1)1-o(1), this expression is not less than ((1c)λ1)3/4+1lnn((1-c)\lambda^{\prime\prime}_{1})^{3/4}+\frac{1}{\sqrt{\ln n}}.

If m=h+1m=h^{\prime}+1, r{r,r+1}r^{\prime\prime}\in\{r,r+1\}, and vGv\in G, then with probability 1o(1)1-o(1) either vv is (r+2m+1,x)(r^{\prime\prime}+2m+1,x)-bad or

Therefore, with probability 1o(1)1-o(1), either vv is (23logn/log((1c)λ1),x)(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)-bad or h=hh^{\prime\prime}=h^{\prime}.

By the previous two lemmas, with probability 1o(1)1-o(1) either vv is (r+2m+4,x)(r+2m+4,x)-bad, or

for all mh+1m\leq h^{\prime}+1 and rr<r+4r\leq r^{\prime\prime}<r+4.

Assume that the later holds. If λm+1<λm|\lambda_{m+1}|<|\lambda_{m}| then if nn is sufficiently large the γ\gamma^{\prime} term will be negligable relative to the γ\gamma term, while if λm+1=λm|\lambda_{m+1}|=|\lambda_{m}| increasing rr^{\prime\prime} by two would multiply both of these terms by (1c)2λm2(1-c)^{2}\lambda_{m}^{2}. Either way, we have that for any khk\leq h^{\prime} and rrr+1r\leq r^{\prime\prime}\leq r+1,

as long as nn is sufficiently large, and

We assumed that nn is large, so that can only fail if λm=λm+1|\lambda_{m}|=|\lambda_{m+1}|. In this case, λm=λm+1\lambda_{m}=-\lambda_{m+1}, so increasing mm by one would result in both terms having the same sign, at which point the condition is satisfied. So,

for any ihi\leq h^{\prime}. For any i<hi<h^{\prime}, if λi>λi+1|\lambda_{i}|>|\lambda_{i+1}|, then for sufficiently large nn, λiλi+1|\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}| will be greater than 1lnn\frac{1}{\ln n}. On the other hand, if λi=λi+1|\lambda_{i}|=|\lambda_{i+1}| and nn is sufficiently large, λiλi+1|\lambda^{\prime}_{i}|-|\lambda^{\prime}_{i+1}| will be less than 1lnn\frac{1}{\ln n}. So, the algorithm will suceed at determining which eigenvalues have the same absolute values and assign λi\lambda^{\prime}_{i} the same sign as λi\lambda_{i} for each ii. So, λiλi<ln3/2(n)|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n) as desired. Assuming this all works, the slowest part of the algorithm is computing expressions of the form Nr,r[E](vv)N_{r^{\prime\prime},r^{\prime\prime\prime}[E]}(v\cdot v), each such expression can be computed in O(n)O(n) time, and only a constant number of them need to be computed. So, the algorithm runs in O(n)O(n) time. ∎

So, this algorithm sometimes works, but it fails if vv is bad. This risk can be mitigated by using multiple vertices as follows.

then Improved-Eigenvalue-approximation-algorithm(c) returns (λ1,...,λh)(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}}) such that h=hh^{\prime}=h^{\prime\prime} and λiλi<ln3/2(n)|\lambda^{\prime}_{i}-\lambda_{i}|<\ln^{-3/2}(n) for all ii with probability 1o(1)1-o(1).

Generating EE takes O(n)O(n) time, picking v[i]v[i] takes o(n)o(n) time, each execution of Basic-Eigenvalue-approximation-algorithm(c) runs in O(nlogn)O(n\sqrt{\log n}) time, and combining their outputs takes o(n)o(n) time. So, this algorithm runs in O(nlogn)O(n\log n) time.

Assuming the conditions are satisfied, there exists y<12y<\frac{1}{2} such that 1y1-y of GG’s vertices are (23logn/log((1c)λ1,x)(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1},x)-good with probability 1o(1)1-o(1). So, with probability 1o(1)1-o(1), the majority of the selected vertices of GG are (23logn/log((1c)λ1),x)(\frac{2}{3}\cdot\log n/\log((1-c)\lambda_{1}),x)-good and the majority of the executions of Basic-Eigenvalue-approximation-algorithm give good output. If this happens, then the median value of hh^{\prime\prime} is hh^{\prime}, and for each 1ih1\leq i\leq h^{\prime}, the median value of λi\lambda^{\prime}_{i} is within ln3/2(n)\ln^{-3/2}(n) of λi\lambda_{i} for each ii, as desired. ∎

Given approximations of PQPQ’s eigenvalues, one can attempt to approximate Ni({v})P1Ni({v})N_{i}(\{v\})\cdot P^{-1}N_{i}(\{v^{\prime}\}) as follows.

The slowest part of the algorithm is computing the expressions of the form Nr,r[E](vv)N_{r^{\prime\prime},r^{\prime}[E]}(v\cdot v^{\prime}). Each of these can be computed in O(E[Nr[G\E](v)])=O(((1c)λ1)r)O(E[|N_{r^{\prime}[G\backslash E]}(v^{\prime})|])=O(((1-c)\lambda_{1})^{r^{\prime}}) average time, and the number of these that need to be computed is constant in nn. So, this algorithm runs in O(((1c)λ1)r)O(((1-c)\lambda_{1})^{r^{\prime}}) average time.

With probability 1o(1)1-o(1) either vv is (r+2m+4,x)(r+2m+4,x)-bad, vv^{\prime} is (r+1,x)(r^{\prime}+1,x)-bad, or

for all mh+1m\leq h^{\prime}+1 and rr<r+4r\leq r^{\prime\prime}<r+4.

for any ihi\leq h with probability 1o(1)1-o(1). So,

with probability 1o(1)1-o(1). That implies that

By goodness of vv and vv^{\prime}, this is less than or equal to

For any two vertices in different communities, vv and vv^{\prime}, the fact that QQ’s rows are distinct implies that Q({v}{v})0Q(\overrightarrow{\{v\}}-\overrightarrow{\{v^{\prime}\}})\neq 0. So, wi({v})wi({v})w_{i}(\{v\})\neq w_{i}(\{v^{\prime}\}) for some 1ih1\leq i\leq h^{\prime}. That means that for any two vertices vv and vv^{\prime},

for all 1ih1\leq i\leq h^{\prime}, with equality for all ii if and only if vv and vv^{\prime} are in the same community. This also implies that given a vertex vv, another vertex in the same community vv^{\prime}, and a vertex in a different community vv^{\prime\prime},

for all 1ih1\leq i\leq h^{\prime} and the inequality is strict for at least one ii. This suggests the following algorithms for classifying vertices.

Assuming that each of GG’s edges was independently assigned to EE with probability cc, this algorithm runs in O(((1c)λ1)max(r,r))O(((1-c)\lambda_{1})^{\max(r,r^{\prime})}) average time. Furthermore, if each execution of Vertex-product-approximation-algorithm succeeds and 13(2x(minpj)1/2+x2)13(2x(\min p_{j})^{-1/2}+x^{2}) is less than the minimum nonzero value of (wi({v})wi({v}))P1(wi({v})wi({v}))(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))\cdot P^{-1}(w_{i}(\{v\})-w_{i}(\{v^{\prime}\})) then the algorithm returns the correct result with probability 1o(1)1-o(1).

The slowest step of the algorithm is using Vertex-product-approximation-algorithm. This runs in an average time of O(((1c)λ1)max(r,r))O(((1-c)\lambda_{1})^{\max(r,r^{\prime})}) and must be done 33 times. If each execution of Vertex-product-approximation-algorithm succeeds then with probability 1o(1)1-o(1) the ziz_{i} are all within 65(2x(minpj)1/2+x2)\frac{6}{5}(2x(\min p_{j})^{-1/2}+x^{2}) of the products they seek to approximate, in which case

which is true for some ii if and only if vv and vv^{\prime} are in different communities. ∎

Assuming that EE was generated properly, this algorithm runs in O(((1c)λ1)r)O(((1-c)\lambda_{1})^{r^{\prime}}) average time. Let x>0x>0 and assume that each execution of Vertex-product-approximation-algorithm succeeds (including the previous ones to compute zi(v[σ]v[σ])z_{i}(v[\sigma]\cdot v[\sigma])), and that v[]v[] contains exactly one vertex from each community. Also, assume that 13(2x(minpj)1/2+x2)13(2x(\min p_{j})^{-1/2}+x^{2}) is less than the minimum nonzero value of (wi({v})wi({v}))P1(wi({v})wi({v}))(w_{i}(\{v\})-w_{i}(\{v^{\prime}\}))\cdot P^{-1}(w_{i}(\{v\})-w_{i}(\{v^{\prime}\})). Then this algorithm classifies vv^{\prime} correctly with probability 1o(1)1-o(1).

Again, the slowest step of the algorithm is running Vertex-product-approximation-algorithm. This runs in an average time of O(((1c)λ1)r)O(((1-c)\lambda_{1})^{r^{\prime}}) and must be done kk times. If the conditions given above are satisifed, then each ziz_{i} is within 2120(2x(minpj)1/2+x2)\frac{21}{20}(2x(\min p_{j})^{-1/2}+x^{2}) of the product it seeks to approximate with probability 1o(1)1-o(1). If this is the case, then

for all ii and σ\sigma^{\prime} if vv^{\prime} is in the same community as v[σ]v[\sigma], and

for all ii and σ\sigma iff vv^{\prime} is in the same community as v[σ]v[\sigma]. So,

iff vv^{\prime} is in the same community as v[σ]v[\sigma]. Therefore, the algorithm returns the correct result with probability 1o(1)1-o(1). ∎

At this point, we can finally start giving algorithms for classifying a graph’s vertices.

Let x>0x^{\prime}>0, and assume that all of the following hold:

Let y=2kex2(1c)λh2minpi16λ1k3/2((minpi)1/2+x)/(1ex2(1c)λh2minpi16λ1k3/2((minpi)1/2+x)(((1c)λh44λ13)1))y=2ke^{-\frac{x^{2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x)}}/\left(1-e^{-\frac{x^{2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x)}\cdot((\frac{(1-c)\lambda_{h^{\prime}}^{4}}{4\lambda_{1}^{3}})-1)}\right)

and y=2kex2(1c)λh2minpi16λ1k3/2((minpi)1/2+x)/(1ex2(1c)λh2minpi16λ1k3/2((minpi)1/2+x)(((1c)λh44λ13)1))y^{\prime}=2ke^{-\frac{x^{\prime 2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x^{\prime})}}/\left(1-e^{-\frac{x^{\prime 2}(1-c)\lambda_{h^{\prime}}^{2}\min p_{i}}{16\lambda_{1}k^{3/2}((\min p_{i})^{-1/2}+x^{\prime})}\cdot((\frac{(1-c)\lambda_{h^{\prime}}^{4}}{4\lambda_{1}^{3}})-1)}\right)

This algorithm runs in O(m2n1ϵ3+n1+23ϵ)O(m^{2}n^{1-\frac{\epsilon}{3}}+n^{1+\frac{2}{3}\epsilon}) time. Furthermore, with probability 1o(1)1-o(1), GG is such that Unreliable-graph-classification-algorithm(G,c,m,ϵ,x,(λ1,...,λh))(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) has at least a

chance of classifying at least 1y1-y^{\prime} of GG’s vertices correctly.

Let r0=(1ϵ3)logn/log((1c)λ1)r_{0}=(1-\frac{\epsilon}{3})\log n/\log((1-c)\lambda_{1}). There exists y<yy^{\prime\prime}<y such that if these conditions hold, then with probability 1o(1)1-o(1), at least 1y1-y^{\prime\prime} of GG’s vertices are (r0,x)(r_{0},x)-good and the number of vertices in GG in community σ\sigma is within nlogn\sqrt{n}\log n of pσnp_{\sigma}n for all σ\sigma. If this is the case, then for sufficiently large nn, it is at least 1k(1minpi)mmy1-k(1-\min p_{i})^{m}-my likely that every one of the mm randomly selected vertices is (r0,x)(r_{0},x)-good and at least one is selected from each community.

If v[i]v[i] is (r0,x)(r_{0},x)-good for all ii, then with probability 1o(1)1-o(1), vertex-comparison-algorithm(v[i],v[j],r,r,E,x,c,(λ1,...,λh))(v[i],v[j],r,r^{\prime},E,x,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) determines whether or not v[i]v[i] and v[j]v[j] are in the same community correctly for every ii and jj, allowing the algorithm to pick one member of each community. If that happens, then the algorithm will classify each (r+h,x)(r^{\prime}+h^{\prime},x^{\prime})-good vertex correctly with probability 1o(1)1-o(1). So, as long as the initial selection of v[]v[] is good, the algorithm classifies at least 1y1-y^{\prime} of the graph’s vertices correctly with probability 1o(1)1-o(1).

Generating EE and v[]v[] takes O(n)O(n) time. Computing Nr[G\E](v[i])N_{r^{\prime\prime}[G\backslash E]}(v[i]) for all rr+2h+3r^{\prime\prime}\leq r+2h^{\prime}+3 takes O(mrNr[G\E](v[i]))=O(mn)O(m|\cup_{r^{\prime\prime}}N_{r^{\prime\prime}[G\backslash E]}(v[i]))=O(mn) time, and computing Nr[G\E](v)N_{r^{\prime\prime}[G\backslash E]}(v^{\prime}) for all rrr^{\prime\prime}\leq r^{\prime} and vGv^{\prime}\in G takes

time. Once these have been computed, running Vertex-comparison- algorithm(v[i],v[j],r,r,E,x,(λ1,...,λh))(v[i],v[j],r,r^{\prime},E,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) for every ii and jj takes O(m2((1c)λ1)r)=O(m2n1ϵ3)O(m^{2}\cdot((1-c)\lambda_{1})^{r})=O(m^{2}n^{1-\frac{\epsilon}{3}}) time, at which point an alleged member of each community can be found in O(m2)O(m^{2}) time. Running Vertex-classification-algorithm(v[],v,r,r,E,c,(λ1,...,λh))(v^{\prime}[],v^{\prime\prime},r,r^{\prime},E,c,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) for every vGv^{\prime\prime}\in G takes O(n((1c)λ1)r)=O(n1+23ϵ)O(n\cdot((1-c)\lambda_{1})^{r^{\prime}})=O(n^{1+\frac{2}{3}\epsilon}) time. So, the overall algorithm runs in O(m2n1ϵ3+n1+23ϵ)O(m^{2}n^{1-\frac{\epsilon}{3}}+n^{1+\frac{2}{3}\epsilon}) average time. ∎

So, this algorithm can sometimes give a vertex classification that is nontrivially better than that obtained by guessing. However, it has an assymptotically nonzero failure rate and requires too much information about the graph’s parameters. In order to get around that, we combine the results of multiple executions of the algorithm and add in a parameter analysis procedure as follows.

Assume that there exist xx, xx^{\prime}, and ϵ\epsilon such that xx is either a unit reciprocal or an integer, ϵ\epsilon is a rational number of the form 1z\frac{1}{z} or 11z1-\frac{1}{z}, and all of the following hold:

With probability 1o(1)1-o(1), Reliable-graph-classification-algorithm(G,m,δ\delta, T(n)) runs in O(m2n1ϵ3T(n)+n1+23ϵT(n))O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n)) time and classifies at least 13y1-3y^{\prime} of GG’s vertices correctly, where

With probability 1o(1)1-o(1), Improved-Eigenvalue-approximation-algorithm gives output such that h=hh^{\prime\prime}=h^{\prime} and λiλiln3/2(n)|\lambda_{i}-\lambda^{\prime}_{i}|\leq\ln^{-3/2}(n) for each ii. Assuming that this holds, the algorithm finds the largest xx and ϵ\epsilon that satisfy the conditions above and the largest unit reciprocal less than 1/91/9, cc, that satisifes the conditions for Unreliable-graph-classification-algorithm(G,c,m,ϵ,x,(λ1,...,λh))(G,c,m,\epsilon,x,(\lambda^{\prime}_{1},...,\lambda^{\prime}_{h^{\prime\prime}})) to have a greater than 1/21/2 success rate for any k1/δk\leq\lfloor 1/\delta\rfloor with probability 1o(1)1-o(1). Since its success rate is greater than 1/21/2 and T(n)=ω(1)T(n)=\omega(1), more than half of the executions of Unreliable-graph-classification-algorithm give classifications with error yy^{\prime} or less with probability 1o(1)1-o(1). That means that y2yy^{\prime\prime}\leq 2y^{\prime} and at least one of the classifications in the selected set must have error yy^{\prime} or less. Thus, all of the selected classifications have error 3y3y^{\prime} or less. The requirement that minpi>4y\min p_{i}>4y^{\prime} ensures that the bijection between any two of these classifications’ communities that minimizes disagreeement is the identity. Therefore, this algorithm classifies at least 13y1-3y^{\prime} of the vertices correctly, as desired.

Assuming this all works correctly, Improved-Eigenvalue-approximation-algorithm runs in O(nln(n))O(n\ln(n)) time. Finding xx, ϵ\epsilon, and cc takes constant time, and running Unreliable-graph-classification-algorithm T(n)T(n) times takes O(m2n1ϵ3T(n)+n1+23ϵT(n))O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n)). Computing the degree of agreement between each pair of classifications takes O(nln2(n))O(n\ln^{2}(n)) time. T(n)<lnnT(n)<\ln n, so the brute force algorithm finds yy^{\prime\prime} and the corresponding set in O(n)O(n) time. Combining the classifications takes O(n)O(n) time. Therefore, this whole algorithm runs in O(m2n1ϵ3T(n)+n1+23ϵT(n))O(m^{2}n^{1-\frac{\epsilon}{3}}T(n)+n^{1+\frac{2}{3}\epsilon}T(n)) time with probability 1o(1)1-o(1), as desired. ∎

If the conditions hold, Reliable-graph-classification- algorithm(G,ln(41/δ)/δ,δ,ln(n))(G,\ln(4\lfloor 1/\delta\rfloor)/\delta,\delta,\ln(n)) has the desired properties by the previous lemma. ∎

2 Exact recovery

Recall also that exact recovery is solvable for a community partition [k]=s=1tAs[k]=\sqcup_{s=1}^{t}A_{s}, if there exists an algorithm that assigns to each node in GG an element of {A1,,At}\{A_{1},\dots,A_{t}\} that contains its true communityUp to a relabelling of the communities. with probability 1on(1)1-o_{n}(1). Exact recovery is solvable in SBM(n,p,W)SBM(n,p,W) if it is solvable for the partition of [k][k] into kk singletons, i.e., all communities can be recovered.

is an ff-divergence. For t=1/2t=1/2, i.e., the gap between the arithmetic and geometric means, we have

which is the Hellinger divergence (or distance), and the maximization over tt of the part xμ(x)tν(x)1t\sum_{x}\mu(x)^{t}\nu(x)^{1-t} is the exponential of the Chernoff divergence. We refer to Section 8.3 for further discussions on D+D_{+}. Note also that we will often evaluate D+D_{+} as D+(x,y)D_{+}(x,y) where x,yx,y are vectors instead of measures.

We next present our main theorem for exact recovery. We first provide necessary and sufficient conditions for exact recovery of partitions, and then provide an agnostic algorithm that solves exact recovery efficiently, more precisely, in quasi-linear time.

Note that the second item in the theorem implies that Agnostic-degree-profiling solves exact recovery efficiently whenever the parameters pp and QQ allow for exact recovery to be solvable.

If Qij=0Q_{ij}=0 for some ii and jj then the results above still hold, except that if for all ii and jj in different subsets of the partition,

but there exist ii and jj in different subsets of the partition such that D+((PQ)i,(PQ)j)=1D_{+}((PQ)_{i},(PQ)_{j})=1 and ((PQ)i,k(PQ)j,k((PQ)i,k(PQ)j,k)=0((PQ)_{i,k}\cdot(PQ)_{j,k}\cdot((PQ)_{i,k}-(PQ)_{j,k})=0 for all kk, then the optimal algorithm will have an assymptotically constant failure rate. The recovery algorithm also needs to be modified to accomodate ’s in QQ.

Agnostic-degree-profiling algorithm. Inputs: a graph G=([n],E)G=([n],E), and a splitting parameter γ\gamma\in (see Theorem 7 for the choice of γ\gamma). Output: Each node v[n]v\in[n] is assigned a community-list A(v){A1,,At}A(v)\in\{A_{1},\dots,A_{t}\}, where A1,,AtA_{1},\dots,A_{t} is intended to be the partition of [k][k] in to the largest number of subsets such that D+((pQ)i,(pQ)j)1D_{+}((pQ)_{i},(pQ)_{j})\geq 1 for all i,ji,j in [k][k] that are in different subsets. Algorithm: (1) Define the graph GG^{\prime} on the vertex set [n][n] by selecting each edge in GG independently with probability γ\gamma, and define the graph GG^{\prime\prime} that contains the edges in GG that are not in GG^{\prime}. (2) Run Agnostic-sphere-comparison on GG^{\prime} to obtain the preliminary classification σ[k]n\sigma^{\prime}\in[k]^{n} (see Section 7.1 and Corollary 2.) (3) Estimate pp and QQ based on the alleged communities’ sizes and the edge densities between them. (4) For each node v[n]v\in[n], determine in which community node vv is most likely to belong to based on its degree profile computed from the preliminary classification σ\sigma^{\prime} and the estimates of pp and QQ (see Section 7.2.2), and call it σv\sigma^{\prime\prime}_{v} (5) Re-estimate pp and QQ based on the sizes of the communities claimed by σ\sigma^{\prime\prime} and the edge densities between them. (6) For each node v[n]v\in[n], determine in which group A1,,AtA_{1},\dots,A_{t} node vv is most likely to belong to based on its degree profile computed from the preliminary classification σ\sigma^{\prime\prime} and the new estimates of PP and QQ (see Section 7.2.2).

2.2 Testing degree profiles

In this section, we consider the problem of deciding which community a node in the SBM belongs to based on its degree profile. The notions are the same as in [AS15], we repeat them to ease the reading and explain how the Agnostic-degree-profiling algorithm works.

hence, by the additivity of Poisson distribution and the triangular inequality,

We will rely on a simple one-sided bound (see (44)) to approximate our events under the Poisson measure.

In other words, DD has independent Poisson entries with different means. We use the following notation to summarize the above setting:

Our goal is to infer the value of HH by observing a realization of DD. To minimize the error probability given a realization of DD, we must pick the most likely hypothesis conditioned on this realization, i.e.,

which is the Maximum A Posteriori (MAP) decoding rule.Ties can be broken arbitrarily. To resolve this maximization, we can proceed to a tournament of k1k-1 pairwise comparisons of the hypotheses. Each comparison allows us to eliminate one candidate for the maxima, i.e.,

The error probability PeP_{e} of this decoding rule is then given by,

where D+(c1,c2)D_{+}(c_{1},c_{2}) is the CH-divergence as defined in (16).

In other words, the CH-divergence provides the error exponent for deciding among multivariate Poisson distributions. We did not find this result in the literature, but found a similar result obtained by Verdú [Ver86], who shows that the Hellinger distance (the special case with t=1/2t=1/2 instead of the maximization over tt) appears in the error exponent for testing Poisson point-processes, although [Ver86] does not investigate the exact error exponent. This lemma was proven in [AS15].

This lemma together with previous bounds on PeP_{e} imply that if D+(ci,cj)>1D_{+}(c_{i},c_{j})>1 for all iji\neq j, the true hypothesis is correctly recovered with probability o(1/n)o(1/n). However, it may be that D+(ci,cj)>1D_{+}(c_{i},c_{j})>1 only for a subset of (i,j)(i,j)-pairs. What can we then infer? While we may not recover the true value of HH with probability o(1/n)o(1/n), we may narrow down the search within a subset of possible hypotheses with that probability of error.

Testing composite multivariate Poisson distributions. We now consider the previous setting, but we are no longer interested in determining the true hypothesis, but in deciding between two (or more) disjoint subsets of hypotheses. Under hypothesis 1, the distribution of DD belongs to a set of possible distributions, namely P(λi)\mathcal{P}(\lambda_{i}) where iAi\in A, and under hypothesis 2, the distribution of DD belongs to another set of distributions, namely P(λi)\mathcal{P}(\lambda_{i}) where iBi\in B. Note that AA and BB are disjoint subsets such that AB=[k]A\cup B=[k]. In short,

Moreover, applying bounds on the minima of two sums,

The same reasoning can be applied to the problem of deciding whether a given node belongs to a group of communities, with more than two groups. Also, for any pp and pp^{\prime} such that pjpj<lnn/n|p_{j}-p^{\prime}_{j}|<\ln n/\sqrt{n} for each jj, QQ, γ(n)\gamma(n), and ii,

So, the error rate for any algorithm that classifies vertices based on their degree profile in a graph drawn from a sparse SBM is at most O(1/n2)O(1/n^{2}) more than twice what it would be if the probability distribution of degree profiles really was the poisson distribution.

In summary, we have proved the following.

So, this hypothesis testing procedure has an error probability of at most

With probability 1o(1)1-o(1), the size of every community is within lnn/n\sqrt{\ln n/n} of its expected value, and the edge density between each pair of communities is within nlnn\sqrt{n}\ln n of its expected value. Also, there exists a constant cc^{\prime} such that with probability 1o(1)1-o(1), no vertex has degree more than clnnc^{\prime}\ln n. Assuming this is the case, the misclassification of vertices can not change the apparent number of edges between any two communities by more than cδnlnnc^{\prime}\delta n\ln n, and it can not change the apparent size of any community by more than δn\delta n. Thus, pipilnn/n+δ|p^{\prime}_{i}-p_{i}|\leq\sqrt{\ln n/n}+\delta and

This lets us prove the following “robust” version of this lemma to prove Theorem 7.

First, note that by the previous lemma, with probability 1o(1)1-o(1), GG is such that every element of the estimates of pp and QQ is within cδ+lnn/nc\delta+\sqrt{\ln n/n} of its true value. Choose c3c_{3} such that every vertex in the graph has degree less than c3lnnc_{3}\ln n with probability 1o(1)1-o(1). If this holds, then the probability of misclassifying a vertex based on its true degree profile and the estimates of its parameters is at most

based on the previous bounds on the probability that classifying a vertex based on its degree profile fails.

The key observation is that vv’s mmth neighbor had at least a mini,j(piqi,j)/piqi,j\min_{i,j}(p_{i}q_{i,j})/\sum p_{i^{\prime}}q_{i^{\prime},j} chance of actually being in community σ\sigma for each σ\sigma, so its probability of being reported as being in community σ\sigma is at most 1+c1δ1+c^{\prime}_{1}\delta times the probability that it actually is. So, the probability that its reported degree profile is bad is at most (1+c1δ)N1(v)(1+c^{\prime}_{1}\delta)^{|N_{1}(v)|} times the probability that its actual degree profile is bad. The conclusion follows. ∎

2.3 Proof of Theorem 7

We prove the possibility result here. The converse was proven in [AS15] for known parameters, hence also applies here.

The idea behind Claim 1 is contained in Lemma 16. However, there are several technical steps that need to be handled:

The graphs GG^{\prime} and GG^{\prime\prime} obtained in step 1 of the algorithm are correlated, since an edge cannot be both in GG^{\prime} and GG^{\prime\prime}. However, this effect can be discarded since two independent versions would share edges with low enough probability.

The classification in step 2 using Agnostic-sphere-comparison has a vanishing fraction of vertices which are wrongly labeled, and the SBM parameters are unknown. This requires using the robust version of Lemma 16, namely Lemma 18.

In the case where D+((PQ)i,(PQ)j)=1D_{+}((PQ)_{i},(PQ)_{j})=1 a more careful classification is needed as carried in steps 3 and 4 of the algorithm.

With probability 1O(1/n)1-O(1/n), no vertex in the graph has degree greater than c3lnnc_{3}\ln n. Assuming that this holds, no vertex’s set of neighbors in GG^{\prime\prime} is more than

times as likely to occur as it would be if GG^{\prime\prime} were independent of GG^{\prime}. So, the fact that they are not has negligible impact on the algorithm’s error rate. Now, let

By Lemma 18, if the classification in step 22 has an error rate of at most δ\delta, then the classification in step 33 has an error rate of

observing that if σvσv\sigma^{\prime}_{v}\neq\sigma_{v} the error rate of σv\sigma^{\prime\prime}_{v^{\prime}} for vv^{\prime} adjacent to vv is at worst multiplied by a constant. That in turn ensures that the final classification has an error rate of at most