Convexified Modularity Maximization for Degree-corrected Stochastic Block Models

Yudong Chen, Xiaodong Li, Jiaming Xu

Introduction

Detecting communities/clusters in networks and graphs is an important subroutine in many applications across computer, social and natural sciences and engineering. A standard framework for studying community detection in a statistical setting is the stochastic block model (SBM) proposed in Holland et al. . Also known as the planted partition model in the computer science literature [Condon and Karp, 2001], SBM is a random graph model for generating networks from a set of underlying clusters. The statistical task is to accurately recover the underlying true clusters given a single realization of the random graph.

The versatility and analytic tractability of SBM have made it arguably the most popular model for studying community detections. It however falls short of abstracting a key aspect of real-world networks. In particular, an unrealistic assumption of SBM is that within each community, the degree distributions of each node are the same. In empirical network data sets, however, the degree distributions are often highly inhomogeneous across nodes, sometimes exhibiting a heavy tail behavior with some nodes having very high degrees (so-called hubs). At the same time, sparsely connected nodes with small degrees are also common in real networks. To overcome this shortcoming of the SBM, the degree-corrected stochastic block model (DCSBM) was introduced in the literature to allow for degree heterogeneity within communities, thereby providing a more flexible and accurate model of real-world networks [Dasgupta et al., 2004; Karrer and Newman, 2011].

A number of community detection methods have been proposed based on DCSBM, such as model-based methods and spectral methods. Model-based methods include profile likelihood maximization and modularity maximization [Newman, 2006; Karrer and Newman, 2011]. Although these methods enjoy certain statistical guarantees [Zhao et al., 2012], they often involve optimization over all possible partitions, which is computationally intractable. Recent work in Amini et al. ; Le et al. [2015+] discusses efficient solvers, but the theoretical guarantees are only established under restricted settings such as those with two communities. Spectral methods, which estimate the communities based on the eigenvectors of the graph adjacency matrix and its variants, are computationally fast. Statistical guarantees are derived for spectral methods under certain settings (see, e.g., Dasgupta et al. ; Coja-Oghlan and Lanka ; Chaudhuri et al. ; Qin and Rohe ; Lei and Rinaldo ; Jin ; Gulikers et al. ), but numerical validation on synthetic and real data has not been as thorough. One notable exception is the SCORE method proposed in Jin , which achieved one of the best known clustering performance on the political blogs dataset from Adamic and Glance . Spectral methods are also known to suffer from inconsistency in sparse graphs [Krzakala et al., 2013] as well as sensitivity to outliers [Cai and Li, 2015]. We discuss other related work in details in Section 5.

In this paper, we seek for a clustering algorithm that is computationally feasible, has strong statistical performance guarantees under DCSBM, and provides competitive empirical performance. Our approach makes use of the robustness and computational power of convex optimization. Under the standard SBM, convex optimization has been proven to be statistically efficient under a broad range of model parameters, including the size and number of communities as well as the sparsity of the network; see e.g. Chen et al. ; Chen and Xu ; Guédon and Vershynin ; Ames and Vavasis ; Oymak and Hassibi . Moreover, a significant advantage of convex methods is their robustness against arbitrary outlier nodes, as is established in the theoretical framework in Cai and Li . There, it is also observed that their convex optimization approach leads to state-of-the-art misclassification rates in the political blogs dataset, in which the node degrees are highly heterogeneous. These observations motivate us to study whether strong theoretical guarantees under DCSBM can be established for convex optimization-based methods.

Problem setup and algorithms

In this section, we set up the community detection problem under DCSBM, and describe our algorithms based on convexified modularity maximization and weighted kk-median clustering.

2 Generalized modularity maximization

where di:=j=1nAijd_{i}:=\sum_{j=1}^{n}A_{ij} is the degree of node ii, and L=12i=1ndiL=\frac{1}{2}\sum_{i=1}^{n}d_{i} is the total number of edges in the graph. The modularity maximization approach to community detection is based on finding a partition Ym\bm{Y}_{m} that optimizes Q(Y)Q(\bm{Y}):

This standard form of modularity maximization is known to suffer from a “resolution limit” and cannot detect small clusters [Fortunato and Barthelemy, 2007]. To address this issue, several authors have proposed to replace the normalization factor 12L\frac{1}{2L} by a tuning parameter λ\lambda [Reichardt and Bornholdt, 2006; Lancichinetti and Fortunato, 2011], giving rise to the following generalized formulation of modularity maximization:

While modularity maximization enjoys several desirable statistical properties under SBM and DCSBM [Zhao et al., 2012; Amini et al., 2013], the associated optimization problems (2.2) and (2.3) are not computationally feasible due to the combinatorial constraint, which limits the practical applications of these formulations. In practice, modularity maximization is often used as a guidance for designing heuristic algorithms.

Here we take a more principled approach to computational feasibility while maintaining provable statistical guarantees: we develop a tractable convex surrogate for the above combinatorial optimization problems, whose solution is then refined by a novel weighted kk-median algorithm.

3 Convex relaxation

Introducing the degree vector d=(d1,,dn)\bm{d}=(d_{1},\ldots,d_{n})^{\top}, we can rewrite the generalized modularity maximization problem (2.3) in matrix form as

Besides being positive semidefinite, a partition matrix Y\bm{Y} also satisfies the linear constraints 0Yij10\leq Y_{ij}\leq 1 and Yii=1Y_{ii}=1 for all i,j[n]i,j\in[n]. Using these properties of partition matrices, we obtain the following convexification of the modularity optimization problem (2.4):

Here J\bm{J} is the n×nn\times n matrix with all entries equal to 11. Implementation of the formulation (2.6) requires choosing an appropriate tuning parameter λ\lambda. We will discuss the theoretical range for λ\lambda for consistent clustering in Section 3, and empirical choice of λ\lambda in Section 4. As our convexification is based on the generalized version (2.3) of modularity maximization, it is capable of detecting small clusters, even when the number of clusters rr grows with nn, as is shown later.

4 Explicit clustering via weighted k𝑘k-median

Specifically, our weighted kk-median procedure consists of two steps. First, we multiply the columns of Y^\widehat{\bm{Y}} by the corresponding degrees to obtain the matrix W^:=Y^D\widehat{\bm{W}}:=\widehat{\bm{Y}}\bm{D}, where D:=diag(d)=diag(d1,,dn)\bm{D}:={\rm diag}(\bm{d})={\rm diag}(d_{1},\ldots,d_{n}), which is the diagonal matrix formed by the entries of d\bm{d}. Clustering is performed on the row vectors of W^\widehat{\bm{W}} instead of Y^\widehat{\bm{Y}}. Note that if we consider the ii-th row of Y^\widehat{\bm{Y}} as a vector of nn features for node ii, then the rows of W^\widehat{\bm{W}} can be thought of as vectors of weighted features.

Additionally, we require that the center vectors x1,,xr\bm{x}_{1},\ldots,\bm{x}_{r} are chosen from the row vectors of W^\widehat{\bm{W}} (these centers are sometimes called medoids).

where Rows(Z)\text{Rows}(\bm{Z}) denotes the collection of row vectors of a matrix Z\bm{Z}, and Z1\|\bm{Z}\|_{1} denotes the sum of the absolute values of all entries of Z\bm{Z}.

As the solution Y^\widehat{\bm{Y}} to the convex relaxation (2.6) and the approximate solution Ψˇ\widecheck{\bm{\Psi}} to the kk-median problem (2.7) can both be computed in polynomial-time, our algorithm is computationally tractable. In the next section, we turn to the statistical aspect and show that the clustering induced by Y^\widehat{\bm{Y}} and Ψˇ\widecheck{\bm{\Psi}} is close to the true underlying clusters, under some mild and interpretable conditions of DCSBM.

Theoretical results

In this section, we provide theoretical results characterizing the statistical properties of our algorithm. We show that under mild conditions of DCSBM, the difference between the convex relaxation solution Y^\widehat{\bm{Y}} and the true partition matrix Y\bm{Y}^{*}, and the difference between the approximate kk-median clustering Ψˇ\widecheck{\bm{\Psi}} and the true clustering Ψ\bm{\Psi}^{*}, are well bounded. When additional conditions hold, we further show that Y^\widehat{\bm{Y}} perfectly recovers the true clusters. Our results are non-asymptotic in nature, valid for any scaling of nn, rr, θ\bm{\theta} and B\bm{B} etc.

In the literature of community detection by convex optimization under standard SBM, it is often assumed that the minimum within-cluster edge density is greater than the maximum cross-cluster edge density, i.e.,

See for example Chen et al. ; Oymak and Hassibi ; Ames and Vavasis ; Cai and Li ; Guédon and Vershynin . This requirement (3.1) can be directly extended to the DCSBM setting, leading to the condition

Under DCSBM, however, this condition would often be overly restrictive, particularly when the degree parameters {θi}\{\theta_{i}\} are imbalanced with some of them being very small. In particular, this condition is highly sensitive to the minimum value θmin:=min1inθi,\theta_{\min}:=\min_{1\leq i\leq n}\theta_{i}, which is unnecessary since the community memberships of nodes with larger θi\theta_{i} may still be recoverable.

Here we instead consider a version of the density gap condition that is much milder and more appropriate for DCSBM. For each cluster index 1ar1\leq a\leq r, define the quantities

Therefore, the quantity HaH_{a} controls the average degree of the nodes in the aa-th cluster. With this notation, we impose the condition that

We refer to the condition (3.4) as the degree-corrected density gap condition. This condition can be viewed as the “average” version of (3.2), as it depends on the aggregate quantity HaH_{a} associated with each cluster aa rather than the θi\theta_{i}’s of individual nodes in the cluster — in particular, the condition (3.4) is robust against small θmin\theta_{\min}. This condition plays a key role throughout our theoretical analysis, for both approximate and exact cluster recovery under DCSBM.

To gain intuition on the new degree-corrected density gap condition (3.4), consider the following sub-class of DCSBM with symmetric/balanced clusters.

We say that a DCSBM obeys a F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model, if Baa=pB_{aa}=p for all a=1,,ra=1,\ldots,r, Bab=qB_{ab}=q for all 1a<br1\leq a<b\leq r, and G1=G2==Gr=gG_{1}=G_{2}=\cdots=G_{r}=g.

In a F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model, the true clusters are balanced in terms of the connectivity matrix B\bm{B} and the sum of the degree heterogeneity parameters (rather than the cluster size). Under this model, straightforward calculation gives Ha=((r1)q+p)gH_{a}=((r-1)q+p)g for all a=1ra=1\ldots r. The degree-corrected density gap condition (3.4) then reduces to p>qp>q, i.e., the classical density gap condition (3.1).

2 Theory of approximate clustering

Also recall our definitions of HaH_{a} and GaG_{a} in equation (3.3). Furthermore, for each 1ar1\leq a\leq r and iCai\in C^{*}_{a}, define the quantity fi:=θiHaf_{i}:=\theta_{i}H_{a}, which corresponds approximately to the expected degree of node ii and satisfies f1=a,bBabGaGb\|\bm{f}\|_{1}=\sum_{a,b}B_{ab}G_{a}G_{b}.

Under DCSBM, assume that the degree-corrected density gap condition (3.4) holds. Moreover, suppose that the tuning parameter λ\lambda in the convex relaxation (2.6) satisfies

for some number δ>0\delta>0. Then with probability at least 0.992(e/2)2n0.99-2(e/2)^{-2n}, the solution Y^\widehat{\bm{Y}} to (2.6) satisfies the bound

We prove this claim in Section 7.1. The bound (3.6) holds with probability close to one. Notably, it is insensitive to θmin\theta_{\min} as should be expected, because community memberships of nodes with relatively large θi\theta_{i} are still recoverable. In contrast, the error bounds of several existing methods, such as that of SCORE method in [Jin, 2015, eq. (2.15), (2.16)], depend on θmin\theta_{\min} crucially.

Under the F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model, recall that Ha(p+(r1)q)gH_{a}\equiv(p+(r-1)q)g and density gap condition (3.4) becomes p>qp>q. Moreover, the constraint (3.5) for δ\delta and λ\lambda becomes

Note that the first inequality above is the same as the standard density gap condition imposed in, for example, Chen et al. ; Chen and Xu ; Cai and Li . Furthermore, the vector f\bm{f} satisfies f1=r(p+(r1)q)g2r2pg2\|\bm{f}\|_{1}=r(p+(r-1)q)g^{2}\leq r^{2}pg^{2}. Substituting these expressions into the bound (3.6), we obtain the following corollary for the symmetric DCSBM setting.

Under the F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model of DCSBM, if the condition (3.7) holds for the density gap and tuning parameter, then with probability at least 0.992(e/2)2n0.99-2(e/2)^{-2n}, the solution Y^\widehat{\bm{Y}} to the convex relaxation (2.6) satisfies the bound

Note that if pq=c\frac{p}{q}=c for an absolute constant cc, then the bound (3.8) takes the simpler form YY^1,θn+rgnpδ\|\bm{Y}^{*}-\widehat{\bm{Y}}\|_{1,\bm{\theta}}\lesssim\frac{n+rg\sqrt{np}}{\delta}. If θi=1\theta_{i}=1 for all nodes ii, the F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model reduces to the standard SBM with equal community size. If we assume r=O(1)r=O(1) additionally, and note that g=n/rg=n/r and let δ=pq4\delta=\frac{p-q}{4}, then the error bound (3.8) becomes

This bound matches the error bounds proved in [Guédon and Vershynin, 2015, Theorem 1.3].

Let Sr\mathcal{S}_{r} denote the set of all r×rr\times r permutation matrices. The set of misclassified nodes with respect to a permutation matrix ΠSr\bm{\Pi}\in\mathcal{S}_{r} is defined as

With this definition, we have the following theorem that quantifies the misclassification rate of approximate kk-median solution Ψˇ\widecheck{\bm{\Psi}}.

Under the F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model, assume that the parameters δ\delta and λ\lambda satisfy (3.7). Then with probability at least 0.992(e/2)2n0.99-2(e/2)^{-2n}, the approximate kk-median solution Ψˇ\widecheck{\bm{\Psi}} satisfies the bound

We prove this claim in Section 7.2. Extension to the general DCSBM setting is straightforward.

If we let Πθ\bm{\Pi}_{\bm{\theta}} be a minimizer of the LHS of (3.9) and Eθ:=E(Πθ)\mathcal{E}_{\bm{\theta}}:=\mathcal{E}(\bm{\Pi}_{\bm{\theta}}), then the quantity iEθθi\sum_{i\in\mathcal{E}_{\bm{\theta}}}\theta_{i} is the number of misclassified nodes weighted by their degree heterogeneity parameters {θi}\{\theta_{i}\}. Theorem 2 controls this weighted quantity, which is natural as nodes with smaller θi\theta_{i} are harder to cluster and thus less controlled in (3.9). Notably, the bound given in (3.9) is applicable even in the sparse graph regime with bounded average degrees, i.e., p,q=O(1/n)p,q=O(1/n). For example, suppose that p=a/np=a/n and q=b/nq=b/n for two fixed constants a>ba>b, r=O(1)r=O(1) and gng\asymp n; if (ab)/a(a-b)/\sqrt{a} is sufficiently large, then, with the choice δ(ab)/n\delta\asymp(a-b)/n, the right hand side of (3.9) can be an arbitrarily small constant times nn. In comparison, conventional spectral methods are known to be inconsistent in this sparse regime [Krzakala et al., 2013]. While this difficulty is alleviated under SBM by the use of regularization or non-backtracking matrices (e.g., Le and Vershynin ; Bordenave et al. ), rigorous justification and numerical validation under DCSBM have not been well explored.

It is sometimes desirable to have a direct (unweighted) bound on the number misclassified nodes. Suppose that Π0Sr\bm{\Pi}_{0}\in\mathcal{S}_{r} is a permutation matrix that minimizes E(Π)|\mathcal{E}(\bm{\Pi})|, and let E0:=E(Π0)\mathcal{E}_{0}:=\mathcal{E}(\bm{\Pi}_{0}). A bound on the unweighted misclassification error E0|\mathcal{E}_{0}| can be easily derived from the general weighted bound (3.9). For example, combining (3.9) with the AM-HM inequality iEθθiEθ2iEθ1/θi\sum_{i\in\mathcal{E}_{\bm{\theta}}}\theta_{i}\geq\frac{|\mathcal{E}_{\bm{\theta}}|^{2}}{\sum_{i\in\mathcal{E}_{\bm{\theta}}}1/\theta_{i}}, we obtain that

Another bound on E0|\mathcal{E}_{0}|, which is applicable even when some θi\theta_{i}’s are zero, can be derived as follows: we pick any number τ>0\tau>0 and use the inequality (3.9) to get

This simple bound is already quite useful, for example in standard SBM with θi1\theta_{i}\equiv 1, p1np\geq\frac{1}{n} and rr equal-sized clusters. In this case, setting τ=0.9\tau=0.9 in (3.11) yields that the number of misclassified nodes satisfies E0r2nppq.|\mathcal{E}_{0}|\lesssim\frac{r^{2}\sqrt{np}}{p-q}. When r=2r=2, this bound is consistent with those in Guédon and Vershynin , but our result is more general as it applies to more clusters r3r\geq 3.

3 Theory of perfect clustering

In this section, we show that under an additional condition on the minimum degree heterogeneity parameter θmin=min1inθj\theta_{\min}=\min_{1\leq i\leq n}\theta_{j}, the solution Y^\widehat{\bm{Y}} to the convex relaxation perfectly recovers the true partition matrix Y\bm{Y}^{*}. In this case the true clusters can be extracted easily from Y^\widehat{\bm{Y}} without using the kk-median procedure.

For the purpose of studying perfect clustering, we consider a setting of DCSBM with Baa=pB_{aa}=p for all a=1,,ra=1,\ldots,r, and Bab=qB_{ab}=q for all 1a<br1\leq a<b\leq r. Under this setup, the degree-corrected density gap condition (3.5) becomes

Recalling the definition of GaG_{a} in (3.3), we further define Gmin:=min1arGa.G_{\min}:=\min_{1\leq a\leq r}G_{a}. The following theorem characterizes when perfect clustering is guaranteed.

Suppose that the degree-corrected density gap condition (3.12) is satisfied for some number δ>0\delta>0 and tuning parameter λ\lambda, and that

for some sufficiently large absolute constant C0C_{0}. Then with probability at least 110n11-10n^{-1}, the solution Y^\widehat{\bm{Y}} to the convex relaxation (2.6) is unique and equals Y\bm{Y}^{*}.

The condition (3.13) depends on the minimum values GminG_{\min} and θmin\theta_{\min}. Such dependence is necessary for perfect clustering, as clusters and nodes with overly small GaG_{a} and θi\theta_{i} will have too few edges and are not recoverable. In comparison, the approximate recovery results in Theorem 1 are not sensitive to either θmin\theta_{\min} or GminG_{\min}, as should be expected. Valid for the more general DCSBM, Theorem 3 significantly generalizes the existing theory for standard SBM on perfect clustering by SDP in the literature (see, e.g., Chen et al. ; Chen and Xu ; Cai and Li ). Taking nn\to\infty, Theorem 3 guarantees that the probability of perfect clustering converges to one, thereby implying the convex relaxation approach is strongly consistent in the sense of Zhao et al. .

In the special case of standard SBM with θi=1,i[n]\theta_{i}=1,\forall i\in[n], the density gap lower bound (3.13) simplifies to

Numerical results

In this section, we provide numerical results on both synthetic and real datasets, which corroborate our theoretical findings. Our convexified modularity maximization approach is found to empirically outperform state-of-the-art methods in several settings.

The convexified modularity maximization problem (2.6) is a semidefinite program (SDP), and can be solved efficiently by a range of general and specialized algorithms. Here we use the alternating direction method of multipliers (ADMM) suggested in Cai and Li . To specify the ADMM solver, we need some notations as follows. For any two n×nn\times n matrices X\bm{X} and Y\bm{Y}, let max{X,Y}\max\{\bm{X},\bm{Y}\} be the matrix whose (i,j)(i,j)-th entry is given by max{Xij,Yij}\max\{X_{ij},Y_{ij}\}; the matrix min{X,Y}\min\{\bm{X},\bm{Y}\} is similarly defined. For a symmetric matrix X\bm{X} with an eigenvalue decomposition X=UΣU\bm{X}=\bm{U}\bm{\Sigma}\bm{U}^{\top}, let (X)+:=Umax{Σ,0}U(\bm{X})_{+}:=\bm{U}\max\{\bm{\Sigma},\bm{0}\}\bm{U}^{\top}, and let (X)I(\bm{X})_{\bm{I}} be the matrix obtained by setting all the diagonal entries of X\bm{X} to 11. Recall that J\bm{J} denotes the n×nn\times n all-one matrix. The ADMM algorithm for solving (2.6) with the dual update step size equal to 11, is given as Algorithm 1.

After obtaining the solution Y^\widehat{\bm{Y}} to the convex relaxation, we extract an explicit clustering using the weighted kk-median procedure described in (2.7) with k=rk=r, where the number of major clusters rr is assumed known. Our complete community detection algorithm, Convexified Modularity Maximization (CMM), is summarized in Algorithm 2. In our experiments, the weighted kk-median problem is solved by an iterative greedy procedure that optimizes alternatively over the variables Ψ\bm{\Psi} and X\bm{X} in (2.7), with 100100 random initializations.

We provide experiment results on synthetic data generated from DCSBM. For each node i[n]i\in[n], the degree heterogeneity parameter θi\theta_{i} is sampled independently from a Pareto(α,β)(\alpha,\beta) distribution with the density function f(xα,β)=αβαxα+11{xβ}f(x|\alpha,\beta)=\frac{\alpha\beta^{\alpha}}{x^{\alpha+1}}{\mathbf{1}_{\left\{{x\geq\beta}\right\}}}, where α\alpha and β\beta are called the shape and scale parameters, respectively. We consider different values of the shape parameter, and choose the scale parameter accordingly so that the expectation of each θi\theta_{i} is fixed at 11. Note that the variability of the θi\theta_{i}’s decreases with the shape parameter α\alpha. Given the degree heterogeneity parameters {θi}\{\theta_{i}\} and two numbers 0q<p10\leq q<p\leq 1, a graph is generated from DCSBM, with the edge probability between nodes iCai\in C^{*}_{a} and jCbj\in C^{*}_{b} being min(1,θiθjBab)\min(1,\theta_{i}\theta_{j}B_{ab}) and Baa=p,Bab=q,1abrB_{aa}=p,B_{ab}=q,\forall 1\leq a\neq b\leq r.

We applied our CMM approach in Algorithm 2 to the resulting graph, and recorded the misclassification rate E(Π0)/n|{\mathcal{E}}(\bm{\Pi}_{0})|/n (cf. the discussion after Theorem 2). For comparison, we also applied the SCORE algorithm in Jin and the OCCAM algorithm in Zhang et al. , which are reported to have state-of-the-art empirical performance on DCSBM in the existing literature. The SCORE algorithm performs kk-means on the top-22 to top-rr eigenvectors of the adjacency matrix normalized element-wise by the top-11 eigenvector. OCCAM is a type of regularized spectral kk-median algorithm; it can produce non-overlapping clusters and its regularization parameter is given explicitly in Zhang et al. . For all kk-means/medians procedures used in the experiments, we took k=rk=r and used 100100 random initializations.

Fig. 1 shows the misclassification rates of CMM (solid lines), SCORE (dash lines) and OCCAM (individual markers) for various settings of nn, pp, qq, cluster size and the shape parameter for θ\bm{\theta}. We see that the misclassification rate of CMM grows as the degree parameters {θi}\{\theta_{i}\} becomes more heterogeneous (smaller values of the shape parameter), and as the graph becomes sparser, which is consistent with the prediction of Theorem 2. Moreover, our approach has consistently lower misclassification rates than SCORE and OCCAM, with SCORE and OCCAM exhibiting similar performance.

2 Political blog network dataset

We next test the empirical performance of CMM (Algorithm 2), SCORE and OCCAM on the US political blog network dataset from Adamic and Glance . This dataset consists of 19090 hyperlinks (directed edges) between 1490 political blogs collected in the year 2005. The political leaning (liberal versus conservative) of each blog has been labeled manually based on blog directories, incoming and outgoing links and posts around the time of the 2004 presidential election. We treat these labels as the true memberships of r=2r=2 communities. We ignore the edge direction, and focus on the largest connected component with n=1222n=1222 nodes and 16,71416,714 edges, represented by the adjacency matrix A\bm{A}. This graph has high degree variation: the maximum degree is 351351 while the mean degree is around 2727. CMM, SCORE and OCCAM misclassify 6262, 5858 and 6565 nodes, respectively, out of 12221222 nodes on this dataset. The SCORE method has the best known error rate on the political blogs dataset in the literature [Jin, 2015], and we see that our CMM approach is comparable to the state of the art. Panel (a) in Fig. 2 shows the adjacency matrix A\bm{A} with rows and columns sorted according to the true community labels. The output of ADMM Algorithm 1 for solving the convex relaxation (2.6) is shown in Fig. 2 (b). The partition matrix corresponding to the output of the weighted kk-median step in Algorithm 2 is shown in Fig. 2 (c).

3 Facebook dataset

In this section, we consider the Facebook network dataset from Traud et al. , and compare the empirical performance of our CMM approach with the SCORE and OCCAM methods. The Facebook network dataset consists of 100 US universities and all the “friendship” links between the users within each university, recorded on one particular day in September 2005. The dataset also contains several node attributes such as the gender, dorm, graduation year and academic major of each user. Here we report results on the friendship networks of two universities: Simmons College and Caltech.

The Simmons College network contains 1518 nodes and 32988 undirected edges. The subgraph induced by nodes with graduation year between 2006 and 2009 has a largest connected component with 1137 nodes and 24257 undirected edges, which we shall focus on. It is observed in Traud et al. that the community structure of the Simmons College network exhibits a strong correlation with the graduation year — students in the same year are more likely to be friends. Panel (a) of Fig. 3 shows this largest component with nodes colored according to their graduation year.

We applied the CMM (Algorithm 2), SCORE and OCCAM methods to partition the largest component into r=4r=4 clusters. In Panels (b)–(d) of Fig. 3 the clustering results of these three methods are shown as the node colors. In Fig. 4 we also provide the confusion matrices of the clustering results against the graduation years; the (i,j)(i,j)-th entry of a confusion matrix represents the number of nodes that are from graduation year i+2005i+2005 but assigned to cluster jj by the algorithm. We see that our CMM approach produced a partition more correlated with the actual graduation years. In fact, if we treat the graduation years as the ground truth cluster labels, then CMM misclassified 12.04%12.04\% of the nodes, whereas SCORE and OCCAM have higher misclassification rates of 23.57%23.57\% and 22.43%22.43\%, respectively. A closer investigation of Fig. 3 and Fig. 4 shows that CMM was better in distinguishing between the nodes of year 2006 and 2007.

3.2 Caltech network

In this section, we provide experiment results on the Caltech network. This network has 769 nodes and 16656 undirected edges. We consider the subgraph induced by nodes with known dorm attributes, and focus on its largest connected component, which consists of 590 nodes and 12822 edges. The community structure is highly correlated with which of the 88 dorms a user is from, as observed in Traud et al. .

We applied CMM, SCORE and OCCAM to partition this largest component into r=8r=8 clusters. With the dorms as the ground truth cluster labels, CMM misclassified 21.02%21.02\% of the nodes, whereas SCORE and OCCAM had higher misclassification rates of 31.02%31.02\% and 32.03%32.03\%, respectively. The confusion matrices of these methods are shown in Fig. 5. We see that dorm 3 was difficult to recover and largely missed by all three methods, but our CMM algorithm better identified the other dorms.

Related work

In this section, we discuss prior results that are related to our work. Existing community detection methods for DCSBM include model-based methods and spectral methods. In model-based methods such as profile likelihood and modularity maximization [Newman, 2006], one fits the model parameters to the observed network based on the likelihood functions or modularity functions determined by the statistical structure of DCSBM. In Karrer and Newman , the maximum likelihood estimator is used to infer the unknown model parameters θ\bm{\theta} and B\bm{B}. These estimates are then plugged into the log likelihood function, which leads to a quality function for community partitions. An estimate of the community structure is obtained by maximizing this quality function using a greedy heuristic algorithm. No provable theoretical guarantee is known for this greedy algorithm, and one usually needs to run the algorithm with many random initial points to achieve good performance. The work in Zhao et al. discusses profile likelihood methods for DCSBM and the closely related modularity maximization approach. Under the assumption that the number of clusters is fixed, strong consistency is proved when the average degree is Ω(logn)\Omega(\log n), and weak consistency when it is Ω(1)\Omega(1). However, directly solving the maximization problems is computationally infeasible, as it involves searching over all possible partitions. Numerically, these optimization problems are solved heuristically using Tabu search and spectral decomposition without theoretical guarantees. The algorithm proposed in Amini et al. involves finding an initial clustering using spectral methods, then iteratively updating the labels via maximizing conditional pseudo likelihood, which is done using the EM algorithm in each step of iteration. After simplifying the iterations into one E-step, they establish guaranteed consistency when there are only two clusters. The work in Le et al. [2015+] proposes to approximate the profile likelihood functions, modularity functions or other criterions using surrogates defined in a 22-dimensional subspace constructed by spectral dimension reduction. Thanks to the convexity of the surrogate functions, the search complexity is polynomial. The method and theory are however only applicable when there are two communities.

Spectral methods for community detection have attracted interest from diverse communities including computer science, applied math, statistics, and machine learning; see e.g. Rohe et al. and the references therein for results of spectral clustering under SBM. The seminal work of Dasgupta et al. on DCSBM (proposed under the name of Extended Planted Partition model) considered a spectral method similar to that in McSherry . One major drawback is that the knowledge of θ\bm{\theta} is required in both the theory and algorithm. In the algorithm proposed in Coja-Oghlan and Lanka , the adjacency matrix is first normalized by the node degrees and then thresholded entrywise, after which spectral clustering is applied. Strong consistency is proved for the setting with a fixed number of clusters. In Chaudhuri et al. , a modified spectral clustering method was proposed using a regularized random-walk graph Laplacian, and strong consistency is established under the assumption that the average degree grows at least as Ω(n)\Omega(\sqrt{n}). A different spectral clustering approach based on regularized graph Laplacians is considered in Qin and Rohe . Their theoretical bound on the misclassified rates depends on the eigenvectors of the graph Laplacian, which is a still random object. Spectral clustering based on unmodified adjacency matrices and degree-normalized adjacency matrices are analyzed in Lei and Rinaldo and Gulikers et al. , which prove rigorous error rate results but did not provide numerical validation on either synthetic or real data.

It is observed in Jin that spectral clustering based directly the adjacency matrix (or their normalized version) often result in inconsistent clustering in real data, such as the political blogs dataset Adamic and Glance , a popular benchmark for testing community detection approaches. To address this issue, a new spectral clustering algorithm called SCORE is proposed in Jin . Specifically, the second to rr-th leading eigenvectors are divided by the first leading eigenvector elementwisely, and spectral clustering is applied to the resulting ratio matrix. In their theoretical results, an implicit assumption is that the number of communities rr is bounded by a constant, as implied by the condition (2.14) in Jin . In comparison, our convexified modularity maximization approach works for growing rr both theoretically and empirically. As illustrated in Section 4.1, our method exhibited better performance on both the synthetic and real datasets considered there, especially when r3r\geq 3.

Discussion and future work

The proposed method also enjoys good empirical performance, as was demonstrated on both synthetic data and real-world networks. On these datasets our method was observed to have performance comparable to, and sometimes better than, the state-of-the-art spectral clustering methods, particularly when there are more than two communities.

A future direction important in both theory and practice, is to consider networks with overlapping communities, where a node may belong to multiple communities simultaneously. To accommodate such a setting several extensions of SBM have been introduced in the literature. For example, Zhang et al. proposed a spectral algorithm based on the Overlapping Continuous Community Assignment Model (OCCAM). As our CMM method is shown to be an attractive alternative to spectral methods for DCSBM, it will be interesting to extend CMM to allow for both overlapping communities and heterogeneous degrees. Another direction of interest is to develop a general theory of optimal misclassification rates for DCSBM along the lines of Gao et al. ; Zhang and Zhou .

Proofs

Recall that d\bm{d} and f\bm{f} are the vectors of node degrees and their expectations, respectively, where

for each iCai\in C^{\ast}_{a}, a=1,,ra=1,\ldots,r. A key step in our proofs is to appropriately control the deviation of the degrees from their expectation. This is done in the following lemma.

Under DCSBM, with probability at least 0.990.99, we have

By Markov’s inequality, with probability 0.9950.995, there holds

We control the terms S1S_{1}, S2S_{2} and S3S_{3} separately below.

Similarly, for each pair iCai\in C_{a}^{*} and jCbj\in C_{b}^{*} with 1a<br1\leq a<b\leq r, we have

Combining the last two inequalities, we obtain the bound

By Grothendieck’s inequality [Grothendieck, 1953; Lindenstrauss and Pełczyński, 1968] we have

where KGK_{G} is Grothendieck’s constant known to satisfy KG1.783K_{G}\leq 1.783. Since λmin1arBaaHa2\lambda\leq\min_{1\leq a\leq r}\frac{B_{aa}}{H_{a}^{2}}, applying Lemma 1 on ffdd1\|\bm{f}\bm{f}^{\top}-\bm{d}\bm{d}^{\top}\|_{\infty\to 1} ensures that with probability at least 0.990.99,

It follows from Grothendieck’s inequality that

The norm on the last RHS can be expressed as

For each fixed pair of sign vectors x,y{±1}n\bm{x},\bm{y}\in\{\pm 1\}^{n}, Bernstein’s inequality ensures that for each t>0t>0, with probability at most 2et2e^{-t} there holds the inequality

where σ2:=i<jvarAij12a,b=1rBabGaGb=12f1.\sigma^{2}:=\sum_{i<j}\mathsf{var}A_{ij}\leq\frac{1}{2}\sum_{a,b=1}^{r}B_{ab}G_{a}G_{b}=\frac{1}{2}\|\bm{f}\|_{1}. Setting t=2nt=2n and applying the union bound over all sign vectors, we obtain that with probability at most 2(e/2)2n2(e/2)^{-2n},

It follows that with probability at least 12(e/2)2n1-2(e/2)^{-2n},

Putting together the bounds for S1S_{1}, S2S_{2} and S3S_{3}, we conclude that with probability at least 0.992(e/2)2n0.99-2(e/2)^{-2n}, the bound (3.6) holds.

2 Proof of Theorem 2

Recall that (Ψ,X)(\overline{\bm{\Psi}},\overline{\bm{X}}) is the exact optimal solution to the weighted kk-median problem (2.7), (Ψˇ,Xˇ)(\widecheck{\bm{\Psi}},\widecheck{\bm{X}}) is the approximate solution, and W^:=Y^D\widehat{\bm{W}}:=\widehat{\bm{Y}}\bm{D} is the column-weighted version of the solution Y^\widehat{\bm{Y}} to the convex program (2.6). The last constraint in (2.7) ensures that the row vectors of X\overline{\bm{X}} and Xˇ\widecheck{\bm{X}} are subsets of the row vectors of W^\widehat{\bm{W}}. If we define the matrices W:=Ψ  X\overline{\bm{W}}:=\overline{\bm{\Psi}}\;\overline{\bm{X}} and Wˇ:=Ψˇ  Xˇ\widecheck{\bm{W}}:=\widecheck{\bm{\Psi}}\;\widecheck{\bm{X}}, then the row vectors of W\overline{\bm{W}} and Wˇ\widecheck{\bm{W}} are also subsets of the row vectors of W^\widehat{\bm{W}}. For any matrix M\bm{M}, we let Mi\bm{M}_{i\bullet} denote the ii-th row vector of M\bm{M}, and Mj\bm{M}_{\bullet j} the jj-th column vector of M\bm{M}. At a high level, we prove Theorem 2 by translating the upper bound on the weighted error Y^Y1,θ\|\widehat{\bm{Y}}-\bm{Y}^{*}\|_{1,\bm{\theta}}, given in (3.8) in Corollary 1, to an upper bound on the weighted misclassification rate defined in Definition 3. This is done in three steps.

which implies that D(W~W^)1D(W^W)1+D(W~W)12D(W^W)1\|\bm{D}(\widetilde{\bm{W}}-\widehat{\bm{W}})\|_{1}\leq\|\bm{D}(\widehat{\bm{W}}-\bm{W}^{*})\|_{1}+\|\bm{D}(\widetilde{\bm{W}}-\bm{W}^{*})\|_{1}\leq 2\|\bm{D}(\widehat{\bm{W}}-\bm{W}^{*})\|_{1}. Since (Ψ,X~)(\bm{\Psi}^{\ast},\widetilde{\bm{X}}) is feasible to the optimization (2.7), we have

For each j[n]j\in[n], if dj>0d_{j}>0, then it follows from the above definition that Wˇij=Yˇijdj\widecheck{W}_{ij}=\widecheck{Y}_{ij}d_{j}. Suppose dj=0d_{j}=0; because Rows(Wˇ)Rows(W^)\text{Rows}(\widecheck{\bm{W}})\subseteq\text{Rows}(\widehat{\bm{W}}) and W^=Y^D\widehat{\bm{W}}\bm{=}\widehat{\bm{Y}}\bm{D}, for each i[n]i\in[n], there exists an index i[n]i^{\prime}\in[n] such that

Putting together, we conclude that Wˇ=YˇD\widecheck{\bm{W}}=\widecheck{\bm{Y}}\bm{D}. In view of the bound (7.1) and the definitions W^:=Y^D\widehat{\bm{W}}:=\widehat{\bm{Y}}\bm{D} and W:=YD\bm{W}^{*}:=\bm{Y}^{*}\bm{D}, we get that

The bound in (7.3) is weighted by the empirical degrees d\bm{d}. Our next step is to convert this bound into one that is weighted by the population quantity f\bm{f}. Recall that Rows(Wˇ)Rows(W^)\text{Rows}(\widecheck{\bm{W}})\subseteq\text{Rows}(\widehat{\bm{W}}) and W^=Y^D\widehat{\bm{W}}\bm{=}\widehat{\bm{Y}}\bm{D}. If dj>0d_{j}>0, then for any i[n]i\in[n], there exists an ii^{\prime} such that

Since Y^\widehat{\bm{Y}} is feasible to the convex relaxation (2.6), we have 0Y^J\bm{0}\leq\widehat{\bm{Y}}\leq\bm{J}. It follows that 0YˇJ\bm{0}\leq\widecheck{\bm{Y}}\leq\bm{J} and hence YˇY1\|\widecheck{\bm{Y}}-\bm{Y}^{*}\|_{\infty}\leq 1. Setting M:=ffTddT1M:=\|\bm{f}\bm{f}^{T}-\bm{d}\bm{d}^{T}\|_{1}, we observe that any matrix Z\bm{Z} satisfies the bound

where Z1,f\|\bm{Z}\|_{1,\bm{f}} and Z1,d\|\bm{Z}\|_{1,\bm{d}} are defined in the same fashion as Z1,θ\|\bm{Z}\|_{1,\bm{\theta}} given in Definition 2. Therefore, the bound (7.3) implies that

To bound the second term above, we apply Lemma 1 to get that with probability at least 0.99,

Also note that under the F(n,r,p,q,g)\mathcal{F}(n,r,p,q,g)-model,

which implies that fi=θih,i[n]f_{i}=\theta_{i}h,\forall i\in[n] and

Combining the last three equations gives the following bound on the second term of (7.4):

We can control the first term in (7.4) using the bound (3.8) in Corollary 1. Putting together, straightforward calculation yields the inequality

Recall that diag(θ){\rm diag}(\bm{\theta}) is the n×nn\times n diagonal matrix whose diagonal entries are correspondingly the entries of θ\bm{\theta}. For each a=1,,ra=1,\ldots,r, define the set of node indices

and let S:=a=1rSaS:=\bigcup_{a=1}^{r}S_{a}. It follows from (7.6) that

Consider the set Ta:=Ca\SaT_{a}:=C_{a}^{*}\backslash S_{a} for each a=1,,ra=1,\ldots,r. There are three cases for each TaT_{a}. In the first case, Ta=T_{a}=\emptyset, and we denote by R1R_{1} the collection of all such indices aa. In the second case, TaT_{a}\neq\emptyset and Ψˇi=Ψˇj\bm{\widecheck{\Psi}}_{i\bullet}=\bm{\widecheck{\Psi}}_{j\bullet} for all i,jTai,j\in T_{a}. We say that these TaT_{a}’s are pure, and denote by R2R_{2} the collection of all such indices aa. Finally, we set R3:={1,,r}\(R1R2)R_{3}:=\{1,\ldots,r\}\backslash(R_{1}\cup R_{2}); for each aR3a\in R_{3}, we say that TaT_{a} is impure since there exist i,jTai,j\in T_{a} such that ΨˇiΨˇj\bm{\widecheck{\Psi}}_{i\bullet}\neq\bm{\widecheck{\Psi}}_{j\bullet}.

For each aR1a\in R_{1}, we have Sa=CaS_{a}=C_{a}^{*}, which implies that

For each pair a,bR2R3a,b\in R_{2}\cup R_{3} with aba\neq b, by definition we know that TaT_{a}\neq\emptyset and TbT_{b}\neq\emptyset. Then for each pair iTaCai\in T_{a}\subseteq C_{a}^{*}, jTbCbj\in T_{b}\subseteq C_{b}^{*}, we have

whence YˇiYˇj\bm{\widecheck{Y}}_{i\bullet}\neq\bm{\widecheck{Y}}_{j\bullet}. We claim that this implies ΨˇiΨˇj\bm{\widecheck{\Psi}}_{i\bullet}\neq\bm{\widecheck{\Psi}}_{j\bullet}. Suppose that this claim is not true. For each k[n]k\in[n], if dk=0d_{k}=0, then Yˇik=Yˇjk=0\widecheck{Y}_{ik}=\widecheck{Y}_{jk}=0 in view of the definition of Yˇ\widecheck{\bm{Y}} in (7.2); if dk>0d_{k}>0, then the definition Wˇ=ΨˇXˇ\widecheck{\bm{W}}=\widecheck{\bm{\Psi}}\widecheck{\bm{X}} implies that

Therefore, we have Yˇi=Yˇj\bm{\widecheck{Y}}_{i\bullet}=\bm{\widecheck{Y}}_{j\bullet}, which is a contradiction.

In conclusion, we have proved that for each pair a,bR2R3a,b\in R_{2}\cup R_{3} with aba\neq b and each pair iTai\in T_{a}, jTbj\in T_{b}, we have ΨˇiΨˇj\bm{\widecheck{\Psi}}_{i\bullet}\neq\bm{\widecheck{\Psi}}_{j\bullet}. Moreover, since for each aR2a\in R_{2}, the set TaT_{a} is pure by definition, there exists a permutation matrix ΠSr\bm{\Pi}\in\mathcal{S}_{r} such that for all iaR2Tai\in\bigcup_{a\in R_{2}}T_{a}, there holds (ΨˇΠ)i=Ψi(\widecheck{\bm{\Psi}}\bm{\Pi})_{i\bullet}=\bm{\Psi}^{*}_{i\bullet}. Recalling Definition 3, we conclude that the set (aR3Ta)S(\bigcup_{a\in R_{3}}T_{a})\bigcup S contains the misclassified node set with respect to Π\bm{\Pi}. It follows that

The matrix Ψˇ\widecheck{\bm{\Psi}} consists of at most rr distinct row vectors. Because R2R_{2} is pure and R3R_{3} is impure by definition, we have the inequality

Applying the bounds (7.9), (7.10), (7.8) and (7.7) in order, we obtain

thereby proving the inequality (3.9) in the theorem.

3 Proof of Theorem 3

Let G:=i=1nθi=θ1G:=\sum_{i=1}^{n}\theta_{i}=\|\bm{\theta}\|_{1} and we have for all 1ar1\leq a\leq r,

By the assumption (3.13) and the fact that δ<p,\delta<p,

We now turn to the proof of Theorem 3. In the proof we make use of several technical lemmas given in the Appendix.

Recall that fi=θiHa,f_{i}=\theta_{i}H_{a}, for each iCai\in C^{\ast}_{a}, a=1,,ra=1,\ldots,r. The following lemma, complementing Lemma 1, characterizes the relationship between the degrees d1,,dnd_{1},\ldots,d_{n} and the population quantities f1,,fnf_{1},\ldots,f_{n}.

With probability at least 12n21-\frac{2}{n^{2}}, for all i=1,,ni=1,\ldots,n,

If we further assume that the condition (3.13) holds with some large enough numerical constant C0C_{0}, there holds for all i=1,,ni=1,\ldots,n

We prove this lemma in Section 7.3.1 to follow.

Recall the decomposition Y=Ψ(Ψ)\bm{Y}^{*}=\bm{\Psi}^{\ast}(\bm{\Psi}^{\ast})^{\top}, where

To establish the theorem, it suffices to show for any feasible solution Y\bm{Y} with YY\bm{Y}\neq\bm{Y}^{\ast},

Let ϵ=δ10.\epsilon=\frac{\delta}{10}. We propose to decompose Δ(Y)\Delta(\bm{Y}) as

Below we establish lower bounds for the terms S1S_{1}, S2S_{2}, S3S_{3} and S4S_{4} respectively.

We plan to construct an n×nn\times n diagonal matrix D\bm{D}, such that with high probability,

Such a diagonal matrix D\bm{D} implies that with high probability,

where the step (a)(a) follows diag(Y)=diag(Y)=In{\rm diag}(\bm{Y}^{*})={\rm diag}(\bm{Y})=\bm{I}_{n}, (b)(b) follows from the definition of Λ\bm{\Lambda}, (c)(c) holds due to Y0\bm{Y}\succeq\bm{0} and Λ0\bm{\Lambda}\succeq\bm{0}, and the last equality follows because ΛΨ=0\bm{\Lambda}\bm{\Psi}^{\ast}=\bm{0}.

In what follows, we will show how to construct explicitly the diagonal matrix D\bm{D} satisfying the condition (7.16) with high probability. Notice that the condition (7.16) is equivalent to that with high probability, for all 1ar1\leq a\leq r,

The last inequality is due to the fact that λd(a)d(a)+ϵθ(a)θ(a)pθ(a)θ(a)\lambda\bm{d}_{(a)}\bm{d}_{(a)}^{\top}+\epsilon\bm{\theta}_{(a)}\bm{\theta}_{(a)}^{\top}-p\bm{\theta}_{(a)}\bm{\theta}_{(a)}^{\top} has at most one negative eigenvalue. Therefore, to establish (7.16), we only need to prove

Then Λ1=Λ11+Λ12\bm{\Lambda}_{1}=\bm{\Lambda}_{11}+\bm{\Lambda}_{12}. By Weyl’s inequality [Horn and Johnson, 2013], to prove Λ10\bm{\Lambda}_{1}\succ\bm{0}, we only need to show that

Applying Lemma 5 in the appendix, we can prove that with probability at least 11n21-\frac{1}{n^{2}},

for some numerical constant C1C_{1}. Moreover,

where aija_{ij} denotes the (i,j)(i,j)-th entry of A.\bm{A}. By Chernoff’s inequality, for each iCai\in C^{\ast}_{a}, with probability at least 11n31-\frac{1}{n^{3}},

By the bound (7.13) and the separation condition (3.12), with probability at least 12n21-\frac{2}{n^{2}} there holds the bound

Then by the fact ϵ=δ10\epsilon=\frac{\delta}{10}, the expression (7.21) and the union bound, we conclude that with probability at least 13n21-\frac{3}{n^{2}},

Therefore, to guarantee (7.19), it suffices to let

For any (i,j)Ca×Ca(i,j)\in C^{\ast}_{a}\times C^{\ast}_{a} for some a=1,,ra=1,\ldots,r, Yij=1YijY^{*}_{ij}=1\geq Y_{ij}. This implies that the matrix YwYw\bm{Y}_{w}^{*}-\bm{Y}_{w} is entrywise nonnegative, and thus

For any (i,j)Ca×Cb(i,j)\in C^{\ast}_{a}\times C^{\ast}_{b} with aba\neq b, by the bound (7.13), with probability at least 12n21-\frac{2}{n^{2}} there holds

The separation condition (3.12) implies that λ>q+δHaHb\lambda>\frac{q+\delta}{H_{a}H_{b}}, so with probability at least 12n21-\frac{2}{n^{2}},

imply that with probability at least 12n21-\frac{2}{n^{2}},

Moreover, we know for any (i,j)Ca×Cb(i,j)\in C^{\ast}_{a}\times C^{\ast}_{b} with aba\neq b, Yij=0Y^{*}_{ij}=0. Therefore, with probability at least 12n21-\frac{2}{n^{2}},

It follows from Lemma 3 in the Appendix that with probability at least 11n21-\frac{1}{n^{2}},

Notice that UU+PT(WW)\bm{U}\bm{U}^{\top}+\mathcal{P}_{T^{\perp}}\left(\frac{\bm{W}}{\|\bm{W}\|}\right) is a subgradient of X\|\bm{X}\|_{*} at X=YΘ12\bm{X}=\bm{Y}^{\ast}\circ\bm{\Theta}^{\frac{1}{2}}. Hence, we have

Below we bound the term PT(W),Θ12\|\mathcal{P}_{T}(\bm{W})\|_{\infty,\Theta^{-\frac{1}{2}}}. From the definition of PT\mathcal{P}_{T},

We bound UUW,Θ12\|\bm{U}\bm{U}^{\top}\bm{W}\|_{\infty,\bm{\Theta}^{-\frac{1}{2}}} below. Notice that (UUW)ij=0(\bm{U}\bm{U}^{\top}\bm{W})_{ij}=0 if ii and jj are from the same cluster. Thus, to bound the term (UUW)ij(\bm{U}\bm{U}^{\top}\bm{W})_{ij}, it suffices to consider the case where ii belongs to cluster kk and jj belongs to a different cluster kk^{\prime} for kkk^{\prime}\neq k, Recall CkC^{\ast}_{k} is the set of users in cluster kk. Then

which is the weighted average of independent random variables. By Bernstein’s inequality, with probability at least 1n31-n^{-3},

Then with probability at least 1n11-n^{-1},

Similarly we bound WUU,Θ12\|\bm{W}\bm{U}\bm{U}^{\top}\|_{\infty,\bm{\Theta}^{-\frac{1}{2}}} and UUWUU,Θ12\|\bm{U}\bm{U}^{\top}\bm{W}\bm{U}\bm{U}^{\top}\|_{\infty,\bm{\Theta}^{-\frac{1}{2}}}. Therefore, with probability at least 13n11-3n^{-1},

where the last inequality follows from the theorem assumption (3.13). Substituting the bounds (7.25) and (7.26) into the inequality (7.24), we get that with probability at least 14n11-4n^{-1},

Combining the bounds for SiS_{i} with i=1,2,3,4i=1,2,3,4, we conclude that with probability at least 110n11-10n^{-1},

thereby proving that Δ(Y)>0\Delta(Y)>0 for any feasible YY\bm{Y}\neq\bm{Y}^{\ast}.

3.1 Proof of Lemma 2

The equation (7.12) can be obtained straightforwardly by Chernoff’s inequality. To prove (7.13), we only need to establish for all i=1,,ni=1,\ldots,n,

Therefore, the assumption (3.13) implies that

Therefore, as long as C0C_{0} is large enough, we have

Since δ<p\delta<p, for sufficiently large C0C_{0}, we have that 12lognfi15\sqrt{\frac{12\log n}{f_{i}}}\leq\frac{1}{5}, and this implies

The bound (7.27) can then be deduced from (7.28). ∎

Acknowledgment

Y. Chen was supported by the School of Operations Research and Information Engineering at Cornell University. X. Li was supported by a startup fund from the Statistics Department at University of California, Davis. J. Xu was supported by the National Science Foundation under Grant CCF 14-09106, IIS-1447879, OIS 13-39388, and CCF 14-23088, and Strategic Research Initiative on Big-Data Analytics of the College of Engineering at the University of Illinois, DOD ONR Grant N00014-14-1-0823, and Simons Foundation Grant 328025.

Appendix A Supporting lemmas

In this section we state several additional technical lemmas concerning random matrices. These lemmas are used in the proof of our main theorems.

Recall that \circ denotes the element-wise product between matrices.

Let W\bm{W}^{\prime} denote an independent copy of W\bm{W}. Let M=(Mij)\bm{M}=(\bm{M}_{ij}) denote an n×nn\times n zero-diagonal symmetric matrix whose entries are Rademacher and independent from W\bm{W} and W\bm{W}^{\prime}. We apply the usual symmetrization arguments:

where (a)(a) follow from the Jensen’s inequality, (b)(b) follows because WW\bm{W}-\bm{W}^{\prime} has the same distribution as (WW)M(\bm{W}-\bm{W}^{\prime})\circ\bm{M}, and (c)(c) follow from the triangle inequality.

Notice that ξij\xi_{ij} has a symmetric distribution with zero mean and unit variance. Let X\bm{X} denote the random matrix with Xij=Xij=ξijbij.\bm{X}_{ij}=\bm{X}_{ij}=\xi_{ij}b_{ij}. Then one can check that WM\bm{W}\circ\bm{M} has the same distribution as X\bm{X}. Notice that X1/θmin\|\bm{X}\|_{\infty}\leq 1/\theta_{\min}. It follows from [Bandeira and van Handel, 2015+, Corollary 3.6] that there exists some absolute constant c1>0c_{1}>0 such that

Since the entries of W1/θmin\|\bm{W}\|_{\infty}\leq 1/\theta_{\min}, Talagrand’s concentration inequality for 1-Lipschitz convex functions (see, e.g., [Tao, 2012, Theorem 2.1.13]) yields the bound

for some absolute constants c2,c3c_{2},c_{3}, which implies that for any c>0c>0, there exists c>0c^{\prime}>0, such that

Consider a finite sequence {Xk}\{\bm{X}_{k}\} of independent, random, self-adjoint matrices with dimension dd. Assume that

If the norm of the total variance satisfies

then, the following inequality holds for all t0t\geq 0

Let A=(Aij)1i,jn\bm{A}=(A_{ij})_{1\leq i,j\leq n} be a symmetric random matrix whose diagonal entries are all zeros. Moreover, suppose AijA_{ij}, 1i<jn1\leq i<j\leq n are independent zero-mean random variables satisfying AijR|A_{ij}|\leq R and Var(Aij)σ2\textrm{Var}(A_{ij})\leq\sigma^{2}. Then, with probability at least 12n41-\frac{2}{n^{4}}, we have

For each pair (i,j):1i<jn(i,j):1\leq i<j\leq n, let Xij\bm{X}_{ij} be the matrix whose (i,j)(i,j) and (j,i)(j,i) entries are both AijA_{ij}, whereas other entires are zeros. Then we have

References