Concentration and regularization of random graphs

Can M. Le, Elizaveta Levina, Roman Vershynin

Introduction

Many classical and modern results in probability theory, starting from the Law of Large Numbers, can be expressed as concentration of random objects about their expectations. The objects studied most are sums of independent random variables, martingales, nice functions on product probability spaces and metric measure spaces. For a panoramic exposition of concentration phenomena in modern probability theory and related fields, the reader is referred to the books .

This paper studies concentration properties of random graphs. The first step of such study should be to decide how to interpret the statement that a random graph $G$ concentrates near its expectation. To do this, it will be useful to look at the graph $G$ through the lens of the matrices classically associated with $G$ , namely the adjacency and Laplacian matrices.

Concentration of random graphs interpreted this way, and also of general random matrices, has been studied in several communities, in particular in random matrix theory, combinatorics and network science.

We will study random graphs generated from an inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ , where edges are formed independently with given probabilities $p_{ij}$ , see . This is a generalization of the classical Erdös-Rényi model $G(n,p)$ where all edge probabilities $p_{ij}$ equal $p$ . Many popular graph models arise as special cases of $G(n,(p_{ij}))$ , such as the stochastic block model, a benchmark model in the analysis of networks discussed in Section 1.7, and random subgraphs of given graphs.

Often, the question of interest is estimating some features of the probability matrix $(p_{ij})$ from random graphs drawn from $G(n,(p_{ij}))$ . Concentration of adjacency matrix and Laplacian matrix around their expectations, when it holds, guarantees that such features can be recovered. As an example of this use of our concentration results, we will show that if $(p_{ij})$ has a block structure, the blocks can be accurately estimated from a single realization of $G(n,(p_{ij}))$ even when the average vertex degree is bounded.

The cleanest concentration results are available for the classical Erdös-Rényi model $G(n,p)$ in the dense regime. In terms of the expected degree $d=pn$ , we have with high probability that

The lower bound on density in (1.1) can be essentially relaxed all the way down to $d=\Omega(\log n)$ . Thus, with high probability we have

This result was proved in based on the method developed by J. Kahn and E. Szemeredi . More generally, (1.2) holds for any inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ with maximal expected degree $d=\max_{i}\sum_{j}p_{ij}$ . This generalization can be deduced from a recent result of S. Bandeira and R. van Handel [4, Corollary 3.6], while a weaker bound $O(\sqrt{d\log n})$ follows from concentration inequalities for sums of independent random matrices . Alternatively, an argument in can be used to prove (1.2) for a somewhat larger but still useful value

see . The same can be obtained by using Seginer’s bound on random matrices . As we will see shortly, our paper provides an alternative and completely different approach to general concentration results like (1.2).

2. Sparse graphs do not concentrate

In the sparse regime, where the expected degree $d$ is bounded, concentration breaks down. According to , a random graph from $G(n,p)$ satisfies with high probability that

What exactly makes the norm $A$ abnormally large in the sparse regime? The answer is, the vertices with too high degrees. In the dense case where $d\gg\log n$ , all vertices typically have approximately the same degrees $(1+o(1))d$ . This no longer happens in the sparser regime $d\ll\log n$ ; the degrees do not cluster tightly about the same value anymore. There are vertices with too high degrees; they are captured by the second inequality in (1.4). Even a single high-degree vertex can blow up the norm of the adjacency matrix. Indeed, since the norm of $A$ is bounded below by the Euclidean norm of each of its rows, we have $\|A\|\geq\sqrt{d(A)}$ .

3. Regularization enforces concentration

If high-degree vertices destroy concentration, can we “tame” these vertices? One proposal would be to remove these vertices from the graph altogether. U. Feige and E. Ofek showed that this works for $G(n,p)$ – the removal of the high degree vertices enforces concentration. Indeed, if we drop all vertices with degrees, say, larger than $2d$ , the the remaining part of the graph satisfies

with high probability, where $A^{\prime}$ denotes the adjacency matrix of the new graph. The argument in is based on the method developed by J. Kahn and E. Szemeredi . It extends to the inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ with $d$ defined in (1.3), see . As we will see, our paper provides an alternative and completely different approach to such results.

Although the removal of high degree vertices solves the concentration problem, such solution is not ideal, since those vertices are in some sense the most important ones. In real-world networks, the vertices with highest degrees are “hubs” that hold the network together. Their removal would cause the network to break down into disconnected components, which leads to a considerable loss of structural information.

Would it be possible to regularize the graph in a more gentle way – instead of removing the high-degree vertices, reduce the weights of their edges just enough to keep the degrees bounded by $O(d)$ ? The main result of our paper states that this is true. Let us first state this result informally; Theorem 2.1 provides a more general and formal statement.

Consider a random graph from the inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ , and let $d=\max_{ij}np_{ij}$ . For all high degree vertices of the graph (say, those with degrees larger than $2d$ ), reduce the weights of the edges incident to them in an arbitrary way, but so that all degrees of the new (weighted) graph become bounded by $2d$ . Then, with high probability, the adjacency matrix $A^{\prime}$ of the new graph concentrates:

4. Examples of graph regularization

The regularization procedure in Theorem 1.1 is very flexible. Depending on how one chooses the weights, one can obtain as partial cases several results we summarized earlier, as well as some new ones.

Remove all high degree vertices. If we remove all vertices with degrees larger than $2d$ , we recover another result of U. Feige and E. Ofek (1.5), which states that the removal of the high degree vertices enforces concentration.

Remove just enough edges from high-degree vertices. Instead of removing the high-degree vertices with all of their edges, we can remove just enough edges to make all degrees bounded by $2d$ . This milder regularization still produces the concentration bound (1.5).

5. Concentration of Laplacian

So far, we have looked at random graphs through the lens of their adjacency matrices. A different matrix that captures the geometry of a graph is the (symmmetric, normalized) Laplacian matrix, defined as

Here $I$ is the identity matrix and $D=\operatorname{diag}(d_{i})$ is the diagonal matrix with degrees $d_{i}=\sum_{j=1}^{n}A_{ij}$ on the diagonal. The reader is referred to for an introduction to graph Laplacians and their role in spectral graph theory. Here we mention just two basic facts: the spectrum of $\mathcal{L}(A)$ is a subset of $$, and the smallest eigenvalue is always zero.

Concentration of Laplacians of random graphs has been studied in . Just like the adjacency matrix, the Laplacian is known to concentrate in the dense regime where $d=\Omega(\log n)$ , and it fails to concentrate in the sparse regime. However, the obstructions to concentration are opposite. For the adjacency matrices, as we mentioned, the trouble is caused by high-degree vertices. For the Laplacian, the problem lies with low-degree vertices. In particular, for $d=o(\log n)$ the graph is likely to have isolated vertices; they produce multiple zero eigenvalues of $\mathcal{L}(A)$ , which are easily seen to destroy the concentration.

In analogy to our discussion of adjacency matrices, we can try to regularize the graph to “tame” the low-degree vertices in various ways, for example remove the low-degree vertices, connect them to some other vertices, artificially increase the degrees $d_{i}$ in the definition (1.6) of Laplacian, and so on. Here we will focus on the following simple way of regularization proposed in and analyzed in . Choose $\tau>0$ and add the same number $\tau/n$ to all entries of the adjacency matrix $A$ , thereby replacing it with

in the definition (1.6) of the Laplacian. This regularization raises all degrees $d_{i}$ to $d_{i}+\tau$ . If we choose $\tau\sim d$ , the regularized graph does not have low-degree vertices anymore.

The following consequence of Theorem 1.1 shows that such regularization indeed forces Laplacian to concentrate. Here we state this result informally; Theorem 4.1 provides a more formal statement.

Consider a random graph from the inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ , and let $d=\max_{ij}np_{ij}$ . Choose a number $\tau\sim d$ . Then, with high probability, the regularized Laplacian $\mathcal{L}(A_{\tau})$ concentrates:

We will deduce this result from Theorem 1.1 in Section 4. Theorem 1.2 is an improvement upon a bound in that had an extra $\log d$ factor, and it was conjectured there that the logarithmic factor is not needed. Theorem 1.2 confirms this conjecture.

6. A numerical experiment

To conclude our discussion of various ways to regularize sparse graphs, let us illustrate the effect of regularization by a numerical experiment. Consider an inhomogeneous Erdös-Rényi graph with $n=1000$ vertices, $90\%$ of which have expected degrees $7$ and $10\%$ percent have expected degrees $35$ . We then regularize the graph by reducing the weights of edges proportionally to the excess of degrees – just as we described in Section 1.4 item 4, except that we use the overall average degree (approximately $10$ ) instead of $d$ (which results in a more severe weight reduction suitable for our illustration purpose).

Figure 1 shows the histogram of the spectrum of $A$ (left) and $A^{\prime}$ (right). As we can see, the high degree vertices lead to the long tails in the histogram of the eigenvalues, and regularization shrinks these tails toward the bulk.

7. Application: community detection in networks

Concentration of random graphs has an important application to statistical analysis of networks, in particular to the problem of community detection. A common way of modeling communities in networks is the stochastic block model , which is a special case of the inhomogeneous Erdös-Rényi model considered in this paper. For the purpose of this example, we focus on the simplest version of the stochastic block model $G(n,\frac{a}{n},\frac{b}{n})$ , also known as the balanced planted partition model, defined as follows. The set of vertices is divided into two subsets (communities) of size $n/2$ each. Edges between vertices are drawn independently with probability $a/n$ if they are in the same community and with probability $b/n$ otherwise.

The community detection problem is to recover the community labels of vertices from a single realization of the random graph model. A large literature exists on both the recovery algorithms and the theory establishing when a recovery is possible . There are methods that perform better than a random guess (i.e. the fraction of misclassified vertices is bounded away from $0.5$ as $n\to\infty$ with high probability) under the condition

and no method can perform better than a random guess if this condition is violated.

Moreover, strong consistency, or exact recovery (labeling all vertices correctly with high probability) is possible when the expected degree $(a+b)/2$ is of order $\log n$ or larger and $a$ and $b$ are sufficiently separated, see . Weak consistency (the fraction of mislabeled vertices going to 0 with high probability) is achievable if and only if

see . Many of these results hold in the non-asymptotic regime, for graphs of fixed size $n$ . Thus, for any $\varepsilon>0$ there exists $C_{\varepsilon}$ (which only depends on $\varepsilon$ ) such that one can recover communities up to $\varepsilon n$ mislabeled vertices as long as

In particular, recovery of communities is possible even for very sparse graphs – those with bounded expected degrees. Several types of algorithms are known to succeed in this regume, including non-backtracking walks , spectral methods and methods based on semidefinite programming .

As an application of the new concentration results, we show that the regularized spectral clustering , one of the simplest most popular algorithms for community detection, can recover communities in the sparse regime. In general, spectral clustering works by computing the leading eigenvectors of either the adjacency matrix or the Laplacian or their regularized versions, and running the $k$ -means clustering algorithm on these eigenvectors to recover the node labels. In the simple case of the model $G(n,\frac{a}{n},\frac{b}{n})$ , one can simply assign nodes to communities based on the sign (positive or negative) of the corresponding entries of the eigenvector $v_{2}(\mathcal{L}(A_{\tau}))$ corresponding to the second smallest eigenvalue of regularized Laplacian matrix $\mathcal{L}(A_{\tau})$ (or the regularized adjacency matrix $A^{\prime}$ ).

Before stating our result, let us quote a simple version of the Davis-Kahan theorem perturbation theorem (see e.g. [5, Theorem VII.3.2]).

Let $X,Y$ be symmetric matrices such that the second smallest eigenvalues of $X$ and $Y$ have multiplicity one and they are of distance at least $\delta>0$ from the remaining eigenvalues of $X$ and $Y$ . Denote by $x$ and $y$ the eigenvectors of $X$ and $Y$ corresponding to the second largest eigenvalues of $X$ and $Y$ , respectively. Then

Let $\varepsilon>0$ and $r\geq 1$ . Let $A$ be the adjacency matrix drawn from the stochastic block model $G(n,\frac{a}{n},\frac{b}{n})$ . Assume that

where $C_{\varepsilon}=Cr^{4}\varepsilon^{-2}$ and $C$ is an appropriately large absolute constant. Choose $\tau$ to be the average degree of the graph, i.e. $\tau=(d_{1}+\cdots+d_{n})/n$ where $d_{i}$ are vertex degrees. Then with probability at least $1-e^{-r}$ , we have

In particular, the signs of the entires of $v_{2}(\mathcal{L}(A_{\tau}))$ correctly estimate the partition into the two communities, up to at most $\varepsilon n$ misclassified vertices.

8. Organization of the paper

In Section 2, we state a formal version of Theorem 1.1. We show there how to deduce this result from a new decomposition of random graphs, which we state as Theorem 2.6. We prove this decomposition theorem in Section 3. In Section 4, we state and prove a formal version of Theorem 1.2 about the concentration of the Laplacian. We conclude the paper with Section 5 where we propose some questions for further investigation.

Acknowledgement

The authors are grateful to Ramon van Handel for several insightful comments on the preliminary version of this paper.

Full version of Theorem 1.1, and reduction to a graph decomposition

In this section we state a more general and quantitative version of Theorem 1.1, and we reduce it to a new form of graph decomposition, which can be of interest on its own.

Consider a random graph from the inhomogeneous Erdös-Rényi model $G(n,(p_{ij}))$ , and let $d=\max_{ij}np_{ij}$ . For any $r\geq 1$ , the following holds with probability at least $1-n^{-r}$ . Consider any subset consisting of at most $10n/d$ vertices, and reduce the weights of the edges incident to those vertices in an arbitrary way. Let $d^{\prime}$ be the maximal degree of the resulting graph. Then the adjacency matrix $A^{\prime}$ of the new (weighted) graph satisfies

In this result and in the rest of the paper, $C,C_{1},C_{2},\ldots$ denote absolute constants whose values may be different from line to line.

The subset of $10n/d$ vertices in Theorem 2.1 can be completely arbitrary. So let us choose the high-degree vertices, say those with degrees larger than $2d$ . There are at most $10n/d$ such vertices with high probability; this follows by an easy calculation, and also from Lemma 3.5. Thus we immediately deduce Theorem 1.1.

If we do not reduce weights of any edges and $d$ is bounded, then the upper bound in Theorem 2.1 is tight (up to a constant depending on $r$ ). This is because of (1.4), which states that the adjacency matrix does not concentrate in the sparse regime without regularization.

One may wonder if Theorem 2.1 can be proved by developing an $\epsilon$ -net argument similar to the method of J. Kahn and E. Szemeredi and its versions . Although we can not rule out such possibility, we do not see how this method could handle a general regularization. The reader familiar with the method can easily notice an obstacle. The contribution of the so-called light couples becomes hard to control when one changes, and even reduces, the individual entries of $A$ (the weights of edges).

We will develop an alternative and somewhat simpler approach, which will be able to handle a general regularization of random graphs. It sheds light on the specific structure of graphs that enables concentration. We are going to identify this structure through a graph decomposition in the next section. But let us pause briefly to mention the following useful reduction.

In other words, we may prove Theorem 2.1 for directed inhomogeneous Erdös-Rényi graphs, where edges between any vertices and in any direction appear indepednently with probabilities $p_{ij}$ . In the rest of the argument, we will only work with such random directed graphs.

In this section, we reduce Theorem 2.1 to the following decomposition of inhomogeneous Erdös-Rényi directed random graphs. This decomposition may have an independent interest. Throughout the paper, we denote by $B_{\mathcal{N}}$ the matrix which coincides with a matrix $B$ on a subset of edges $\mathcal{N}\subset[n]\times[n]$ and has zero entries elsewhere.

Consider a random directed graph from the inhomogeneous Erdös-Rényi model, and let $d$ be as in (1.3). For any $r\geq 1$ , the following holds with probability at least $1-3n^{-r}$ . One can decompose the set of edges $[n]\times[n]$ into three classes $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ so that the following properties are satisfied for the adjacency matrix $A$ .

Each row of $A_{\mathcal{R}}$ and each column of $A_{\mathcal{C}}$ has at most $32r$ ones.

Moreover, $\mathcal{R}$ intersects at most $n/d$ columns, and $\mathcal{C}$ intersects at most $n/d$ rows of $[n]\times[n]$ .

Figure 2 illustrates a possible decomposition Theorem 2.6 can provide. The edges in $\mathcal{N}$ form a big “core” where the graph concentrates well even without regularization. The edges in $\mathcal{R}$ and $\mathcal{C}$ can be thought of (at least heuristically) as those attached to high-degree vertices.

We will prove Theorem 2.6 in Section 3; let us pause to deduce Theorem 2.1 from it.

2. Deduction of Theorem 2.1

To make this argument rigorous, let us start with the simple calculation we just mentioned.

Let $x$ be a vector with $\|x\|_{2}=1$ . Using Cauchy-Schwarz inequality and the assumptions, we have

Since $x$ is arbitrary, this completes the proof. ∎

We are ready to formally deduce the main part of Theorem 2.1 from Theorem 2.6; we defer the “moreover” part to Section 3.6.

We will bound the spectral norm of each of the three terms separately.

Step 1. Deviation on $\mathcal{N}$ . Let us further decompose

Further, a simple restriction property implies that

Indeed, restricting a matrix onto a product subset of $[n]\times[n]$ can only reduce its norm. Although the set of reweighted edges $\mathcal{E}$ is not a product subset, it can be decomposed into two product subsets:

where $I$ is the subset of vertices incident to the edges in $\mathcal{E}$ . Then (2.4) holds; right hand side of that inequality is bounded by $Cr^{3/2}\sqrt{d}$ by Theorem 2.6. Thus we handled the first term in (2.3).

To bound the second term in (2.3), we can use another restriction property that states that the norm of the matrix with non-negative entries can only reduce by restricting onto any subset of $[n]\times[n]$ (whether a product subset or not). This yields

so we handled the second term in (2.3). Recalling that the first term there is bounded by $Cr^{3/2}\sqrt{d}$ , we conclude that $\|(A-A^{\prime})_{\mathcal{N}}\|\leq 2Cr^{3/2}\sqrt{d}$ .

Returning to (2.2), we recall that the first term in the right hand is bounded by $Cr^{3/2}\sqrt{d}$ , and we just bounded the second term by $2Cr^{3/2}\sqrt{d}$ . Hence

Step 2. Deviation on $\mathcal{R}$ and $\mathcal{C}$ . By triangle inequality, we have

Simplifying this inequality, we complete the proof of the main part of Theorem 2.1. ∎

Proof of Decomposition Theorem 2.6

We will construct the decomposition in Theorem 2.6 by an iterative procedure. The first and crucial step is to find a big blockIn this paper, by block we mean a product set $I\times J$ with arbitrary index subsets $I,J\subset[n]$ . These subsets are not required to be intervals of successive integers. $\mathcal{N}^{\prime}\subset[n]\times[n]$ of size at least $(n-n/d)\times n/2$ on which $A$ concentrates, i.e.

Repeating this argument for the transpose, we obtain another block $\mathcal{N}^{\prime\prime}$ of size at least $n/2\times(n-n/d)$ where the graph concentrates as well. So the graph concentrates on $\mathcal{N}_{0}:=\mathcal{N}^{\prime}\cup\mathcal{N}^{\prime\prime}$ . The “core” $\mathcal{N}_{0}$ will form the first part of the class $\mathcal{N}$ we are constructing.

It remains to control the graph on the complement of $\mathcal{N}_{0}$ . That set of edges is quite small; it can be described as a union of a block $\mathcal{C}_{0}$ with $n/d$ rows, a block $\mathcal{R}_{0}$ with $n/d$ columns and an exceptional $n/2\times n/2$ block; see Figure 3(b) for illustration. We may consider $\mathcal{C}_{0}$ and $\mathcal{R}_{0}$ as the first parts of the future classes $\mathcal{C}$ and $\mathcal{R}$ we are constructing.

Indeed, since $\mathcal{C}_{0}$ has so few rows, the expected number of ones in each column of $\mathcal{C}_{0}$ is bounded by $1$ . For simplicity, let us think that all columns of $\mathcal{C}_{0}$ have $O(1)$ ones as desired. (In the formal argument, we will add the bad columns to the exceptional block.) Of course, the block $\mathcal{R}_{0}$ can be handled similarly.

At this point, we decomposed $[n]\times[n]$ into $\mathcal{N}_{0}$ , $\mathcal{R}_{0}$ , $\mathcal{C}_{0}$ and an exceptional $n/2\times n/2$ block. Now we repeat the process for the exceptional block, constructing $\mathcal{N}_{1}$ , $\mathcal{R}_{1}$ , and $\mathcal{C}_{1}$ there, and so on. Figure 3(c) illustrates this process. At the end, we choose $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ to be the unions of the blocks $\mathcal{N}_{i}$ , $\mathcal{R}_{i}$ and $\mathcal{C}_{i}$ respectively.

Two precautions have to be taken in this argument. First, we need to make concentration on the core blocks $\mathcal{N}_{i}$ better at each step, so that the sum of those error bounds would not depend of the total number of steps. This can be done with little effort, with the help of the exponential decrease of the size of the blocks $\mathcal{N}_{i}$ . Second, we have a control of the sizes but not locations of the exceptional blocks. Thus to be able to carry out the decomposition argument inside an exceptional block, we need to make the argument valid uniformly over all blocks of that size. This will require us to be delicate with probabilistic arguments, so we can take a union bound over such blocks.

2. Grothendieck-Pietsch factorization

As we mentioned in the previous section, our proof of Theorem 2.6 is based on Grothendieck-Pietsch factorization. This general and well known result in functional analysis has already been used in a similar probabilistic context, see [26, Proposition 15.11].

To compare the two norms, one can start with the obvious inequality

Both parts of this inequality are optimal, so there is an unavoidable slack between the upper and lower bounds. However, Grothendieck-Pietsch factorization allows us to tighten the inequality by changing $B$ sightly. The next two results offer two ways to change $B$ – introduce weights and pass to a sub-matrix.

Let $B$ be a $k\times m$ real matrix. Then there exist positive weights $\mu_{j}$ with $\sum_{j=1}^{m}\mu_{j}=1$ such that

where $D_{\mu}=\operatorname{diag}(\mu_{j})$ denotes the $m\times m$ diagonal matrix with weights $\mu_{j}$ on the diagonal.

This result is a known combination of the Little Grothendieck Theorem (see [41, Corollary 10.10] and ) and Pietsch Factorization (see [41, Theorem 9.2]). In an explicit form, a version of this result can be found e.g. in [26, Proposition 15.11]. The weights $\mu_{j}$ can be computed algorithmically, see .

The following related version of Grothendieck-Pietsch’s factorization can be especially useful in probabilistic contexts, see [26, Proposition 15.11]. Here and in the rest of the paper, we denote by $B_{I\times J}$ the sub-matrix of a matrix $B$ with rows indexed by a subset $I$ and columns indexed by a subset $J$ .

Let $B$ be a $k\times m$ real matrix and $\delta>0$ . Then there exists $J\subseteq[m]$ with $|J|\geq(1-\delta)m$ such that

Consider the weights $\mu_{i}$ given by Theorem 3.1, and choose $J$ to consist of the indices $j$ satisfying $\mu_{j}\leq 1/(\delta m)$ . Since $\sum_{j}\mu_{j}=1$ , the set $J$ must contain at least $(1-\delta)m$ indices as claimed. Furthermore, the diagonal entries of $(D_{\mu}^{-1/2})_{J\times J}$ are all bounded from below by $\sqrt{\delta m}$ , which yields

On the other hand, by (3.1) the left-hand side of this inequality is bounded by $2\|B\|_{\infty\rightarrow 2}$ . Rearranging the terms, we complete the proof. ∎

3. Concentration on a big block

The lemmas of this and next section should be best understood for $m=n$ , $I=J=[n]$ and $\alpha=1$ . In this case, we are working with the entire adjacency matrix, and trying to make the first step in the iterative procedure. The further steps will require us to handle smaller blocks $I\times J$ ; the parameter $\alpha$ will then become smaller in order to achieve better concentration for smaller blocks.

Let $1\leq m\leq n$ and $\alpha\geq m/n$ . Then for $r\geq 1$ the following holds with probability at least $1-n^{-r}$ . Consider a block $I\times J$ of size $m\times m$ . Let $I^{\prime}$ be the set of indices of the rows of $A_{I\times J}$ that contain at most $\alpha d$ ones. Then

A more useful bond will follow from Bernstein’s inequality. Indeed, $X_{i}$ is a sum of $m$ independent random variables with zero means and variances at most $d/n$ . By Bernstein’s inequality, for any $t>0$ we have

We are now ready to bound the right-hand side of (3.3). By (3.6), the random variable $X_{i}\xi_{i}$ is sub-gaussianFor definitions and basic facts about sub-gaussian random variables, see e.g. . with sub-gaussian norm at most $\sqrt{\alpha d}$ . It follows that $(X_{i}\xi_{i})^{2}$ is sub-exponential with sub-exponential norm at most $2\alpha d$ . Using Bernstein’s inequality for sub-exponential random variables (see Corrollary 5.17 in ), we have

Choosing $\varepsilon:=(10/c)r\log(en/m)$ , we bound this probability by $(en/m)^{-5rm}$ .

Summarizing, we have proved that for fixed $I,J\subseteq[n]$ and $x\in\{-1,1\}^{m}$ , with probability at least $1-(en/m)^{-5rm}$ , the following holds:

Taking a union bound over all possibilities of $m,I,J,x$ and using (3.3), (3.8), we see that the conclusion of the lemma holds with probability at least

Applying Lemma 3.3 followed by Grothendieck-Piesch factorization (Theorem 3.2), we obtain the following.

4. Restricted degrees

The two simple lemmas of this section will help us to handle the part of the adjacency matrix outside the core block constructed in Lemma 3.4. First, we show that almost all rows have at most $O(\alpha d)$ ones, and thus are included in the core block.

Let $1\leq m\leq n$ and $\alpha\geq\sqrt{m/n}$ . Then for $r\geq 1$ the following holds with probability at least $1-n^{-r}$ . Consider a block $I\times J$ of size $m\times m$ . Then all but $m/\alpha d$ rows of $A_{I\times J}$ have at most $8r\alpha d$ ones.

Let $S$ be the number of rows $i$ with $d_{i}>8r\alpha d$ . Then $S$ is a sum of $m$ independent Bernoulli random variables with expectations at most $p$ . Again, Chernoff’s inequality implies

The second inequality here holds since $e\alpha d\leq p^{-1/2}$ . (To see this, notice that the definition of $p$ and assumption on $\alpha$ imply that $p^{-1/2}=(2\alpha n/m)^{4r\alpha d}\geq 2^{4r\alpha d}$ .)

It remains to take a union bound over all blocks $I\times J$ . We obtain that the conclusion of the lemma holds with probability at least

In the last inequality we used the assumption that $\alpha\geq\sqrt{m/n}$ . The proof is complete. ∎

Next, we handle the block of rows that do have too many ones. We show that most columns of this block have $O(1)$ ones.

Let $1\leq m\leq n$ and $\alpha\geq\sqrt{m/n}$ . Then for $r\geq 1$ the following holds with probability at least $1-n^{-r}$ . Consider a block $I\times J$ of size $k\times m$ with some $k\leq m/\alpha d$ . Then all but $m/4$ columns of $A_{I\times J}$ have at most $32r$ ones.

Let $S$ be the number of columns $j$ with $d_{j}>32r$ . Then $S$ is a sum of $m$ independent Bernoulli random variables with expectations at most $p$ . Again, Chernoff’s inequality implies

The second inequality here holds since $4e<p^{1/2}$ , which in turn follows by assumption on $\alpha$ .

It remains to take a union bound over all blocks $I\times J$ . It is enough to consider the blocks with largest possible number of columns, thus with $k=\lceil m/\alpha d\rceil$ . We obtain that the conclusion of the lemma holds with probability at least

In the last inequality we used the assumption that $\alpha\geq\sqrt{m/n}$ . The proof is complete. ∎

5. Iterative decomposition: proof of Theorem 2.1

Finally, we combine the tools we developed so far, and we construct an iterative decomposition of the adjacency matrix the way we outline in Section 3.1. Let us set up one step of this procedure, where we consider an $m\times m$ block and decompose almost all of it (everything except an $m/2\times m/2$ block) into classes $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ satisfying the conclusion of Theorem 2.6. Once we can do this, we repeat the procedure for the $m/2\times m/2$ block, etc.

Let $1\leq m\leq n$ and $\alpha\geq\sqrt{m/n}$ . Then for $r\geq 1$ the following holds with probability at least $1-3n^{-r}$ . Consider a block $I\times J$ of size $m\times m$ . Then there exists an exceptional sub-block $I_{1}\times J_{1}$ with dimensions at most $m/2\times m/2$ such that the remaining part of the block, that is $(I\times J)\setminus(I_{1}\times J_{1})$ , can be decomposed into three classes $\mathcal{N}$ , $\mathcal{R}\subset(I\setminus I_{1})\times J$ and $\mathcal{C}\subset I\times(J\setminus J_{1})$ so that the following holds.

Each row of $A_{\mathcal{R}}$ and each column of $A_{\mathcal{C}}$ has at most $32r$ ones.

Moreover, $\mathcal{R}$ intersects at most $n/\alpha d$ columns and $\mathcal{C}$ intersects at most $n/\alpha d$ rows of $I\times J$ .

After a permutation of rows and columns, a decomposition of the block stated in Lemma 3.7 can be visualized in Figure 4(c).

Since we are going to use Lemmas 3.4, 3.5 and 3.6, let us fix realization of the random graph that satisfies the conclusion of those three lemmas.

By Lemma 3.5, all but $m/\alpha d$ rows of $A_{I\times J}$ have at most $8r\alpha d$ ones; let us denote by $I^{\prime}$ the set of indices of those rows with at most $8r\alpha d$ ones. Then we can use Lemma 3.4 for the block $I^{\prime}\times J$ and with $\alpha$ replaced by $8r\alpha$ ; the choice of $I^{\prime}$ ensures that all rows have small numbers of ones, as required in that lemma. To control the rows outside $I^{\prime}$ , we may use Lemma 3.6 for $(I\setminus I^{\prime})\times J$ ; as we already noted, this block has at most $m/\alpha d$ rows as required in that lemma. Intersecting the good sets of columns produced by those two lemmas, we obtain a set of at most $m/2$ exceptional columns $J_{1}\subset J$ such that the following holds.

For the block $\mathcal{C}:=(I\setminus I^{\prime})\times(J\setminus J_{1})$ , all columns of $A_{\mathcal{C}}$ have at most $32r$ ones.

Figure 4(a) illustrates the decomposition of the block $I\times J$ into the set of exceptional columns indexed by $J_{1}$ and good sets $\mathcal{N}_{1}$ and $\mathcal{C}$ .

To finish the proof, we apply the above argument to the transpose $A^{\mathsf{T}}$ on the block $J\times I$ . To be precise, we start with the set $J^{\prime}\subset J$ of all but $m/\alpha d$ small columns of $A_{I\times J}$ (those with at most $8r\alpha d$ ones); then we obtain an exceptional set $I_{1}\subset I$ of at most $m/2$ rows; and finally we conclude that $A$ concentrates on the block $\mathcal{N}_{2}:=(I\setminus I_{1})\times J^{\prime}$ and has small rows on the block $\mathcal{R}:=(I\setminus I_{1})\times(J\setminus J^{\prime})$ . Figure 4(b) illustrates this decomposition.

It only remains to combine the decompositions for $A$ and $A^{\mathsf{T}}$ ; Figure 4(c) illustrates a result of the combination. Once we define $\mathcal{N}:=\mathcal{N}_{1}\cup\mathcal{N}_{2}$ , it becomes clear that $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ have the required properties.It may happen that an entry ends up in more than one class $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ . In such cases, we split the tie arbitrarily. ∎

Let us fix a realization of the random graph that satisfies the conclusion of Lemma 3.7. Applying that lemma for $m=n$ and with $\alpha=1$ , we decompose the set of edges $[n]\times[n]$ into three classes $\mathcal{N}_{0}$ , $\mathcal{C}_{0}$ and $\mathcal{R}_{0}$ plus an $n/2\times n/2$ exceptional block $I_{1}\times J_{1}$ . Apply Lemma 3.7 again, this time for the block $I_{1}\times J_{1}$ , for $m=n/2$ and with $\alpha=\sqrt{1/2}$ . We decompose $I_{1}\times J_{1}$ into $\mathcal{N}_{1}$ , $\mathcal{C}_{1}$ and $\mathcal{R}_{1}$ plus an $n/4\times n/4$ exceptional block $I_{2}\times J_{2}$ .

Repeat this process for $\alpha=\sqrt{m/n}$ where $m$ is the running size of the block; we halve this size at each step, and so we have $\alpha_{i}\leq 2^{-i/2}$ . Figure 3(c) illustrates a decomposition that we may obtain this way. In a finite number of steps (actually, in $O(\log n)$ steps) the exceptional block becomes empty, and the process terminates. At that point we have decomposed the set of edges $[n]\times[n]$ into $\mathcal{N}$ , $\mathcal{R}$ and $\mathcal{C}$ , defined as the union of $\mathcal{N}_{i}$ , $\mathcal{C}_{i}$ and $\mathcal{R}_{i}$ respectively, which we obtained at each step. It is clear that $\mathcal{R}$ and $\mathcal{C}$ satisfy the required properties.

It remains to bound the deviation of $A$ on $\mathcal{N}$ . By construction, $\mathcal{N}_{i}$ satisfies

In the second inequality we used that $\alpha_{i}\leq 2^{-i/2}$ , which forces the series to converge. The proof of Theorem 2.6 is complete. ∎

This strengthening is in fact easy to check. To do so, note that the definition of $d^{\prime}$ was used only once in the proof of Theorem 2.1, namely in Step 2 where we bounded the norms of $A^{\prime}_{\mathcal{R}}$ and $A^{\prime}_{\mathcal{C}}$ . Thus, to obtain the strengthening, it is enough to replace the application of Lemma 2.7 there by the following lemma.

To prove the claim, let $x$ be a vector with $\|x\|_{2}=1$ . Using Cauchy-Schwarz inequality and the assumptions, we have

Since $x$ is arbitrary, this completes the proof. ∎

Concentration of the regularized Laplacian

In this section, we state the following formal version of Theorem 1.2, and we deduce it from concentration of adjacency matrices (Theorem 2.1).

Consider a random graph from the inhomogeneous Erdös-Rényi model, and let $d$ be as in (1.3). Choose a number $\tau>0$ . Then, for any $r\geq 1$ , with probability at least $1-e^{-r}$ one has

Two sources contribute to the deviation of Laplacian – the deviation of the adjacency matrix and the deviation of the degrees. Let us separate and bound them individually.

Here $D_{\tau}=\operatorname{diag}(d_{i}+\tau)$ and $\bar{D}_{\tau}=\operatorname{diag}(\bar{d}_{i}+\tau)$ are the diagonal matrices with degrees of $A_{\tau}$ and $\bar{A}_{\tau}$ on the diagonal, respectively. Using the fact that $A_{\tau}-\bar{A}_{\tau}=A-\bar{A}$ , we can represent the deviation as

Step 2. Bounding $S$ . Let us introduce a diagonal matrix $\Delta$ that should be easier to work with than $D_{\tau}$ . Set $\Delta_{ii}=1$ if $d_{i}\leq 8rd$ and $\Delta_{ii}=d_{i}/\tau+1$ otherwise. Then entries of $\tau\Delta$ are upper bounded by the corresponding entries of $D_{\tau}$ , and so

Here in the first inequality we used that $\Delta_{jj}\geq 1$ and $\sum_{j}A_{ij}=d_{i}$ ; the second inequality follows by definition of $\Delta_{ii}$ . Applying Theorem 2.1, we obtain with probability $1-n^{-r}$ that

To bound $R_{2}$ , we note that by construction of $\Delta$ , the matrices $\bar{A}$ and $\Delta^{-1/2}\bar{A}\Delta^{-1/2}$ coincide on the block $I\times I$ , where $I$ is the set of vertices satisfying $d_{i}\leq 8rd$ . This block is very large – indeed, Lemma 3.5 implies that $|I^{c}|\leq n/d$ with probability $1-n^{-r}$ . Outside this block, i.e. on the small blocks $I^{c}\times[n]$ and $[n]\times I^{c}$ , the entries of $\bar{A}-\Delta^{-1/2}\bar{A}\Delta^{-1/2}$ are bounded by the corresponding entries of $\bar{A}$ , which are all bounded by $d/n$ . Thus, using Lemma 2.7, we have

Substituting the bounds for $R_{1}$ and $R_{2}$ into (4.1), we conclude that

Step 3. Bounding $T$ . Bounding the spectral norm by the Hilbert-Schmidt norm, we get

and $\delta_{ij}=(d_{i}+\tau)(d_{j}+\tau)$ and $\bar{\delta}_{ij}=(\bar{d}_{i}+\tau)(\bar{d}_{j}+\tau)$ . To bound $T_{ij}$ , we note that

Recalling the definition of $\delta_{ij}$ and $\bar{\delta}_{ij}$ and adding and subtracting $(d_{i}+\tau)(\bar{d}_{j}+\tau)$ , we have

So, using the inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and bounding $\bar{d}_{j}+\tau$ by $d+\tau$ , we obtain

Substituting this bound and (4.3) into (4.2) we conclude that

It remains to substitute the bounds for $S$ and $T$ into the inequality $\|E\|\leq\|S\|+\|T\|$ , and simplify the expression. The resulting bound holds with probability at least $1-n^{-r}-n^{-r}-e^{-2r}\geq 1-e^{-r}$ , as claimed. ∎

Further questions

The main point of our paper was that regularization helps sparse graphs to concentrate. We have discussed several kinds of regularization in Section 1.4 and mentioned some more in Section 1.4. We found that any meaningful regularization works, as long as it reduces the too high degrees and increases the too low degrees. Is there an optimal way to regularize a graph? Designing the best “preprocessing” of sparse graphs for spectral algorithms is especially interesting from the applied perspective, i.e. for real world networks.

On the theoretical level, can regularization of sparse graphs produce the same optimal bound $2\sqrt{d}(1+o(1))$ that we saw for dense graphs in (1.1)? Thus, an ideal regularization should bring all parasitic outliers of the spectrum into the bulk. If so, this would lead to a potentially simple spectral clustering algorithm for community detection in networks which matches the theoretical lower bounds. Algorithms with optimal rates exist for this problem , but their analysis is very technical.

2. How exactly concentration depends on regularization?

3. Average expected degree?

Both concentration results of this paper, Theorems 1.1 and 1.2, depend on $d=\max_{ij}np_{ij}$ . Would it be possible to reduce $d$ to the maximal expected degree $d_{ave}=\max_{i}\sum_{j}p_{ij}$ ?

4. From random graphs to random matrices?

Adjacency matrices of random graphs are particular examples of random matrices. Does the phenomenon we described, namely that regularization leads to concentration, apply for general random matrices? Guided by Theorem 1.1, we might expect the following for a broader general class of random matrices $B$ with mean zero independent entries. First, the only reason the spectral norm of $B$ is too large (and that it is determined by outliers of spectrum) could be the existence of a large row or column. Furthermore, it might be possible to reduce the norm of $B$ (and thus bring the outliers into the bulk of spectrum) by regularizing in some way the rows and columns that are too large. For related questions in random matrix theory, see the recent work .