Norms of random matrices: local and global problems

Elizaveta Rebrova, Roman Vershynin

Introduction

2. Random matrices and their norms

with high probability. Note that the order n\sqrt{n} is the best we can generally hope for. Indeed, if the entries of AA have unit variance, then the typical magnitude of the Euclidean norm of a row of AA is n\sim\sqrt{n}, and the operator norm of AA can not be smaller than that. Moreover, by the bounded fourth moment assumption is nearly necessaryFor almost surely convergence of A/n\|A\|/\sqrt{n}, fourth moment is necessary and sufficient , while for convergence in probability the weak fourth moment is necessary and sufficient . for the bound

A number of quantitative and more general versions of these bounds are known .

3. Main results

Now let us postulate nothing at all about the distribution of the i.i.d. entries of AA. It still makes sense to ask: is enforcing the ideal bound (1.1) for random matrices a local or a global problem? That is, can we enforce the bound (1.1) by modifying the entries in a small submatrix of AA? We will show in this paper that this is possible if and only if the entries of AA have zero moment and finite variance. The “if” part is covered by the following theorem.

where CC is a sufficiently large absolute constant.

This shows that the dependence on ε\varepsilon in Theorem 1.1 is almost optimal.

By rescaling, a more general version of Theorem 1.1 holds for any finite variance of the entries. The two main assumptions in this theorem – mean zero and finite variance – are necessary in Theorem 1.1. Without either of them, the problem becomes global in a strong sense: the desired O(n)O(\sqrt{n}) bound can not be achieved even after modifying a large submatrix. This is the content of the following result.

It should be noted that while Theorem 1.1 becomes harder for smaller ε\varepsilon, Theorem 1.3 becomes harder for larger ε\varepsilon, those near 11.

4. What if we remove large entries?

One may naturally wonder what exactly may cause the norm of a mean zero random matrix AA to be too large. A natural guess is that the only troublemakers are a few large entries of AA. Indeed, this is exactly how the necessity of the fourth moment for (1.1) was shown in . So we may ask – can we obtain a result like Theorem 1.1 simply by zeroing out a few largest entries of AA?

The answer is no. A counterexample is a sparse Bernoulli matrix AA, whose i.i.d. entries take values ±n\pm\sqrt{n} with probability 1/2n1/2n each and with probability 11/2n1-1/2n. It is not hard to check that AA is likely to have a row whose norm exceeds cnlog(n)/loglognnc\sqrt{n}\log(n)/\log\log n\gg\sqrt{n}, and consequently we have An\|A\|\gg\sqrt{n}. In other words, without removal of any entries the norm of AA is too large. However, if we are to remove any entries based purely on their magnitudes, we must remove them all. (Recall that all non-zero elements of AA have the same magnitude n\sqrt{n}.) But removal of all nonzero entries of AA is not a local intervention, since such entries can not be placed in a small submatrix (we explained this in Remark 1.2).

Nevertheless, under slightly stronger moment assumptions than in Theorem 1.1, zeroing out a few large entires does bring the norm of AA down. The following result can be quickly deduced by truncation from known bounds on random matrices such as .

We will deduce Proposition 1.4 from a general bound of A. Bandeira and R. van Handel in Section 10.

5. Other related results

Acknowledgement

Si Tang and Antonio Auffinger were first to note a version of Proposition 1.4, which they kindly showed to us along with a proof based on . Our correspondence led us to add Section 1.4. We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1. We also would like to thank the referee for the valuable suggestions, which helped to improve the presentation of our paper.

The method

Our approach to Theorem 1.1 utilizes and advances the methods developed recently in and . We will first control the cut norm of AA and then pass to the operator norm using Grothendieck-Pietsch factorization. Let us describe these steps in more detail.

Rather than bounding the operator norm of a random matrix AA directly, we shall compare it with two simpler norms,

The simplest of the three is the 22\to\infty norm. A quick check reveals that it equals the maximum Euclidean norm of the rows AiTA_{i}^{\mathsf{T}} of AA:

The next simplest norm is 2\infty\to 2, which can be conveniently computed as

This norm is equivalent within a constant factor to the cut norm from the computer science literature , where the maximum is taken over {0,1}n\{0,1\}^{n}. The hardest of the three is the operator norm,

To see why the difficulty in bounding these norms rises this way, note that one has to control nn random variables in (2.1), 2n2^{n} random variables in (2.2), and infinitely many random variables in (2.3).

2. Ideal relationships among the norms

How large do we expect the three norms to be for random matrices? For a simple example, let us first consider a Gaussian random matrix AA with i.i.d. N(0,1)N(0,1) entries. Then it is not difficult to check that

Indeed, note that the rows of AA have Euclidean norms n\sqrt{n} on average, so the bound on the 22\to\infty norm follows by union bound and using Gaussian concentration. The bound on the 2\infty\to 2 norm follows from (2.2) by using Gaussian concentration for the normal random vector AxAx and taking the union bound over {1,1}n\{-1,1\}^{n}. The bound on the operator norm is a non-asymptotic version of Bai-Yin’s law, see e.g. [24, Theorem 5.32].

One might wonder if (2.4) holds not only in the Gaussian case but generally for random matrices AA with i.i.d. entries that have zero mean and unit variance. In particular, it would be wonderful if the three norms were always related to each other as follows:

This, however, would be too optimistic to expect, since the bound An\|A\|\lesssim\sqrt{n} can not hold without higher moments assumptions as we mentioned in Section 1.1. Nevertheless, we will obtain a version of (2.5) after removal a small fraction of rows of AA. With high probability, we will be able to find subsets of rows J1J2J3J_{1}\subset J_{2}\subset J_{3} with cardinalities Jiεn|J_{i}|\leq\varepsilon n and such that

where the inequalities hide a factor that depends on ε\varepsilon.

3. A roadmap of the proof

The first step in proving (2.6) is to find a small set J1J_{1} with J1εn|J_{1}|\lesssim\varepsilon n and such that

with high probability. In other words, we would like to bound all rows of AA simultaneously by O(n)O(\sqrt{n}) after removing a few columns of AA. To show this we first focus on one row, where we need to bound a sum of independent random variables (the squares of the row’s entries). In Theorem 4.2 we show how to bound sums of independent random variables almost surely by gently damping the summands. Damping, or reweighting down, is a softer operation than removing entries. It allows us to treat in Section 5 all columns simultaneously without much effort, thus proving (2.7). The argument in this step is similar to the approach proposed recently in . We somewhat simplify the method of and also improve the dependence between the number of removed columns and the resulting 22\to\infty norm; this will ultimately lead to the optimal dependence on ε\varepsilon in Theorem 1.1.

At the next step, we extend J1J_{1} to a bigger set of rows J2J_{2} with J2εn|J_{2}|\lesssim\varepsilon n and so that

Suppose for a moment that we are not concerned about removal of any columns. It is not too hard to show the general bound

for a random matrix AA with independent, mean zero entries; we prove this in Lemma 6.1. However, this bound is not very helpful in our situation. We need to work with the matrix AJ1cA_{J_{1}^{c}} instead of AA, which is not trivial: the removal of the columns in J1J_{1} that we did in the first step made the entries of AJ1cA_{J_{1}^{c}} dependent. In Lemma 6.2, we first prove a variant of (2.9) for AJ1cA_{J_{1}^{c}} under an additional symmetry assumption on the distribution of the entries of AA. Then we manage to remove this assumption with a delicate symmetrization argument, which we develop in the rest of Section 6, with the final result being Theorem 6.6. The general idea of this step, as well as some of our arguments here, are inspired by . However we need to be considerably more careful than in to obtain (2.8) with a logarithmic dependence on ε\varepsilon.

Next, we pass from 2\infty\to 2 norm to the operator norm in Section 7. This is done by using Grothendieck-Pietsch factorization (Theorem 7.1), a result that yields the first inequality in (2.6) for completely arbitrary, even non-random, matrices. This reasoning was recently used in a similar context in .

The argument we just described works under the additional assumption that the entries of AA be O(n)O(\sqrt{n}) almost surely. To be specific, such boundedness assumption is needed to make the damping argument in Step 1 work with mild, logarithmic dependence on ε\varepsilon. The contribution of the entries that are larger than n\sqrt{n} are controlled in Section 8 by showing that there can not be too many of them. The unit variance assumption implies that there are O(1)O(1) such large entries per column on average. This does not mean, of course, that all columns will have O(1)O(1) large entries with high probability; in fact there could be columns with logn/loglogn\sim\log n/\log\log n large entries. But we will check in Lemma 8.1 that the number of such heavy columns is small; removing them will lead to the desired bound O(n)O(\sqrt{n}) on the operator norm for the matrix with large entries. We develop this argument in Proposition 8.4 and Corollary 8.6, and derive the full strength of Theorem 1.1 in Section 8.4.

Theorem 1.3 is proved in Section 9. The paper is concluded with Section 11 where we discuss some further problems.

Acknowledgements

We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1 and to Antonio Auffinger for the interesting discussion of 2+ε2+\varepsilon finite moment case.

Preliminaries

Throughout the paper, positive absolute constant are denoted C,C1,c,c1C,C_{1},c,c_{1}, etc. Their values may be different from line to line. We often write aba\lesssim b to indicate that aCba\leq Cb for some absolute constant CC.

The discrete interval {1,2,,n}\{1,2,\ldots,n\} is denoted by [n][n]. If R\mathcal{R} is some subset of indices, R[n]×[n]\mathcal{R}\subset[n]\times[n], let us denote by ARA_{\mathcal{R}} the matrix obtained from AA by replacing the indices in R\mathcal{R} by zero:

We will often consider subsets of columns of the matrix, so when R=J×[n]\mathcal{R}=J\times[n] we use a simplified notation: for J[n]J\subset[n]

where AiA_{i} and AjA^{j} denote the rows and columns of AA.

Recall that the operator norm can be computed as a maximum of the quadratic form:

Taking the maximum over all unit vectors xx and yy, we complete the proof. ∎

3. Concentration

In this paper we make use of good concentration properties of the sums of sub-gaussian (and sub-exponential) random variables, that is, such that grow not faster than standard normal (respectively, exponential) random variables. Recall that by definition a random variable YY is called sub-gaussian if its moments satisfy

for some number M2>0M_{2}>0. The minimal number M2M_{2} is called the sub-gaussian moment of XX, denoted as Yψ2\|Y\|_{\psi_{2}}. Analogously, a random variable is called sub-exponential if

for some number M1>0M_{1}>0. The minimal number M1M_{1} is called the sub-exponential moment of YY, denoted as Yψ1\|Y\|_{\psi_{1}}.

The class of sub-gaussian random variables contains standard normal, Bernoulli, and generally all bounded random variables. The class of sub-exponential random variables is exactly the class of squares of sub-gaussians. See for more information and statements of standard concentration inequalities.

Also we will need a concentration inequality for random permutations from .

The same inequality holds for the sum S=i=1naπ(i)xiS^{\prime}=\sum_{i=1}^{n}a_{\pi(i)}x_{i} as well, since it has the same distribution as SS.

4. Discretization

The following lemma allows us to approximate a general continuous random variable by a sum of independent, scaled Bernoulli random variables. This lemma was originally proved in . Here we give a proof for completeness, and then discuss some particular cases needed for the proof of Theorem 1.1.

Consider a non-negative, continuous random variable XX. There exists a non-negative random variable XX^{\prime} satisfying the following.

XX^{\prime} stochastically dominates XX, i.e.

XX^{\prime} is a sum of scaled, independent Bernoulli random variables:

where qkq_{k} are non-negative numbers and ξk\xi_{k} are independent Ber(2k)\operatorname{Ber}(2^{-k}) random variables.

Set the values qkq_{k} to be the quantiles of the distribution of XX:

(These values are well defined since the cumulative distribution function of XX is continuous by assumption.) By definition, (qk)(q_{k}) is an increasing sequence. Define XX^{\prime} by (3.1).

To check part 1, note that by definition,

almost surely. Taking expectation of both sides, we obtain

Now, using the definition of qkq_{k}, we have

Let us prove part 2. If t[qk,qk+1)t\in[q_{k},q_{k}+1) for some k=0,1,2,k=0,1,2,\ldots, then using the definitions of XX^{\prime} and qkq_{k} we obtain

and the inequality in part 2 follows. If tqt\geq q_{\infty} then, using the continuity of the cumulative distribution of XX, we obtain

and the inequality in part 2 follows again. The proof is complete. ∎

Suppose XMX\leq M almost surely. Then, in the second part of the conclusion of Lemma 3.3, XX can be represented as a finite sum

where qkq_{k} are non-negative numbers, qk[0,M]q_{k}\in[0,M], and ξk\xi_{k} are independent Ber(pk)\operatorname{Ber}(p_{k}) random variables. Here pk=2k1/Mp_{k}=2^{-k}\geq 1/M for k<κk<\kappa and pκ=1/Mp_{\kappa}=1/M.

Stochastic dominance of XX^{\prime} over XX in Lemma 3.3 implies that one can realize the random variables XX and XX^{\prime} on the same probability space so that

Moreover, in the same way we can construct a majorizing collection for any collection of independent random variables. In particular, we can do it for all entries of the matrix AA at once.

Damping a sum of independent random variables

Here we will be interested in a stronger result – that the sum be O(n)O(n) almost surely instead of in expectation. To do this, we will be looking for random weights

To make the damping as gentle as possible, we are looking for largest possible weights WiW_{i}, hopefully very close to 11.

To get started, let us consider the simple case where n=1n=1 and try to damp one random variable.

Let ε(0,1)\varepsilon\in(0,1). There exists a random variable WW taking values in $$ and such that

Fix a level L1L\geq 1 whose value we will choose later, and define

Next, the lower bound in (4.2) holds trivially since W1W\leq 1. For the upper bound, we have

2. Damping a sum of random variables

Now let us address the damping problem for general number nn of random variables, which we described in the beginning of this section. Applying Lemma 4.1 for each random variable XiX_{i}, we get weights WiW_{i} such that

for small ε\varepsilon. We will now considerably improve both these bounds, making only one mild extra assumption that Xi=O(n)X_{i}=O(n) almost surely.

Let X1,,XnX_{1},\ldots,X_{n} be i.i.d. random variables such that

for some K1K\geq 1. Let ε(0,1/2)\varepsilon\in(0,1/2). There exist random variables W1,,WnW_{1},\ldots,W_{n} taking values in $$ and such that

Improvement in the order of nn in (4.4) does not require an extra boundedness assumption, and it was done in previous work . We employ the same ideas as in [19, Lemma 3.3] and obtain better (logarithmic) dependence on ε\varepsilon in (4.3) in trade of the additional assumption mentioned.

Step 1: Bernoulli distribution. Let us first prove the theorem in the partial case where XjX_{j} are scaled Bernoulli random variables. Assume that XjX_{j} can take values qq and , and

Let ν\nu denote the (random) number of nonzero XjX_{j}’s:

Here is how we will define the weights WjW_{j}. If Xj=0X_{j}=0 then clearly there is no need to damp XjX_{j} so put Wj=1W_{j}=1. The same applies if the number ν\nu of non-zero XjX_{j}’s does not significantly exceed its expectation pnpn. Otherwise we damp all terms by the same amount Wjpn/νW_{j}\sim pn/\nu. Formally, we fix some parameter L=L(K,ε)L=L(K,\varepsilon) whose value we will determine later, and set

Let us check (4.3). In the event when νLpn\nu\leq Lpn, we have

Let us now check (4.4). Since the lower bound is trivial, we will only have to check the upper bound. We will again split the calculation into two cases based on the size of ν\nu. If νLpn\nu\leq Lpn then all Wj=1W_{j}=1, so we trivially get

If ν>Lpn\nu>Lpn, then the definition of WjW_{j} gives

Since νBinom(n,p)\nu\sim\operatorname{Binom}(n,p), we have

using a standard consequence of Stirling’s approximation. Thus

provided that L10L\geq 10. Thus we showed that

where in the last step we used the assumption that p1/Knp\geq 1/Kn that we made in (4.5).

Now that we have the bounds (4.6) and (4.7), it is enough to choose

which implies that E1+εE\leq 1+\varepsilon. The proof for the Bernoulli distribution is complete.

Here XjkX_{jk} are independent random variables; each XjkX_{jk} can take values qkq_{k} and , and

The argument will be similar to step 1 of the proof. For each level kk we let νk\nu_{k} denote number of non-zero XjkX_{jk}’s:

Again, for each level kk define the weights WjkW_{jk} like in step 1:

since WjWjkW_{j}\leq W_{jk} by construction. Now, for each level kk, we can use step 1 of the proof, where we showed in (4.6) that

Let us now check (4.4). The lower bound is trivial, and we will only have to check the upper bound. For each level kk, we can use step 1 of the proof, where we showed in (4.7) that

which is true as long as L10L\geq 10. Then, by construction we have

where in the last step we used the inequality 1+xex1+x\leq e^{x}. Recall from (4.8) that the exponents pkp_{k} form a decreasing geometric progression with values 2k2^{-k} until the last (smallest) term of order 1/Kn1/Kn. So this last term dominates the sum k=1κeLpkn\sum_{k=1}^{\kappa}e^{-Lp_{k}n}, and we obtain

Now that we have the bounds (4.10) and (4.11), it is enough to choose

with C\refthm:dampingsum6KC_{\ref{thm: damping sum}}\geq 6K and the right hand side of (4.11) will be bounded by

as claimed. The proof of the theorem is complete. ∎

The 2→∞→22\to\infty norm of random matrices

In this section we prove Theorem 1.1 under the additional assumption that all entries AijA_{ij} of AA are not too large. Specifically, let us assume that

Consider an n×nn\times n random matrix AA with i.i.d. entries AijA_{ij} which have mean zero and at most unit variance and satisfy (5.1). Let ε(0,1/2]\varepsilon\in(0,1/2]. Then with probability at least 1exp(εn)1-\exp(-\varepsilon n), there exists a subset J[n]J\in[n] with cardinality Jεn|J|\leq\varepsilon n such that

We apply Theorem 4.2 for the squares of the elements in each row of AA, i.e. for the random variables (ai12,,ain2)(a_{i1}^{2},\ldots,a_{in}^{2}). This gives us random weights WijW_{ij}\in which satisfy for each i[n]i\in[n] that

To make the same system of weights work for all rows, we define

Then obviously VjWijV_{j}\leq W_{ij} for every ii, and so

We will remove from AA the columns whose weights VjV_{j} are too small, namely those in

as we claimed in the lemma. Indeed, if J>εn|J|>\varepsilon n then using that all VjV_{j}\in we have

But the probability of this event can be bounded by Markov’s inequality:

where in the last bound we used (5.2). This proves (5.3).

It remains to check that all rows BiB_{i} of the matrix B=A[n]×J0cB=A_{[n]\times J_{0}^{c}} are bounded as claimed. We have

Taking the square root of both sides completes the proof. ∎

From 2→∞→22\to\infty norm to ∞→2→2\infty\to 2 norm

In this section we will control the 2\infty\to 2 norm of a random matrix. Our first task is to bound the 2\infty\to 2 norm by the simpler 22\to\infty norm. There are two ways to do this, both of them going back to . The resulting comparison inequalities are interesting in their own right; we state them in Lemmas 6.1 and 6.3. The ultimate result of this section is Theorem 6.6, which gives an optimal bound O(n)O(n) on the 2\infty\to 2 norm of a random matrix after removing a small fraction of columns.

The first method is based on flipping the signs of the entries independently at random. Here is the main result of this section.

Let AA be an n×nn\times n random matrix whose entries are independent, mean zero random variables. Then

Let εij\varepsilon_{ij} be independent Rademacher random variables (which are also independent of AA) and consider the random matrix

A basic symmetrization inequality (see [15, Lemma 6.3]) yields

Condition on AA; the randomness now rests in the random signs (εij)(\varepsilon_{ij}) only. It suffices to show that the conditional expectation satisfies

Fix x{1,1}nx\in\{-1,1\}^{n}. Using independence and (2.1), we get

Moreover, the standard concentration results ([24, Lemma 5.9]) show that each ξi\xi_{i} is a sub-gaussian random variable, and we have

Thus ξi2\xi_{i}^{2} is a sub-exponential random variable (see [24, Lemma 5.9]) and

Applying Bernstein’s concentration inequality [24, Corollary 5.17] together with (6.3) and (6.4), we obtain

where t1t\geq 1 is arbitrary. Integration of these tails implies (6.1). ∎

We will need a minor variation of Lemma 6.1 that can be applied even when some of the columns of AA are removed.

Let AA be an n×nn\times n random matrix whose entries are independent, symmetric random variables. Let J[n]J\subset[n] be a random subset, which is independent of the signs of the entries of AA. Then

So, the only part of Lemma 6.1 that does not work for a matrix with removed columns is the symmetrization part. In the following two sections we will develop the tools to overcome the extra symmetry assumption we have to add in Lemma 6.2.

2. Using random permutations

We just showed how to convert an 2\infty\to 2 bound to a 22\to\infty bound for random matrices by using random signs. Alternatively, one can use random permutations for the same purpose, and obtain the following bound.

Let AA be an n×nn\times n random matrix with i.i.d. entries. Then

where 1=(1,1,,1){\textbf{1}}=(1,1,\ldots,1) denotes the vector whose all coordinates equal 11.

The concentration inequality for random permutations (Lemma 3.2) states that each ξi\xi_{i} is a sub-gaussian random variable, and we have

Just like in the proof of Lemma 6.1, this implies that

Since the expectation is bounded by the ψ1\psi_{1} norm (see e.g. [24, Definition 5.13]), we conclude that

Applying Bernstein’s inequality like in Lemma 6.1, we find that

for all t0t\geq 0. Thus, for any t1t\geq 1 we have with probability at least 1exp(tn)1-\exp(-tn) that

We have already bounded the first sum. As for the second one, the definition of ξ\xi in (6.6) yields

where mm denotes the number of ones in xjx_{j} and AiTA_{i}^{\mathsf{T}} is the ii-th row of AA. Thus

We substitute this and (6.7) into (6.8) and obtain that for any t1t\geq 1,

It remains to recall (6.2) and take a union bound over x{1,1}nx\in\{-1,1\}^{n}. It follows that the inequality

where t1t\geq 1 is arbitrary. Integration of these tails implies (6.5). ∎

It is worthwhile to mention a high-probability version of Lemma 6.3.

Let AA be an n×nn\times n random matrix with i.i.d. entries. Then with probability at least 1en1-e^{-n} we have

where 1=(1,1,,1){\textbf{1}}=(1,1,\ldots,1) denotes the vector whose all coordinates equal 11.

At the end of the proof of Lemma 6.3, we obtained inequality (6.9) which states (for large constant tt) that

with probability at least 1en1-e^{-n}. Note that

deterministically. Indeed, it is easy to check that permutations of the elements of the rows of AA do not affect these two quantities. It follows that

3. Bounding 2→∞→22\to\infty and ∞→2→2\infty\to 2 norms with tiny probability

Recall from Section 2.2 that ideally, we would want

with high probability. But this is too good to be true in our situation, where we assume only two moments for the entries of AA. Nevertheless, we will now show that these bounds still hold, albeit with exponentially small probability.

Let AA be an n×nn\times n random matrix whose entries are i.i.d. random variables with mean zero and at most unit variance. Let δ(0,1/2)\delta\in(0,1/2). Then

with probability at least 12exp(δ2n)\frac{1}{2}\exp(-\delta^{2}n).

We will first bound below the probability of the event

and then use Lemma 6.4 to control A2\|A\|_{\infty\to 2}.

where AiTA_{i}^{\mathsf{T}} denote the rows of AA. Thus Ei=1nEi\mathcal{E}\subset\bigcap_{i=1}^{n}\mathcal{E}_{i} where

are independent events. This reduces the problem to bounding the probability of each event Ei\mathcal{E}_{i} below.

The assumptions on the entries of AA imply that

Using Chebyshev’s inequality, we see that

By independence of the events Ei\mathcal{E}_{i}, this implies

Next we apply Lemma 6.4, which states that the event

It remains to note that by definition of E\mathcal{E} and F\mathcal{F}, the event EF\mathcal{E}\cap\mathcal{F} implies the inequalities in (6.10). ∎

4. Bounding ∞→2→2\infty\to 2 norm with high probability

In the previous section, we were able to prove the optimal bounds

for a random matrix AA, but they only hold with exponentially small probability. We claim that the probability of success can be increased to almost 11 if we are allowed to remove a few columns of AA. We already proved this fact for the 22\to\infty norm in Lemma 5.1. It is time to handle the 2\infty\to 2 norm.

Consider an n×nn\times n random matrix AA with i.i.d. entries AijA_{ij} which have mean zero and at most unit variance and satisfy (5.1). Let ε(0,1/2]\varepsilon\in(0,1/2]. Then with probability at least 12exp(εn)1-2\exp(-\varepsilon n), there exists a subset J[n]J\in[n] with cardinality Jεn|J|\leq\varepsilon n such that

Step 1: Defining the two key events. We will be interested in the two key events that suitably control the 22\to\infty and 2\infty\to 2 norms of a random matrix. Thus, for a random matrix BB and numbers r,K0r,K\geq 0, we define

In terms of these events, we want to show that

for some absolute constant CC^{\prime}. Since the latter event is so likely, intersecting with it would not cause much harm. Indeed, we will show that the bad event

This would finish the proof, since we would then have

Step 2: Symmetrization. As an intermediate step, let us bound the probability of a symmetrized version of B\mathcal{B}, namely the event

and AA^{\prime} is an independent copy of the random matrix AA. We claim that

Step 3. Using the small-probability bounds. The last piece of information we will use is the conclusion of Lemma 6.5 for δ:=1/(2lnε1)\delta:=1/(2\ln\varepsilon^{-1}). It states that the good event

Note in passing that there is no guarantee that this statement would hold for the same constants CC and CC^{\prime} as we chose in the definition of B\mathcal{B} above. However, we can make this happen by adjusting these constants upwards as necessary. The reader can easily check both (6.12) and (6.14) would still hold after such an adjustment.

The event B\mathcal{B} is determined by AA, and G\mathcal{G} is determined by AA^{\prime} only. Thus B\mathcal{B} and G\mathcal{G} are independent, and (6.15) gives

Thus, using (6.12) and (6.14), we conclude that

We have shown (6.11) and thus have completed the proof of the theorem. ∎

From ∞→2→2\infty\to 2 norm to the operator norm: controlling the bounded entries

In Theorem 6.6, we gave an optimal O(n)O(n) bound for the 2\infty\to 2 norm of a random matrix with few removed columns. We will now convert this into an optimal O(n)O(\sqrt{n}) bound for the operator norm. This can be done by applying a form of Grothendieck-Pietsch theorem (see [15, Proposition 15.11]), which has been used recently in [14, section 3.2] in a similar context.

Let BB be a k×mk\times m real matrix and δ>0\delta>0. Then there exists J[m]J\subset[m] with Jδm|J|\leq\delta m such that

Applying Theorem 6.6 followed by Grothendieck-Pietsch theorem, we obtain the following result.

Consider an n×nn\times n random matrix AA with i.i.d. entries AijA_{ij} which have mean zero and at most unit variance and satisfy (5.1). Let ε(0,1]\varepsilon\in(0,1]. Then with probability at least 12exp(εn/2)1-2\exp(-\varepsilon n/2), there exists a subset J[n]J\in[n] with cardinality Jεn|J|\leq\varepsilon n such that

Apply Theorem 6.6 for ε/2\varepsilon/2 instead of ε\varepsilon. We obtain a subset of columns J1[n]J_{1}\subset[n], J1εn/2|J_{1}|\leq\varepsilon n/2, which satistfies

with probability at least 12exp(εn/2)1-2\exp(-\varepsilon n/2).

Next apply Grothendieck-Pietsch Theorem 7.1 for the matrix AJ1cA_{J_{1}^{c}} and for δ=ε/2\delta=\varepsilon/2. We obtain a further subset J2J1cJ_{2}\subset J_{1}^{c}, J2δJ1cεn/2|J_{2}|\leq\delta|J_{1}^{c}|\leq\varepsilon n/2, such that the removal of columns in both J:=J1J2J:=J_{1}\cup J_{2} leads to

In the last inequality, we used the bound (7.1) and that δ=ε/2\delta=\varepsilon/2 and J1cnεn/2n/2|J_{1}^{c}|\geq n-\varepsilon n/2\geq n/2. The proof is complete. ∎

We are ready to prove a partial case of Theorem 1.1, for the matrices whose entries are O(n)O(\sqrt{n}). It follows by applying Lemma 7.2 for AA and ATA^{\mathsf{T}} separately, and then superposing the results.

Apply Lemma 7.2 for AA and ATA^{\mathsf{T}}. We obtain that with probability at least 14exp(εn/2)1-4\exp(-\varepsilon n/2), there exists sets II and JJ with at most εn\varepsilon n indices in each, and such that

We already controlled the first term in (7.2). As for the second term, since adding columns can only increase the operator norm, we have AIc×JAIc×[n]\|A_{I^{c}\times J}\|\leq\|A_{I^{c}\times[n]}\|, which we also bounded in (7.2). The proof is complete. ∎

Controlling the large entries, and completing the proof of Theorem 1.1

In the previous section, we proved a partial case of Theorem 1.1 that controls relatively small entries of AA, those of the order O(n)O(\sqrt{n}). Larger entries will be controlled in this section.

The following general lemma will help us analyze the patterns such large entries can form.

Let BB be an n×nn\times n random matrix whose entries are independent Bernoulli random variables with mean pp. Let ε(0,1/2]\varepsilon\in(0,1/2]. Consider the rows of BB with more than 21pn+2lnε121pn+2\ln\varepsilon^{-1} ones. Then with probability 1exp(εn/2)1-\exp(-\varepsilon n/2), these rows have at most εn\varepsilon n ones altogether.

To see the connection to our original problem, we will later choose the entries of BB to be the indicators of the large entries of AA.

Let K21pnK\geq 21pn be a number to be chosen later. (We will eventually choose KK as 21pn+2lnε121pn+2\ln\varepsilon^{-1} as in the statement of the lemma.) Define the random variables

The quantity of interest is the total number of ones in the heavy rows, and it equals i=1nXi\sum_{i=1}^{n}X_{i}. To control this sum of independent random variables, we can use the standard Bernstein’s trick (commonly called Chernoff’s bound), where we use Markov’s inequality after exponentiation. We obtain

where the last equality follows by independence and identical distribution. Now, by definition of X1X_{1} we have

Substituting this bound into (8.2), we conclude that

if we choose KK so that eKε/2e^{-K}\leq\varepsilon/2. To finish the proof, recall that our argument works if KK satisfies the two conditions: K21pnK\geq 21pn and eKε/2e^{-K}\leq\varepsilon/2. We thus choose K:=21pn+2lnε1K:=21pn+2\ln\varepsilon^{-1} and complete the proof. ∎

Apply Lemma 8.1 for BB and BTB^{\mathsf{T}} with ε/2\varepsilon/2 instead of ε\varepsilon, and take the intersection of the two good events. With the required probability, we obtain a set of εn\varepsilon n bad entries of BB whose removal makes all rows and columns of BB contain at most 21pn+2lnε121pn+2\ln\varepsilon^{-1} ones. It remains to note that these εn\varepsilon n entries can be trivially placed in some εn×εn\varepsilon n\times\varepsilon n submatrix of BB, and deletion of the whole εn×εn\varepsilon n\times\varepsilon n submatrix can only decrease the number of non-zero elements in the rows and columns of the residual part. ∎

It is not difficult to obtain a version of Corollary 8.2 for symmetric random matrices. This version can be interpreted as a statement about Erdös-Rényi random graphs G(n,p)G(n,p), with BB playing the role of the adjacency matrix. It states that with high probability, one can make all degrees of a G(n,p)G(n,p) random graph bounded by O(pn+lnε1)O(pn+\ln\varepsilon^{-1}) after removing the internal edges from a sub-graph with εn\varepsilon n vertices.

2. Moderately large entries

We will use Corollary 8.2 to deduce Theorem 1.1 for matrices with moderately large entries. Namely, we assume here that all entries of AA satisfy

Consider the matrix BB whose elements are indicators of moderately large entries of AA, i.e.

Then BijB_{ij} are i.i.d. Bernoulli random variables with mean

3. Very large entries

Finally, we will need to prove Theorem 1.1 for very large entries – now we assume that all entries of AA satisfy

There are typically very few such entries, as the following simple result shows.

Thus the expected number of non-zero entries in AA is at most εn/25\varepsilon n/25. A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives

Since a set of εn\varepsilon n indices can be always placed in an εn×εn\varepsilon n\times\varepsilon n submatrix, we can state Lemma 8.5 as follows.

4. Proof of Theorem 1.1

We are going to assemble Proposition 7.3 for the bounded entries of AA, Proposition 8.4 for moderately large entries, and Corollary 8.6 for very large entries. The εn×εn\varepsilon n\times\varepsilon n sub-matrices that appear in these results are possibly different. The following simple lemma will help us to combine them into one.

Let BB be a matrix. Zeroing out any submatrix of BB cannot increase the operator norm more than twice.

The last inequality follows from the fact that zeroing out any subset of rows or columns cannot increase the operator norm. ∎

One may wonder if zeroing out a submatrix can increase the norm at all. Basic simulations show that it actually can.

Decompose AA into a sum of three n×nn\times n matrices with disjoint support,

where BB contains bounded entries of AA – those that satisfy Aijn/2|A_{ij}|\leq\sqrt{n}/2, the matrix MM contains moderately large entries – those for which n/2<Aij5n/ε\sqrt{n}/2<|A_{ij}|\leq 5\sqrt{n/\varepsilon}, and LL contains large entries – those satisfying Aij>5n/ε|A_{ij}|>5\sqrt{n/\varepsilon}.

To bound BB, let us subtract the mean and first bound

The entries of this matrix have zero mean and satisfy

(where we used the moment assumption) and

This proves the conclusion of Theorem 1.1 with 3ε3\varepsilon instead of ε\varepsilon, where ε(0,1/2]\varepsilon\in(0,1/2] is arbitrary. By rescaling, Theorem 1.1 holds also as originally stated. This concludes the proof of Theorem 1.1. ∎

Global problem: proof of Theorem 1.3

In this section we prove Theorem 1.3, which states that either nonzero mean or infinite second moment make it impossible to repair the matrix norm by removing a small submatrix. We will first prove a non-asymptotic version version of this result. Once this is done, an application of Borel-Cantelli Lemma will quickly yield Theorem 1.3.

Consider an n×nn\times n random matrix AA whose entries are i.i.d. random variables that have either nonzero mean or infinite second moment, and let ε(0,1)\varepsilon\in(0,1). Then, for any M>0M>0 there exists n0n_{0} that may depend only on ε\varepsilon, MM and the distribution of the entries, and such that for any n>n0n>n_{0} the following event holds with probability at least 1en1-e^{-n}: every (1ε)n×(1ε)n(1-\varepsilon)n\times(1-\varepsilon)n submatrix AA^{\prime} of AA satisfies

Indeed, modifying an εn×εn\varepsilon n\times\varepsilon n submatrix always leaves some (1ε)n×(1ε)n(1-\varepsilon)n\times(1-\varepsilon)n submatrix AA^{\prime} intact, so we can apply Proposition 9.1 for that submatrix.

Here we will prove the part of Proposition 9.1 about infinite second moment; the case of nonzero mean will be treated in Section 9.2. Let us start with the following lemma which will help us treat a fixed submatrix.

Consider an m×mm\times m random matrix BB whose entries are i.i.d. random variables with infinite second moment. Then, for any M>0M>0 there exists m0m_{0} that may depend only on MM and the distribution of the entries, and such that for any m>m0m>m_{0} we have

with probability at least 1exp(M2m)1-\exp(-M^{2}m).

(This follows easily from Lebesgue’s monotone convergence theorem.)

Consider the matrix Bˉ\bar{B} with entries Bˉij\bar{B}_{ij}. We have

Then we bound the failure probability as follows:

Apply Hoeffding’s inequality for the random variables Bˉij2\bar{B}_{ij}^{2} and use that they are bounded by K2K^{2} by construction. The probability above gets bounded by

If m>2K2/M2=m0m>2K^{2}/M^{2}=m_{0}, this probability can be further bounded by exp(M2m)\exp(-M^{2}m), as claimed. ∎

We can assume without loss of generality that MM is large enough depending on ε\varepsilon. (Indeed, once the conclusion of the proposition holds for one value of MM it automatically holds for all smaller values.)

Apply Lemma 9.2 for an m×mm\times m matrix AnA^{\prime}_{n} with m=(1ε)nm=(1-\varepsilon)n, and then take a union bound over all (nm)2\binom{n}{m}^{2} possible choices of such submatrices. It follows that the conclusion of Proposition 9.1 holds with probability at least

By Stirling’s approximation, we have (nm)(en/m)m\binom{n}{m}\leq(en/m)^{m}. Using this and substituting m=(1ε)nm=(1-\varepsilon)n, we bound the probability below by

If the value of MM is sufficiently large depending on ε\varepsilon, this probability is larger than 1exp(n)1-\exp(-n), as claimed. Proposition 9.1 for infinite second moment is proved. ∎

2. Nonzero mean

Now we will prove the part of Proposition 9.1 about nonzero mean. We can assume here that the second moment of the entries AijA_{ij} is finite, as the opposite case was treated in Section 9.1. As before, we will first focus on one submatrix. In the following lemma we make an extra boundedness assumption, which we will get rid of using truncation later.

Consider an m×mm\times m random matrix BB whose entries are i.i.d. random variables that satisfy

Then, for any M>0M>0 there exists m0m_{0} that may depend only on μ\mu, σ\sigma, KK and MM, and such that for any m>m0m>m_{0} we have

with probability at least 1exp(M2m)1-\exp(-M^{2}m).

(To check this inequality, recall that BxTBx\|B\|\geq x^{\mathsf{T}}Bx for any unit vector xx; use this for the vector xx whose all coordinates equal 1/m1/\sqrt{m}.) Then we can bound the failure probability as follows:

Apply Bernstein’s inequality for the random variables BijB_{ij} and use that they have variance at most σ2\sigma^{2} and are bounded by KmK\sqrt{m} by assumption. The failure probability gets bounded by

If mm is large enough depending μ\mu, σ\sigma, KK and MM, then this probability can be further bounded by exp(M2m)\exp(-M^{2}m), as claimed. ∎

Next, we will use truncation to get rid of the boundedness assumption in Lemma 9.3 and thus prove the following.

Consider an m×mm\times m random matrix BB whose entries are i.i.d. random variables that satisfy

Then, for any M>0M>0 there exists m0m_{0} that may depend only on μ\mu, σ\sigma, KK, MM and the distribution of the entries, and such that for any m>m0m>m_{0} we have

with probability at least 1exp(M2m)1-\exp(-M^{2}m).

Choosing m0m_{0} large enough depending on MM and the distribution of BijB_{ij}, we can make sure that for any mm0m\geq m_{0} the truncated random variables

(This follows easily from Lebesgue’s monotone convergence theorem.)

Let us consider the event that all entries of BB are appropriately bounded:

Suppose for a moment that (9.2) fails, so we have B<Mm\|B\|<M\sqrt{m}. Since the inequality Bmaxi,jBij\|B\|\geq\max_{i,j}|B_{ij}| is always true, the event E\mathcal{E} must hold in this case. This in turn implies that the truncation has no effect on the entries, i.e. Bˉij=Bij\bar{B}_{ij}=B_{ij} for all i,ji,j.

We have shown that in the event of the failure of (9.2), we may automatically assume that the entries of BB are appropriately bounded. Therefore the failure probability satisfies

where Bˉ\bar{B} denotes the matrix with the truncated entries Bˉij\bar{B}_{ij}. It remains to apply Lemma 9.3 for the random matrix Bˉ\bar{B}, noting that truncation may only decrease the second moment. The failure probability gets bounded by exp(M2m)\exp(-M^{2}m), as claimed. ∎

As we mentioned in the beginning of this section, we can assume that the entries BijB_{ij} have finite second moment σ2\sigma^{2}. Then the conclusion of the proposition follows by exact same union bound argument as in the end of Section 9.1 (just use Lemma 9.4 instead of Lemma 9.2 there.) ∎

3. Proof of Theorem 1.3

where the minimum is taken over all (1ε)n×(1ε)n(1-\varepsilon)n\times(1-\varepsilon)n submatrices AnA^{\prime}_{n} of AnA_{n}. As we mentioned below Proposition 9.1, this would imply the conclusion of Theorem 1.3, since modifying an εn×εn\varepsilon n\times\varepsilon n submatrix leaves some (1ε)n×(1ε)n(1-\varepsilon)n\times(1-\varepsilon)n sub-matrix intact.

where the minimum has the same meaning as before. By Proposition 9.1, there exists n0n_{0} such that

We have shown that for any M>0M>0, with probability 11 there exists NN such that

Intersecting these almost sure events for M=1,2,M=1,2,\ldots, we conclude (9.3). Theorem 1.3 is proved. ∎

Removal of large entries under 2+ε2𝜀2+\varepsilon moments

2𝜀2+\varepsilon moments In this section we give a proof of Proposition 1.4 based on a general bound on random matrices due to A. Bandeira and R. van Handel .

Let us call an entry AijA_{ij} large if Aij>R:=n1/2ε/8|A_{ij}|>R:=n^{1/2-\varepsilon/8}, otherwise call the entry small.

We claim that there are very few large entries with high probability, and we can check this by the same argument as in Lemma 8.5. Indeed, the 2+ε2+\varepsilon moment assumption and Chebyshev’s inequality give

where the last inequality follows by our choice of RR. Thus the expected number of large entries is at most n2/n1+ε/8=n1ε/8n^{2}/n^{1+\varepsilon/8}=n^{1-\varepsilon/8}. A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives

For convenience, let us subtract the mean, and first bound

which is an n×nn\times n random matrix with independent mean zero entries GijG_{ij}. A theorem of A. Bandeira and R. van Handel (see [6, Remark 3.13]) states that for any t>0t>0, we have

(where we used the moment assumption) and

Then, using (10.2) with t=nt=\sqrt{n}, we conclude that

where the last inequality holds due to the definition of RR, if nn is sufficiently large in terms of ε\varepsilon.

where we used the moment assumption and a weaker form of the bound (10.1).

Concluding, it follows from (10.3) and (10.1) that with probability at least exp(nε/5)\exp(-n^{\varepsilon/5}), we have

The proof of Proposition 1.4 is complete. ∎

Further questions

Several extensions of Theorem 1.1 seem plausible.

It is natural to expect a version Theorem 1.1 even if the entries of AA are not identically distributed. Our argument relies on the identical distribution in several places, including discretization arguments (proof of Theorem 4.2) and symmetrization (proofs of Lemmas 6.3 and 6.4).

A version of Theorem 1.1 should hold for symmetric matrices AA with independent entries on and above the diagonal. A simplest way to get this result would be to use Theorem 1.1 to control the parts of AA above and below the diagonal separately, and then combine them. However, for this argument one would need a version of Theorem 1.1 for non-identical distributed entries.

Unlike Feige-Ofek’s result mentioned in Section 1.5, Theorem 1.1 does not indicate what sub-matrix should be removed to improve the norm; it is rather an existential result. It would be nice to have an explicit description of a submatrix to be removed.

It would be good to remove the logarithmic factor lnε1\ln\varepsilon^{-1} from the bound in Theorem 1.1, or to show that this factor is necessary. Such bound would be optimal up to an absolute constant factor.

Finally, while Remark 1.2 states that the dependence on ε\varepsilon in Theorem 1.1 is optimal in general, this dependence might be dramatically improved under a natural boundedness assumption. Namely, suppose that the entries of AA are O(n)O(\sqrt{n}) almost surely. (In fact, most of the proof – until Section 8 – was done under this additional assumption.) In this case, is the dependence of the norm on ε\varepsilon logarithmic in Theorem 1.1, i.e.

In fact, for the partial case of Bernoulli matrices such that np=c0=constnp=c_{0}=const (where pp is a probability of a non-zero entry) this bound can be quickly deduced from Corollary 8.2.

non-zero elements of order O(n)O(\sqrt{n}). Hence,

References