Norms of random matrices: local and global problems

Elizaveta Rebrova, Roman Vershynin

Introduction

2. Random matrices and their norms

with high probability. Note that the order $\sqrt{n}$ is the best we can generally hope for. Indeed, if the entries of $A$ have unit variance, then the typical magnitude of the Euclidean norm of a row of $A$ is $\sim\sqrt{n}$ , and the operator norm of $A$ can not be smaller than that. Moreover, by the bounded fourth moment assumption is nearly necessaryFor almost surely convergence of $\|A\|/\sqrt{n}$ , fourth moment is necessary and sufficient , while for convergence in probability the weak fourth moment is necessary and sufficient . for the bound

A number of quantitative and more general versions of these bounds are known .

3. Main results

Now let us postulate nothing at all about the distribution of the i.i.d. entries of $A$ . It still makes sense to ask: is enforcing the ideal bound (1.1) for random matrices a local or a global problem? That is, can we enforce the bound (1.1) by modifying the entries in a small submatrix of $A$ ? We will show in this paper that this is possible if and only if the entries of $A$ have zero moment and finite variance. The “if” part is covered by the following theorem.

where $C$ is a sufficiently large absolute constant.

This shows that the dependence on $\varepsilon$ in Theorem 1.1 is almost optimal.

By rescaling, a more general version of Theorem 1.1 holds for any finite variance of the entries. The two main assumptions in this theorem – mean zero and finite variance – are necessary in Theorem 1.1. Without either of them, the problem becomes global in a strong sense: the desired $O(\sqrt{n})$ bound can not be achieved even after modifying a large submatrix. This is the content of the following result.

It should be noted that while Theorem 1.1 becomes harder for smaller $\varepsilon$ , Theorem 1.3 becomes harder for larger $\varepsilon$ , those near $1$ .

4. What if we remove large entries?

One may naturally wonder what exactly may cause the norm of a mean zero random matrix $A$ to be too large. A natural guess is that the only troublemakers are a few large entries of $A$ . Indeed, this is exactly how the necessity of the fourth moment for (1.1) was shown in . So we may ask – can we obtain a result like Theorem 1.1 simply by zeroing out a few largest entries of $A$ ?

The answer is no. A counterexample is a sparse Bernoulli matrix $A$ , whose i.i.d. entries take values $\pm\sqrt{n}$ with probability $1/2n$ each and with probability $1-1/2n$ . It is not hard to check that $A$ is likely to have a row whose norm exceeds $c\sqrt{n}\log(n)/\log\log n\gg\sqrt{n}$ , and consequently we have $\|A\|\gg\sqrt{n}$ . In other words, without removal of any entries the norm of $A$ is too large. However, if we are to remove any entries based purely on their magnitudes, we must remove them all. (Recall that all non-zero elements of $A$ have the same magnitude $\sqrt{n}$ .) But removal of all nonzero entries of $A$ is not a local intervention, since such entries can not be placed in a small submatrix (we explained this in Remark 1.2).

Nevertheless, under slightly stronger moment assumptions than in Theorem 1.1, zeroing out a few large entires does bring the norm of $A$ down. The following result can be quickly deduced by truncation from known bounds on random matrices such as .

We will deduce Proposition 1.4 from a general bound of A. Bandeira and R. van Handel in Section 10.

5. Other related results

Acknowledgement

Si Tang and Antonio Auffinger were first to note a version of Proposition 1.4, which they kindly showed to us along with a proof based on . Our correspondence led us to add Section 1.4. We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1. We also would like to thank the referee for the valuable suggestions, which helped to improve the presentation of our paper.

The method

Our approach to Theorem 1.1 utilizes and advances the methods developed recently in and . We will first control the cut norm of $A$ and then pass to the operator norm using Grothendieck-Pietsch factorization. Let us describe these steps in more detail.

Rather than bounding the operator norm of a random matrix $A$ directly, we shall compare it with two simpler norms,

The simplest of the three is the $2\to\infty$ norm. A quick check reveals that it equals the maximum Euclidean norm of the rows $A_{i}^{\mathsf{T}}$ of $A$ :

The next simplest norm is $\infty\to 2$ , which can be conveniently computed as

This norm is equivalent within a constant factor to the cut norm from the computer science literature , where the maximum is taken over $\{0,1\}^{n}$ . The hardest of the three is the operator norm,

To see why the difficulty in bounding these norms rises this way, note that one has to control $n$ random variables in (2.1), $2^{n}$ random variables in (2.2), and infinitely many random variables in (2.3).

2. Ideal relationships among the norms

How large do we expect the three norms to be for random matrices? For a simple example, let us first consider a Gaussian random matrix $A$ with i.i.d. $N(0,1)$ entries. Then it is not difficult to check that

Indeed, note that the rows of $A$ have Euclidean norms $\sqrt{n}$ on average, so the bound on the $2\to\infty$ norm follows by union bound and using Gaussian concentration. The bound on the $\infty\to 2$ norm follows from (2.2) by using Gaussian concentration for the normal random vector $Ax$ and taking the union bound over $\{-1,1\}^{n}$ . The bound on the operator norm is a non-asymptotic version of Bai-Yin’s law, see e.g. [24, Theorem 5.32].

One might wonder if (2.4) holds not only in the Gaussian case but generally for random matrices $A$ with i.i.d. entries that have zero mean and unit variance. In particular, it would be wonderful if the three norms were always related to each other as follows:

This, however, would be too optimistic to expect, since the bound $\|A\|\lesssim\sqrt{n}$ can not hold without higher moments assumptions as we mentioned in Section 1.1. Nevertheless, we will obtain a version of (2.5) after removal a small fraction of rows of $A$ . With high probability, we will be able to find subsets of rows $J_{1}\subset J_{2}\subset J_{3}$ with cardinalities $|J_{i}|\leq\varepsilon n$ and such that

where the inequalities hide a factor that depends on $\varepsilon$ .

3. A roadmap of the proof

The first step in proving (2.6) is to find a small set $J_{1}$ with $|J_{1}|\lesssim\varepsilon n$ and such that

with high probability. In other words, we would like to bound all rows of $A$ simultaneously by $O(\sqrt{n})$ after removing a few columns of $A$ . To show this we first focus on one row, where we need to bound a sum of independent random variables (the squares of the row’s entries). In Theorem 4.2 we show how to bound sums of independent random variables almost surely by gently damping the summands. Damping, or reweighting down, is a softer operation than removing entries. It allows us to treat in Section 5 all columns simultaneously without much effort, thus proving (2.7). The argument in this step is similar to the approach proposed recently in . We somewhat simplify the method of and also improve the dependence between the number of removed columns and the resulting $2\to\infty$ norm; this will ultimately lead to the optimal dependence on $\varepsilon$ in Theorem 1.1.

At the next step, we extend $J_{1}$ to a bigger set of rows $J_{2}$ with $|J_{2}|\lesssim\varepsilon n$ and so that

Suppose for a moment that we are not concerned about removal of any columns. It is not too hard to show the general bound

for a random matrix $A$ with independent, mean zero entries; we prove this in Lemma 6.1. However, this bound is not very helpful in our situation. We need to work with the matrix $A_{J_{1}^{c}}$ instead of $A$ , which is not trivial: the removal of the columns in $J_{1}$ that we did in the first step made the entries of $A_{J_{1}^{c}}$ dependent. In Lemma 6.2, we first prove a variant of (2.9) for $A_{J_{1}^{c}}$ under an additional symmetry assumption on the distribution of the entries of $A$ . Then we manage to remove this assumption with a delicate symmetrization argument, which we develop in the rest of Section 6, with the final result being Theorem 6.6. The general idea of this step, as well as some of our arguments here, are inspired by . However we need to be considerably more careful than in to obtain (2.8) with a logarithmic dependence on $\varepsilon$ .

Next, we pass from $\infty\to 2$ norm to the operator norm in Section 7. This is done by using Grothendieck-Pietsch factorization (Theorem 7.1), a result that yields the first inequality in (2.6) for completely arbitrary, even non-random, matrices. This reasoning was recently used in a similar context in .

The argument we just described works under the additional assumption that the entries of $A$ be $O(\sqrt{n})$ almost surely. To be specific, such boundedness assumption is needed to make the damping argument in Step 1 work with mild, logarithmic dependence on $\varepsilon$ . The contribution of the entries that are larger than $\sqrt{n}$ are controlled in Section 8 by showing that there can not be too many of them. The unit variance assumption implies that there are $O(1)$ such large entries per column on average. This does not mean, of course, that all columns will have $O(1)$ large entries with high probability; in fact there could be columns with $\sim\log n/\log\log n$ large entries. But we will check in Lemma 8.1 that the number of such heavy columns is small; removing them will lead to the desired bound $O(\sqrt{n})$ on the operator norm for the matrix with large entries. We develop this argument in Proposition 8.4 and Corollary 8.6, and derive the full strength of Theorem 1.1 in Section 8.4.

Theorem 1.3 is proved in Section 9. The paper is concluded with Section 11 where we discuss some further problems.

Acknowledgements

We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1 and to Antonio Auffinger for the interesting discussion of $2+\varepsilon$ finite moment case.

Preliminaries

Throughout the paper, positive absolute constant are denoted $C,C_{1},c,c_{1}$ , etc. Their values may be different from line to line. We often write $a\lesssim b$ to indicate that $a\leq Cb$ for some absolute constant $C$ .

The discrete interval $\{1,2,\ldots,n\}$ is denoted by $[n]$ . If $\mathcal{R}$ is some subset of indices, $\mathcal{R}\subset[n]\times[n]$ , let us denote by $A_{\mathcal{R}}$ the matrix obtained from $A$ by replacing the indices in $\mathcal{R}$ by zero:

We will often consider subsets of columns of the matrix, so when $\mathcal{R}=J\times[n]$ we use a simplified notation: for $J\subset[n]$

where $A_{i}$ and $A^{j}$ denote the rows and columns of $A$ .

Recall that the operator norm can be computed as a maximum of the quadratic form:

Taking the maximum over all unit vectors $x$ and $y$ , we complete the proof. ∎

3. Concentration

In this paper we make use of good concentration properties of the sums of sub-gaussian (and sub-exponential) random variables, that is, such that grow not faster than standard normal (respectively, exponential) random variables. Recall that by definition a random variable $Y$ is called sub-gaussian if its moments satisfy

for some number $M_{2}>0$ . The minimal number $M_{2}$ is called the sub-gaussian moment of $X$ , denoted as $\|Y\|_{\psi_{2}}$ . Analogously, a random variable is called sub-exponential if

for some number $M_{1}>0$ . The minimal number $M_{1}$ is called the sub-exponential moment of $Y$ , denoted as $\|Y\|_{\psi_{1}}$ .

The class of sub-gaussian random variables contains standard normal, Bernoulli, and generally all bounded random variables. The class of sub-exponential random variables is exactly the class of squares of sub-gaussians. See for more information and statements of standard concentration inequalities.

Also we will need a concentration inequality for random permutations from .

The same inequality holds for the sum $S^{\prime}=\sum_{i=1}^{n}a_{\pi(i)}x_{i}$ as well, since it has the same distribution as $S$ .

4. Discretization

The following lemma allows us to approximate a general continuous random variable by a sum of independent, scaled Bernoulli random variables. This lemma was originally proved in . Here we give a proof for completeness, and then discuss some particular cases needed for the proof of Theorem 1.1.

Consider a non-negative, continuous random variable $X$ . There exists a non-negative random variable $X^{\prime}$ satisfying the following.

$X^{\prime}$ stochastically dominates $X$ , i.e.

$X^{\prime}$ is a sum of scaled, independent Bernoulli random variables:

where $q_{k}$ are non-negative numbers and $\xi_{k}$ are independent $\operatorname{Ber}(2^{-k})$ random variables.

Set the values $q_{k}$ to be the quantiles of the distribution of $X$ :

(These values are well defined since the cumulative distribution function of $X$ is continuous by assumption.) By definition, $(q_{k})$ is an increasing sequence. Define $X^{\prime}$ by (3.1).

To check part 1, note that by definition,

almost surely. Taking expectation of both sides, we obtain

Now, using the definition of $q_{k}$ , we have

Let us prove part 2. If $t\in[q_{k},q_{k}+1)$ for some $k=0,1,2,\ldots$ , then using the definitions of $X^{\prime}$ and $q_{k}$ we obtain

and the inequality in part 2 follows. If $t\geq q_{\infty}$ then, using the continuity of the cumulative distribution of $X$ , we obtain

and the inequality in part 2 follows again. The proof is complete. ∎

Suppose $X\leq M$ almost surely. Then, in the second part of the conclusion of Lemma 3.3, $X$ can be represented as a finite sum

where $q_{k}$ are non-negative numbers, $q_{k}\in[0,M]$ , and $\xi_{k}$ are independent $\operatorname{Ber}(p_{k})$ random variables. Here $p_{k}=2^{-k}\geq 1/M$ for $k<\kappa$ and $p_{\kappa}=1/M$ .

Stochastic dominance of $X^{\prime}$ over $X$ in Lemma 3.3 implies that one can realize the random variables $X$ and $X^{\prime}$ on the same probability space so that

Moreover, in the same way we can construct a majorizing collection for any collection of independent random variables. In particular, we can do it for all entries of the matrix $A$ at once.

Damping a sum of independent random variables

Here we will be interested in a stronger result – that the sum be $O(n)$ almost surely instead of in expectation. To do this, we will be looking for random weights

To make the damping as gentle as possible, we are looking for largest possible weights $W_{i}$ , hopefully very close to $1$ .

To get started, let us consider the simple case where $n=1$ and try to damp one random variable.

Let $\varepsilon\in(0,1)$ . There exists a random variable $W$ taking values in $$ and such that

Fix a level $L\geq 1$ whose value we will choose later, and define

Next, the lower bound in (4.2) holds trivially since $W\leq 1$ . For the upper bound, we have

2. Damping a sum of random variables

Now let us address the damping problem for general number $n$ of random variables, which we described in the beginning of this section. Applying Lemma 4.1 for each random variable $X_{i}$ , we get weights $W_{i}$ such that

for small $\varepsilon$ . We will now considerably improve both these bounds, making only one mild extra assumption that $X_{i}=O(n)$ almost surely.

Let $X_{1},\ldots,X_{n}$ be i.i.d. random variables such that

for some $K\geq 1$ . Let $\varepsilon\in(0,1/2)$ . There exist random variables $W_{1},\ldots,W_{n}$ taking values in $$ and such that

Improvement in the order of $n$ in (4.4) does not require an extra boundedness assumption, and it was done in previous work . We employ the same ideas as in [19, Lemma 3.3] and obtain better (logarithmic) dependence on $\varepsilon$ in (4.3) in trade of the additional assumption mentioned.

Step 1: Bernoulli distribution. Let us first prove the theorem in the partial case where $X_{j}$ are scaled Bernoulli random variables. Assume that $X_{j}$ can take values $q$ and , and

Let $\nu$ denote the (random) number of nonzero $X_{j}$ ’s:

Here is how we will define the weights $W_{j}$ . If $X_{j}=0$ then clearly there is no need to damp $X_{j}$ so put $W_{j}=1$ . The same applies if the number $\nu$ of non-zero $X_{j}$ ’s does not significantly exceed its expectation $pn$ . Otherwise we damp all terms by the same amount $W_{j}\sim pn/\nu$ . Formally, we fix some parameter $L=L(K,\varepsilon)$ whose value we will determine later, and set

Let us check (4.3). In the event when $\nu\leq Lpn$ , we have

Let us now check (4.4). Since the lower bound is trivial, we will only have to check the upper bound. We will again split the calculation into two cases based on the size of $\nu$ . If $\nu\leq Lpn$ then all $W_{j}=1$ , so we trivially get

If $\nu>Lpn$ , then the definition of $W_{j}$ gives

Since $\nu\sim\operatorname{Binom}(n,p)$ , we have

using a standard consequence of Stirling’s approximation. Thus

provided that $L\geq 10$ . Thus we showed that

where in the last step we used the assumption that $p\geq 1/Kn$ that we made in (4.5).

Now that we have the bounds (4.6) and (4.7), it is enough to choose

which implies that $E\leq 1+\varepsilon$ . The proof for the Bernoulli distribution is complete.

Here $X_{jk}$ are independent random variables; each $X_{jk}$ can take values $q_{k}$ and , and

The argument will be similar to step 1 of the proof. For each level $k$ we let $\nu_{k}$ denote number of non-zero $X_{jk}$ ’s:

Again, for each level $k$ define the weights $W_{jk}$ like in step 1:

since $W_{j}\leq W_{jk}$ by construction. Now, for each level $k$ , we can use step 1 of the proof, where we showed in (4.6) that

Let us now check (4.4). The lower bound is trivial, and we will only have to check the upper bound. For each level $k$ , we can use step 1 of the proof, where we showed in (4.7) that

which is true as long as $L\geq 10$ . Then, by construction we have

where in the last step we used the inequality $1+x\leq e^{x}$ . Recall from (4.8) that the exponents $p_{k}$ form a decreasing geometric progression with values $2^{-k}$ until the last (smallest) term of order $1/Kn$ . So this last term dominates the sum $\sum_{k=1}^{\kappa}e^{-Lp_{k}n}$ , and we obtain

Now that we have the bounds (4.10) and (4.11), it is enough to choose

with $C_{\ref{thm: damping sum}}\geq 6K$ and the right hand side of (4.11) will be bounded by

as claimed. The proof of the theorem is complete. ∎

The 2→∞→22\to\infty norm of random matrices

In this section we prove Theorem 1.1 under the additional assumption that all entries $A_{ij}$ of $A$ are not too large. Specifically, let us assume that

Consider an $n\times n$ random matrix $A$ with i.i.d. entries $A_{ij}$ which have mean zero and at most unit variance and satisfy (5.1). Let $\varepsilon\in(0,1/2]$ . Then with probability at least $1-\exp(-\varepsilon n)$ , there exists a subset $J\in[n]$ with cardinality $|J|\leq\varepsilon n$ such that

We apply Theorem 4.2 for the squares of the elements in each row of $A$ , i.e. for the random variables $(a_{i1}^{2},\ldots,a_{in}^{2})$ . This gives us random weights $W_{ij}\in$ which satisfy for each $i\in[n]$ that

To make the same system of weights work for all rows, we define

Then obviously $V_{j}\leq W_{ij}$ for every $i$ , and so

We will remove from $A$ the columns whose weights $V_{j}$ are too small, namely those in

as we claimed in the lemma. Indeed, if $|J|>\varepsilon n$ then using that all $V_{j}\in$ we have

But the probability of this event can be bounded by Markov’s inequality:

where in the last bound we used (5.2). This proves (5.3).

It remains to check that all rows $B_{i}$ of the matrix $B=A_{[n]\times J_{0}^{c}}$ are bounded as claimed. We have

Taking the square root of both sides completes the proof. ∎

From 2→∞→22\to\infty norm to ∞→2→2\infty\to 2 norm

In this section we will control the $\infty\to 2$ norm of a random matrix. Our first task is to bound the $\infty\to 2$ norm by the simpler $2\to\infty$ norm. There are two ways to do this, both of them going back to . The resulting comparison inequalities are interesting in their own right; we state them in Lemmas 6.1 and 6.3. The ultimate result of this section is Theorem 6.6, which gives an optimal bound $O(n)$ on the $\infty\to 2$ norm of a random matrix after removing a small fraction of columns.

The first method is based on flipping the signs of the entries independently at random. Here is the main result of this section.

Let $A$ be an $n\times n$ random matrix whose entries are independent, mean zero random variables. Then

Let $\varepsilon_{ij}$ be independent Rademacher random variables (which are also independent of $A$ ) and consider the random matrix

A basic symmetrization inequality (see [15, Lemma 6.3]) yields

Condition on $A$ ; the randomness now rests in the random signs $(\varepsilon_{ij})$ only. It suffices to show that the conditional expectation satisfies

Fix $x\in\{-1,1\}^{n}$ . Using independence and (2.1), we get

Moreover, the standard concentration results ([24, Lemma 5.9]) show that each $\xi_{i}$ is a sub-gaussian random variable, and we have

Thus $\xi_{i}^{2}$ is a sub-exponential random variable (see [24, Lemma 5.9]) and

Applying Bernstein’s concentration inequality [24, Corollary 5.17] together with (6.3) and (6.4), we obtain

where $t\geq 1$ is arbitrary. Integration of these tails implies (6.1). ∎

We will need a minor variation of Lemma 6.1 that can be applied even when some of the columns of $A$ are removed.

Let $A$ be an $n\times n$ random matrix whose entries are independent, symmetric random variables. Let $J\subset[n]$ be a random subset, which is independent of the signs of the entries of $A$ . Then

So, the only part of Lemma 6.1 that does not work for a matrix with removed columns is the symmetrization part. In the following two sections we will develop the tools to overcome the extra symmetry assumption we have to add in Lemma 6.2.

2. Using random permutations

We just showed how to convert an $\infty\to 2$ bound to a $2\to\infty$ bound for random matrices by using random signs. Alternatively, one can use random permutations for the same purpose, and obtain the following bound.

Let $A$ be an $n\times n$ random matrix with i.i.d. entries. Then

where ${\textbf{1}}=(1,1,\ldots,1)$ denotes the vector whose all coordinates equal $1$ .

The concentration inequality for random permutations (Lemma 3.2) states that each $\xi_{i}$ is a sub-gaussian random variable, and we have

Just like in the proof of Lemma 6.1, this implies that

Since the expectation is bounded by the $\psi_{1}$ norm (see e.g. [24, Definition 5.13]), we conclude that

Applying Bernstein’s inequality like in Lemma 6.1, we find that

for all $t\geq 0$ . Thus, for any $t\geq 1$ we have with probability at least $1-\exp(-tn)$ that

We have already bounded the first sum. As for the second one, the definition of $\xi$ in (6.6) yields

where $m$ denotes the number of ones in $x_{j}$ and $A_{i}^{\mathsf{T}}$ is the $i$ -th row of $A$ . Thus

We substitute this and (6.7) into (6.8) and obtain that for any $t\geq 1$ ,

It remains to recall (6.2) and take a union bound over $x\in\{-1,1\}^{n}$ . It follows that the inequality

where $t\geq 1$ is arbitrary. Integration of these tails implies (6.5). ∎

It is worthwhile to mention a high-probability version of Lemma 6.3.

Let $A$ be an $n\times n$ random matrix with i.i.d. entries. Then with probability at least $1-e^{-n}$ we have

where ${\textbf{1}}=(1,1,\ldots,1)$ denotes the vector whose all coordinates equal $1$ .

At the end of the proof of Lemma 6.3, we obtained inequality (6.9) which states (for large constant $t$ ) that

with probability at least $1-e^{-n}$ . Note that

deterministically. Indeed, it is easy to check that permutations of the elements of the rows of $A$ do not affect these two quantities. It follows that

3. Bounding 2→∞→22\to\infty and ∞→2→2\infty\to 2 norms with tiny probability

Recall from Section 2.2 that ideally, we would want

with high probability. But this is too good to be true in our situation, where we assume only two moments for the entries of $A$ . Nevertheless, we will now show that these bounds still hold, albeit with exponentially small probability.

Let $A$ be an $n\times n$ random matrix whose entries are i.i.d. random variables with mean zero and at most unit variance. Let $\delta\in(0,1/2)$ . Then

with probability at least $\frac{1}{2}\exp(-\delta^{2}n)$ .

We will first bound below the probability of the event

and then use Lemma 6.4 to control $\|A\|_{\infty\to 2}$ .

where $A_{i}^{\mathsf{T}}$ denote the rows of $A$ . Thus $\mathcal{E}\subset\bigcap_{i=1}^{n}\mathcal{E}_{i}$ where

are independent events. This reduces the problem to bounding the probability of each event $\mathcal{E}_{i}$ below.

The assumptions on the entries of $A$ imply that

Using Chebyshev’s inequality, we see that

By independence of the events $\mathcal{E}_{i}$ , this implies

Next we apply Lemma 6.4, which states that the event

It remains to note that by definition of $\mathcal{E}$ and $\mathcal{F}$ , the event $\mathcal{E}\cap\mathcal{F}$ implies the inequalities in (6.10). ∎

4. Bounding ∞→2→2\infty\to 2 norm with high probability

In the previous section, we were able to prove the optimal bounds

for a random matrix $A$ , but they only hold with exponentially small probability. We claim that the probability of success can be increased to almost $1$ if we are allowed to remove a few columns of $A$ . We already proved this fact for the $2\to\infty$ norm in Lemma 5.1. It is time to handle the $\infty\to 2$ norm.

Consider an $n\times n$ random matrix $A$ with i.i.d. entries $A_{ij}$ which have mean zero and at most unit variance and satisfy (5.1). Let $\varepsilon\in(0,1/2]$ . Then with probability at least $1-2\exp(-\varepsilon n)$ , there exists a subset $J\in[n]$ with cardinality $|J|\leq\varepsilon n$ such that

Step 1: Defining the two key events. We will be interested in the two key events that suitably control the $2\to\infty$ and $\infty\to 2$ norms of a random matrix. Thus, for a random matrix $B$ and numbers $r,K\geq 0$ , we define

In terms of these events, we want to show that

for some absolute constant $C^{\prime}$ . Since the latter event is so likely, intersecting with it would not cause much harm. Indeed, we will show that the bad event

This would finish the proof, since we would then have

Step 2: Symmetrization. As an intermediate step, let us bound the probability of a symmetrized version of $\mathcal{B}$ , namely the event

and $A^{\prime}$ is an independent copy of the random matrix $A$ . We claim that

Step 3. Using the small-probability bounds. The last piece of information we will use is the conclusion of Lemma 6.5 for $\delta:=1/(2\ln\varepsilon^{-1})$ . It states that the good event

Note in passing that there is no guarantee that this statement would hold for the same constants $C$ and $C^{\prime}$ as we chose in the definition of $\mathcal{B}$ above. However, we can make this happen by adjusting these constants upwards as necessary. The reader can easily check both (6.12) and (6.14) would still hold after such an adjustment.

The event $\mathcal{B}$ is determined by $A$ , and $\mathcal{G}$ is determined by $A^{\prime}$ only. Thus $\mathcal{B}$ and $\mathcal{G}$ are independent, and (6.15) gives

Thus, using (6.12) and (6.14), we conclude that

We have shown (6.11) and thus have completed the proof of the theorem. ∎

From ∞→2→2\infty\to 2 norm to the operator norm: controlling the bounded entries

In Theorem 6.6, we gave an optimal $O(n)$ bound for the $\infty\to 2$ norm of a random matrix with few removed columns. We will now convert this into an optimal $O(\sqrt{n})$ bound for the operator norm. This can be done by applying a form of Grothendieck-Pietsch theorem (see [15, Proposition 15.11]), which has been used recently in [14, section 3.2] in a similar context.

Let $B$ be a $k\times m$ real matrix and $\delta>0$ . Then there exists $J\subset[m]$ with $|J|\leq\delta m$ such that

Applying Theorem 6.6 followed by Grothendieck-Pietsch theorem, we obtain the following result.

Consider an $n\times n$ random matrix $A$ with i.i.d. entries $A_{ij}$ which have mean zero and at most unit variance and satisfy (5.1). Let $\varepsilon\in(0,1]$ . Then with probability at least $1-2\exp(-\varepsilon n/2)$ , there exists a subset $J\in[n]$ with cardinality $|J|\leq\varepsilon n$ such that

Apply Theorem 6.6 for $\varepsilon/2$ instead of $\varepsilon$ . We obtain a subset of columns $J_{1}\subset[n]$ , $|J_{1}|\leq\varepsilon n/2$ , which satistfies

with probability at least $1-2\exp(-\varepsilon n/2)$ .

Next apply Grothendieck-Pietsch Theorem 7.1 for the matrix $A_{J_{1}^{c}}$ and for $\delta=\varepsilon/2$ . We obtain a further subset $J_{2}\subset J_{1}^{c}$ , $|J_{2}|\leq\delta|J_{1}^{c}|\leq\varepsilon n/2$ , such that the removal of columns in both $J:=J_{1}\cup J_{2}$ leads to

In the last inequality, we used the bound (7.1) and that $\delta=\varepsilon/2$ and $|J_{1}^{c}|\geq n-\varepsilon n/2\geq n/2$ . The proof is complete. ∎

We are ready to prove a partial case of Theorem 1.1, for the matrices whose entries are $O(\sqrt{n})$ . It follows by applying Lemma 7.2 for $A$ and $A^{\mathsf{T}}$ separately, and then superposing the results.

Apply Lemma 7.2 for $A$ and $A^{\mathsf{T}}$ . We obtain that with probability at least $1-4\exp(-\varepsilon n/2)$ , there exists sets $I$ and $J$ with at most $\varepsilon n$ indices in each, and such that

We already controlled the first term in (7.2). As for the second term, since adding columns can only increase the operator norm, we have $\|A_{I^{c}\times J}\|\leq\|A_{I^{c}\times[n]}\|$ , which we also bounded in (7.2). The proof is complete. ∎

Controlling the large entries, and completing the proof of Theorem 1.1

In the previous section, we proved a partial case of Theorem 1.1 that controls relatively small entries of $A$ , those of the order $O(\sqrt{n})$ . Larger entries will be controlled in this section.

The following general lemma will help us analyze the patterns such large entries can form.

Let $B$ be an $n\times n$ random matrix whose entries are independent Bernoulli random variables with mean $p$ . Let $\varepsilon\in(0,1/2]$ . Consider the rows of $B$ with more than $21pn+2\ln\varepsilon^{-1}$ ones. Then with probability $1-\exp(-\varepsilon n/2)$ , these rows have at most $\varepsilon n$ ones altogether.

To see the connection to our original problem, we will later choose the entries of $B$ to be the indicators of the large entries of $A$ .

Let $K\geq 21pn$ be a number to be chosen later. (We will eventually choose $K$ as $21pn+2\ln\varepsilon^{-1}$ as in the statement of the lemma.) Define the random variables

The quantity of interest is the total number of ones in the heavy rows, and it equals $\sum_{i=1}^{n}X_{i}$ . To control this sum of independent random variables, we can use the standard Bernstein’s trick (commonly called Chernoff’s bound), where we use Markov’s inequality after exponentiation. We obtain

where the last equality follows by independence and identical distribution. Now, by definition of $X_{1}$ we have

Substituting this bound into (8.2), we conclude that

if we choose $K$ so that $e^{-K}\leq\varepsilon/2$ . To finish the proof, recall that our argument works if $K$ satisfies the two conditions: $K\geq 21pn$ and $e^{-K}\leq\varepsilon/2$ . We thus choose $K:=21pn+2\ln\varepsilon^{-1}$ and complete the proof. ∎

Apply Lemma 8.1 for $B$ and $B^{\mathsf{T}}$ with $\varepsilon/2$ instead of $\varepsilon$ , and take the intersection of the two good events. With the required probability, we obtain a set of $\varepsilon n$ bad entries of $B$ whose removal makes all rows and columns of $B$ contain at most $21pn+2\ln\varepsilon^{-1}$ ones. It remains to note that these $\varepsilon n$ entries can be trivially placed in some $\varepsilon n\times\varepsilon n$ submatrix of $B$ , and deletion of the whole $\varepsilon n\times\varepsilon n$ submatrix can only decrease the number of non-zero elements in the rows and columns of the residual part. ∎

It is not difficult to obtain a version of Corollary 8.2 for symmetric random matrices. This version can be interpreted as a statement about Erdös-Rényi random graphs $G(n,p)$ , with $B$ playing the role of the adjacency matrix. It states that with high probability, one can make all degrees of a $G(n,p)$ random graph bounded by $O(pn+\ln\varepsilon^{-1})$ after removing the internal edges from a sub-graph with $\varepsilon n$ vertices.

2. Moderately large entries

We will use Corollary 8.2 to deduce Theorem 1.1 for matrices with moderately large entries. Namely, we assume here that all entries of $A$ satisfy

Consider the matrix $B$ whose elements are indicators of moderately large entries of $A$ , i.e.

Then $B_{ij}$ are i.i.d. Bernoulli random variables with mean

3. Very large entries

Finally, we will need to prove Theorem 1.1 for very large entries – now we assume that all entries of $A$ satisfy

There are typically very few such entries, as the following simple result shows.

Thus the expected number of non-zero entries in $A$ is at most $\varepsilon n/25$ . A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives

Since a set of $\varepsilon n$ indices can be always placed in an $\varepsilon n\times\varepsilon n$ submatrix, we can state Lemma 8.5 as follows.

4. Proof of Theorem 1.1

We are going to assemble Proposition 7.3 for the bounded entries of $A$ , Proposition 8.4 for moderately large entries, and Corollary 8.6 for very large entries. The $\varepsilon n\times\varepsilon n$ sub-matrices that appear in these results are possibly different. The following simple lemma will help us to combine them into one.

Let $B$ be a matrix. Zeroing out any submatrix of $B$ cannot increase the operator norm more than twice.

The last inequality follows from the fact that zeroing out any subset of rows or columns cannot increase the operator norm. ∎

One may wonder if zeroing out a submatrix can increase the norm at all. Basic simulations show that it actually can.

Decompose $A$ into a sum of three $n\times n$ matrices with disjoint support,

where $B$ contains bounded entries of $A$ – those that satisfy $|A_{ij}|\leq\sqrt{n}/2$ , the matrix $M$ contains moderately large entries – those for which $\sqrt{n}/2<|A_{ij}|\leq 5\sqrt{n/\varepsilon}$ , and $L$ contains large entries – those satisfying $|A_{ij}|>5\sqrt{n/\varepsilon}$ .

To bound $B$ , let us subtract the mean and first bound

The entries of this matrix have zero mean and satisfy

(where we used the moment assumption) and

This proves the conclusion of Theorem 1.1 with $3\varepsilon$ instead of $\varepsilon$ , where $\varepsilon\in(0,1/2]$ is arbitrary. By rescaling, Theorem 1.1 holds also as originally stated. This concludes the proof of Theorem 1.1. ∎

Global problem: proof of Theorem 1.3

In this section we prove Theorem 1.3, which states that either nonzero mean or infinite second moment make it impossible to repair the matrix norm by removing a small submatrix. We will first prove a non-asymptotic version version of this result. Once this is done, an application of Borel-Cantelli Lemma will quickly yield Theorem 1.3.

Consider an $n\times n$ random matrix $A$ whose entries are i.i.d. random variables that have either nonzero mean or infinite second moment, and let $\varepsilon\in(0,1)$ . Then, for any $M>0$ there exists $n_{0}$ that may depend only on $\varepsilon$ , $M$ and the distribution of the entries, and such that for any $n>n_{0}$ the following event holds with probability at least $1-e^{-n}$ : every $(1-\varepsilon)n\times(1-\varepsilon)n$ submatrix $A^{\prime}$ of $A$ satisfies

Indeed, modifying an $\varepsilon n\times\varepsilon n$ submatrix always leaves some $(1-\varepsilon)n\times(1-\varepsilon)n$ submatrix $A^{\prime}$ intact, so we can apply Proposition 9.1 for that submatrix.

Here we will prove the part of Proposition 9.1 about infinite second moment; the case of nonzero mean will be treated in Section 9.2. Let us start with the following lemma which will help us treat a fixed submatrix.

Consider an $m\times m$ random matrix $B$ whose entries are i.i.d. random variables with infinite second moment. Then, for any $M>0$ there exists $m_{0}$ that may depend only on $M$ and the distribution of the entries, and such that for any $m>m_{0}$ we have

with probability at least $1-\exp(-M^{2}m)$ .

(This follows easily from Lebesgue’s monotone convergence theorem.)

Consider the matrix $\bar{B}$ with entries $\bar{B}_{ij}$ . We have

Then we bound the failure probability as follows:

Apply Hoeffding’s inequality for the random variables $\bar{B}_{ij}^{2}$ and use that they are bounded by $K^{2}$ by construction. The probability above gets bounded by

If $m>2K^{2}/M^{2}=m_{0}$ , this probability can be further bounded by $\exp(-M^{2}m)$ , as claimed. ∎

We can assume without loss of generality that $M$ is large enough depending on $\varepsilon$ . (Indeed, once the conclusion of the proposition holds for one value of $M$ it automatically holds for all smaller values.)

Apply Lemma 9.2 for an $m\times m$ matrix $A^{\prime}_{n}$ with $m=(1-\varepsilon)n$ , and then take a union bound over all $\binom{n}{m}^{2}$ possible choices of such submatrices. It follows that the conclusion of Proposition 9.1 holds with probability at least

By Stirling’s approximation, we have $\binom{n}{m}\leq(en/m)^{m}$ . Using this and substituting $m=(1-\varepsilon)n$ , we bound the probability below by

If the value of $M$ is sufficiently large depending on $\varepsilon$ , this probability is larger than $1-\exp(-n)$ , as claimed. Proposition 9.1 for infinite second moment is proved. ∎

2. Nonzero mean

Now we will prove the part of Proposition 9.1 about nonzero mean. We can assume here that the second moment of the entries $A_{ij}$ is finite, as the opposite case was treated in Section 9.1. As before, we will first focus on one submatrix. In the following lemma we make an extra boundedness assumption, which we will get rid of using truncation later.

Consider an $m\times m$ random matrix $B$ whose entries are i.i.d. random variables that satisfy

Then, for any $M>0$ there exists $m_{0}$ that may depend only on $\mu$ , $\sigma$ , $K$ and $M$ , and such that for any $m>m_{0}$ we have

with probability at least $1-\exp(-M^{2}m)$ .

(To check this inequality, recall that $\|B\|\geq x^{\mathsf{T}}Bx$ for any unit vector $x$ ; use this for the vector $x$ whose all coordinates equal $1/\sqrt{m}$ .) Then we can bound the failure probability as follows:

Apply Bernstein’s inequality for the random variables $B_{ij}$ and use that they have variance at most $\sigma^{2}$ and are bounded by $K\sqrt{m}$ by assumption. The failure probability gets bounded by

If $m$ is large enough depending $\mu$ , $\sigma$ , $K$ and $M$ , then this probability can be further bounded by $\exp(-M^{2}m)$ , as claimed. ∎

Next, we will use truncation to get rid of the boundedness assumption in Lemma 9.3 and thus prove the following.

Consider an $m\times m$ random matrix $B$ whose entries are i.i.d. random variables that satisfy

Then, for any $M>0$ there exists $m_{0}$ that may depend only on $\mu$ , $\sigma$ , $K$ , $M$ and the distribution of the entries, and such that for any $m>m_{0}$ we have

with probability at least $1-\exp(-M^{2}m)$ .

Choosing $m_{0}$ large enough depending on $M$ and the distribution of $B_{ij}$ , we can make sure that for any $m\geq m_{0}$ the truncated random variables

(This follows easily from Lebesgue’s monotone convergence theorem.)

Let us consider the event that all entries of $B$ are appropriately bounded:

Suppose for a moment that (9.2) fails, so we have $\|B\|<M\sqrt{m}$ . Since the inequality $\|B\|\geq\max_{i,j}|B_{ij}|$ is always true, the event $\mathcal{E}$ must hold in this case. This in turn implies that the truncation has no effect on the entries, i.e. $\bar{B}_{ij}=B_{ij}$ for all $i,j$ .

We have shown that in the event of the failure of (9.2), we may automatically assume that the entries of $B$ are appropriately bounded. Therefore the failure probability satisfies

where $\bar{B}$ denotes the matrix with the truncated entries $\bar{B}_{ij}$ . It remains to apply Lemma 9.3 for the random matrix $\bar{B}$ , noting that truncation may only decrease the second moment. The failure probability gets bounded by $\exp(-M^{2}m)$ , as claimed. ∎

As we mentioned in the beginning of this section, we can assume that the entries $B_{ij}$ have finite second moment $\sigma^{2}$ . Then the conclusion of the proposition follows by exact same union bound argument as in the end of Section 9.1 (just use Lemma 9.4 instead of Lemma 9.2 there.) ∎

3. Proof of Theorem 1.3

where the minimum is taken over all $(1-\varepsilon)n\times(1-\varepsilon)n$ submatrices $A^{\prime}_{n}$ of $A_{n}$ . As we mentioned below Proposition 9.1, this would imply the conclusion of Theorem 1.3, since modifying an $\varepsilon n\times\varepsilon n$ submatrix leaves some $(1-\varepsilon)n\times(1-\varepsilon)n$ sub-matrix intact.

where the minimum has the same meaning as before. By Proposition 9.1, there exists $n_{0}$ such that

We have shown that for any $M>0$ , with probability $1$ there exists $N$ such that

Intersecting these almost sure events for $M=1,2,\ldots$ , we conclude (9.3). Theorem 1.3 is proved. ∎

Removal of large entries under 2+ε2𝜀2+\varepsilon moments

2𝜀2+\varepsilon moments In this section we give a proof of Proposition 1.4 based on a general bound on random matrices due to A. Bandeira and R. van Handel .

Let us call an entry $A_{ij}$ large if $|A_{ij}|>R:=n^{1/2-\varepsilon/8}$ , otherwise call the entry small.

We claim that there are very few large entries with high probability, and we can check this by the same argument as in Lemma 8.5. Indeed, the $2+\varepsilon$ moment assumption and Chebyshev’s inequality give

where the last inequality follows by our choice of $R$ . Thus the expected number of large entries is at most $n^{2}/n^{1+\varepsilon/8}=n^{1-\varepsilon/8}$ . A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives

For convenience, let us subtract the mean, and first bound

which is an $n\times n$ random matrix with independent mean zero entries $G_{ij}$ . A theorem of A. Bandeira and R. van Handel (see [6, Remark 3.13]) states that for any $t>0$ , we have

(where we used the moment assumption) and

Then, using (10.2) with $t=\sqrt{n}$ , we conclude that

where the last inequality holds due to the definition of $R$ , if $n$ is sufficiently large in terms of $\varepsilon$ .

where we used the moment assumption and a weaker form of the bound (10.1).

Concluding, it follows from (10.3) and (10.1) that with probability at least $\exp(-n^{\varepsilon/5})$ , we have

The proof of Proposition 1.4 is complete. ∎

Further questions

Several extensions of Theorem 1.1 seem plausible.

It is natural to expect a version Theorem 1.1 even if the entries of $A$ are not identically distributed. Our argument relies on the identical distribution in several places, including discretization arguments (proof of Theorem 4.2) and symmetrization (proofs of Lemmas 6.3 and 6.4).

A version of Theorem 1.1 should hold for symmetric matrices $A$ with independent entries on and above the diagonal. A simplest way to get this result would be to use Theorem 1.1 to control the parts of $A$ above and below the diagonal separately, and then combine them. However, for this argument one would need a version of Theorem 1.1 for non-identical distributed entries.

Unlike Feige-Ofek’s result mentioned in Section 1.5, Theorem 1.1 does not indicate what sub-matrix should be removed to improve the norm; it is rather an existential result. It would be nice to have an explicit description of a submatrix to be removed.

It would be good to remove the logarithmic factor $\ln\varepsilon^{-1}$ from the bound in Theorem 1.1, or to show that this factor is necessary. Such bound would be optimal up to an absolute constant factor.

Finally, while Remark 1.2 states that the dependence on $\varepsilon$ in Theorem 1.1 is optimal in general, this dependence might be dramatically improved under a natural boundedness assumption. Namely, suppose that the entries of $A$ are $O(\sqrt{n})$ almost surely. (In fact, most of the proof – until Section 8 – was done under this additional assumption.) In this case, is the dependence of the norm on $\varepsilon$ logarithmic in Theorem 1.1, i.e.

In fact, for the partial case of Bernoulli matrices such that $np=c_{0}=const$ (where $p$ is a probability of a non-zero entry) this bound can be quickly deduced from Corollary 8.2.

non-zero elements of order $O(\sqrt{n})$ . Hence,

Introduction

2. Random matrices and their norms

3. Main results

4. What if we remove large entries?

5. Other related results

Acknowledgement

The method

2. Ideal relationships among the norms

3. A roadmap of the proof

Acknowledgements

Preliminaries

3. Concentration

4. Discretization

Damping a sum of independent random variables

2. Damping a sum of random variables

The 2→∞→22\to\infty norm of random matrices

From 2→∞→22\to\infty norm to ∞→2→2\infty\to 2 norm

2. Using random permutations

3. Bounding 2→∞→22\to\infty and ∞→2→2\infty\to 2 norms with tiny probability

4. Bounding ∞→2→2\infty\to 2 norm with high probability

From ∞→2→2\infty\to 2 norm to the operator norm: controlling the bounded entries

Controlling the large entries, and completing the proof of Theorem 1.1

2. Moderately large entries

3. Very large entries

4. Proof of Theorem 1.1

Global problem: proof of Theorem 1.3

2. Nonzero mean

3. Proof of Theorem 1.3

Removal of large entries under 2+ε2𝜀2+\varepsilon moments

Further questions

References