Norms of random matrices: local and global problems
Elizaveta Rebrova, Roman Vershynin
Introduction
2. Random matrices and their norms
with high probability. Note that the order is the best we can generally hope for. Indeed, if the entries of have unit variance, then the typical magnitude of the Euclidean norm of a row of is , and the operator norm of can not be smaller than that. Moreover, by the bounded fourth moment assumption is nearly necessaryFor almost surely convergence of , fourth moment is necessary and sufficient , while for convergence in probability the weak fourth moment is necessary and sufficient . for the bound
A number of quantitative and more general versions of these bounds are known .
3. Main results
Now let us postulate nothing at all about the distribution of the i.i.d. entries of . It still makes sense to ask: is enforcing the ideal bound (1.1) for random matrices a local or a global problem? That is, can we enforce the bound (1.1) by modifying the entries in a small submatrix of ? We will show in this paper that this is possible if and only if the entries of have zero moment and finite variance. The “if” part is covered by the following theorem.
where is a sufficiently large absolute constant.
This shows that the dependence on in Theorem 1.1 is almost optimal.
By rescaling, a more general version of Theorem 1.1 holds for any finite variance of the entries. The two main assumptions in this theorem – mean zero and finite variance – are necessary in Theorem 1.1. Without either of them, the problem becomes global in a strong sense: the desired bound can not be achieved even after modifying a large submatrix. This is the content of the following result.
It should be noted that while Theorem 1.1 becomes harder for smaller , Theorem 1.3 becomes harder for larger , those near .
4. What if we remove large entries?
One may naturally wonder what exactly may cause the norm of a mean zero random matrix to be too large. A natural guess is that the only troublemakers are a few large entries of . Indeed, this is exactly how the necessity of the fourth moment for (1.1) was shown in . So we may ask – can we obtain a result like Theorem 1.1 simply by zeroing out a few largest entries of ?
The answer is no. A counterexample is a sparse Bernoulli matrix , whose i.i.d. entries take values with probability each and with probability . It is not hard to check that is likely to have a row whose norm exceeds , and consequently we have . In other words, without removal of any entries the norm of is too large. However, if we are to remove any entries based purely on their magnitudes, we must remove them all. (Recall that all non-zero elements of have the same magnitude .) But removal of all nonzero entries of is not a local intervention, since such entries can not be placed in a small submatrix (we explained this in Remark 1.2).
Nevertheless, under slightly stronger moment assumptions than in Theorem 1.1, zeroing out a few large entires does bring the norm of down. The following result can be quickly deduced by truncation from known bounds on random matrices such as .
We will deduce Proposition 1.4 from a general bound of A. Bandeira and R. van Handel in Section 10.
5. Other related results
Acknowledgement
Si Tang and Antonio Auffinger were first to note a version of Proposition 1.4, which they kindly showed to us along with a proof based on . Our correspondence led us to add Section 1.4. We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1. We also would like to thank the referee for the valuable suggestions, which helped to improve the presentation of our paper.
The method
Our approach to Theorem 1.1 utilizes and advances the methods developed recently in and . We will first control the cut norm of and then pass to the operator norm using Grothendieck-Pietsch factorization. Let us describe these steps in more detail.
Rather than bounding the operator norm of a random matrix directly, we shall compare it with two simpler norms,
The simplest of the three is the norm. A quick check reveals that it equals the maximum Euclidean norm of the rows of :
The next simplest norm is , which can be conveniently computed as
This norm is equivalent within a constant factor to the cut norm from the computer science literature , where the maximum is taken over . The hardest of the three is the operator norm,
To see why the difficulty in bounding these norms rises this way, note that one has to control random variables in (2.1), random variables in (2.2), and infinitely many random variables in (2.3).
2. Ideal relationships among the norms
How large do we expect the three norms to be for random matrices? For a simple example, let us first consider a Gaussian random matrix with i.i.d. entries. Then it is not difficult to check that
Indeed, note that the rows of have Euclidean norms on average, so the bound on the norm follows by union bound and using Gaussian concentration. The bound on the norm follows from (2.2) by using Gaussian concentration for the normal random vector and taking the union bound over . The bound on the operator norm is a non-asymptotic version of Bai-Yin’s law, see e.g. [24, Theorem 5.32].
One might wonder if (2.4) holds not only in the Gaussian case but generally for random matrices with i.i.d. entries that have zero mean and unit variance. In particular, it would be wonderful if the three norms were always related to each other as follows:
This, however, would be too optimistic to expect, since the bound can not hold without higher moments assumptions as we mentioned in Section 1.1. Nevertheless, we will obtain a version of (2.5) after removal a small fraction of rows of . With high probability, we will be able to find subsets of rows with cardinalities and such that
where the inequalities hide a factor that depends on .
3. A roadmap of the proof
The first step in proving (2.6) is to find a small set with and such that
with high probability. In other words, we would like to bound all rows of simultaneously by after removing a few columns of . To show this we first focus on one row, where we need to bound a sum of independent random variables (the squares of the row’s entries). In Theorem 4.2 we show how to bound sums of independent random variables almost surely by gently damping the summands. Damping, or reweighting down, is a softer operation than removing entries. It allows us to treat in Section 5 all columns simultaneously without much effort, thus proving (2.7). The argument in this step is similar to the approach proposed recently in . We somewhat simplify the method of and also improve the dependence between the number of removed columns and the resulting norm; this will ultimately lead to the optimal dependence on in Theorem 1.1.
At the next step, we extend to a bigger set of rows with and so that
Suppose for a moment that we are not concerned about removal of any columns. It is not too hard to show the general bound
for a random matrix with independent, mean zero entries; we prove this in Lemma 6.1. However, this bound is not very helpful in our situation. We need to work with the matrix instead of , which is not trivial: the removal of the columns in that we did in the first step made the entries of dependent. In Lemma 6.2, we first prove a variant of (2.9) for under an additional symmetry assumption on the distribution of the entries of . Then we manage to remove this assumption with a delicate symmetrization argument, which we develop in the rest of Section 6, with the final result being Theorem 6.6. The general idea of this step, as well as some of our arguments here, are inspired by . However we need to be considerably more careful than in to obtain (2.8) with a logarithmic dependence on .
Next, we pass from norm to the operator norm in Section 7. This is done by using Grothendieck-Pietsch factorization (Theorem 7.1), a result that yields the first inequality in (2.6) for completely arbitrary, even non-random, matrices. This reasoning was recently used in a similar context in .
The argument we just described works under the additional assumption that the entries of be almost surely. To be specific, such boundedness assumption is needed to make the damping argument in Step 1 work with mild, logarithmic dependence on . The contribution of the entries that are larger than are controlled in Section 8 by showing that there can not be too many of them. The unit variance assumption implies that there are such large entries per column on average. This does not mean, of course, that all columns will have large entries with high probability; in fact there could be columns with large entries. But we will check in Lemma 8.1 that the number of such heavy columns is small; removing them will lead to the desired bound on the operator norm for the matrix with large entries. We develop this argument in Proposition 8.4 and Corollary 8.6, and derive the full strength of Theorem 1.1 in Section 8.4.
Theorem 1.3 is proved in Section 9. The paper is concluded with Section 11 where we discuss some further problems.
Acknowledgements
We are thankful to Ramon van Handel who showed us a simple argument that we use here to prove Lemma 3.1 and to Antonio Auffinger for the interesting discussion of finite moment case.
Preliminaries
Throughout the paper, positive absolute constant are denoted , etc. Their values may be different from line to line. We often write to indicate that for some absolute constant .
The discrete interval is denoted by . If is some subset of indices, , let us denote by the matrix obtained from by replacing the indices in by zero:
We will often consider subsets of columns of the matrix, so when we use a simplified notation: for
where and denote the rows and columns of .
Recall that the operator norm can be computed as a maximum of the quadratic form:
Taking the maximum over all unit vectors and , we complete the proof. ∎
3. Concentration
In this paper we make use of good concentration properties of the sums of sub-gaussian (and sub-exponential) random variables, that is, such that grow not faster than standard normal (respectively, exponential) random variables. Recall that by definition a random variable is called sub-gaussian if its moments satisfy
for some number . The minimal number is called the sub-gaussian moment of , denoted as . Analogously, a random variable is called sub-exponential if
for some number . The minimal number is called the sub-exponential moment of , denoted as .
The class of sub-gaussian random variables contains standard normal, Bernoulli, and generally all bounded random variables. The class of sub-exponential random variables is exactly the class of squares of sub-gaussians. See for more information and statements of standard concentration inequalities.
Also we will need a concentration inequality for random permutations from .
The same inequality holds for the sum as well, since it has the same distribution as .
4. Discretization
The following lemma allows us to approximate a general continuous random variable by a sum of independent, scaled Bernoulli random variables. This lemma was originally proved in . Here we give a proof for completeness, and then discuss some particular cases needed for the proof of Theorem 1.1.
Consider a non-negative, continuous random variable . There exists a non-negative random variable satisfying the following.
stochastically dominates , i.e.
is a sum of scaled, independent Bernoulli random variables:
where are non-negative numbers and are independent random variables.
Set the values to be the quantiles of the distribution of :
(These values are well defined since the cumulative distribution function of is continuous by assumption.) By definition, is an increasing sequence. Define by (3.1).
To check part 1, note that by definition,
almost surely. Taking expectation of both sides, we obtain
Now, using the definition of , we have
Let us prove part 2. If for some , then using the definitions of and we obtain
and the inequality in part 2 follows. If then, using the continuity of the cumulative distribution of , we obtain
and the inequality in part 2 follows again. The proof is complete. ∎
Suppose almost surely. Then, in the second part of the conclusion of Lemma 3.3, can be represented as a finite sum
where are non-negative numbers, , and are independent random variables. Here for and .
Stochastic dominance of over in Lemma 3.3 implies that one can realize the random variables and on the same probability space so that
Moreover, in the same way we can construct a majorizing collection for any collection of independent random variables. In particular, we can do it for all entries of the matrix at once.
Damping a sum of independent random variables
Here we will be interested in a stronger result – that the sum be almost surely instead of in expectation. To do this, we will be looking for random weights
To make the damping as gentle as possible, we are looking for largest possible weights , hopefully very close to .
To get started, let us consider the simple case where and try to damp one random variable.
Let . There exists a random variable taking values in $$ and such that
Fix a level whose value we will choose later, and define
Next, the lower bound in (4.2) holds trivially since . For the upper bound, we have
2. Damping a sum of random variables
Now let us address the damping problem for general number of random variables, which we described in the beginning of this section. Applying Lemma 4.1 for each random variable , we get weights such that
for small . We will now considerably improve both these bounds, making only one mild extra assumption that almost surely.
Let be i.i.d. random variables such that
for some . Let . There exist random variables taking values in $$ and such that
Improvement in the order of in (4.4) does not require an extra boundedness assumption, and it was done in previous work . We employ the same ideas as in [19, Lemma 3.3] and obtain better (logarithmic) dependence on in (4.3) in trade of the additional assumption mentioned.
Step 1: Bernoulli distribution. Let us first prove the theorem in the partial case where are scaled Bernoulli random variables. Assume that can take values and , and
Let denote the (random) number of nonzero ’s:
Here is how we will define the weights . If then clearly there is no need to damp so put . The same applies if the number of non-zero ’s does not significantly exceed its expectation . Otherwise we damp all terms by the same amount . Formally, we fix some parameter whose value we will determine later, and set
Let us check (4.3). In the event when , we have
Let us now check (4.4). Since the lower bound is trivial, we will only have to check the upper bound. We will again split the calculation into two cases based on the size of . If then all , so we trivially get
If , then the definition of gives
Since , we have
using a standard consequence of Stirling’s approximation. Thus
provided that . Thus we showed that
where in the last step we used the assumption that that we made in (4.5).
Now that we have the bounds (4.6) and (4.7), it is enough to choose
which implies that . The proof for the Bernoulli distribution is complete.
Here are independent random variables; each can take values and , and
The argument will be similar to step 1 of the proof. For each level we let denote number of non-zero ’s:
Again, for each level define the weights like in step 1:
since by construction. Now, for each level , we can use step 1 of the proof, where we showed in (4.6) that
Let us now check (4.4). The lower bound is trivial, and we will only have to check the upper bound. For each level , we can use step 1 of the proof, where we showed in (4.7) that
which is true as long as . Then, by construction we have
where in the last step we used the inequality . Recall from (4.8) that the exponents form a decreasing geometric progression with values until the last (smallest) term of order . So this last term dominates the sum , and we obtain
Now that we have the bounds (4.10) and (4.11), it is enough to choose
with and the right hand side of (4.11) will be bounded by
as claimed. The proof of the theorem is complete. ∎
The 2→∞→22\to\infty norm of random matrices
In this section we prove Theorem 1.1 under the additional assumption that all entries of are not too large. Specifically, let us assume that
Consider an random matrix with i.i.d. entries which have mean zero and at most unit variance and satisfy (5.1). Let . Then with probability at least , there exists a subset with cardinality such that
We apply Theorem 4.2 for the squares of the elements in each row of , i.e. for the random variables . This gives us random weights which satisfy for each that
To make the same system of weights work for all rows, we define
Then obviously for every , and so
We will remove from the columns whose weights are too small, namely those in
as we claimed in the lemma. Indeed, if then using that all we have
But the probability of this event can be bounded by Markov’s inequality:
where in the last bound we used (5.2). This proves (5.3).
It remains to check that all rows of the matrix are bounded as claimed. We have
Taking the square root of both sides completes the proof. ∎
From 2→∞→22\to\infty norm to ∞→2→2\infty\to 2 norm
In this section we will control the norm of a random matrix. Our first task is to bound the norm by the simpler norm. There are two ways to do this, both of them going back to . The resulting comparison inequalities are interesting in their own right; we state them in Lemmas 6.1 and 6.3. The ultimate result of this section is Theorem 6.6, which gives an optimal bound on the norm of a random matrix after removing a small fraction of columns.
The first method is based on flipping the signs of the entries independently at random. Here is the main result of this section.
Let be an random matrix whose entries are independent, mean zero random variables. Then
Let be independent Rademacher random variables (which are also independent of ) and consider the random matrix
A basic symmetrization inequality (see [15, Lemma 6.3]) yields
Condition on ; the randomness now rests in the random signs only. It suffices to show that the conditional expectation satisfies
Fix . Using independence and (2.1), we get
Moreover, the standard concentration results ([24, Lemma 5.9]) show that each is a sub-gaussian random variable, and we have
Thus is a sub-exponential random variable (see [24, Lemma 5.9]) and
Applying Bernstein’s concentration inequality [24, Corollary 5.17] together with (6.3) and (6.4), we obtain
where is arbitrary. Integration of these tails implies (6.1). ∎
We will need a minor variation of Lemma 6.1 that can be applied even when some of the columns of are removed.
Let be an random matrix whose entries are independent, symmetric random variables. Let be a random subset, which is independent of the signs of the entries of . Then
So, the only part of Lemma 6.1 that does not work for a matrix with removed columns is the symmetrization part. In the following two sections we will develop the tools to overcome the extra symmetry assumption we have to add in Lemma 6.2.
2. Using random permutations
We just showed how to convert an bound to a bound for random matrices by using random signs. Alternatively, one can use random permutations for the same purpose, and obtain the following bound.
Let be an random matrix with i.i.d. entries. Then
where denotes the vector whose all coordinates equal .
The concentration inequality for random permutations (Lemma 3.2) states that each is a sub-gaussian random variable, and we have
Just like in the proof of Lemma 6.1, this implies that
Since the expectation is bounded by the norm (see e.g. [24, Definition 5.13]), we conclude that
Applying Bernstein’s inequality like in Lemma 6.1, we find that
for all . Thus, for any we have with probability at least that
We have already bounded the first sum. As for the second one, the definition of in (6.6) yields
where denotes the number of ones in and is the -th row of . Thus
We substitute this and (6.7) into (6.8) and obtain that for any ,
It remains to recall (6.2) and take a union bound over . It follows that the inequality
where is arbitrary. Integration of these tails implies (6.5). ∎
It is worthwhile to mention a high-probability version of Lemma 6.3.
Let be an random matrix with i.i.d. entries. Then with probability at least we have
where denotes the vector whose all coordinates equal .
At the end of the proof of Lemma 6.3, we obtained inequality (6.9) which states (for large constant ) that
with probability at least . Note that
deterministically. Indeed, it is easy to check that permutations of the elements of the rows of do not affect these two quantities. It follows that
3. Bounding 2→∞→22\to\infty and ∞→2→2\infty\to 2 norms with tiny probability
Recall from Section 2.2 that ideally, we would want
with high probability. But this is too good to be true in our situation, where we assume only two moments for the entries of . Nevertheless, we will now show that these bounds still hold, albeit with exponentially small probability.
Let be an random matrix whose entries are i.i.d. random variables with mean zero and at most unit variance. Let . Then
with probability at least .
We will first bound below the probability of the event
and then use Lemma 6.4 to control .
where denote the rows of . Thus where
are independent events. This reduces the problem to bounding the probability of each event below.
The assumptions on the entries of imply that
Using Chebyshev’s inequality, we see that
By independence of the events , this implies
Next we apply Lemma 6.4, which states that the event
It remains to note that by definition of and , the event implies the inequalities in (6.10). ∎
4. Bounding ∞→2→2\infty\to 2 norm with high probability
In the previous section, we were able to prove the optimal bounds
for a random matrix , but they only hold with exponentially small probability. We claim that the probability of success can be increased to almost if we are allowed to remove a few columns of . We already proved this fact for the norm in Lemma 5.1. It is time to handle the norm.
Consider an random matrix with i.i.d. entries which have mean zero and at most unit variance and satisfy (5.1). Let . Then with probability at least , there exists a subset with cardinality such that
Step 1: Defining the two key events. We will be interested in the two key events that suitably control the and norms of a random matrix. Thus, for a random matrix and numbers , we define
In terms of these events, we want to show that
for some absolute constant . Since the latter event is so likely, intersecting with it would not cause much harm. Indeed, we will show that the bad event
This would finish the proof, since we would then have
Step 2: Symmetrization. As an intermediate step, let us bound the probability of a symmetrized version of , namely the event
and is an independent copy of the random matrix . We claim that
Step 3. Using the small-probability bounds. The last piece of information we will use is the conclusion of Lemma 6.5 for . It states that the good event
Note in passing that there is no guarantee that this statement would hold for the same constants and as we chose in the definition of above. However, we can make this happen by adjusting these constants upwards as necessary. The reader can easily check both (6.12) and (6.14) would still hold after such an adjustment.
The event is determined by , and is determined by only. Thus and are independent, and (6.15) gives
Thus, using (6.12) and (6.14), we conclude that
We have shown (6.11) and thus have completed the proof of the theorem. ∎
From ∞→2→2\infty\to 2 norm to the operator norm: controlling the bounded entries
In Theorem 6.6, we gave an optimal bound for the norm of a random matrix with few removed columns. We will now convert this into an optimal bound for the operator norm. This can be done by applying a form of Grothendieck-Pietsch theorem (see [15, Proposition 15.11]), which has been used recently in [14, section 3.2] in a similar context.
Let be a real matrix and . Then there exists with such that
Applying Theorem 6.6 followed by Grothendieck-Pietsch theorem, we obtain the following result.
Consider an random matrix with i.i.d. entries which have mean zero and at most unit variance and satisfy (5.1). Let . Then with probability at least , there exists a subset with cardinality such that
Apply Theorem 6.6 for instead of . We obtain a subset of columns , , which satistfies
with probability at least .
Next apply Grothendieck-Pietsch Theorem 7.1 for the matrix and for . We obtain a further subset , , such that the removal of columns in both leads to
In the last inequality, we used the bound (7.1) and that and . The proof is complete. ∎
We are ready to prove a partial case of Theorem 1.1, for the matrices whose entries are . It follows by applying Lemma 7.2 for and separately, and then superposing the results.
Apply Lemma 7.2 for and . We obtain that with probability at least , there exists sets and with at most indices in each, and such that
We already controlled the first term in (7.2). As for the second term, since adding columns can only increase the operator norm, we have , which we also bounded in (7.2). The proof is complete. ∎
Controlling the large entries, and completing the proof of Theorem 1.1
In the previous section, we proved a partial case of Theorem 1.1 that controls relatively small entries of , those of the order . Larger entries will be controlled in this section.
The following general lemma will help us analyze the patterns such large entries can form.
Let be an random matrix whose entries are independent Bernoulli random variables with mean . Let . Consider the rows of with more than ones. Then with probability , these rows have at most ones altogether.
To see the connection to our original problem, we will later choose the entries of to be the indicators of the large entries of .
Let be a number to be chosen later. (We will eventually choose as as in the statement of the lemma.) Define the random variables
The quantity of interest is the total number of ones in the heavy rows, and it equals . To control this sum of independent random variables, we can use the standard Bernstein’s trick (commonly called Chernoff’s bound), where we use Markov’s inequality after exponentiation. We obtain
where the last equality follows by independence and identical distribution. Now, by definition of we have
Substituting this bound into (8.2), we conclude that
if we choose so that . To finish the proof, recall that our argument works if satisfies the two conditions: and . We thus choose and complete the proof. ∎
Apply Lemma 8.1 for and with instead of , and take the intersection of the two good events. With the required probability, we obtain a set of bad entries of whose removal makes all rows and columns of contain at most ones. It remains to note that these entries can be trivially placed in some submatrix of , and deletion of the whole submatrix can only decrease the number of non-zero elements in the rows and columns of the residual part. ∎
It is not difficult to obtain a version of Corollary 8.2 for symmetric random matrices. This version can be interpreted as a statement about Erdös-Rényi random graphs , with playing the role of the adjacency matrix. It states that with high probability, one can make all degrees of a random graph bounded by after removing the internal edges from a sub-graph with vertices.
2. Moderately large entries
We will use Corollary 8.2 to deduce Theorem 1.1 for matrices with moderately large entries. Namely, we assume here that all entries of satisfy
Consider the matrix whose elements are indicators of moderately large entries of , i.e.
Then are i.i.d. Bernoulli random variables with mean
3. Very large entries
Finally, we will need to prove Theorem 1.1 for very large entries – now we assume that all entries of satisfy
There are typically very few such entries, as the following simple result shows.
Thus the expected number of non-zero entries in is at most . A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives
Since a set of indices can be always placed in an submatrix, we can state Lemma 8.5 as follows.
4. Proof of Theorem 1.1
We are going to assemble Proposition 7.3 for the bounded entries of , Proposition 8.4 for moderately large entries, and Corollary 8.6 for very large entries. The sub-matrices that appear in these results are possibly different. The following simple lemma will help us to combine them into one.
Let be a matrix. Zeroing out any submatrix of cannot increase the operator norm more than twice.
The last inequality follows from the fact that zeroing out any subset of rows or columns cannot increase the operator norm. ∎
One may wonder if zeroing out a submatrix can increase the norm at all. Basic simulations show that it actually can.
Decompose into a sum of three matrices with disjoint support,
where contains bounded entries of – those that satisfy , the matrix contains moderately large entries – those for which , and contains large entries – those satisfying .
To bound , let us subtract the mean and first bound
The entries of this matrix have zero mean and satisfy
(where we used the moment assumption) and
This proves the conclusion of Theorem 1.1 with instead of , where is arbitrary. By rescaling, Theorem 1.1 holds also as originally stated. This concludes the proof of Theorem 1.1. ∎
Global problem: proof of Theorem 1.3
In this section we prove Theorem 1.3, which states that either nonzero mean or infinite second moment make it impossible to repair the matrix norm by removing a small submatrix. We will first prove a non-asymptotic version version of this result. Once this is done, an application of Borel-Cantelli Lemma will quickly yield Theorem 1.3.
Consider an random matrix whose entries are i.i.d. random variables that have either nonzero mean or infinite second moment, and let . Then, for any there exists that may depend only on , and the distribution of the entries, and such that for any the following event holds with probability at least : every submatrix of satisfies
Indeed, modifying an submatrix always leaves some submatrix intact, so we can apply Proposition 9.1 for that submatrix.
Here we will prove the part of Proposition 9.1 about infinite second moment; the case of nonzero mean will be treated in Section 9.2. Let us start with the following lemma which will help us treat a fixed submatrix.
Consider an random matrix whose entries are i.i.d. random variables with infinite second moment. Then, for any there exists that may depend only on and the distribution of the entries, and such that for any we have
with probability at least .
(This follows easily from Lebesgue’s monotone convergence theorem.)
Consider the matrix with entries . We have
Then we bound the failure probability as follows:
Apply Hoeffding’s inequality for the random variables and use that they are bounded by by construction. The probability above gets bounded by
If , this probability can be further bounded by , as claimed. ∎
We can assume without loss of generality that is large enough depending on . (Indeed, once the conclusion of the proposition holds for one value of it automatically holds for all smaller values.)
Apply Lemma 9.2 for an matrix with , and then take a union bound over all possible choices of such submatrices. It follows that the conclusion of Proposition 9.1 holds with probability at least
By Stirling’s approximation, we have . Using this and substituting , we bound the probability below by
If the value of is sufficiently large depending on , this probability is larger than , as claimed. Proposition 9.1 for infinite second moment is proved. ∎
2. Nonzero mean
Now we will prove the part of Proposition 9.1 about nonzero mean. We can assume here that the second moment of the entries is finite, as the opposite case was treated in Section 9.1. As before, we will first focus on one submatrix. In the following lemma we make an extra boundedness assumption, which we will get rid of using truncation later.
Consider an random matrix whose entries are i.i.d. random variables that satisfy
Then, for any there exists that may depend only on , , and , and such that for any we have
with probability at least .
(To check this inequality, recall that for any unit vector ; use this for the vector whose all coordinates equal .) Then we can bound the failure probability as follows:
Apply Bernstein’s inequality for the random variables and use that they have variance at most and are bounded by by assumption. The failure probability gets bounded by
If is large enough depending , , and , then this probability can be further bounded by , as claimed. ∎
Next, we will use truncation to get rid of the boundedness assumption in Lemma 9.3 and thus prove the following.
Consider an random matrix whose entries are i.i.d. random variables that satisfy
Then, for any there exists that may depend only on , , , and the distribution of the entries, and such that for any we have
with probability at least .
Choosing large enough depending on and the distribution of , we can make sure that for any the truncated random variables
(This follows easily from Lebesgue’s monotone convergence theorem.)
Let us consider the event that all entries of are appropriately bounded:
Suppose for a moment that (9.2) fails, so we have . Since the inequality is always true, the event must hold in this case. This in turn implies that the truncation has no effect on the entries, i.e. for all .
We have shown that in the event of the failure of (9.2), we may automatically assume that the entries of are appropriately bounded. Therefore the failure probability satisfies
where denotes the matrix with the truncated entries . It remains to apply Lemma 9.3 for the random matrix , noting that truncation may only decrease the second moment. The failure probability gets bounded by , as claimed. ∎
As we mentioned in the beginning of this section, we can assume that the entries have finite second moment . Then the conclusion of the proposition follows by exact same union bound argument as in the end of Section 9.1 (just use Lemma 9.4 instead of Lemma 9.2 there.) ∎
3. Proof of Theorem 1.3
where the minimum is taken over all submatrices of . As we mentioned below Proposition 9.1, this would imply the conclusion of Theorem 1.3, since modifying an submatrix leaves some sub-matrix intact.
where the minimum has the same meaning as before. By Proposition 9.1, there exists such that
We have shown that for any , with probability there exists such that
Intersecting these almost sure events for , we conclude (9.3). Theorem 1.3 is proved. ∎
Removal of large entries under 2+ε2𝜀2+\varepsilon moments
2𝜀2+\varepsilon moments In this section we give a proof of Proposition 1.4 based on a general bound on random matrices due to A. Bandeira and R. van Handel .
Let us call an entry large if , otherwise call the entry small.
We claim that there are very few large entries with high probability, and we can check this by the same argument as in Lemma 8.5. Indeed, the moment assumption and Chebyshev’s inequality give
where the last inequality follows by our choice of . Thus the expected number of large entries is at most . A standard application of Chernoff’s inequality (see e.g. [25, Chapter 2]) gives
For convenience, let us subtract the mean, and first bound
which is an random matrix with independent mean zero entries . A theorem of A. Bandeira and R. van Handel (see [6, Remark 3.13]) states that for any , we have
(where we used the moment assumption) and
Then, using (10.2) with , we conclude that
where the last inequality holds due to the definition of , if is sufficiently large in terms of .
where we used the moment assumption and a weaker form of the bound (10.1).
Concluding, it follows from (10.3) and (10.1) that with probability at least , we have
The proof of Proposition 1.4 is complete. ∎
Further questions
Several extensions of Theorem 1.1 seem plausible.
It is natural to expect a version Theorem 1.1 even if the entries of are not identically distributed. Our argument relies on the identical distribution in several places, including discretization arguments (proof of Theorem 4.2) and symmetrization (proofs of Lemmas 6.3 and 6.4).
A version of Theorem 1.1 should hold for symmetric matrices with independent entries on and above the diagonal. A simplest way to get this result would be to use Theorem 1.1 to control the parts of above and below the diagonal separately, and then combine them. However, for this argument one would need a version of Theorem 1.1 for non-identical distributed entries.
Unlike Feige-Ofek’s result mentioned in Section 1.5, Theorem 1.1 does not indicate what sub-matrix should be removed to improve the norm; it is rather an existential result. It would be nice to have an explicit description of a submatrix to be removed.
It would be good to remove the logarithmic factor from the bound in Theorem 1.1, or to show that this factor is necessary. Such bound would be optimal up to an absolute constant factor.
Finally, while Remark 1.2 states that the dependence on in Theorem 1.1 is optimal in general, this dependence might be dramatically improved under a natural boundedness assumption. Namely, suppose that the entries of are almost surely. (In fact, most of the proof – until Section 8 – was done under this additional assumption.) In this case, is the dependence of the norm on logarithmic in Theorem 1.1, i.e.
In fact, for the partial case of Bernoulli matrices such that (where is a probability of a non-zero entry) this bound can be quickly deduced from Corollary 8.2.
non-zero elements of order . Hence,