Delocalization of eigenvectors of random matrices with independent entries

Mark Rudelson, Roman Vershynin

Introduction

This paper establishes a complete delocalization of random matrices with independent entries haying variances of the same order of magnitude. For an $n\times n$ matrix $G$ , complete delocalization refers to the situation where all unit eigenvectors $v$ of $G$ have all coordinates of the smallest possible magnitude $n^{-1/2}$ , up to logarithmic corrections. For example, a random matrix $G$ with independent standard normal entries is completely delocalized with high probability. Indeed, by rotation invariance the unit eigenvectors $v$ are uniformly distributed on the sphere $S^{n-1}$ , so with high probability one has $\|v\|_{\infty}=\max_{i\leq n}|v_{i}|=O(\sqrt{\log(n)/n})$ for all $v$ .

Rotation-invariant ensembles seem to be the only example where delocalization can be obtained easily. Only recently was it proved by L. Erdös et al. that general symmetric and Hermitian random matrices $H$ with independent entries are completely delocalized . These results were later extended by L. Erdös et al. and by Tao and Vu , see also surveys . Very recently, the optimal bound $O(\sqrt{\log n/n})$ was obtained by Vu and Wang for the “bulk” eigenvectors of Hermitian matrices. Delocalization properties with varying degrees of strength and generality were then established for several other symmetric and Hermitian ensembles – band matrices , sparse matrices (adjacency matrices of Erdös-Renyi graphs) , heavy-tailed matrices , and sample covariance matrices .

Despite this recent progress, no delocalization results were available for non-Hermitian random matrices prior to the present work. Similar to the Hermitian case, non-Hermitian random matrices have been successful in describing various physical phenomena, see and the references therein. The distribution of eigenvectors of non-Hermitian random matrices has been studied in physics literature, mostly focusing on correlations of certain eigenvector entries, see .

All previous approaches to delocalization in random matrices were spectral. Delocalization was obtained as a byproduct of local limit laws, which determine eigenvalue distribution on microscopic scales. For example, delocalization for symmetric random matrices was deduced from a local version of Wigner’s semicircle law which controls the number of eigenvalues of $H$ falling in short intervals, even down to intervals where the average number of eigenvalues is logarithmic in the dimension .

In this paper we develop a new approach to delocalization of random matrices, which is geometric rather than spectral. The only spectral properties we rely on are crude bounds on the extreme singular values of random matrices. As a result, the new approach can work smoothly in situations where limit spectral laws are unknown or even impossible. In particular, one does not need to require that the variances of all entries be the same, or even that the matrix of variances be doubly-stochastic (as e.g. ).

The case $\alpha=2$ corresponds to sub-gaussian random variablesStandard properties of sub-gaussian random variables can be found in [38, Section 5.2.3].. It is convenient to state and prove the main result for sub-gaussian random variables, and then deduce a similar result for general $\alpha>0$ using a standard truncation argument.

The same conclusion as in Theorem 1.1 holds for a complex matrix $G$ . One just needs to require that both real and imaginary parts of all entries are independent and satisfy the three conditions in Theorem 1.1.

The exponent $9/2$ of the logarithm in Theorem 1.1 is suboptimal, and there are several points in the proof that can be improved. We believe that by taking care of these points, it is possible to improve the exponent to the optimal value $1/2$ . However, such improvements would come at the expense of simplicity of the argument, while in this paper we aim at presenting the most transparent proof. The exponents $3/2$ is probably suboptimal as well.

The proof of Theorem 1.1 shows that $C$ depends polynomially on $K$ , i.e., $C\leq 2K^{C_{0}}$ for some absolute constant $C_{0}$ . This observation allows one to extend Theorem 1.1 to the situation where the entries $G_{ij}$ of $G$ have uniformly bounded $\psi_{\alpha}$ -norms, for any fixed $\alpha>0$ .

Here $C,\beta,\gamma$ depend only on $\alpha>0$ and $M$ .

Let $G$ be a random matrix as in Theorem 1.1, and let $t\geq 2$ and $s\geq 0$ . Then, with probability at least $1-n^{1-t(s+1)}$ , all $(s/\sqrt{n})$ -approximate eigenvectors $v$ of $G$ satisfy

The results in of this paper could be extended in several other ways. For instance, it is possible to drop the assumption that all variances of the entries are of the same order and prove a similar theorem for sparse random matrices. One can establish the isotropic delocalization in the sense of . We did not pursue these directions since it would have made the presentation heavier. It is also possible that a version of Theorems 1.1 and 1.6 can be proved for Hermitian matrices. We leave this direction for the future.

Our approach to Theorem 1.6 is based on a dimension reduction argument. If the matrix $G$ has a localized approximate eigenvector, it will be detected from an imbalance of a suitable projection of $G$ . As we shall see, this argument yields a lower bound on $\|(G-zI)v\|_{2}$ that is uniform over all unit vectors $v$ such that $\|v\|_{\infty}\gg 1/\sqrt{n}$ .

Let us describe this strategy in loose terms. We fix $z$ and consider the random matrix $A=G-zI$ ; denote its columns by $A_{j}$ . Consider a projection $P$ whose kernel contains all $A_{j}$ except for $j\in\{j_{0}\}\cup J_{0}$ , where $j_{0}\in[n]$ and $J\subset[n]\setminus\{j_{0}\}$ are a random uniform index and subset respectively, with $|J_{0}|=l\sim\log^{10}n$ . We call such $P$ a test projection. By its definition, triangle inequality and Cachy-Schwarz inequality, we have for any vector $v$ that

Suppose that $v$ is localized, say $\|v\|_{\infty}>l^{2}/\sqrt{n}$ . Using the randomness of $j_{0}$ and $J$ , with non-negligible probability (around $1/n$ ) we have $|v_{j_{0}}|=\|v\|_{\infty}>l^{2}/\sqrt{n}$ and $(\sum_{j\in J_{0}}|v_{j}|^{2})^{1/2}\lesssim\sqrt{l/n}$ . On this event, we have shown that

Since the right hand side of (1.2) does not depend on $v$ , we obtained a uniform lower estimate for all localized vectors $v$ .

It remains to estimate the magnitudes of $\|PA_{j}\|_{2}$ for $j\in\{j_{0}\}\cup J_{0}$ . What helps us is that the test projection $P$ can be made independent of the random vectors $A_{j}$ appearing in (1.2). Since $A=G-zI$ , we can represent $A_{j}=G_{j}-ze_{j}$ where $e_{j}$ denote the standard basis vectors. Assume first that $z$ is very close to zero, so $A_{j}\approx G_{j}$ . Then, using concentration of measure we can argue that $\|PA_{j}\|_{2}\approx\|PG_{j}\|_{2}\sim\sqrt{l}$ with high probability (and thus for all $j\in\{j_{0}\}\cup J_{0}$ simultaneously). Substituting this into (1.2) we conclude that the nice lower bound

holds for all localized approximate eigenvectors $v$ corresponding to (approximate) eigenvalues $z$ that are very close to zero.

The challenging part of our argument is for $z$ not close to zero, namely when the diagonal parts $Pe_{j}$ dominates in the representation $PA_{j}=PG_{j}-zPe_{j}$ . Estimating the magnitudes of $\|Pe_{j}\|_{2}$ might be as difficult as the original delocalization problem. However, it turns out that using concentration, it is possible to compare the terms $\|Pe_{j}\|_{2}$ with each other without knowing their magnitudes. This will require a careful construction of a test projection $P$ .

The rest of the paper is organized as follows. In Section 2, we recall some known linear algebraic and probabilistic facts. In Section 3, we rigorously develop the argument that was informally described above. It reduces the delocalization problem to finding a test projection $P$ for which the norms of the columns $Pe_{j}$ have similar magnitudes. In Section 4, we shall develop a helpful tool for estimating $\|Pe_{j}\|_{2}$ , an estimate of the distance between anisotropic random vectors and subspaces. In Section 5, we shall express $\|Pe_{j}\|_{2}$ in terms of such distances, and thus will be able to compare these terms with each other. In Section 6 we deduce Theorem 1.6. Finally, Appendix contains auxiliary results on the smallest singular values of random matrices.

Notation and preliminaries

We shall work with random variables $\xi$ which satisfy the following assumption.

$\xi$ is either real valued and satisfies

or $\xi$ is complex valued, where $\operatorname*{Re}\xi$ and $\operatorname*{Im}\xi$ are independent random variables each satisfying the three conditions in (2.1).

We will establish the conclusion of Theorem 1.6 for random matrices $G$ with independent entries that satisfy Assumption 2.1. Thus we will simultaneously treat the real case and the complex case discussed in Remark 1.2.

We will regard the parameter $K$ in Assumption 2.1 as a constant, thus $C,C_{1},c,c_{1},\ldots$ will denote positive numbers that may depend on $K$ only; their values may change from line to line.

Without loss of generality, we can assume that $G$ , as well as various other matrices that we will encounter in the proof, have full rank. This can be achieved by a perturbation argument, where one adds to $G$ an independent Gaussian random matrix $G^{\prime}$ whose all entries are independent $N(0,\sigma^{2})$ random variables with sufficiently small $\sigma>0$ . Such perturbation will not affect the proof of Theorem 1.6 since any $\varepsilon$ -approximate eigenvector of $G$ will be a $(2\varepsilon)$ -approximate eigenvector of $G^{\prime}$ whenever $\left\|G-G^{\prime}\right\|<\varepsilon$ .

Here $A^{\dagger}$ denotes the Moore-Penrose pseudoinverse of $A$ , see e.g. . We will need a few elementary properties of singular values.

Let $A$ be an $m\times n$ matrix and $r=\operatorname*{rank}(A)$ .

Appendix A contains estimates of the smallest singular values of random matrices.

Next, we state a concentration property of sub-gaussian random vectors.

Let $A$ be a fixed $m\times n$ matrix. Consider a random vector $X=(X_{1},\ldots,X_{n})$ with independent components $X_{i}$ which satisfy Assumption 2.1.

(Concentration) For any $t\geq 0$ , we have

In both parts, $c=c(K)>0$ is polynomial in $K$ .

This result can be deduced from Hanson-Wright inequality. For part (ii), this was done in . A modern proof of Hanson-Wright inequality and deduction of both parts of Theorem 2.3 are discussed in . There $X_{i}$ were assumed to have unit variances; the general case follows by a standard normalization step.

Sub-gaussian concentration paired with a standard covering argument yields the following result on norms of random matrices, see .

The same holds if $B=P$ is an $r\times N$ matrix such that $PP^{*}=I_{r}$ .

Reducing delocalization to the existence of a test projection

We will first try to bound the probability of the following localization event for a random matrix $A$ and parameters $l,W,w>0$ :

In this section, we reduce our task to the existence of a certain linear map $P$ which reduces dimension from $n$ to $\sim l$ , and which we call a test projection.

To this end, given an $m\times n$ matrix $B$ , we shall denote by $B_{j}$ the $j$ -th column of $B$ , and for a subset $J\subseteq[n]$ , we denote by $B_{J}$ the submatrix of $B$ formed by the columns indexed by $J$ . Fix $n$ and $l\leq n$ , and define the set of pairs

We equip $\Lambda$ with the uniform probability measure.

Let $l\leq n$ . Consider an $n\times n$ random matrix $A$ with an arbitrary distribution. Suppose that to each $(j_{0},J_{0})\in\Lambda$ corresponds a number $l^{\prime}\leq n$ and an $l^{\prime}\times n$ matrix $P=P(n,l,A,j_{0},J_{0})$ with the following properties:

$\ker(P)\supseteq\{A_{j}\}_{j\not\in\{j_{0}\}\cup J_{0}}$ .

Let $\alpha,\kappa>0$ . Let $w>0$ and $W=\frac{w}{\kappa l}+\frac{\sqrt{2}}{\alpha}$ . Then we can bound the probability of the localization event (3.1) as follows:

where $\mathcal{B}_{\alpha,\kappa}$ denotes the following balancing event:

Proposition 3.1 states that in order to establish delocalization (as encoded by the complement of the event $\mathcal{L}_{W,w}$ ), it is enough to find a test projection $P$ which satisfies the balancing property $\mathcal{B}_{\alpha,\kappa}$ .

Let $v\in S^{n-1}$ , $(j_{0},J_{0})\in\Lambda$ let $P$ be as in the statement. Using the properties (i) and (ii) of $P$ , we have

The event $\mathcal{B}_{\alpha,\kappa}$ will help us balance the norms $\|PA_{j_{0}}\|_{2}$ and $\|PA_{J_{0}}\|$ , while the following elementary lemma will help us balance the coefficients $v_{i}$ .

For a given $v\in S^{n-1}$ and for random $(j_{0},J_{0})\in\Lambda$ , define the event

Let $k_{0}\in[n]$ denote a coordinate for which $|v_{k_{0}}|=\|v\|_{\infty}$ . Then

Conditionally on $j_{0}=k_{0}$ , the distribution of $J_{0}$ is uniform in the set $\{J\subseteq[n]\setminus\{k_{0}\},\,|J|=l-1\}$ . Thus using Chebyshev’s inequality we obtain

Assume that a realization of the random matrix $A$ satisfies

(We will analyze when this event occurs later.) Combining with the conclusion of Lemma 3.2, we see that there exists $(j_{0},J_{0})\in\Lambda$ such that both events $\mathcal{V}_{v}$ and $\mathcal{B}_{\alpha,\kappa}$ hold. Then we can continue estimating $\|Av\|_{2}$ in (3.3) using $\mathcal{V}_{v}$ and $\mathcal{B}_{\alpha,\kappa}$ as follows:

provided the right hand side is non-negative. In particular, if $\|v\|_{\infty}>W\sqrt{l/n}$ where $W=\frac{w}{\kappa l}+\frac{\sqrt{2}}{\alpha}$ , then $\|Av\|_{2}>w/\sqrt{n}$ . Thus the localization event $\mathcal{L}_{W,w}$ must fail.

Let us summarize. We have shown that the localization event $\mathcal{L}_{W,w}$ implies the failure of the event (3.5). The probability of this failure can be estimated using Chebyshev’s inequality and Fubini theorem as follows:

This completes the proof of Proposition 3.1. ∎

We might choose $P$ to be the orthogonal projection with

In reality, $P$ will be a bit more adapted to $A$ . Let us see what it will take to prove the two inequalities defining the balancing event $\mathcal{B}_{\alpha,\kappa}$ in (3.2). The second inequality can be deduced from the small ball probability estimate, Theorem 2.3(ii). Turning to the first inequality, note that

up to a polynomial factor in $|J_{0}|=l-1$ (thus logarithmic in $n$ ). So we need to show that

Since $A=G-zI_{n}$ , the columns $A_{i}$ of $A$ can be expressed as $A_{i}=G_{i}-ze_{i}$ . Thus, informally speaking, our task is to show that with high probability,

The first inequality can be deduced from sub-gaussian concentration, Theorem 2.3. The second inequality in (3.6) is challenging, and most of the remaining work is devoted to validating it. Instead of estimating $\|Pe_{j}\|_{2}$ , we will compare these terms with each other.

Later, in Proposition 5.3, we will relate $\|Pe_{j}\|_{2}$ to distances between anisotropic random vectors and subspaces. We will now digress to develop a general bound on such distances, which may be interesting on its own.

Distances between anisotropic random vectors and subspaces

Let us start with the isotropic case, where the random vectors in question have all independent coordinates. Here one can use Theorem 2.3 to control the distances.

Let $1\leq k\leq n$ . Consider independent random vectors $X,X_{1},X_{2},\ldots,X_{k}$ with independent coordinates satisfying Assumption 2.1. Consider the subspace $E_{k}=\operatorname*{span}(X_{i})_{i=1}^{k}$ . Then

By adding small independent Gaussian perturbations to the vectors $X_{j}$ , we can assume that $\dim(E_{k})=k$ almost surely. We can represent the distance as

In this paper, we will need to control the distances in the more difficult anisotropic case, where all random vectors are transformed by a fixed linear map $D$ . In other words, we will be interested in distances of the form $d(DX,E_{k})$ where $E_{k}$ is the span of the vectors $DX_{1},\ldots,DX_{k}$ . An ideal estimate should look like

where $s_{i}(D)$ are the singular values of $D$ arranged in the non-increasing order. To see why such estimate would make sense, note that in the isotropic case where $D=I_{n}$ the distance is of order $\sqrt{n-k}$ , while for $D$ of rank $k$ or lower, the distance is zero.

The following result, based again on Theorem 2.3, establishes a somewhat weaker form of (4.1) with exponentially high probability.

Let $D$ be an $n\times n$ matrix with singular valuesAs usual, we arrange the singular values in a non-increasing order. $s_{i}=s_{i}(D)$ , and define $\bar{S}_{m}^{2}=\sum_{i>m}s_{i}^{2}$ for $m\geq 0$ . Let $1\leq k\leq n$ . Consider independent random vectors $X,X_{1},X_{2},\ldots,X_{k}$ with independent coordinates satisfying Assumption 2.1. Consider the subspace $E_{k}=\operatorname*{span}(DX_{i})_{i=1}^{k}$ . Then for every $k/2\leq k_{0}<k$ and $k<k_{1}\leq n$ , one has

Here $M=Ck\sqrt{k_{0}}/(k-k_{0})$ and $C=C(K)$ , $c=c(K)>0$ .

It is important that the probability bounds in Theorem 4.2 are exponential in $k_{1}-k$ and $k-k_{0}$ . We will later choose $k\sim l\sim\log^{2}n$ and $k_{0}\approx(1-\delta)k$ , $k_{1}\approx(1+\delta)k$ , where $\delta\sim 1/\log n$ . This will allow us to make the exceptional probabilities Theorem 4.2 smaller than, say, $n^{-10}$ .

As will be clear from the proof, one can replace the distance $d(DX,E_{k})$ in part (ii) of Theorem 4.2 by the following bigger quantity:

We truncate the singular values of $B$ by defining an $n\times n$ matrix $\bar{B}$ with the same left and right singular vectors as $B$ , and with singular values

Since $s_{i}(\bar{B})\leq s_{i}(B)$ for all $i$ , we have $\bar{B}\bar{B}^{*}\preceq BB^{*}$ in the p.s.d. order, which implies

It remains to bound $\|\bar{B}X\|_{2}$ below. This can be done using Theorem 2.3(ii):

For $i>k_{1}-k$ , Cauchy interlacing theorem yields $s_{i}(\bar{B})=s_{i}(B)\geq s_{i+k}(D)$ , thus

(ii) We truncate the singular value decomposition $D=\sum_{i=1}^{n}s_{i}u_{i}v_{i}^{*}$ by defining

We will estimate these two terms separately.

Using this for $t=\sqrt{k}s_{k_{0}+1}$ , we obtain that with probability at least $1-2\exp(-ck)$ ,

Next, we estimate the first term in (4.4), $d(D_{0}X,E_{k})$ . Our immediate goal is to represent $D_{0}X$ as a linear combination

with some control of the norm of the coefficient vector $a=(a_{1},\ldots,a_{k})$ . To this end, let us consider the singular value decomposition

Thus $P_{0}$ is a $k_{0}\times n$ matrix satisfying $P_{0}P_{0}^{*}=I_{k_{0}}$ . Let $G$ denote the $n\times k$ with columns $X_{1},\ldots,X_{k}$ .

We apply Theorem A.3 for the $k_{0}\times k$ matrix $P_{0}G$ . It states that with probability at least $1-2k\exp(-c(k-k_{0}))$ , we have

Using Lemma 2.2(ii) we can find a coefficient vector $a=(a_{1},\ldots,a_{k})$ such that

Multiplying both sides of (4.8) by $U_{0}\Sigma_{0}$ and recalling that $D_{0}=U_{0}\Sigma_{0}V_{0}^{*}=U_{0}\Sigma_{0}P_{0}$ , we obtain the desired identity (4.6).

Now we have representation (4.6) with a good control of $\|a\|_{2}$ . Then we can estimate the distance as follows:

(Recall that $G$ denotes the $n\times k$ with matrix columns $X_{1},\ldots,X_{k}$ .) Applying Theorem 2.4, we have with probability at least $1-2\exp(-k)$ that

Intersecting this with the event (4.10), we obtain with probability at least $1-6k\exp(-c(k-k_{0}))$ that

Finally, we combine this with the event (4.5) and put into the estimate (4.4). It follows that with probability at least $1-8k\exp(-c(k-k_{0}))$ , one has

Due to our choice of $M$ (in (4.10) and (4.7)), the theorem is proved.The factor $8$ in the probability estimate can be reduced to $2$ by adjusting $c$ . We will use the same step in later arguments. ∎

Construction of a test projection

We are now ready to construct a test projection $P$ , which will be used later in Proposition 3.1.

with probability at least $1-2n^{2}\exp(-cl/\log n)$ , one has

In the rest of this section we prove Theorem 5.1.

Consider the $n\times n$ random matrix $A$ with columns $A_{j}$ . Let $\bar{A}$ denote the $(n-l)\times(n-l)$ minor of $A$ obtained by removing the first $l$ rows and columns. By known invertibility results for random matrices, we will see that most singular values of $\bar{A}$ , and thus also of $\bar{A}^{-1}$ , are within a factor $n^{O(1)}$ from each other. Then we will find a somewhat smaller interval (a “spectral window”) in which the singular values of $\bar{A}^{-1}$ are within constant factor from each other. This is a consequence of the following elementary lemma.

Let $s_{1}\geq s_{2}\geq\cdots\geq s_{n}$ , and define $\bar{S}_{k}^{2}=\sum_{j>k}s_{j}^{2}$ for $k\geq 0$ . Assume that for some $l\leq n$ and $R\geq 1$ , one has

Set $\delta=c/\log R$ . Then there exists $l^{\prime}\in[l/2,l]$ such that

Let us divide the interval $[l/2,l]$ into $1/(8\delta)$ intervals of length $4\delta l$ . Then for at least one of these intervals, the sequence $s_{i}^{2}$ decreases over it by a factor at most $2$ . Indeed, if this were not true, the sequence would decrease by a factor at least $2^{1/(8\delta)}>R$ over $[l/2,l]$ , which would contradict the assumption (5.1). Set $l^{\prime}$ to be the midpoint of the interval we just found, thus

By monotonicity of $s_{i}^{2}$ , this implies the first part of the conclusion (5.2). To see this, note that since $l^{\prime}\leq l$ , we have $l^{\prime}-2\delta l\leq(1-\delta)l^{\prime}\leq(1+\delta)l^{\prime}\leq l^{\prime}+2\delta l$ .

To deduce the second part of (5.2), note that by monotonicity we have

where the very last inequality follows from (5.3). Estimates (5.4) and (5.5) together imply that $\bar{S}_{l^{\prime}-\delta l}^{2}\leq 5\bar{S}_{l^{\prime}+\delta l}^{2}$ . Like in the first part, we finish by monotonicity. ∎

We shall apply Lemma 5.2 to the singular values of $\bar{A}^{-1}$ , i.e. for

To verify the assumptions of the lemma, we can use known estimates of the extreme singular values of random matrices. By Theorem 2.4 (see Remark 2.5), with probability at least $1-\exp(-n)$ , we have $\|G\|\leq C\sqrt{n}$ , and thus

Further, by Theorem A.2, with probability at least $1-2l\exp(-c(n-2l))$ , one has

(Here we used that $l\leq n/4$ .) Summarizing, with probability at least $1-2n\exp(-cl)$ ,

Let us condition on $\bar{A}$ for which event (5.6) holds. We apply Lemma 5.2 with $R=(C_{1}/c_{1})n$ and thus for

We find $l^{\prime}\in[l/2,l]$ such that (5.2) holds. Note that the value of $l^{\prime}$ depends only on the minor $\bar{A}$ , thus only on $\{A_{j}\}_{j>l}$ , as claimed in Theorem 5.1. Since we have conditioned on $\bar{A}$ , the value of the “spectral window” $l^{\prime}$ is now fixed.

2. Construction of P𝑃P

We construct $P$ in two steps. First we define a matrix $Q$ of the same dimensions that satisfies (ii) of the Theorem, and then obtain $P$ by orthogonalization of the rows of $Q$ .

Thus we shall look for an $l^{\prime}\times n$ matrix $Q$ that consists of three blocks of columns:

We require that $Q$ satisfy condition (ii) in Theorem 5.1, i.e. that

We explore this requirement in Section 5.4; for now let us assume that it holds.

Choose $P$ to be an $l^{\prime}\times n$ matrix that satisfies the following two defining properties:

the span of the rows of $P$ is the same as the span of the rows of $Q$ .

One can construct $P$ by Gram-Schmidt orhtogonalization of the rows of $Q$ .

Note that the construction of $P$ along with (5.8) implies (i) and (ii) of Theorem 5.1. It remains to estimate $\|Pe_{j}\|_{2}$ thereby proving (iii) of Theorem 5.1.

Let $q_{i}$ denote the rows of $Q$ and $q_{ij}$ denote the entries of $Q$ . Then:

The values of $\|Pe_{i}\|_{2}$ , $i\leq n$ , are determined by $Q$ , and they do not depend on a particular choice of $P$ satisfying its defining properties (a), (b).

For every $l^{\prime}<i\leq l$ , $\|Pe_{i}\|_{2}=0$ .

(i) Any $P,P^{\prime}$ that satisfy the defining properties (a), (b) must satisfy $P^{\prime}=UP$ for some $l^{\prime}\times l^{\prime}$ unitary matrix $U$ . It follows that $\|P^{\prime}e_{i}\|_{2}=\|Pe_{i}\|_{2}$ for all $i$ .

(ii) Let us assume that $i=1$ ; the argument for general $i$ is similar. By part (i), we can construct the rows of $P$ by performing Gram-Schmidt procedure on the rows of $Q$ in any order. We choose the following order: $q_{l^{\prime}},q_{l^{\prime}-1},\ldots,q_{1}$ , and thus construct the rows $p_{l^{\prime}},p_{l^{\prime}-1},\ldots,p_{1}$ of $P$ . This yields

where $p_{ij}$ denote the entries of $P$ .

Next, for each $2\leq j\leq l^{\prime}$ , (5.11) implies that $p_{j}\in\operatorname*{span}(q_{k})_{k\geq 2}=E_{1}$ , and thus the first coordinate of $p_{j}$ equal zero. Using this in (5.12), we conclude that

(iii) is trivial since $Qe_{i}=0$ for all $l^{\prime}<i\leq l$ by the construction of $Q$ , while the rows of $P$ are the linear combination of the rows of $Q$ . ∎

4. The kernel requirement (5.8)

In order to estimate the distances $d(q_{i},E_{i})$ defined by the rows of $Q$ , let us explore the condition (5.8) for $Q$ . To express this condition algebraically, let us consider the $n\times(n-l)$ matrix $A^{(l)}$ obtained by removing the first $l$ columns from $A$ . Then (5.8) can be written as

Let us denote the first $l$ rows of $A^{(l)}$ by $B_{i}^{\mathsf{T}}$ , thus

Without loss of generality, we can assume that the matrix $\bar{A}$ is almost surely invertible (see Section 2 for a perturbation argument achieving this). Multiplying both sides of the previous equations by $\bar{A}^{-1}$ , we further rewrite them as

Thus we can choose $Q$ to satisfy the requirement (5.8) by choosing $q_{ii}>0$ arbitrarily and defining $\bar{q}_{i}$ as in (5.15).

5. Estimating the distances, and completion of proof of Theorem 5.1

We shall now estimate $\|Pe_{i}\|_{2}$ , $1\leq i\leq l^{\prime}$ , using identities (5.9) and (5.15). By the construction of $Q$ and (5.15) we have

Let us estimate $\|Pe_{1}\|_{2}$ ; the argument for general $\|Pe_{i}\|_{2}$ is similar. By (5.9),

We will use Theorem 4.2 to obtain lower and upper bounds on $d_{1}$ .

We apply Theorem 4.2 in dimension $n-l$ instead of $n$ , and with

Recall here that in (5.7) we selected $\delta=c/\log n$ . Note that by construction (5.14), the vectors $B_{i}$ do not contain the diagonal elements of $A$ , and so their entries have mean zero as required in Theorem 4.2. Applying part (i) of that theorem, we obtain with probability at least $1-2\exp(-c\delta l^{\prime})$ that

Now we apply part (ii) of Theorem 4.2. This time we shall use a sharper bound stated in Remark 4.4. It yields that with probability at least $1-2l^{\prime}\exp(-c\delta l^{\prime})$ , the following holds. There exists $a=(a_{2},\ldots,a_{l^{\prime}})$ such that

We can simplify (5.18). Using (5.2) and monotonicity, we have

Recall that this holds with probability at least $1-2l^{\prime}\exp(-c\delta l^{\prime})$ . On this event, by the construction of $r_{i}$ and using the bound on $a$ in (5.19), we have

5.3. Completion of the proof of Theorem 5.1

Combining the events (5.20) and (5.17), we have shown the following. With probability at least $1-4l^{\prime}\exp(-c\delta l^{\prime})$ , the following two-sided estimate holds:

A similar statement can be proved for general $d_{i}$ , $1\leq i\leq l^{\prime}$ . By intersecting these events, we obtain that with probability at least $1-4(l^{\prime})^{2}\exp(-c\delta l^{\prime})$ , all such bounds for $d_{i}$ hold simultaneously. Suppose this indeed occurs. Then by (5.16), we have

We have calculated the conditional probability of (5.21); recall that we conditioned on $\bar{A}$ which satisfies the event (5.6), which itself holds with probability $1-2n\exp(-cl)$ . Thus the unconditional probability of the event (5.21) is at least $1-2n\exp(-cl)-C_{1}(l^{\prime})^{2}\exp(-c\delta l^{\prime})$ . Recalling that $l/2\leq l\leq n/4$ and $\delta=c/\log n$ , and simplifying this expression, we arrive at the probability bound claimed in Theorem 5.1. Since $M\leq 2\sqrt{l}/\delta$ according to (5.19), the estimate (5.21) yields the first part of (iii) in Theorem 5.1. The second part, stating that $Pe_{i}=0$ for $l^{\prime}<i\leq l$ , was already noted in (iii) or Proposition 5.3. Thus Theorem 5.1 is proved. ∎

Proof of Theorem 1.6 and Corollary 1.5

Let $G$ be a random matrix from Theorem 1.6. We shall apply Proposition 3.1 for

Let $\alpha=c/(l\log^{3/2}n)$ and $\kappa=c$ . Then, for every fixed $(j_{0},J_{0})\in\Lambda$ , one can find a test projection as required in Proposition 3.1. Moreover,

Without loss of generality, we assume that $j_{0}=1$ and $J_{0}=\{2,\ldots,n\}$ . We apply Theorem 5.1, and choose $l^{\prime}\in[l/2,l]$ and $P$ determined by $\{A_{j}\}_{j>l}$ guaranteed by that theorem. The test projeciton $P$ automatically satisfies the conditions of Proposition 3.1. Moreover, with probability at least $1-2n^{2}\exp(-cl/\log n)$ , one has

Let us condition on $\{A_{j}\}_{j>l}$ for which the event (6.2) holds; this fixes $l^{\prime}$ and $P$ but leaves $\{A_{j}\}_{j\leq l}$ random as before.

The definition (3.2) of balancing event $\mathcal{B}_{\alpha,\kappa}$ requires us to estimate the norms of

Hence, estimates (6.3) and (6.5) hold simultaneously with probability at least $1-4\exp(-cl)$ . Recall that this concerns conditional probability, where we conditioned on the event (6.2), which itself holds with probability at least $1-2n^{2}\exp(-cl/\log n)$ . Therefore, estimates (6.3) and (6.5) hold simultaneously with (unconditional) probability at least $1-4\exp(-cl)-2n^{2}\exp(-cl/\log n)\geq 1-6n^{2}\exp(-cl/\log n)$ . Together they yield

This is the first part of the event $\mathcal{B}_{\alpha,\kappa}$ . Finally, (6.3) implies that $\|PA_{1}\|_{2}\geq c\sqrt{l}$ , which is the second part of the event $\mathcal{B}_{\alpha,\kappa}$ for $\kappa=c$ . The proof is complete. ∎

Substituting the conclusion of Proposition 6.1 into Proposition 3.1, we obtain:

From this we can readily deduce a slightly stronger version of Theorem 1.6.

Consider a random matrix $G$ as in Theorem 1.6. Let $0\leq s\leq n$ , $s+2\leq l\leq n/4$ and $W=Cl\log^{3/2}n$ . Then the event

Recall that $G$ is nicely bounded with high probability. Indeed, Theorem 2.4 (see Remark 2.5) states that the event

Assume that $\mathcal{E}_{\textrm{norm}}$ holds. Then all $(s/\sqrt{n})$ -approximate eigenvalues of $G$ are contained in the complex disc centered at the origin and with radius $\|G\|+s/\sqrt{n}\leq 2C_{1}\sqrt{n}$ . Let $\{z_{1},\ldots,z_{N}\}$ be a $(1/\sqrt{n})$ -net of this disc such that $N\leq C_{2}n^{2}$ .

Assume $\mathcal{L}_{W}$ holds, so there exists an $(s/\sqrt{n})$ -approximate eigenvector $v$ of $G$ such that $\|v\|_{2}=1$ and $\|v\|_{\infty}>W\sqrt{l/n}$ . Choose a point $z_{i}$ in the net closest to $z$ , so $|z-z_{i}|\leq 1/\sqrt{n}$ . Then

This argument shows that $\mathcal{L}_{W}\cap\mathcal{E}_{\textrm{norm}}\subseteq\bigcup_{i=1}^{N}\mathcal{L}_{W}^{(i)}$ , where

Recall that the probability of $\mathcal{E}_{\textrm{norm}}$ is estimated in (6.6), and the probabilities of the events $\mathcal{L}_{W}^{(i)}$ can be bounded using Proposition 6.2 with $w=s+1$ . (Our assumption that $l\geq s+2$ enforces the bound $w<l$ that is needed in Proposition 6.2.) It follows that

Simplifying this bound we complete the proof. ∎

We are going to apply Corollary 6.3 for $l=Ct(s+1)\log^{2}n$ . This is possible as long as $s\leq n$ and $t(s+1)<cn/\log^{2}n$ , since the latter restriction enforces the bound $l\leq n/4$ . In this regime, the conclusion of Theorem 1.6 follows directly from Corollary 6.3.

In the remaining case, where either $s\geq n$ or $t(\varepsilon\sqrt{n}+1)>cn/\log^{2}n$ , the right hand side of (1.1) is greater than $\|v\|_{2}$ for an appropriate choice of the constant $C$ . Thus, in this case, the bound (1.1) holds trivially since one always has $\|v\|_{\infty}\leq\|v\|_{2}$ . Theorem 1.6 is proved. ∎

Using a standard truncation argument, we will now deduce Corollary 1.5 for general exponential tail decay. We will first prove the following relaxation of Proposition 6.2.

Here $\beta,\gamma,C,c>0$ depend only on $\alpha$ and $M$ .

Appendix A Invertibility of random matrices

Our delocalization method relied on estimates of the smallest singular values of rectangular random matrices. The method works well provided one has access to estimates that are polynomial in the dimension of the matrix (which sometimes was of order $n$ , and other times of order $l\sim\log^{2}n$ ), and provided the probability of having these estimates is, say, at least $1-n^{-10}$ .

In the recent years, significantly sharper bounds were proved than those required in our delocalization method, see survey . We chose to include weaker bounds in this appendix for two reasons. First, they hold in somewhat more generality than those recorded in the literature, and also their proofs are significantly simpler.

Let $N\geq n$ , and let $A=D+G$ where $D$ is an arbitrary $N\times n$ fixed matrix and $G$ is an $N\times n$ random matrix with independent entries satisfying Assumption 2.1. Then

Using the negative second moment identity (see Lemma A.4]), we have

Union bound yields that with probability at least $1-2n\exp(-c(N-n))$ , we have $d(A_{i},E_{i})\geq c\sqrt{N-n}$ for all $i\leq n$ . Plugging this into (A.2), we conclude that with the same probability, $s_{n}(A)^{-2}\leq c^{-2}n/(N-n)$ . This completes the proof. ∎

Let $A=D+G$ where $D$ is an arbitrary $N\times M$ fixed matrix and $G$ is an $N\times M$ random matrix with independent entries satisfying Assumption 2.1. Then all singular values $s_{n}(A)$ for $1\leq n\leq\min(N,M)$ satisfy the estimate (A.1) with $c=c(K)>0$ .

Recall that $s_{n}(A)\geq s_{n}(A_{0})$ where $A_{0}$ is formed by the first $n$ columns of $A$ . The conclusion follows from Theorem A.1 applied to $A_{0}$ . ∎

Let us explain the idea of the proof of Theorem A.3. We need a lower bound for

where $G_{i}$ denote the columns of $G$ . The bound has to be uniform over $x\in S^{m-1}$ . Let $m=(1-\delta)k$ and set $m_{0}=(1-\rho)m$ for a suitably chosen $\rho\ll\delta$ .

The vectors $x$ that lie near the subspace $E^{\perp}$ , which has dimension $m-m_{0}=\rho m$ , can be controlled by the remaining $k-m_{0}$ vectors $PG_{i}$ , since $k-m_{0}\gg m-m_{0}$ . Indeed, this is equivalent to controlling the smallest singular value of a $(m-m_{0})\times(k-m_{0})$ random matrix whose columns are $QG_{i}$ , where $Q$ projects onto $E^{\perp}$ . This is a version of Theorem A.3 for very fat matrices, and it can be proved in a standard way by using $\varepsilon$ -nets.

Let $m_{0}\leq m$ . Consider the $m\times m_{0}$ matrix $T_{0}$ formed by the first $m_{0}$ columns of matrix $T=PG$ . Then

This is a minor variant of Theorem A.1; its proof is very similar and is omitted. ∎

There exist $C=C(K)$ , $c=c(K)>0$ such that the following holds. Consider the same situation as in Theorem A.3, except that we assume that $k\geq Cm$ . Then

Lemma A.5 is a minor variation of [38, Theorem 5.39] for $k\geq Cm$ independent sub-gaussian columns, and it can be proved in a similar way (using a standard concentration and covering argument). ∎

Denote $T:=PG$ ; our goal is to bound below the quantity

Let $\varepsilon,\rho\in(0,1/2)$ be parameters, and set $m_{0}=(1-\rho)m$ . We decompose

where $T_{0}$ is the $m\times m_{0}$ matrix that consists of the first $m_{0}$ columns of $T$ , and $\bar{T}$ is the $(k-m_{0})\times m$ matrix that consists of the last $k-m_{0}$ columns of $T$ . Let $x\in S^{m-1}$ . Then

Assume that $s_{m_{0}}(T_{0})>0$ (which will be seen to be a likely event), so $\dim(E)=m_{0}$ .

The argument now splits according to the position of $x$ relative to $E$ . Assume first that $\|P_{E}x\|_{2}\geq\varepsilon$ . Since $\operatorname*{rank}(T_{0})=m_{0}$ , using Lemma 2.2(i) we have

We will later apply Lemma A.4 to bound $s_{m_{0}}(T_{0})$ below.

Consider now the opposite case, where $\|P_{E}x\|_{2}<\varepsilon$ . There exists $y\in E^{\perp}$ such that $\|x-y\|_{2}\leq\varepsilon$ , and in particular $\|y\|_{2}\geq\|x\|_{2}-\varepsilon\geq 1-\varepsilon>1/2$ . Thus

is an $(m-m_{0})\times(k-m_{0})$ matrix. Since $\|z\|_{2}\geq 1/2$ , we have $\|\bar{T}^{*}y\|_{2}\geq\frac{1}{2}s_{m-m_{0}}(B)$ , which together with (A.3) yields

A bit later, we will use Lemma A.5 to bound $s_{m-m_{0}}(\bar{B})$ below.

Putting the two cases together, we have shown that

It remains to estimate $s_{m_{0}}(T_{0})$ , $s_{m-m_{0}}(\bar{B})$ and $\|\bar{T}\|$ .

Since $m_{0}=(1-\rho)m$ and $\rho\in(0,1/2)$ , Lemma A.4 yields that with probability at least $1-2m\exp(-c\rho m)$ , we have

Next, we use Lemma A.5 for the $(m-m_{0})\times(k-m_{0})$ matrix $\bar{B}=R\bar{G}$ . Let $\delta\in(0,1)$ be such that $m=(1-\delta)k$ . Since $m_{0}=(1-\rho)m$ , by choosing $\rho=c_{0}\delta$ with a suitable $c_{0}>0$ we can achieve that $k-m_{0}\geq C(m-m_{0})$ to satisfy the dimension requirement in Lemma A.5. Then, with probability at least $1-2\exp(-c\delta k)$ we have

Further, by Theorem 2.4, with probability at least $1-2\exp(-k)$ we have

Putting all these estimates in (A.4), we find that with probability at least $1-2m\exp(-c\rho m)-2\exp(-c\delta k)-2\exp(-k)$ , one has

Now we choose $\varepsilon=c_{1}\sqrt{\delta}$ with a suitable $c_{1}>0$ , and recall that we have chosen $\rho=c_{0}\delta$ . We conclude that $s_{m}(T)\geq c\min\{\delta,\,\sqrt{\delta k}\}=c\delta$ . Since $m=(1-\delta)k$ , the proof of Theorem A.3 is complete. ∎