Spectral norm of products of random and deterministic matrices

Roman Vershynin

Introduction

This paper grew out of an attempt to understand the class of random matrices with non-independent entries, but which can be factorized through random matrices with independent entries. Equivalently, we are interested in sample covariance matrices of a wide class of random vectors – the linear transformations of vectors with independent entries.

For random matrices with independent and identically distributed entries, the spectral norm is well studied. Let $W$ be an $m\times n$ matrix whose entries are real independent and identically distributed random variables with mean zero, variance $1$ and finite fourth moment. Estimates of the type

are known to hold (and are sharp) in both the limit regime for dimensions increasing to infinity, and the non-limit regime where the dimensions are fixed. The meaning of (1.1) in the limit regime is that, for a family of matrices as above whose dimensions $m$ and $n$ increase to infinity and whose aspect ratio $m/n$ converges to a constant, the ratio $\|W\|/(\sqrt{n}+\sqrt{m})$ converges to $1$ almost surely .

In the non-limit regime, i.e. for arbitrary dimensions $n$ and $m$ , variants of (1.1) were proved by Y. Seginer and R. Latala . If $W$ is an $m\times n$ matrix whose entries are i.i.d. mean zero random variables, then denoting the rows of $W$ by $X_{i}$ and the columns by $Y_{j}$ , the result of Y. Seginer states that

where $C$ is an absolute constant. This estimate is sharp because $\|W\|$ is obviously bounded below by the Euclidean norm of any row and any column of $W$ . Furthermore, if the entries $w_{ij}$ of the matrix $W$ are not necessarily identically distributed, then R. Latala’s result states that

In particular, if $W$ is an $m\times n$ matrix whose entries are independent random variables with mean zero and fourth moments bounded by $1$ , then one can deduce from either Y. Seginer’s or R. Latala’s result that

This is a variant of (1.1) in the non-limit regime.

The fourth moment hypothesis is known to be necessary. Consider again a family of matrices whose dimensions $m$ and $n$ increase to infinity, and whose aspect ratio $m/n$ converges to a constant. If the entries are independent and identically distributed random variables with mean zero and infinite fourth moment, then the upper limit of the ratio $\|W\|/(\sqrt{n}+\sqrt{m})$ is infinite almost surely .

2. The main result

The main result of this paper is an extension of the optimal bound (1.2) to the class of random matrices with non-independent entries, but which can be factored through a matrix with independent entries.

Let $\varepsilon\in(0,1)$ and let $m,n,N$ be positive integers. Consider a random $m\times n$ matrix $W=BA$ , where $A$ is an $N\times n$ random matrix whose entries are independent random variables with mean zero and $(4+\varepsilon)$ -th moment bounded by $1$ , and $B$ is an $m\times N$ non-random matrix such that $\|B\|\leq 1$ . Then

where $C(\varepsilon)$ is a function that depends only on $\varepsilon$ .

1. An important feature of this result is that its conclusion is independent of the dimension $N$ .

2. The proof of Theorem 1.1 yields the stronger estimate

4. Under the stronger subgaussian moment assumption (1.6) on the entries, Theorem 1.1 is easy to prove using standard concentration and an $\varepsilon$ -net argument. In contrast, if only some finite moment is assumed, we do not know any simple proof.

3. The smallest singular value

Our main motivation for Theorem 1.1 was to complete the analysis of the smallest singular value of random rectangular matrices carried out by M. Rudelson and the author in . The smallest singular value $s_{\min}(W)$ of a matrix $W$ can be equivalently described as $s_{\min}(W)=\inf_{x}\|Wx\|_{2}/\|x\|_{2}$ .

Analyzing the smallest singular value is generally harder than analyzing the largest one (the spectral norm). The analogue of (1.1) for the smallest singular value of random $m\times n$ matrices $W$ (for $m>n$ ) is

The optimal limit version of this result proved in holds under exactly the same hypotheses as (1.1) – for i.i.d. entries with mean zero, variance $1$ and finite fourth moment.

Many papers addressed (1.5) for fixed dimensions $n$ , $m$ . Sufficiently tall matrices ( $m\geq Cn$ for sufficiently large $C$ ) were studied in ; extensions to genuinely rectangular matrices ( $m>(1+\varepsilon)n$ for some $\varepsilon>0$ ) were studied in , with gradually improving dependence on $\varepsilon$ . An optimal version of (1.5) for all dimensions was obtained in . All these works put somewhat stronger moment assumptions than the fourth moment of the entries $w_{ij}$ of the matrix $W$ . A convenient assumption is that the entries $w_{ij}$ are subgaussian random variables. This means that all their moments are bounded by the corresponding moments of the standard normal random variable, i.e.

where $M$ is called the subgaussian moment. It was proved in that if the entries of $W$ are i.i.d. mean zero subgaussian random variables with unit variance, then for every $t>0$ one has

where $C,c>0$ depend only on the subgaussian moment $M$ . In particular, for such matrices we have

where $c_{1}>0$ depends only on the desired probability and the subgaussian moment. This result encompasses the case of square matrices where $m=n$ and hence (1.8) yields $s_{\min}(W)\geq c_{2}/\sqrt{n}$ . For Gaussian square matrices this optimal bound was obtained in and ; for general square matrices a weaker bound $n^{-3/2}$ was obtained in and the best bound as above in ; the estimate is shown to be optimal in .

Whether (1.8) holds under weaker moment assumptions was only known in the case of square matrices. It was proved in using (1.2) that (1.8) holds under the fourth moment assumption for square matrices, i.e. for $m=n$ . Whether the same is true for arbitrary rectangular matrices under the fourth moment assumption was left open in . The bottleneck of the argument occurred in Proposition 7.3 on where we needed a correct bound on the spectral norm of a product of a random matrix and a fixed orthogonal projection. Such a bound was easy to get only under the subgaussian hypothesis. Theorem 1.1 of the present paper extends the argument of for random matrices with bounded $(4+\varepsilon)$ -th moment. It follows directly from the argument of and Theorem 1.1.

Let $\varepsilon\in(0,1)$ and $m\geq n$ be positive integers. Let $A$ be a random $m\times n$ matrix whose entries are i.i.d. random variables with mean zero, unit variance and $(4+\varepsilon)$ -th moment bounded by $M$ . Then, for every $\delta>0$ there exist $t>0$ and $n_{0}$ which depend only on $\varepsilon$ , $\delta$ and $M$ , and such that

This result follows by the argument in , where one considers probability estimates conditional on the event that the norm of a product $W$ of a random matrix and a non-random orthogonal projection is small (see [28, Proposition 7.3]).

After this paper was written, two important related results appeared on the universality of the smallest singular value in two extreme regimes – for almost square matrices and for genuinely rectangular matrices. One of these results, by T. Tao and V. Vu works for square and almost square matrices where the the defect $m-n$ is constant. It is valid for matrices with i.i.d. entries with mean zero, unit variance and bounded $C$ -th moment where $C$ is a sufficiently large absolute constant. The result states that the smallest singular value of such $m\times n$ matrices $A$ is asymptotically the same as of the Gaussian matrix $G$ of the same dimensions and with i.i.d. standard normal entries. Specifically,

This universality result, combined with the known asymptotic estimates of the smallest singular value of Gaussian matrices $s_{\min}(G)$ allows one to obtain bounds sharper than in Corollary 1.2. However, the universality result of is only known in the almost square regime $m-n=O(1)$ (and under stronger moment assumptions), while Corollary 1.2 is valid for all dimensions $m\geq n$ .

Another recent universality result was obtained by O. Feldheim and S. Sodin for genuinely rectangular matrices, i.e. with aspect ratio $m/n$ separated from $1$ by a constant, and with subgaussian i.i.d. entries. In particular they proved the inequality

Deviation inequalities (1.7) and (1.10) complement each other – the former is multiplicative (and is valid for arbitrary dimensions) while the latter is additive (and is applicable for genuinely rectangular matrices). Each of these two inequalities clearly has the regime where it is stronger.

4. Outline of the argument

This bound is already independent of the dimension $N$ , but is off by $\sqrt{\log n}$ from being optimal. The logarithmic term is unfortunately a limitation of this method. This term comes from M. Rudelson’s result, Theorem 3.1 below, where it is needed in full generality. It would be useful to understand the situations where the logarithmic term can be removed from M. Rudelson’s theorem. So far, only one such situation is known from where the independent random vectors $X_{j}$ are uniformly distributed in a convex body.

The advantage of almost square matrices is that the magnitude of their entries is easy to control. A simple consequence of the $(4+\varepsilon)$ -th moment hypothesis and Markov’s inequality yields that the entries of $A=(a_{ij})$ satisfy $\max_{i,j}|a_{ij}|\leq\sqrt{n}$ with high probability. Note that the same estimate holds for square matrices ( $N=n$ ) under the fourth moment assumption. So, in regard to the magnitude of entries, almost square matrices are similar to exactly square matrices, for which the desired bound follows from R. Latala’s result (1.2).

This prompts us to construct the proof of Theorem 1.1 for almost square matrices similarly to R. Latala’s argument in , i.e. using fairly standard concentration of measure results in the Gauss space, coupled with delicate constructions of nets. We first decompose $A$ into a sum of matrices which contain entries of similar magnitude. As the magnitude increases, these matrices become sparser. This quickly reduces the problem to random sparse matrices, whose entries are i.i.d. random variables valued in $\{-1,0,1\}$ . The spectral norm of random sparse matrices was studied in as a development of the work of Z. Furedi and J. Komlos . However, we need to bound the spectral norm of the matrix $W=BA$ rather than $A$ . Independence of entries is not available for $W$ , which makes it difficult to use the known combinatorial methods based on the bounding trace of high powers of $W$ .

Acknowledgement

The author is grateful for the referee for careful reading of the manuscript, and for many suggestions which greatly improved the presentation.

Preliminaries

Throughout the paper, the results are stated and proved over the field of real numbers. They are easy to generalize to complex numbers.

We denote by $C,C_{1},c,c_{1}\ldots$ positive absolute constants, and by $C(\varepsilon),C_{1}(\varepsilon),\ldots$ positive quantities that may depend only on the parameter $\varepsilon$ . Their values can change from line to line.

2. Concentration of measure

The method that we carry out in Section 4 uses concentration in the Gauss space in combination with constructions of $\varepsilon$ -nets. Here we recall some basic facts we need.

where $c_{0}\in(0,1)$ is an absolute constant.

As a very restrictive but useful example, Theorem 2.1 implies the following deviation inequality for sums of independent exponential random variables $g_{i}^{2}$ (which can also be derived by the more standard approach via moment generating functions).

Let $d=(d_{1},\ldots,d_{m})$ be a vector of real numbers, and let $g_{1},\ldots,g_{m}$ be independent standard normal random variables. Then, for every $t>0$ we have

Another classical deviation inequality we will need is Bennett’s inequality, see e.g. [9, Theorem 2]:

Let $X_{1},\ldots,X_{N}$ be independent mean zero random variables such that $|X_{i}|\leq 1$ for all $i$ . Consider the sum $S=X_{1}+\cdots+X_{N}$ and let $\sigma^{2}:=\operatorname{Var}(S)$ . Then, for every $t>0$ we have

We will also need M. Talagrand’s concentration inequality for convex Lipschitz funcitons from [31, Theorem 6.6]; see also [18, Corollary 4.10] and the discussion below it.

3. Nets

Consider a subset $U$ of a normed space $X$ , and let $\varepsilon>0$ . Recall that an $\varepsilon$ -net of $U$ is a subset $\mathcal{N}$ of $U$ such that the distance from any point of $U$ to $\mathcal{N}$ is at most $\varepsilon$ . In other words, for every $x\in U$ there exists $y\in\mathcal{N}$ such that $\|x-y\|_{X}\leq\varepsilon$ .

The following estimate follows by a volumetric argument, see e.g. the proof of Lemma 9.5 in .

When computing norms of linear operators, $\varepsilon$ -nets provide a convenient discretization of the problem. We formalize it in the next proposition.

Let $A:X\to Y$ be a linear operator between normed spaces $X$ and $Y$ , and let $\mathcal{N}$ be an $\varepsilon$ -net of either the unit sphere $S(X)$ or the unit ball $B(X)$ of $X$ for some $\varepsilon\in(0,1)$ . Then

We give the proof for an $\varepsilon$ -net of the unit sphere; the case of the unit ball is similar. Every $z\in S(X)$ has the form $z=x+h$ , where $x\in\mathcal{N}$ and $\|h\|_{X}\leq\varepsilon$ . Since $\|A\|=\sup_{z\in S(X)}\|Az\|_{Y}$ , the triangle inequality yields

The last term in the right hand side is bounded by $\varepsilon\|A\|$ . Thus we have shown that

4. Symmetrization

We will use the standard symmetrization technique as was done in ; see more general inequalities in e.g. [19, Section 6.1]. To this end, let the matrices $A=(a_{ij})$ and $B$ be as in Theorem 1.1. Let $A^{\prime}=(a^{\prime}_{ij})$ be an independent copy of $A$ , and let $\varepsilon_{ij}$ be independent symmetric Bernoulli random variables. Then, by Jensen’s inequality,

Therefore, we can assume without loss of generality in Theorem 1.1 that $a_{ij}$ are symmetric random variables. Furthermore, let $g_{ij}$ be independent standard normal random variables. Then, again by Jensen’s inequality,

Conditioning on $a_{ij}$ , we thus reduce the problem to random gaussian matrices.

We will use a similar symmetrization technique several times in our argument. In particular, in the proof of Lemma 3.8 we apply the following observation, which can be deduced from standard symmetrization lemma ( Lemma 6.3) and the contraction principle ( Theorem 4.4). For the reader’s convenience we include a direct proof.

Consider independent mean zero random variables $Z_{ij}$ such that $|Z_{ij}|\leq 1$ , independent symmetric Bernoulli random variables $\varepsilon_{ij}$ , and vectors $x_{ij}$ in some Banach space, where both $i$ and $j$ range in some finite index sets. Then

To be specific, we can assume that both indices $i$ and $j$ range in the interval $\{1,\ldots,n\}$ for some integer $n$ . Let $(Z^{\prime}_{ij})$ denote an independent copy of the sequence of random variables $(Z_{ij})$ . Then $Z_{ij}-Z^{\prime}_{ij}$ are symmetric random variables. We have

is a convex function. Therefore, on the compact convex set $^{n^{2}}$ it attains its maximum on the extreme points, where all $a_{ij}=\pm 1$ . By symmetry, the function takes the same value at each extreme point, which equals

5. Truncation and conditioning

We will need some elementary observations related to truncation and conditioning of random variables.

Let $X$ be a non-negative random variable, and let $M>0$ , $p\geq 1$ . Then

We will also need two elementary conditioning lemmas. In Section 4, we will need to control the maximal magnitude of the entries $M_{0}=\max_{ij}|a_{ij}|$ of the random matrix $A$ . Conditioning on $M_{0}$ will unfortunately destroy the independence of the entries. So, we will instead condition on an event $\{M_{0}\leq t\}$ for fixed $t$ , which will clearly preserve the independence. This conditional argument used in the proof of Corollary 4.11 relies on the following two elementary lemmas.

Let $X$ be a random variable and $K$ be a real number. Then

Let $X$ , $Y$ be non-negative random variables. Assume there exists $K,L>0$ such that one has for every $t\geq 1$ :

Without loss of generality we can assume that $K=1$ by rescaling $X$ to $X/K$ . Thus we have for every $t\geq 1$ :

By (2.3) and Hölder’s inequality, the first term is bounded as

Further terms can be estimated by Cauchy-Schwarz inequality and using (2.3) and the second inequality in (2.2). Indeed,

6. On the deterministic matrix B𝐵B in Theorem 1.1.

We start with two initial observations that will make our proof of Theorem 1.1 more transparent. By adding an appropriate number of zero rows to $B$ or zero columns to $A$ we can assume without loss of generality that $n=m$ , thus $B$ is an $n\times N$ matrix.

where $\|\cdot\|_{\rm{HS}}$ denotes the Hilbert-Schmidt norm. Throughout the argument, we will only have access to the matrix $B$ through inequalities (2.4). This explains Remark 2 following Theorem 1.1, which states that the range space of $B$ is irrelevant as long as we control the spectral and Hilbert-Schmidt norms of $B$ .

Approach via M. Rudelson’s theorem

In particular, for every $t>0$ , with probability at least $1-2me^{-ct^{2}}$ one has

The first estimate is taken from [22, inequality (3.4)]. The second estimate can be easily derived from it using the following elementary lemma:

Suppose a non-negative random variable $X$ satisfies for some $m\geq 1$ that

Next, if $t<\max(1,\sqrt{\log m})$ then by choosing the absolute constant $c>0$ sufficiently small right hand side of (3.1) is larger than $1$ for a sufficiently small absolute constant $c$ . Therefore, for every $t>0$ one has

because if $t<\max(1,\sqrt{\log m})$ then the right hand side of (3.1) is larger than one, which makes the inequality trivial. This completes the proof. ∎

The next lemma is a consequence of M. Rudelson’s Theorem 3.1 and a standard symmetrization argument.

Let $\varepsilon_{1},\ldots,\varepsilon_{n}$ be independent symmetric Bernoulli random variables. By the triangle inequality, the standard symmetrization argument (see e.g. [19, Lemma 6.3]), and the assumption, we have

Now we take expectation with respect to $X_{1},\ldots,X_{n}$ and use Cauchy-Schwarz inequality to get

2. Theorem 1.1 up to a logarithmic term

We now state a version of Theorem 1.1 with a logarithmic factor.

Let $N,n$ be positive integers. Consider an $N\times n$ random matrix $A$ whose entries are independent random variables with mean zero and $4$ -th moment bounded by $1$ . Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ . Then

The proof will need two auxiliary lemmas. Recall that $B_{1},\ldots,B_{N}$ denote the columns of the matrix $B$ .

The estimate on the expectation follows easily from (2.4):

To estimate the variance, we need to compute

By independence and the mean zero assumption, the only nonzero terms in this sum are those for which $i=j;k=l$ or $i=k;j=l$ or $i=l;j=k$ . Therefore

By the fourth moment assumption and using (2.4) we have

This result says that all columns of the matrix $BA$ have norm $O(\sqrt{n})$ with high probability. Since the spectral norm of a matrix is bounded below by the norm of any column, this result is a necessary step in proving our desired estimate $\|BA\|=O(\sqrt{n})$ .

Let us fix $j\in\{1,\ldots,n\}$ and use Lemma 3.5. This gives

Now we use Chebychev’s inequality, which states that for a random variable $Z$ with $\sigma^{2}=\operatorname{Var}(Z)$ and for an arbitrary $k>0$ , one has

Let $t>0$ be arbitrary. Using Chebychev’s inequality along with (3.5) for $Z=\|X_{j}\|_{2}^{2}$ , $k=t\sqrt{n}$ , we obtain

Taking the union bound over all $j=1,\ldots,n$ , we conclude that

This shows that condition (3.2) holds. Lemma 3.3 then gives

Estimating the maximum in the right hand side using Lemma 3.6, we conclude that

3. Tradeoff between the matrix norm and the magnitude of entries

We would like now to gain more control over the logarithmic factor than we have in Proposition 3.4. Our next result establishes a tradeoff between the logarithmic factor and the magnitude of the matrices $A$ , $B$ . It will be used in the proof of Theorem 3.9.

Let $a,b\geq 0$ and $N,n$ be positive integers. Let $A$ be an $N\times n$ matrix whose entries are random independent variables $a_{ij}$ with mean zero and such that

Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ , and whose columns satisfy

The proof will again be based on M. Rudelson’s Theorem 3.1, although this time we use Rudelson’s theorem in a more delicate way:

Under the assumptions of Proposition 3.7, we have

Next, clearly $\mu_{ij}^{2}\leq a^{2}$ , so

where $\varepsilon_{ij}$ denote independent symmetric Bernoulli random variables.

Let $t>0$ . By the second part of M. Rudelson’s Theorem 3.1 and taking the union bound over $n$ random variables, we conclude that, with probability at least $1-2n^{2}e^{-ct^{2}}$ , we have

The second estimate follows from (3.7) and since $\max_{i}\|B_{i}\|_{2}\leq b$ by the hypothesis.

Let $s>0$ be arbitrary. We apply the above estimate for $t$ chosen so that $2n^{2}e^{-ct^{2}}=e^{-s^{2}}$ . This shows that, with probability at least $1-e^{-s^{2}}$ , one has

Putting this into (3.9) and, together with (3.8), back into (3.6), we complete the proof. ∎

which does not depend on the random variables $(g_{ij})$ , has expectation

We condition on the random variables $(a_{ij})$ ; this fixes a value of $K$ .

Consider a $(1/2)$ -net $\mathcal{N}$ of the unit Euclidean sphere $S^{n-1}$ of cardinality $|\mathcal{N}|\leq 5^{n}$ , which exists by Lemma 2.5. Using Proposition 2.6, we have

Fix $x\in\mathcal{N}$ . For every $j=1,\ldots,n$ , the random variable

is a Gaussian random variable with mean zero and variance

(To obtain the first inequality, take the supremum over $x\in S^{n-1}$ ). Therefore, by Corollary 2.2 with $d_{i}=(\operatorname{Var}\langle X_{i},x\rangle)^{1/2}\leq K$ , we have for every $t>0$ :

Let $s>0$ be arbitrary. The previous estimate for $t=sK\sqrt{n}$ gives

Taking the union bound over $x\in\mathcal{N}$ and using (3.11), we obtain

Finally, we take expectation with respect to the random variables $(a_{ij})$ and use (3.10) to conclude that

4. Theorem 1.1 for logarithmically small columns

Our next step is to combine Propositions 3.4 and 3.7 and obtain a weaker version of the main Theorem 1.1 – this time with the correct bound $O(\sqrt{n})$ on the norm, but under the additional assumption that the columns of the matrix $B$ are logarithmically small.

Let $\varepsilon\in(0,1)$ and let $N,n$ be positive integers. Consider an $N\times n$ random matrix $A$ whose entries are independent random variables with mean zero and $(4+\varepsilon)$ -th moment bounded by $1$ . Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ , and whose columns satisfy for some $M\geq 1$ that

By the symmetrization argument described in Section 2, we can assume without loss of generality that all entries $a_{ij}$ of the matrix $A=(a_{ij})$ are symmetric random variables. Let

We decompose every entry of the matrix $A$ according to its absolute value as

The norm of $B\bar{A}$ can be bounded using Proposition 3.7, which we can apply with $a$ as above and $b=M\log^{-\frac{1}{2}-\frac{1}{\varepsilon}}(2n)$ . This gives

where the last inequality follows by our choice of $a$ and $b$ .

Putting the two estimates together, we conclude by the triangle inequality that

The factor $M^{1/2}$ in the conclusion of Theorem 3.9 can easily be improved to about $M^{\varepsilon/2}$ by choosing $a=t\log^{\frac{1}{2\varepsilon}}(2n)$ in the proof and optimizing in $t$ . We will not need this improvement in our argument.

Approach via concentration

In this section, we develop an alternative way to bound the norm of $BA$ , which rests on Gaussian concentration inequalities and elaborate choice of $\varepsilon$ -nets. The main technical result of this section is the following theorem, which, like Theorem 3.9, gives the correct bound $O(\sqrt{n})$ under some boundedness assumptions on the entries of $A$ .

Let $\varepsilon\in(0,1)$ , $M\geq 1$ and let $N\geq n$ be positive integers such that $\log(2N)\leq Mn$ . Consider an $N\times n$ random matrix $A$ whose entries are independent random variables $a_{ij}$ with mean zero and such that

Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ . Then

where $C(\varepsilon)$ depends only on $\varepsilon$ .

1. If the entries $a_{ij}$ have bounded $(4+\varepsilon)$ -th moment, it is easy to check that $\max_{ij}a_{ij}\sim(nN)^{\frac{1}{4+\varepsilon}}$ holds with high probability. Therefore, under the $(4+\varepsilon)$ -th moment assumption, the hypotheses of Theorem 4.1 are satisfied for almost square matrices, i.e. those for which $N\leq n^{1+c\varepsilon}$ . This will quickly yield the main Theorem 1.1 for almost square matrices, see Corollary 4.11 below.

3. Using M. Talagrand’s concentration result, Theorem 2.4, one can also obtains tail bounds for the norm $\|BA\|$ :

Under the assumptions of Theorem 4.1, one has for every $t>0$ :

In particular, one has for every $q\geq 1$ :

Theorem 4.1 will follow from our analysis of sparse matrices. We will decompose the entries $a_{ij}$ according to their magnitude. As the magnitude increases, the moment assumptions will ensure that there will be fewer such entries, i.e. the resulting matrix becomes sparser.

We start with an elementary lemma, which will help us analyze the magnitude of the rows and columns of the matrix $BA$ when $A$ is a sparse matrix.

Let $N,n$ be positive integers. Consider independent random variables $a_{ij}$ , $i=1,\ldots,N$ , $j=1,\ldots,n$ . Let $p\in(0,1]$ , and suppose that

Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ , whose columns are denoted $B_{i}$ . Then

We will only prove inequality (4.2); the proof of inequality (4.1) is similar. By the assumptions, we have

Consider the sums of independent random variables

The above estimates show that for every $j$ we have

Taking the union bound over all $j$ , we conclude that

Now let $s\geq 1$ be arbitrary, and use the last inequality for $t=(np+\log(2n))s$ . We obtain

The estimates in Lemma 4.3 motivate us to consider the class of $N\times n$ matrices $A=(a_{ij})$ whose entries satisfy the following inequalities for some parameters $p\in(0,1]$ and $K\geq 1$ :

2. Concentration for a fixed vector

Our goal will be to estimate the magnitude of $\|BA\|$ for matrices of the form $A=(g_{ij}a_{ij})$ , where $g_{ij}$ are independent standard normal random variables, and $a_{ij}$ are fixed numbers that satisfy conditions (4.4). Such an estimate will be established in Proposition 4.8 below. By the standard symmetrization, the same estimate will hold true if $A=(a_{ij})$ is a random matrix with entries as in Lemma 4.3. This will be done in Corollary 4.9. Finally, Theorem 4.1 will be deduced from this by decomposing the entries of a random matrix according to their magnitude.

Our first step toward this goal is to check the magnitude of $\|BAx\|_{2}$ for a fixed vector $x$ .

Let $N,n$ be positive integers. Consider an $N\times n$ random matrix $A=(g_{ij}a_{ij})$ where $g_{ij}$ are independent standard normal random variables and $a_{ij}$ are numbers that satisfy conditions (4.4). Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ . Then, for every vector $x\in B_{2}^{n}$ we have

Denoting as usual the columns of $B$ by $B_{i}$ , we have

Since $\|x\|_{2}\leq 1$ and using the last condition in (4.4), we have

We will now strengthen Lemma 4.4 into a deviation inequality for $\|BAx\|_{2}$ . This is a simple consequence of the Gaussian concentration, Theorem 2.1. This deviation inequality is universal in that it holds for any vector $x$ ; in the sequel we will need more delicate inequalities that depend on the distribution of the coordinates in $x$ .

Let $A$ and $B$ be matrices as in Lemma 4.4. Then, for every vector $x\in B_{2}^{n}$ and every $t>0$ we have

where $B_{i}$ are the columns of the matrix $B$ . Therefore, the random vector $BAx$ is distributed identically with the random vector

and where $g_{i}$ are independent standard normal random variables. Since all $|a_{ij}|\leq 1$ by conditions (4.4), and $\|x\|_{2}\leq 1$ by the assumptions, we have

Then the Gaussian concentration, Theorem 2.1, gives for every $t>0$ :

where $g=(g_{1},\ldots,g_{N})$ . Since as we noted above, $f(g)$ is distributed identically with $\|BAx\|_{2}$ , Lemma 4.4 completes the proof. ∎

3. Control of sparse vectors

Nevertheless, the bound in Lemma 4.5 can be made uniform over a set of sparse vectors, whose metric entropy is smaller than that of the whole sphere:

Let $A$ and $B$ be matrices as in Lemma 4.4. There exists an absolute constant $c>0$ such that the following holds. Consider the set of vectors

Let $c>0$ be a constant to be determined later, and let $\lambda:=cp/\log(e/p)$ . Then

Using Proposition 2.6 and taking the union bound over all $x\in\mathcal{N}_{J}$ , we obtain

Since there are $\binom{n}{\lfloor\lambda n\rfloor}\leq(e/\lambda)^{\lambda n}$ ways to choose the subset $J$ , by taking the union bound over all $J$ we conclude that

Finally, if the absolute constant $c>0$ in the definition of $\lambda$ is chosen sufficiently small, we have $\lambda\log(e/\lambda)n+2\lambda n\leq c_{0}np$ . Thus the right hand side of (4.6) is at most

4. Control of spread vectors

Although we now have a good control of sparse vectors, they unfortunately comprise a small part of the unit ball $B_{2}^{n}$ . More common but harder to deal with are “spread vectors” – those having many coordinates that are not close to zero. The next result gains control of the spread vectors.

Let $A$ and $B$ be matrices as in Lemma 4.4 with $N\geq n$ . Let $M\geq 2$ . Consider the set of vectors

This time we will need to work with multiple nets to account for different possible distributions of the magnitude of the coordinates of vectors $x\in B_{2,\infty}$ . Since $\|x\|_{\infty}\leq\|x\|_{2}$ , without loss of generality we can assume that $M\leq\sqrt{n}$ .

A standard calculation shows that $\mathcal{N}$ is an $\frac{1}{2}$ -net of $B_{2,\infty}$ in the $B_{2,\infty}$ -norm, i.e. for every $x\in B_{2,\infty}$ there exists $y\in\mathcal{N}$ such that $x-y\in\frac{1}{2}B_{2,\infty}$ . Therefore, by Proposition 2.6,

Fix $x\in\mathcal{N}$ . Since $\|x\|_{2}\leq 1$ , the number of coordinates of $x$ that satisfy $|x_{j}|=h_{k}$ is at most $\lfloor h_{k}^{-2}\rfloor$ , for every $k$ . Decomposing $x$ according to the coordinates whose absolute value is $h_{k}$ , we have by the triangle inequality that

Fix $k$ and assume that $\mathcal{N}_{k}\neq\emptyset$ . Since $h_{k}\leq M/\sqrt{n}$ , we have

To estimate the cardinality of $\mathcal{N}_{k}$ , note that there are at most $\min(m,n)$ ways to choose $\|x\|_{0}:=l$ ; there are $\binom{n}{l}$ ways to choose the support of $x$ ; and there are $2^{l}$ ways to choose the (signs of) nonzero coordinates of $x$ . Hence by Stirling’s approximation and using (4.8), we have

Step 2: control of a fixed vector. Fix $m$ and fix $x\in\mathcal{N}_{k}$ . As we saw in the proof of Lemma 4.5,

and where $g_{i}$ are independent standard normal random variables. Since $x\in\mathcal{N}_{k}$ , we have $\|x\|_{\infty}=h_{k}\leq\frac{1}{\sqrt{m}}$ . This and the second condition in (4.4) yield

Repeating the estimate in the proof of Lemma 4.5, we bound the Lipschitz norm as

Then the Gaussian concentration, Theorem 2.1, gives for every $t>0$ :

where $g=(g_{1},\ldots,g_{N})$ . Since as we noted above, $f(g)$ is distributed identically with $\|BAx\|_{2}$ , Lemma 4.4 yields that

Let $u>0$ be arbitrary. Applying the above estimate for $t=uK\sqrt{np+\log(2N)}$ and using $N\geq n$ we conclude that

Step 3: union bound. Taking the union bound in (4.10) over all $x\in\mathcal{N}_{k}$ and using estimate (4.9) on the cardinality of $\mathcal{N}_{k}$ , we have for all $u>0$ :

Let $s\geq 1$ . We choose $u=C_{1}s\sqrt{\log M}$ , where $C_{1}:=\sqrt{C/c_{0}}$ . Since $u\geq 1$ and $m\geq 1$ , $M\geq 2$ , we obtain from the above estimate that

Putting this back in (4.7), we conclude that

5. Norms of sparse matrices, and proof of Theorem 4.1

Propositions 4.6 and 4.7 together handle all vectors in the unit ball, and yield the following norm estimate:

Let $A$ and $B$ be matrices as in Lemma 4.4 with $N\geq n$ . Then

Let $c$ be the absolute constant as in Proposition 4.6; we can clearly assume that $c\leq 1/4$ . We define

Note that $M\geq 2$ as required in Proposition 4.6.

Fix a vector $x\in B_{2}^{n}$ . We decompose it according to the magnitude of the coordinates, as follows:

Clearly, $\|y\|_{2}\leq\|x\|_{2}\leq 1$ , $\|z\|_{2}\leq\|x\|_{2}\leq 1$ . By Markov’s inequality, we have

Then $y\in B_{2,0}$ as in Proposition 4.6. On the other hand, $\|z\|_{\infty}\leq M/\sqrt{n}$ by definition, so $z\in B_{2,\infty}$ as in Proposition 4.7. Therefore, by Propositions 4.6 and 4.7 we have

Our choice of $M$ and the assumption $N\geq n$ completes the proof. ∎

Finally, a standard symmetrization argument yields the following norm estimate, which we shall use for sparse random matrices.

Let $p\in(0,1]$ and let $N\geq n$ be positive integers. Consider an $N\times n$ random matrix $A$ whose entries are independent random variables $a_{ij}$ with mean zero and such that

Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ . Then

It would be interesting to remove the logarithmic term from this estimate.

By Lemma 4.3, conditions (4.4) hold with some random parameter $K\geq 1$ which only depends on the random variables $(a_{ij})$ and not on $(g_{ij})$ , and which satisfies

Condition on the random variables $(a_{ij})$ . Proposition 4.8 then yields

Therefore, when we remove the conditioning, we obtain by (4.12) that

Then we have a decomposition $A=\sum_{k=0}^{\infty}2^{k}A^{(k)}$ . This sum is actually finite because of the boundedness assumption on $a_{ij}$ . Indeed, we have

where $k_{0}$ is the maximal integer such that

Using Corollary 4.9 for the matrix $A^{(0)}$ and $p=1$ , we obtain

where the last line follows because $\log(2N)\leq Mn$ and $M\geq 1$ by the hypothesis.

Now we fix $1\leq k\leq k_{0}$ . Using the $(2+\varepsilon)$ -th moment assumption, we have by Markov’s inequality that

By the definition of $p_{k}$ and by (4.14), we have

Therefore, $np_{k}+\log(2N)\leq(1+M)np_{k}\leq 2Mnp_{k}$ , so

Using (4.13) and the triangle inequality, then using (4.15) and (4.5), we conclude that

This completes the proof of Theorem 4.1. ∎

6. Almost square matrices

The main application of Theorem 4.1 is for almost square matrices – those for which $N=n^{1+o(1)}$ . The next lemma verifies the hypotheses of Theorem 4.1 for such matrices.

Let $\varepsilon\in(0,1)$ and let $N,n$ be positive integers satisfying $N\leq n^{1+\varepsilon/10}$ . Let $A$ be an $N\times n$ random matrix whose entries are independent random variables with $(4+\varepsilon)$ -th moment bounded by $1$ . Define the random variable $M$ by the equation

By Markov’s inequality, we have for every $i,j$ that

Taking the union bound over all $nN$ random variables $a_{ij}$ , we obtain

The assumption $N\leq n^{1+\varepsilon/10}$ yields that

Therefore, since $\frac{2+\varepsilon/8}{4+\varepsilon}\leq\frac{1}{2+\varepsilon/4}$ and $t\geq 1$ , we have

We are now ready to state and prove a partial case of Theorem 1.1 for almost square matrices.

Let $\varepsilon\in(0,1)$ and let $N,n$ be positive integers satisfying $N\leq n^{1+\varepsilon/10}$ . Let $A$ be an $N\times n$ random matrix whose entries are independent random variables with mean zero and $(4+\varepsilon)$ -th moment bounded by $1$ . Let $B$ be an $n\times N$ matrix such that $\|B\|\leq 1$ . Then

Without loss of generality we may assume that $N\geq n$ by adding an appropriate number of zero rows to $A$ and zero columns to $B$ . Also, using the standard symmetrization, we can assume that the random variables $a_{ij}$ are symmetric. Let $M$ be the random variable as in Lemma 4.10, and let $t\geq 1$ . By the definition, $\{M\leq t\}$ is the product event. Therefore, conditioning on this event (i) preserves the independence of the entries of $A$ ; (ii) makes all these entries bounded as in (4.17); (iii) can only reduce their moments by Lemma 2.9, thus for all $i,j$ we have

Therefore, we can apply Corollary 4.2 conditionally, with $\varepsilon/4$ and with $M$ replaced by $\max(M,10)$ , which gives

Completion of the proof of Theorem 1.1

By adding an appropriate number of zero rows to $B$ or zero columns to $A$ we can assume that $m=n$ , thus $B$ is an $n\times N$ matrix. Consider the exponent

As usual, let $B_{1},\ldots,B_{N}$ be the columns of the matrix $B$ . Consider the subset $I\subset\{1,\ldots,N\}$ of large columns defined as

Here we choose $C_{0}(\varepsilon)$ sufficiently large so that, by (2.4) and Markov’s inequality, we have

Denote by $A_{I}$ the $N_{0}\times n$ submatrix of $A$ whose rows are in $I$ , by $B_{I}$ the $n\times N_{0}$ submatrix of $B$ whose columns are in $I$ (and similarly for $I^{c}$ ). The decomposition $BA=B_{I}A_{I}+B_{I^{c}}A_{I^{c}}$ implies by the triangle inequality that

This splits our problem into two subproblems, one for $I$ and one for $I^{c}$ . Of course, if $I$ or $I^{c}$ is empty then the corresponding matrix is zero and we can skip its estimation.

The matrices $A_{I}$ , $B_{I}$ are almost square, so Corollary 4.11 applies for them, giving

On the other hand, the columns of the matrix $B_{I^{c}}$ are small by the definition of $I$ :

Therefore, Theorem 3.9 applies to the matrices $A_{I^{c}}$ , $B_{I^{c}}$ , which gives

Putting estimates (5.2) and (5.3) into (5.1), we conclude that